Large files for relations

Started by Thomas Munroover 2 years ago27 messages

thomas.munro@gmail.com

over 2 years ago

11 attachment(s)

Big PostgreSQL databases use and regularly open/close huge numbers of
file descriptors and directory entries for various anachronistic
reasons, one of which is the 1GB RELSEG_SIZE thing. The segment
management code is trickier that you might think and also still
harbours known bugs.

A nearby analysis of yet another obscure segment life cycle bug
reminded me of this patch set to switch to simple large files and
eventually drop all that. I originally meant to develop the attached
sketch-quality code further and try proposing it in the 16 cycle,
while I was down the modernisation rabbit hole[1]https://wiki.postgresql.org/wiki/AllComputers, but then I got side
tracked: at some point I believed that the 56 bit relfilenode thing
might be necessary for correctness, but then I found a set of rules
that seem to hold up without that. I figured I might as well post
what I have early in the 17 cycle as a "concept" patch to see which
way the flames blow.

There are various boring details due to Windows, and then a load of
fairly obvious changes, and then a whole can of worms about how we'd
handle the transition for the world's fleet of existing databases.
I'll cut straight to that part. Different choices on aggressiveness
could be made, but here are the straw-man answers I came up with so
far:

1. All new relations would be in large format only. No 16384.N
files, just 16384 that can grow to MaxBlockNumber * BLCKSZ.

2. The existence of a file 16384.1 means that this smgr relation is
in legacy segmented format that came from pg_upgrade (note that we
don't unlink that file once it exists, even when truncating the fork,
until we eventually drop the relation).

3. Forks that were pg_upgrade'd from earlier releases using hard
links or reflinks would implicitly be in large format if they only had
one segment, and otherwise they could stay in the traditional format
for a grace period of N major releases, after which we'd plan to drop
segment support. pg_upgrade's [ref]link mode would therefore be the
only way to get a segmented relation, other than a developer-only
trick for testing/debugging.

4. Every opportunity to convert a multi-segment fork to large format
would be taken: pg_upgrade in copy mode, basebackup, COPY DATABASE,
VACUUM FULL, TRUNCATE, etc. You can see approximately working sketch
versions of all the cases I thought of so far in the attached.

5. The main places that do file-level copying of relations would use
copy_file_range() to do the splicing, so that on file systems that are
smart enough (XFS, ZFS, BTRFS, ...) with qualifying source and
destination, the operation can be very fast, and other degrees of
optimisation are available to the kernel too even for file systems
without block sharing magic (pushing down block range copies to
hardware/network storage, etc). The copy_file_range() stuff could
also be proposed independently (I vaguely recall it was discussed a
few times before), it's just that it really comes into its own when
you start splicing files together, as needed here, and it's also been
adopted by FreeBSD with the same interface as Linux and has an
efficient implementation in bleeding edge ZFS there.

Stepping back, the main ideas are: (1) for some users of large
databases, it would be painlessly done at upgrade time without even
really noticing, using modern file system facilities where possible
for speed; (2) for anyone who wants to defer that because of lack of
fast copy_file_range() and a desire to avoid prolonged downtime by
using links or reflinks, concatenation can be put off for the next N
releases, giving a total of 5 + N years of option to defer the work,
and in that case there are also many ways to proactively change to
large format before the time comes with varying degrees of granularity
and disruption. For example, set up a new replica and fail over, or
VACUUM FULL tables one at a time, etc.

There are plenty of things left to do in this patch set: pg_rewind
doesn't understand optional segmentation yet, there are probably more
things like that, and I expect there are some ssize_t vs pgoff_t
confusions I missed that could bite a 32 bit system. But you can see
the basics working on a typical system.

I am not aware of any modern/non-historic filesystem[2]https://en.wikipedia.org/wiki/Comparison_of_file_systems that can't do
large files with ease. Anyone know of anything to worry about on that
front? I think the main collateral damage would be weird old external
tools like some weird old version of Windows tar I occasionally see
mentioned, that sort of thing, but that'd just be another case of
"well don't use that then", I guess? What else might we need to think
about, outside PostgreSQL?

What other problems might occur inside PostgreSQL? Clearly we'd need
to figure out a decent strategy to automate testing of all of the
relevant transitions. We could test the splicing code paths with an
optional test suite that you might enable along with a small segment
size (as we're already testing on CI and probably BF after the last
round of segmentation bugs). To test the messy Windows off_t API
stuff convincingly, we'd need actual > 4GB files, I think? Maybe
doable cheaply with file system hole punching tricks.

Speaking of file system holes, this patch set doesn't touch buffile.c
That code wants to use segments for two extra purposes: (1) parallel
create index merges workers' output using segmentation tricks as if
there were holes in the file; this could perhaps be replaced with
large files that make use of actual OS-level holes but I didn't feel
like additionally claiming that all computers have spare files --
perhaps another approach is needed anyway; (2) buffile.c deliberately
spreads large buffiles around across multiple temporary tablespaces
using segments supposedly for space management reasons. So although
it initially looks like a nice safe little place to start using large
files, we'd need an answer to those design choices first.

/me dons flameproof suit and goes back to working on LLVM problems for a while

[1]: https://wiki.postgresql.org/wiki/AllComputers
[2]: https://en.wikipedia.org/wiki/Comparison_of_file_systems

Attachments:

0001-Assert-that-pgoff_t-is-wide-enough.patchtext/x-patch; charset=US-ASCII; name=0001-Assert-that-pgoff_t-is-wide-enough.patchDownload

From b4b6f27af1d196f9d6b3b8d5991216666cf2900f Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Mon, 24 Apr 2023 18:04:43 +1200
Subject: [PATCH 01/11] Assert that pgoff_t is wide enough.

On Windows, we know it's wide enough because we define it directly ourselves.
On Unix, we use off_t, which may only be 32 bits wide on some systems,
depending on compiler switches or macros.  Make absolutely certain that we are
not confused on this point with an assertion, or we'd corrupt large files.
---
 src/backend/storage/file/fd.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 277a28fc13..053588a302 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -102,6 +102,9 @@
 #include "utils/resowner_private.h"
 #include "utils/varlena.h"
 
+StaticAssertDecl(sizeof(pgoff_t) >= 8,
+				 "pgoff_t not big enough to support large files");
+
 /* Define PG_FLUSH_DATA_WORKS if we have an implementation for pg_flush_data */
 #if defined(HAVE_SYNC_FILE_RANGE)
 #define PG_FLUSH_DATA_WORKS 1
-- 
2.40.1

0002-Use-pgoff_t-in-system-call-replacements-on-Windows.patchtext/x-patch; charset=US-ASCII; name=0002-Use-pgoff_t-in-system-call-replacements-on-Windows.patchDownload

From 6154e35d35515a7536524b79cb7ccd6a39d41afe Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sun, 5 Mar 2023 11:24:51 +1300
Subject: [PATCH 02/11] Use pgoff_t in system call replacements on Windows.

All modern Unix systems have 64 bit off_t, but Windows does not.  Use
our pgoff_t type in our POSIX-style replacement functions (lseek(),
ftruncate(), pread(), pwrite() etc etc).  Also in closely related
functions like pg_pwrite_zeros().
---
 configure                       |  6 +++
 configure.ac                    |  1 +
 src/common/file_utils.c         |  4 +-
 src/include/common/file_utils.h |  4 +-
 src/include/port.h              |  2 +-
 src/include/port/pg_iovec.h     |  4 +-
 src/include/port/win32_port.h   | 23 ++++++++++--
 src/port/meson.build            |  1 +
 src/port/preadv.c               |  2 +-
 src/port/pwritev.c              |  2 +-
 src/port/win32ftruncate.c       | 65 +++++++++++++++++++++++++++++++++
 src/port/win32pread.c           |  3 +-
 src/port/win32pwrite.c          |  3 +-
 src/tools/msvc/Mkvcbuild.pm     |  1 +
 14 files changed, 106 insertions(+), 15 deletions(-)
 create mode 100644 src/port/win32ftruncate.c

diff --git a/configure b/configure
index 15daccc87f..47ba18491c 100755
--- a/configure
+++ b/configure
@@ -16537,6 +16537,12 @@ esac
  ;;
 esac
 
+  case " $LIBOBJS " in
+  *" win32ftruncate.$ac_objext "* ) ;;
+  *) LIBOBJS="$LIBOBJS win32ftruncate.$ac_objext"
+ ;;
+esac
+
   case " $LIBOBJS " in
   *" win32getrusage.$ac_objext "* ) ;;
   *) LIBOBJS="$LIBOBJS win32getrusage.$ac_objext"
diff --git a/configure.ac b/configure.ac
index 97f5be6c73..2b3b1b4dca 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1905,6 +1905,7 @@ if test "$PORTNAME" = "win32"; then
   AC_LIBOBJ(win32env)
   AC_LIBOBJ(win32error)
   AC_LIBOBJ(win32fdatasync)
+  AC_LIBOBJ(win32ftruncate)
   AC_LIBOBJ(win32getrusage)
   AC_LIBOBJ(win32link)
   AC_LIBOBJ(win32ntdll)
diff --git a/src/common/file_utils.c b/src/common/file_utils.c
index 74833c4acb..7a63434bc4 100644
--- a/src/common/file_utils.c
+++ b/src/common/file_utils.c
@@ -469,7 +469,7 @@ get_dirent_type(const char *path,
  * error is returned, it is unspecified how much has been written.
  */
 ssize_t
-pg_pwritev_with_retry(int fd, const struct iovec *iov, int iovcnt, off_t offset)
+pg_pwritev_with_retry(int fd, const struct iovec *iov, int iovcnt, pgoff_t offset)
 {
 	struct iovec iov_copy[PG_IOV_MAX];
 	ssize_t		sum = 0;
@@ -538,7 +538,7 @@ pg_pwritev_with_retry(int fd, const struct iovec *iov, int iovcnt, off_t offset)
  * is returned with errno set.
  */
 ssize_t
-pg_pwrite_zeros(int fd, size_t size, off_t offset)
+pg_pwrite_zeros(int fd, size_t size, pgoff_t offset)
 {
 	static const PGIOAlignedBlock zbuffer = {{0}};	/* worth BLCKSZ */
 	void	   *zerobuf_addr = unconstify(PGIOAlignedBlock *, &zbuffer)->data;
diff --git a/src/include/common/file_utils.h b/src/include/common/file_utils.h
index b7efa1226d..534277b12d 100644
--- a/src/include/common/file_utils.h
+++ b/src/include/common/file_utils.h
@@ -42,8 +42,8 @@ extern PGFileType get_dirent_type(const char *path,
 extern ssize_t pg_pwritev_with_retry(int fd,
 									 const struct iovec *iov,
 									 int iovcnt,
-									 off_t offset);
+									 pgoff_t offset);
 
-extern ssize_t pg_pwrite_zeros(int fd, size_t size, off_t offset);
+extern ssize_t pg_pwrite_zeros(int fd, size_t size, pgoff_t offset);
 
 #endif							/* FILE_UTILS_H */
diff --git a/src/include/port.h b/src/include/port.h
index a88d403483..f7707a390e 100644
--- a/src/include/port.h
+++ b/src/include/port.h
@@ -368,7 +368,7 @@ extern FILE *pgwin32_popen(const char *command, const char *type);
  * When necessary, these routines are provided by files in src/port/.
  */
 
-/* Type to use with fseeko/ftello */
+/* Type to use with lseek/ftruncate/pread/fseeko/ftello */
 #ifndef WIN32					/* WIN32 is handled in port/win32_port.h */
 #define pgoff_t off_t
 #endif
diff --git a/src/include/port/pg_iovec.h b/src/include/port/pg_iovec.h
index 689799c425..c762fab662 100644
--- a/src/include/port/pg_iovec.h
+++ b/src/include/port/pg_iovec.h
@@ -43,13 +43,13 @@ struct iovec
 #if HAVE_DECL_PREADV
 #define pg_preadv preadv
 #else
-extern ssize_t pg_preadv(int fd, const struct iovec *iov, int iovcnt, off_t offset);
+extern ssize_t pg_preadv(int fd, const struct iovec *iov, int iovcnt, pgoff_t offset);
 #endif
 
 #if HAVE_DECL_PWRITEV
 #define pg_pwritev pwritev
 #else
-extern ssize_t pg_pwritev(int fd, const struct iovec *iov, int iovcnt, off_t offset);
+extern ssize_t pg_pwritev(int fd, const struct iovec *iov, int iovcnt, pgoff_t offset);
 #endif
 
 #endif							/* PG_IOVEC_H */
diff --git a/src/include/port/win32_port.h b/src/include/port/win32_port.h
index 58965e0dfd..c757687386 100644
--- a/src/include/port/win32_port.h
+++ b/src/include/port/win32_port.h
@@ -76,11 +76,19 @@
 #undef fstat
 #undef stat
 
+/* and likewise for lseek hack */
+#define lseek microsoft_native_lseek
+#include <io.h>
+#undef lseek
+
+/* and also ftruncate, as defined by MinGW headers with 32 bit offset */
+#define ftruncate mingw_native_ftruncate
+#include <unistd.h>
+#undef ftruncate
+
 /* Must be here to avoid conflicting with prototype in windows.h */
 #define mkdir(a,b)	mkdir(a)
 
-#define ftruncate(a,b)	chsize(a,b)
-
 /* Windows doesn't have fsync() as such, use _commit() */
 #define fsync(fd) _commit(fd)
 
@@ -219,6 +227,7 @@ extern int	_pgfseeko64(FILE *stream, pgoff_t offset, int origin);
 extern pgoff_t _pgftello64(FILE *stream);
 #define fseeko(stream, offset, origin) _pgfseeko64(stream, offset, origin)
 #define ftello(stream) _pgftello64(stream)
+#define lseek(fd, offset, origin) _lseeki64((fd), (offset), (origin))
 #else
 #ifndef fseeko
 #define fseeko(stream, offset, origin) fseeko64(stream, offset, origin)
@@ -226,7 +235,13 @@ extern pgoff_t _pgftello64(FILE *stream);
 #ifndef ftello
 #define ftello(stream) ftello64(stream)
 #endif
+#ifndef lseek
+#define lseek(fd, offset, origin) _lseeki64((fd), (offset), (origin))
 #endif
+#endif
+
+/* 64 bit ftruncate is in win32ftruncate.c */
+extern int ftruncate(int fd, pgoff_t length);
 
 /*
  *	Win32 also doesn't have symlinks, but we can emulate them with
@@ -586,9 +601,9 @@ typedef unsigned short mode_t;
 #endif
 
 /* in port/win32pread.c */
-extern ssize_t pg_pread(int fd, void *buf, size_t nbyte, off_t offset);
+extern ssize_t pg_pread(int fd, void *buf, size_t nbyte, pgoff_t offset);
 
 /* in port/win32pwrite.c */
-extern ssize_t pg_pwrite(int fd, const void *buf, size_t nbyte, off_t offset);
+extern ssize_t pg_pwrite(int fd, const void *buf, size_t nbyte, pgoff_t offset);
 
 #endif							/* PG_WIN32_PORT_H */
diff --git a/src/port/meson.build b/src/port/meson.build
index 24416b9bfc..54ce59806a 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -35,6 +35,7 @@ if host_system == 'windows'
     'win32error.c',
     'win32fdatasync.c',
     'win32fseek.c',
+    'win32ftruncate.c',
     'win32getrusage.c',
     'win32link.c',
     'win32ntdll.c',
diff --git a/src/port/preadv.c b/src/port/preadv.c
index e762283e67..6e5e92234f 100644
--- a/src/port/preadv.c
+++ b/src/port/preadv.c
@@ -19,7 +19,7 @@
 #include "port/pg_iovec.h"
 
 ssize_t
-pg_preadv(int fd, const struct iovec *iov, int iovcnt, off_t offset)
+pg_preadv(int fd, const struct iovec *iov, int iovcnt, pgoff_t offset)
 {
 	ssize_t		sum = 0;
 	ssize_t		part;
diff --git a/src/port/pwritev.c b/src/port/pwritev.c
index 519de45037..c430f99806 100644
--- a/src/port/pwritev.c
+++ b/src/port/pwritev.c
@@ -19,7 +19,7 @@
 #include "port/pg_iovec.h"
 
 ssize_t
-pg_pwritev(int fd, const struct iovec *iov, int iovcnt, off_t offset)
+pg_pwritev(int fd, const struct iovec *iov, int iovcnt, pgoff_t offset)
 {
 	ssize_t		sum = 0;
 	ssize_t		part;
diff --git a/src/port/win32ftruncate.c b/src/port/win32ftruncate.c
new file mode 100644
index 0000000000..5e6d4f3e92
--- /dev/null
+++ b/src/port/win32ftruncate.c
@@ -0,0 +1,65 @@
+/*-------------------------------------------------------------------------
+ *
+ * win32ftruncate.c
+ *	   Win32 ftruncate() replacement
+ *
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ *
+ * src/port/win32ftruncate.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#ifdef FRONTEND
+#include "postgres_fe.h"
+#else
+#include "postgres.h"
+#endif
+
+int
+ftruncate(int fd, pgoff_t length)
+{
+	HANDLE		handle;
+	pgoff_t		save_position;
+
+	/*
+	 * We can't use chsize() because it works with 32 bit off_t.  We can't use
+	 * _chsize_s() because it isn't available in MinGW.  So we have to use
+	 * SetEndOfFile(), but that works with the current position.  So we save
+	 * and restore it.
+	 */
+
+	handle = (HANDLE) _get_osfhandle(fd);
+	if (handle == INVALID_HANDLE_VALUE)
+	{
+		errno = EBADF;
+		return -1;
+	}
+
+	save_position = lseek(fd, 0, SEEK_CUR);
+	if (save_position < 0)
+		return -1;
+
+	if (lseek(fd, length, SEEK_SET) < 0)
+	{
+		int			save_errno = errno;
+		lseek(fd, save_position, SEEK_SET);
+		errno = save_errno;
+		return -1;
+	}
+
+	if (!SetEndOfFile(handle))
+	{
+		int			save_errno;
+
+		_dosmaperr(GetLastError());
+		save_errno = errno;
+		lseek(fd, save_position, SEEK_SET);
+		errno = save_errno;
+		return -1;
+	}
+	lseek(fd, save_position, SEEK_SET);
+
+	return 0;
+}
diff --git a/src/port/win32pread.c b/src/port/win32pread.c
index 905cf9f42b..6e6366faaa 100644
--- a/src/port/win32pread.c
+++ b/src/port/win32pread.c
@@ -17,7 +17,7 @@
 #include <windows.h>
 
 ssize_t
-pg_pread(int fd, void *buf, size_t size, off_t offset)
+pg_pread(int fd, void *buf, size_t size, pgoff_t offset)
 {
 	OVERLAPPED	overlapped = {0};
 	HANDLE		handle;
@@ -32,6 +32,7 @@ pg_pread(int fd, void *buf, size_t size, off_t offset)
 
 	/* Note that this changes the file position, despite not using it. */
 	overlapped.Offset = offset;
+	overlapped.OffsetHigh = offset >> 32;
 	if (!ReadFile(handle, buf, size, &result, &overlapped))
 	{
 		if (GetLastError() == ERROR_HANDLE_EOF)
diff --git a/src/port/win32pwrite.c b/src/port/win32pwrite.c
index 5dd10821cf..90dd93dbc5 100644
--- a/src/port/win32pwrite.c
+++ b/src/port/win32pwrite.c
@@ -17,7 +17,7 @@
 #include <windows.h>
 
 ssize_t
-pg_pwrite(int fd, const void *buf, size_t size, off_t offset)
+pg_pwrite(int fd, const void *buf, size_t size, pgoff_t offset)
 {
 	OVERLAPPED	overlapped = {0};
 	HANDLE		handle;
@@ -32,6 +32,7 @@ pg_pwrite(int fd, const void *buf, size_t size, off_t offset)
 
 	/* Note that this changes the file position, despite not using it. */
 	overlapped.Offset = offset;
+	overlapped.OffsetHigh = offset >> 32;
 	if (!WriteFile(handle, buf, size, &result, &overlapped))
 	{
 		_dosmaperr(GetLastError());
diff --git a/src/tools/msvc/Mkvcbuild.pm b/src/tools/msvc/Mkvcbuild.pm
index 958206f315..4b96c2bb44 100644
--- a/src/tools/msvc/Mkvcbuild.pm
+++ b/src/tools/msvc/Mkvcbuild.pm
@@ -113,6 +113,7 @@ sub mkvcbuild
 	  win32env.c win32error.c
 	  win32fdatasync.c
 	  win32fseek.c
+	  win32ftruncate.c
 	  win32getrusage.c
 	  win32gettimeofday.c
 	  win32link.c
-- 
2.40.1

0003-Support-large-files-on-Windows-in-our-VFD-API.patchtext/x-patch; charset=US-ASCII; name=0003-Support-large-files-on-Windows-in-our-VFD-API.patchDownload

From 2782d8c1b5c6ff266488536c49cb3a4d4a7b4da6 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sun, 5 Mar 2023 11:27:16 +1300
Subject: [PATCH 03/11] Support large files on Windows in our VFD API.

All fd.c interfaces that take off_t now need to use pgoff_t instead,
because we can't use Windows' 32 bit off_t.
---
 src/backend/storage/file/fd.c | 30 +++++++++++++++---------------
 src/include/storage/fd.h      | 20 ++++++++++----------
 2 files changed, 25 insertions(+), 25 deletions(-)

diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 053588a302..f5e194a797 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -204,7 +204,7 @@ typedef struct vfd
 	File		nextFree;		/* link to next free VFD, if in freelist */
 	File		lruMoreRecently;	/* doubly linked recency-of-use list */
 	File		lruLessRecently;
-	off_t		fileSize;		/* current size of file (0 if not temporary) */
+	pgoff_t		fileSize;		/* current size of file (0 if not temporary) */
 	char	   *fileName;		/* name of file, or NULL for unused VFD */
 	/* NB: fileName is malloc'd, and must be free'd when closing the VFD */
 	int			fileFlags;		/* open(2) flags for (re)opening the file */
@@ -463,7 +463,7 @@ pg_fdatasync(int fd)
  * offset of 0 with nbytes 0 means that the entire file should be flushed
  */
 void
-pg_flush_data(int fd, off_t offset, off_t nbytes)
+pg_flush_data(int fd, pgoff_t offset, pgoff_t nbytes)
 {
 	/*
 	 * Right now file flushing is primarily used to avoid making later
@@ -636,7 +636,7 @@ pg_flush_data(int fd, off_t offset, off_t nbytes)
  * Truncate a file to a given length by name.
  */
 int
-pg_truncate(const char *path, off_t length)
+pg_truncate(const char *path, pgoff_t length)
 {
 #ifdef WIN32
 	int			save_errno;
@@ -1439,7 +1439,7 @@ FileAccess(File file)
  * Called whenever a temporary file is deleted to report its size.
  */
 static void
-ReportTemporaryFileUsage(const char *path, off_t size)
+ReportTemporaryFileUsage(const char *path, pgoff_t size)
 {
 	pgstat_report_tempfile(size);
 
@@ -1989,7 +1989,7 @@ FileClose(File file)
  * to read into.
  */
 int
-FilePrefetch(File file, off_t offset, off_t amount, uint32 wait_event_info)
+FilePrefetch(File file, pgoff_t offset, pgoff_t amount, uint32 wait_event_info)
 {
 #if defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_WILLNEED)
 	int			returnCode;
@@ -2017,7 +2017,7 @@ FilePrefetch(File file, off_t offset, off_t amount, uint32 wait_event_info)
 }
 
 void
-FileWriteback(File file, off_t offset, off_t nbytes, uint32 wait_event_info)
+FileWriteback(File file, pgoff_t offset, pgoff_t nbytes, uint32 wait_event_info)
 {
 	int			returnCode;
 
@@ -2043,7 +2043,7 @@ FileWriteback(File file, off_t offset, off_t nbytes, uint32 wait_event_info)
 }
 
 int
-FileRead(File file, void *buffer, size_t amount, off_t offset,
+FileRead(File file, void *buffer, size_t amount, pgoff_t offset,
 		 uint32 wait_event_info)
 {
 	int			returnCode;
@@ -2099,7 +2099,7 @@ retry:
 }
 
 int
-FileWrite(File file, const void *buffer, size_t amount, off_t offset,
+FileWrite(File file, const void *buffer, size_t amount, pgoff_t offset,
 		  uint32 wait_event_info)
 {
 	int			returnCode;
@@ -2128,7 +2128,7 @@ FileWrite(File file, const void *buffer, size_t amount, off_t offset,
 	 */
 	if (temp_file_limit >= 0 && (vfdP->fdstate & FD_TEMP_FILE_LIMIT))
 	{
-		off_t		past_write = offset + amount;
+		pgoff_t		past_write = offset + amount;
 
 		if (past_write > vfdP->fileSize)
 		{
@@ -2160,7 +2160,7 @@ retry:
 		 */
 		if (vfdP->fdstate & FD_TEMP_FILE_LIMIT)
 		{
-			off_t		past_write = offset + amount;
+			pgoff_t		past_write = offset + amount;
 
 			if (past_write > vfdP->fileSize)
 			{
@@ -2224,7 +2224,7 @@ FileSync(File file, uint32 wait_event_info)
  * appropriate error.
  */
 int
-FileZero(File file, off_t offset, off_t amount, uint32 wait_event_info)
+FileZero(File file, pgoff_t offset, pgoff_t amount, uint32 wait_event_info)
 {
 	int			returnCode;
 	ssize_t		written;
@@ -2269,7 +2269,7 @@ FileZero(File file, off_t offset, off_t amount, uint32 wait_event_info)
  * appropriate error.
  */
 int
-FileFallocate(File file, off_t offset, off_t amount, uint32 wait_event_info)
+FileFallocate(File file, pgoff_t offset, pgoff_t amount, uint32 wait_event_info)
 {
 #ifdef HAVE_POSIX_FALLOCATE
 	int			returnCode;
@@ -2305,7 +2305,7 @@ FileFallocate(File file, off_t offset, off_t amount, uint32 wait_event_info)
 	return FileZero(file, offset, amount, wait_event_info);
 }
 
-off_t
+pgoff_t
 FileSize(File file)
 {
 	Assert(FileIsValid(file));
@@ -2316,14 +2316,14 @@ FileSize(File file)
 	if (FileIsNotOpen(file))
 	{
 		if (FileAccess(file) < 0)
-			return (off_t) -1;
+			return (pgoff_t) -1;
 	}
 
 	return lseek(VfdCache[file].fd, 0, SEEK_END);
 }
 
 int
-FileTruncate(File file, off_t offset, uint32 wait_event_info)
+FileTruncate(File file, pgoff_t offset, uint32 wait_event_info)
 {
 	int			returnCode;
 
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 6791a406fc..a4528428ff 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -110,16 +110,16 @@ extern File PathNameOpenFile(const char *fileName, int fileFlags);
 extern File PathNameOpenFilePerm(const char *fileName, int fileFlags, mode_t fileMode);
 extern File OpenTemporaryFile(bool interXact);
 extern void FileClose(File file);
-extern int	FilePrefetch(File file, off_t offset, off_t amount, uint32 wait_event_info);
-extern int	FileRead(File file, void *buffer, size_t amount, off_t offset, uint32 wait_event_info);
-extern int	FileWrite(File file, const void *buffer, size_t amount, off_t offset, uint32 wait_event_info);
+extern int	FilePrefetch(File file, pgoff_t offset, pgoff_t amount, uint32 wait_event_info);
+extern int	FileRead(File file, void *buffer, size_t amount, pgoff_t offset, uint32 wait_event_info);
+extern int	FileWrite(File file, const void *buffer, size_t amount, pgoff_t offset, uint32 wait_event_info);
 extern int	FileSync(File file, uint32 wait_event_info);
-extern int	FileZero(File file, off_t offset, off_t amount, uint32 wait_event_info);
-extern int	FileFallocate(File file, off_t offset, off_t amount, uint32 wait_event_info);
+extern int	FileZero(File file, pgoff_t offset, pgoff_t amount, uint32 wait_event_info);
+extern int	FileFallocate(File file, pgoff_t offset, pgoff_t amount, uint32 wait_event_info);
 
-extern off_t FileSize(File file);
-extern int	FileTruncate(File file, off_t offset, uint32 wait_event_info);
-extern void FileWriteback(File file, off_t offset, off_t nbytes, uint32 wait_event_info);
+extern pgoff_t FileSize(File file);
+extern int	FileTruncate(File file, pgoff_t offset, uint32 wait_event_info);
+extern void FileWriteback(File file, pgoff_t offset, pgoff_t nbytes, uint32 wait_event_info);
 extern char *FilePathName(File file);
 extern int	FileGetRawDesc(File file);
 extern int	FileGetRawFlags(File file);
@@ -186,8 +186,8 @@ extern int	pg_fsync(int fd);
 extern int	pg_fsync_no_writethrough(int fd);
 extern int	pg_fsync_writethrough(int fd);
 extern int	pg_fdatasync(int fd);
-extern void pg_flush_data(int fd, off_t offset, off_t nbytes);
-extern int	pg_truncate(const char *path, off_t length);
+extern void pg_flush_data(int fd, pgoff_t offset, pgoff_t nbytes);
+extern int	pg_truncate(const char *path, pgoff_t length);
 extern void fsync_fname(const char *fname, bool isdir);
 extern int	fsync_fname_ext(const char *fname, bool isdir, bool ignore_perm, int elevel);
 extern int	durable_rename(const char *oldfile, const char *newfile, int elevel);
-- 
2.40.1

0004-Use-pgoff_t-instead-of-off_t-in-more-places.patchtext/x-patch; charset=US-ASCII; name=0004-Use-pgoff_t-instead-of-off_t-in-more-places.patchDownload

From ed3a5558a03afaabb7c4c206c053c288c104cb02 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sun, 5 Mar 2023 12:36:55 +1300
Subject: [PATCH 04/11] Use pgoff_t instead of off_t in more places.

XXX  Incomplete
---
 src/backend/access/heap/rewriteheap.c | 2 +-
 src/backend/backup/basebackup.c       | 7 ++++---
 src/backend/storage/file/copydir.c    | 4 ++--
 src/bin/pg_basebackup/receivelog.c    | 2 +-
 src/bin/pg_rewind/file_ops.c          | 4 ++--
 src/bin/pg_rewind/file_ops.h          | 4 ++--
 src/bin/pg_rewind/filemap.c           | 2 ++
 src/bin/pg_rewind/libpq_source.c      | 6 +++---
 src/bin/pg_rewind/local_source.c      | 8 ++++----
 src/bin/pg_rewind/pg_rewind.c         | 2 +-
 src/bin/pg_rewind/rewind_source.h     | 2 +-
 src/include/access/heapam_xlog.h      | 2 +-
 12 files changed, 24 insertions(+), 21 deletions(-)

diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 424958912c..5e5b00d25a 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -194,7 +194,7 @@ typedef struct RewriteMappingFile
 {
 	TransactionId xid;			/* xid that might need to see the row */
 	int			vfd;			/* fd of mappings file */
-	off_t		off;			/* how far have we written yet */
+	pgoff_t		off;			/* how far have we written yet */
 	dclist_head mappings;		/* list of in-memory mappings */
 	char		path[MAXPGPATH];	/* path, for error messages */
 } RewriteMappingFile;
diff --git a/src/backend/backup/basebackup.c b/src/backend/backup/basebackup.c
index 5baea7535b..2dcc04fef2 100644
--- a/src/backend/backup/basebackup.c
+++ b/src/backend/backup/basebackup.c
@@ -95,7 +95,8 @@ static void perform_base_backup(basebackup_options *opt, bbsink *sink);
 static void parse_basebackup_options(List *options, basebackup_options *opt);
 static int	compareWalFileNames(const ListCell *a, const ListCell *b);
 static bool is_checksummed_file(const char *fullpath, const char *filename);
-static int	basebackup_read_file(int fd, char *buf, size_t nbytes, off_t offset,
+static int	basebackup_read_file(int fd, char *buf, size_t nbytes,
+								 pgoff_t offset,
 								 const char *filename, bool partial_read_ok);
 
 /* Was the backup currently in-progress initiated in recovery mode? */
@@ -1488,7 +1489,7 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
 	bool		block_retry = false;
 	uint16		checksum;
 	int			checksum_failures = 0;
-	off_t		cnt;
+	pgoff_t		cnt;
 	int			i;
 	pgoff_t		len = 0;
 	char	   *page;
@@ -1827,7 +1828,7 @@ convert_link_to_directory(const char *pathbuf, struct stat *statbuf)
  * Returns the number of bytes read.
  */
 static int
-basebackup_read_file(int fd, char *buf, size_t nbytes, off_t offset,
+basebackup_read_file(int fd, char *buf, size_t nbytes, pgoff_t offset,
 					 const char *filename, bool partial_read_ok)
 {
 	int			rc;
diff --git a/src/backend/storage/file/copydir.c b/src/backend/storage/file/copydir.c
index e04bc3941a..82f77536b4 100644
--- a/src/backend/storage/file/copydir.c
+++ b/src/backend/storage/file/copydir.c
@@ -120,8 +120,8 @@ copy_file(const char *fromfile, const char *tofile)
 	int			srcfd;
 	int			dstfd;
 	int			nbytes;
-	off_t		offset;
-	off_t		flush_offset;
+	pgoff_t		offset;
+	pgoff_t		flush_offset;
 
 	/* Size of copy buffer (read and write requests) */
 #define COPY_BUF_SIZE (8 * BLCKSZ)
diff --git a/src/bin/pg_basebackup/receivelog.c b/src/bin/pg_basebackup/receivelog.c
index 504d82bef6..e69ad912a2 100644
--- a/src/bin/pg_basebackup/receivelog.c
+++ b/src/bin/pg_basebackup/receivelog.c
@@ -192,7 +192,7 @@ static bool
 close_walfile(StreamCtl *stream, XLogRecPtr pos)
 {
 	char	   *fn;
-	off_t		currpos;
+	pgoff_t		currpos;
 	int			r;
 	char		walfile_name[MAXPGPATH];
 
diff --git a/src/bin/pg_rewind/file_ops.c b/src/bin/pg_rewind/file_ops.c
index 25996b4da4..3e96b8b0a8 100644
--- a/src/bin/pg_rewind/file_ops.c
+++ b/src/bin/pg_rewind/file_ops.c
@@ -85,7 +85,7 @@ close_target_file(void)
 }
 
 void
-write_target_range(char *buf, off_t begin, size_t size)
+write_target_range(char *buf, pgoff_t begin, size_t size)
 {
 	size_t		writeleft;
 	char	   *p;
@@ -203,7 +203,7 @@ remove_target_file(const char *path, bool missing_ok)
 }
 
 void
-truncate_target_file(const char *path, off_t newsize)
+truncate_target_file(const char *path, pgoff_t newsize)
 {
 	char		dstpath[MAXPGPATH];
 	int			fd;
diff --git a/src/bin/pg_rewind/file_ops.h b/src/bin/pg_rewind/file_ops.h
index 427cf8e0b5..41a41cb6cb 100644
--- a/src/bin/pg_rewind/file_ops.h
+++ b/src/bin/pg_rewind/file_ops.h
@@ -13,10 +13,10 @@
 #include "filemap.h"
 
 extern void open_target_file(const char *path, bool trunc);
-extern void write_target_range(char *buf, off_t begin, size_t size);
+extern void write_target_range(char *buf, pgoff_t begin, size_t size);
 extern void close_target_file(void);
 extern void remove_target_file(const char *path, bool missing_ok);
-extern void truncate_target_file(const char *path, off_t newsize);
+extern void truncate_target_file(const char *path, pgoff_t newsize);
 extern void create_target(file_entry_t *entry);
 extern void remove_target(file_entry_t *entry);
 extern void sync_target_dir(void);
diff --git a/src/bin/pg_rewind/filemap.c b/src/bin/pg_rewind/filemap.c
index bd5c598e20..a5855ccaa9 100644
--- a/src/bin/pg_rewind/filemap.c
+++ b/src/bin/pg_rewind/filemap.c
@@ -296,6 +296,8 @@ process_target_wal_block_change(ForkNumber forknum, RelFileLocator rlocator,
 	BlockNumber blkno_inseg;
 	int			segno;
 
+	/* XXX We need to know if it is segmented! */
+
 	segno = blkno / RELSEG_SIZE;
 	blkno_inseg = blkno % RELSEG_SIZE;
 
diff --git a/src/bin/pg_rewind/libpq_source.c b/src/bin/pg_rewind/libpq_source.c
index 5f486b2a61..d4832ccb76 100644
--- a/src/bin/pg_rewind/libpq_source.c
+++ b/src/bin/pg_rewind/libpq_source.c
@@ -30,7 +30,7 @@
 typedef struct
 {
 	const char *path;			/* path relative to data directory root */
-	off_t		offset;
+	pgoff_t		offset;
 	size_t		length;
 } fetch_range_request;
 
@@ -65,7 +65,7 @@ static void libpq_traverse_files(rewind_source *source,
 								 process_file_callback_t callback);
 static void libpq_queue_fetch_file(rewind_source *source, const char *path, size_t len);
 static void libpq_queue_fetch_range(rewind_source *source, const char *path,
-									off_t off, size_t len);
+									pgoff_t off, size_t len);
 static void libpq_finish_fetch(rewind_source *source);
 static char *libpq_fetch_file(rewind_source *source, const char *path,
 							  size_t *filesize);
@@ -343,7 +343,7 @@ libpq_queue_fetch_file(rewind_source *source, const char *path, size_t len)
  * Queue up a request to fetch a piece of a file from remote system.
  */
 static void
-libpq_queue_fetch_range(rewind_source *source, const char *path, off_t off,
+libpq_queue_fetch_range(rewind_source *source, const char *path, pgoff_t off,
 						size_t len)
 {
 	libpq_source *src = (libpq_source *) source;
diff --git a/src/bin/pg_rewind/local_source.c b/src/bin/pg_rewind/local_source.c
index 4e2a1376c6..fb84309c12 100644
--- a/src/bin/pg_rewind/local_source.c
+++ b/src/bin/pg_rewind/local_source.c
@@ -32,7 +32,7 @@ static char *local_fetch_file(rewind_source *source, const char *path,
 static void local_queue_fetch_file(rewind_source *source, const char *path,
 								   size_t len);
 static void local_queue_fetch_range(rewind_source *source, const char *path,
-									off_t off, size_t len);
+									pgoff_t off, size_t len);
 static void local_finish_fetch(rewind_source *source);
 static void local_destroy(rewind_source *source);
 
@@ -125,15 +125,15 @@ local_queue_fetch_file(rewind_source *source, const char *path, size_t len)
  * Copy a file from source to target, starting at 'off', for 'len' bytes.
  */
 static void
-local_queue_fetch_range(rewind_source *source, const char *path, off_t off,
+local_queue_fetch_range(rewind_source *source, const char *path, pgoff_t off,
 						size_t len)
 {
 	const char *datadir = ((local_source *) source)->datadir;
 	PGIOAlignedBlock buf;
 	char		srcpath[MAXPGPATH];
 	int			srcfd;
-	off_t		begin = off;
-	off_t		end = off + len;
+	pgoff_t		begin = off;
+	pgoff_t		end = off + len;
 
 	snprintf(srcpath, sizeof(srcpath), "%s/%s", datadir, path);
 
diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c
index f7f3b8227f..500842e169 100644
--- a/src/bin/pg_rewind/pg_rewind.c
+++ b/src/bin/pg_rewind/pg_rewind.c
@@ -566,7 +566,7 @@ perform_rewind(filemap_t *filemap, rewind_source *source,
 		{
 			datapagemap_iterator_t *iter;
 			BlockNumber blkno;
-			off_t		offset;
+			pgoff_t		offset;
 
 			iter = datapagemap_iterate(&entry->target_pages_to_overwrite);
 			while (datapagemap_next(iter, &blkno))
diff --git a/src/bin/pg_rewind/rewind_source.h b/src/bin/pg_rewind/rewind_source.h
index 69ad0e495f..e17526ce86 100644
--- a/src/bin/pg_rewind/rewind_source.h
+++ b/src/bin/pg_rewind/rewind_source.h
@@ -45,7 +45,7 @@ typedef struct rewind_source
 	 * queue and execute all requests.
 	 */
 	void		(*queue_fetch_range) (struct rewind_source *, const char *path,
-									  off_t offset, size_t len);
+									  pgoff_t offset, size_t len);
 
 	/*
 	 * Like queue_fetch_range(), but requests replacing the whole local file
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index a038450787..d82cd027f4 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -396,7 +396,7 @@ typedef struct xl_heap_rewrite_mapping
 	TransactionId mapped_xid;	/* xid that might need to see the row */
 	Oid			mapped_db;		/* DbOid or InvalidOid for shared rels */
 	Oid			mapped_rel;		/* Oid of the mapped relation */
-	off_t		offset;			/* How far have we written so far */
+	pgoff_t		offset;			/* How far have we written so far */
 	uint32		num_mappings;	/* Number of in-memory mappings */
 	XLogRecPtr	start_lsn;		/* Insert LSN at begin of rewrite */
 } xl_heap_rewrite_mapping;
-- 
2.40.1

0005-Use-large-files-for-relation-storage.patchtext/x-patch; charset=US-ASCII; name=0005-Use-large-files-for-relation-storage.patchDownload

From d22479403d02944e6c2569897816137f8582c6f1 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sun, 5 Mar 2023 11:51:15 +1300
Subject: [PATCH 05/11] Use large files for relation storage.

Traditionally we broke files up into 1Gb segments (configurable) to
support older OSes before the industry transition to "large files" in
the mid 90s.  These days, the only remaining consideration on living
operating systems is that Windows still has 32 bit types in a few
interfaces, but we deal with that by being careful to use pgoff_t
everywhere instead of off_t.

Having many segment files creates extra work for the kernel, which must
manage many more descriptors, and extra work for PostgreSQL, which must
close and reopen them to stay under per-process descriptor limits.

With this patch, all new relations will be non-segmented.  The only way
to have a segmented relation is to inherit it via pg_upgrade.  For some
number of releases, legacy segmented relations will be supported, and
can be upgraded to non-segmented format by any operation that rewrites
the relation, creating a new relfilenode (VACUUM FULL, etc).
---
 src/backend/storage/smgr/md.c | 227 +++++++++++++++++++++++++++-------
 src/include/storage/smgr.h    |   1 +
 2 files changed, 181 insertions(+), 47 deletions(-)

diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index e982a8dd7f..005a7a15bf 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -42,6 +42,14 @@
 #include "utils/memutils.h"
 
 /*
+ *  The magnetic disk storage manager assumes that the operating system
+ *  supports "large files".  Historically, this wasn't the case, so there is
+ *  support for "segmented" files that were upgraded from earlier releases.
+ *  A future release may eventually drop support for those.  See
+ *  md_fork_is_segmented() for details.
+ *
+ *  The following paragraphs describe the historical behavior.
+ *
  *	The magnetic disk storage manager keeps track of open file
  *	descriptors in its own descriptor pool.  This is done to make it
  *	easier to support relations that are larger than the operating
@@ -119,6 +127,9 @@ static MemoryContext MdCxt;		/* context for all MdfdVec objects */
 /* don't try to open a segment, if not already open */
 #define EXTENSION_DONT_OPEN			(1 << 5)
 
+#define MD_FORK_SEGMENTED_UNKNOWN	'u'
+#define MD_FORK_SEGMENTED_FALSE		'f'
+#define MD_FORK_SEGMENTED_TRUE		't'
 
 /* local routines */
 static void mdunlinkfork(RelFileLocatorBackend rlocator, ForkNumber forknum,
@@ -139,8 +150,11 @@ static MdfdVec *_mdfd_openseg(SMgrRelation reln, ForkNumber forknum,
 							  BlockNumber segno, int oflags);
 static MdfdVec *_mdfd_getseg(SMgrRelation reln, ForkNumber forknum,
 							 BlockNumber blkno, bool skipFsync, int behavior);
+static pgoff_t getseekpos(SMgrRelation reln, ForkNumber forknum,
+						  BlockNumber blocknum);
 static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum,
 							  MdfdVec *seg);
+static bool md_fork_is_segmented(SMgrRelation reln, ForkNumber forknum);
 
 static inline int
 _mdfd_open_flags(void)
@@ -459,7 +473,7 @@ void
 mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 		 const void *buffer, bool skipFsync)
 {
-	off_t		seekpos;
+	pgoff_t		seekpos;
 	int			nbytes;
 	MdfdVec    *v;
 
@@ -486,10 +500,7 @@ mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 						InvalidBlockNumber)));
 
 	v = _mdfd_getseg(reln, forknum, blocknum, skipFsync, EXTENSION_CREATE);
-
-	seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
-
-	Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+	seekpos = getseekpos(reln, forknum, blocknum);
 
 	if ((nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, seekpos, WAIT_EVENT_DATA_FILE_EXTEND)) != BLCKSZ)
 	{
@@ -511,7 +522,8 @@ mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	if (!skipFsync && !SmgrIsTemp(reln))
 		register_dirty_segment(reln, forknum, v);
 
-	Assert(_mdnblocks(reln, forknum, v) <= ((BlockNumber) RELSEG_SIZE));
+	if (md_fork_is_segmented(reln, forknum))
+		Assert(_mdnblocks(reln, forknum, v) <= ((BlockNumber) RELSEG_SIZE));
 }
 
 /*
@@ -549,20 +561,30 @@ mdzeroextend(SMgrRelation reln, ForkNumber forknum,
 
 	while (remblocks > 0)
 	{
-		BlockNumber	segstartblock = curblocknum % ((BlockNumber) RELSEG_SIZE);
-		off_t		seekpos = (off_t) BLCKSZ * segstartblock;
+		BlockNumber	segstartblock;
+		pgoff_t		seekpos;
 		int			numblocks;
 
-		if (segstartblock + remblocks > RELSEG_SIZE)
-			numblocks = RELSEG_SIZE - segstartblock;
+		if (md_fork_is_segmented(reln, forknum))
+		{
+			segstartblock = curblocknum % ((BlockNumber) RELSEG_SIZE);
+			seekpos = (pgoff_t) BLCKSZ * segstartblock;
+			if (segstartblock + remblocks > RELSEG_SIZE)
+				numblocks = RELSEG_SIZE - segstartblock;
+			else
+				numblocks = remblocks;
+			Assert(segstartblock < RELSEG_SIZE);
+			Assert(segstartblock + numblocks <= RELSEG_SIZE);
+		}
 		else
+		{
+			segstartblock = curblocknum;
+			seekpos = (pgoff_t) BLCKSZ * segstartblock;
 			numblocks = remblocks;
+		}
 
 		v = _mdfd_getseg(reln, forknum, curblocknum, skipFsync, EXTENSION_CREATE);
 
-		Assert(segstartblock < RELSEG_SIZE);
-		Assert(segstartblock + numblocks <= RELSEG_SIZE);
-
 		/*
 		 * If available and useful, use posix_fallocate() (via FileAllocate())
 		 * to extend the relation. That's often more efficient than using
@@ -579,7 +601,7 @@ mdzeroextend(SMgrRelation reln, ForkNumber forknum,
 			int			ret;
 
 			ret = FileFallocate(v->mdfd_vfd,
-								seekpos, (off_t) BLCKSZ * numblocks,
+								seekpos, (pgoff_t) BLCKSZ * numblocks,
 								WAIT_EVENT_DATA_FILE_EXTEND);
 			if (ret != 0)
 			{
@@ -602,7 +624,7 @@ mdzeroextend(SMgrRelation reln, ForkNumber forknum,
 			 * zeroed buffer for the whole length of the extension.
 			 */
 			ret = FileZero(v->mdfd_vfd,
-						   seekpos, (off_t) BLCKSZ * numblocks,
+						   seekpos, (pgoff_t) BLCKSZ * numblocks,
 						   WAIT_EVENT_DATA_FILE_EXTEND);
 			if (ret < 0)
 				ereport(ERROR,
@@ -615,7 +637,8 @@ mdzeroextend(SMgrRelation reln, ForkNumber forknum,
 		if (!skipFsync && !SmgrIsTemp(reln))
 			register_dirty_segment(reln, forknum, v);
 
-		Assert(_mdnblocks(reln, forknum, v) <= ((BlockNumber) RELSEG_SIZE));
+		if (md_fork_is_segmented(reln, forknum))
+			Assert(_mdnblocks(reln, forknum, v) <= ((BlockNumber) RELSEG_SIZE));
 
 		remblocks -= numblocks;
 		curblocknum += numblocks;
@@ -644,7 +667,6 @@ mdopenfork(SMgrRelation reln, ForkNumber forknum, int behavior)
 		return &reln->md_seg_fds[forknum][0];
 
 	path = relpath(reln->smgr_rlocator, forknum);
-
 	fd = PathNameOpenFile(path, _mdfd_open_flags());
 
 	if (fd < 0)
@@ -667,7 +689,8 @@ mdopenfork(SMgrRelation reln, ForkNumber forknum, int behavior)
 	mdfd->mdfd_vfd = fd;
 	mdfd->mdfd_segno = 0;
 
-	Assert(_mdnblocks(reln, forknum, mdfd) <= ((BlockNumber) RELSEG_SIZE));
+	if (md_fork_is_segmented(reln, forknum))
+		Assert(_mdnblocks(reln, forknum, mdfd) <= ((BlockNumber) RELSEG_SIZE));
 
 	return mdfd;
 }
@@ -680,7 +703,10 @@ mdopen(SMgrRelation reln)
 {
 	/* mark it not open */
 	for (int forknum = 0; forknum <= MAX_FORKNUM; forknum++)
+	{
+		reln->md_segmented[forknum] = MD_FORK_SEGMENTED_UNKNOWN;
 		reln->md_num_open_segs[forknum] = 0;
+	}
 }
 
 /*
@@ -713,7 +739,7 @@ bool
 mdprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
 {
 #ifdef USE_PREFETCH
-	off_t		seekpos;
+	pgoff_t		seekpos;
 	MdfdVec    *v;
 
 	Assert((io_direct_flags & IO_DIRECT_DATA) == 0);
@@ -723,9 +749,7 @@ mdprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
 	if (v == NULL)
 		return false;
 
-	seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
-
-	Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+	seekpos = getseekpos(reln, forknum, blocknum);
 
 	(void) FilePrefetch(v->mdfd_vfd, seekpos, BLCKSZ, WAIT_EVENT_DATA_FILE_PREFETCH);
 #endif							/* USE_PREFETCH */
@@ -752,10 +776,8 @@ mdwriteback(SMgrRelation reln, ForkNumber forknum,
 	while (nblocks > 0)
 	{
 		BlockNumber nflush = nblocks;
-		off_t		seekpos;
+		pgoff_t		seekpos;
 		MdfdVec    *v;
-		int			segnum_start,
-					segnum_end;
 
 		v = _mdfd_getseg(reln, forknum, blocknum, true /* not used */ ,
 						 EXTENSION_DONT_OPEN);
@@ -770,20 +792,26 @@ mdwriteback(SMgrRelation reln, ForkNumber forknum,
 		if (!v)
 			return;
 
-		/* compute offset inside the current segment */
-		segnum_start = blocknum / RELSEG_SIZE;
+		if (md_fork_is_segmented(reln, forknum))
+		{
+			int			segnum_start,
+						segnum_end;
+
+			/* compute offset inside the current segment */
+			segnum_start = blocknum / RELSEG_SIZE;
 
-		/* compute number of desired writes within the current segment */
-		segnum_end = (blocknum + nblocks - 1) / RELSEG_SIZE;
-		if (segnum_start != segnum_end)
-			nflush = RELSEG_SIZE - (blocknum % ((BlockNumber) RELSEG_SIZE));
+			/* compute number of desired writes within the current segment */
+			segnum_end = (blocknum + nblocks - 1) / RELSEG_SIZE;
+			if (segnum_start != segnum_end)
+				nflush = RELSEG_SIZE - (blocknum % ((BlockNumber) RELSEG_SIZE));
 
-		Assert(nflush >= 1);
-		Assert(nflush <= nblocks);
+			Assert(nflush >= 1);
+			Assert(nflush <= nblocks);
+		}
 
-		seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+		seekpos = getseekpos(reln, forknum, blocknum);
 
-		FileWriteback(v->mdfd_vfd, seekpos, (off_t) BLCKSZ * nflush, WAIT_EVENT_DATA_FILE_FLUSH);
+		FileWriteback(v->mdfd_vfd, seekpos, (pgoff_t) BLCKSZ * nflush, WAIT_EVENT_DATA_FILE_FLUSH);
 
 		nblocks -= nflush;
 		blocknum += nflush;
@@ -797,7 +825,7 @@ void
 mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	   void *buffer)
 {
-	off_t		seekpos;
+	pgoff_t		seekpos;
 	int			nbytes;
 	MdfdVec    *v;
 
@@ -814,9 +842,7 @@ mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	v = _mdfd_getseg(reln, forknum, blocknum, false,
 					 EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY);
 
-	seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
-
-	Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+	seekpos = getseekpos(reln, forknum, blocknum);
 
 	nbytes = FileRead(v->mdfd_vfd, buffer, BLCKSZ, seekpos, WAIT_EVENT_DATA_FILE_READ);
 
@@ -866,7 +892,7 @@ void
 mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 		const void *buffer, bool skipFsync)
 {
-	off_t		seekpos;
+	pgoff_t		seekpos;
 	int			nbytes;
 	MdfdVec    *v;
 
@@ -888,9 +914,7 @@ mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	v = _mdfd_getseg(reln, forknum, blocknum, skipFsync,
 					 EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY);
 
-	seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
-
-	Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+	seekpos = getseekpos(reln, forknum, blocknum);
 
 	nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, seekpos, WAIT_EVENT_DATA_FILE_WRITE);
 
@@ -962,6 +986,13 @@ mdnblocks(SMgrRelation reln, ForkNumber forknum)
 	for (;;)
 	{
 		nblocks = _mdnblocks(reln, forknum, v);
+
+		if (!md_fork_is_segmented(reln, forknum))
+		{
+			Assert(segno == 0);
+			return nblocks;
+		}
+
 		if (nblocks > ((BlockNumber) RELSEG_SIZE))
 			elog(FATAL, "segment too big");
 		if (nblocks < ((BlockNumber) RELSEG_SIZE))
@@ -1013,6 +1044,25 @@ mdtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks)
 	if (nblocks == curnblk)
 		return;					/* no work */
 
+	if (!md_fork_is_segmented(reln, forknum))
+	{
+		MdfdVec    *v;
+
+		Assert(reln->md_num_open_segs[forknum] == 1);
+		v = &reln->md_seg_fds[forknum][0];
+
+		if (FileTruncate(v->mdfd_vfd, (pgoff_t) nblocks * BLCKSZ, WAIT_EVENT_DATA_FILE_TRUNCATE) < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not truncate file \"%s\" to %u blocks: %m",
+							FilePathName(v->mdfd_vfd),
+							nblocks)));
+		if (!SmgrIsTemp(reln))
+			register_dirty_segment(reln, forknum, v);
+
+		return;
+	}
+
 	/*
 	 * Truncate segments, starting at the last one. Starting at the end makes
 	 * managing the memory for the fd array easier, should there be errors.
@@ -1058,7 +1108,7 @@ mdtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks)
 			 */
 			BlockNumber lastsegblocks = nblocks - priorblocks;
 
-			if (FileTruncate(v->mdfd_vfd, (off_t) lastsegblocks * BLCKSZ, WAIT_EVENT_DATA_FILE_TRUNCATE) < 0)
+			if (FileTruncate(v->mdfd_vfd, (pgoff_t) lastsegblocks * BLCKSZ, WAIT_EVENT_DATA_FILE_TRUNCATE) < 0)
 				ereport(ERROR,
 						(errcode_for_file_access(),
 						 errmsg("could not truncate file \"%s\" to %u blocks: %m",
@@ -1396,7 +1446,10 @@ _mdfd_getseg(SMgrRelation reln, ForkNumber forknum, BlockNumber blkno,
 		   (EXTENSION_FAIL | EXTENSION_CREATE | EXTENSION_RETURN_NULL |
 			EXTENSION_DONT_OPEN));
 
-	targetseg = blkno / ((BlockNumber) RELSEG_SIZE);
+	if (md_fork_is_segmented(reln, forknum))
+		targetseg = blkno / ((BlockNumber) RELSEG_SIZE);
+	else
+		targetseg = 0;
 
 	/* if an existing and opened segment, we're done */
 	if (targetseg < reln->md_num_open_segs[forknum])
@@ -1433,7 +1486,8 @@ _mdfd_getseg(SMgrRelation reln, ForkNumber forknum, BlockNumber blkno,
 
 		Assert(nextsegno == v->mdfd_segno + 1);
 
-		if (nblocks > ((BlockNumber) RELSEG_SIZE))
+		if (md_fork_is_segmented(reln, forknum) &&
+			nblocks > ((BlockNumber) RELSEG_SIZE))
 			elog(FATAL, "segment too big");
 
 		if ((behavior & EXTENSION_CREATE) ||
@@ -1493,6 +1547,9 @@ _mdfd_getseg(SMgrRelation reln, ForkNumber forknum, BlockNumber blkno,
 							blkno, nblocks)));
 		}
 
+		if (!md_fork_is_segmented(reln, forknum))
+			break;
+
 		v = _mdfd_openseg(reln, forknum, nextsegno, flags);
 
 		if (v == NULL)
@@ -1511,13 +1568,22 @@ _mdfd_getseg(SMgrRelation reln, ForkNumber forknum, BlockNumber blkno,
 	return v;
 }
 
+static pgoff_t
+getseekpos(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
+{
+	if (md_fork_is_segmented(reln, forknum))
+		return (pgoff_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+
+	return (pgoff_t) BLCKSZ * blocknum;
+}
+
 /*
  * Get number of blocks present in a single disk file
  */
 static BlockNumber
 _mdnblocks(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 {
-	off_t		len;
+	pgoff_t		len;
 
 	len = FileSize(seg->mdfd_vfd);
 	if (len < 0)
@@ -1618,3 +1684,70 @@ mdfiletagmatches(const FileTag *ftag, const FileTag *candidate)
 	 */
 	return ftag->rlocator.dbOid == candidate->rlocator.dbOid;
 }
+
+/*
+ * Is this fork in legacy segmented format, inherited from an easlier release
+ * via pg_upgrade?
+ */
+bool
+md_fork_is_segmented(SMgrRelation reln, ForkNumber forknum)
+{
+	char		path_probe[MAXPGPATH];
+	char	   *path;
+
+	Assert(forknum >= 0 && forknum <= MAX_FORKNUM);
+
+	/* Fast return if we have the answer cached. */
+	if (reln->md_segmented[forknum] == MD_FORK_SEGMENTED_FALSE)
+		return false;
+	if (reln->md_segmented[forknum] == MD_FORK_SEGMENTED_TRUE)
+		return true;
+
+	Assert(reln->md_segmented[forknum] == MD_FORK_SEGMENTED_UNKNOWN);
+
+	/*
+	 * All backends must agree, using only clues from the file system, and the
+	 * answer must not change for as long as this relation exists.  The
+	 * correctness of this strategy depends on the following properties:
+	 *
+	 * 1.  When segmented forks are truncated, their higher numbered segments
+	 *	   are truncated to size zero, but they still exist.  That is, higher
+	 *	   segments won't be unlinked for as long as the relation exists.
+	 *
+	 * 2.  We don't create new segmented relations, so the only way they can
+	 *	   exist is if we inherited them via pg_upgrade from an earlier
+	 *	   release.
+	 *
+	 * 3.  Relations that never had more than one segment and were pg_upgraded
+	 *	   are indistinguishable from newly created (non-segmented) relations.
+	 *
+	 * 4.  If the relfilenode is recycled for a later relation, all backends
+	 *	   will close all segments first before potentially reopening the next
+	 *	   generation, either via the sinval or ProcSignalBarrier cache
+	 *	   invalidation system.
+	 *
+	 * Therefore, it is safe for every backend to determine whether the fork is
+	 * segmented by checking the existence of a ".1" file.
+	 */
+	path = relpath(reln->smgr_rlocator, forknum);
+	snprintf(path_probe, sizeof(path_probe), "%s.1", path);
+	if (access(path_probe, F_OK) == 0)
+	{
+		pfree(path);
+		reln->md_segmented[forknum] = MD_FORK_SEGMENTED_TRUE;
+		return true;
+	}
+	else if (errno == ENOENT)
+	{
+		pfree(path);
+		reln->md_segmented[forknum] = MD_FORK_SEGMENTED_FALSE;
+		return false;
+	}
+	pfree(path);
+
+	ereport(ERROR,
+			(errcode_for_file_access(),
+			 errmsg("could not read access in file \"%s\": %m",
+					path_probe)));
+	pg_unreachable();
+}
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index a9a179aaba..e352a035be 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -65,6 +65,7 @@ typedef struct SMgrRelationData
 	 * for md.c; per-fork arrays of the number of open segments
 	 * (md_num_open_segs) and the segments themselves (md_seg_fds).
 	 */
+	char		md_segmented[MAX_FORKNUM + 1];
 	int			md_num_open_segs[MAX_FORKNUM + 1];
 	struct _MdfdVec *md_seg_fds[MAX_FORKNUM + 1];
 
-- 
2.40.1

0006-Detect-copy_file_range-function.patchtext/x-patch; charset=US-ASCII; name=0006-Detect-copy_file_range-function.patchDownload

From d1ffce7141cd34eff9d0d3f65f5e18f472b6d813 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sun, 30 Apr 2023 10:38:46 +1200
Subject: [PATCH 06/11] Detect copy_file_range() function.

---
 configure                  | 2 +-
 configure.ac               | 1 +
 meson.build                | 1 +
 src/include/pg_config.h.in | 3 +++
 src/tools/msvc/Solution.pm | 1 +
 5 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/configure b/configure
index 47ba18491c..7d351b9614 100755
--- a/configure
+++ b/configure
@@ -15700,7 +15700,7 @@ fi
 LIBS_including_readline="$LIBS"
 LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
 
-for ac_func in backtrace_symbols copyfile getifaddrs getpeerucred inet_pton kqueue mbstowcs_l memset_s posix_fallocate ppoll pthread_is_threaded_np setproctitle setproctitle_fast strchrnul strsignal syncfs sync_file_range uselocale wcstombs_l
+for ac_func in backtrace_symbols copyfile copy_file_range getifaddrs getpeerucred inet_pton kqueue mbstowcs_l memset_s posix_fallocate ppoll pthread_is_threaded_np setproctitle setproctitle_fast strchrnul strsignal syncfs sync_file_range uselocale wcstombs_l
 do :
   as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh`
 ac_fn_c_check_func "$LINENO" "$ac_func" "$as_ac_var"
diff --git a/configure.ac b/configure.ac
index 2b3b1b4dca..ddb82e9433 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1794,6 +1794,7 @@ LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
 AC_CHECK_FUNCS(m4_normalize([
 	backtrace_symbols
 	copyfile
+	copy_file_range
 	getifaddrs
 	getpeerucred
 	inet_pton
diff --git a/meson.build b/meson.build
index 096044628c..c06e4f9290 100644
--- a/meson.build
+++ b/meson.build
@@ -2404,6 +2404,7 @@ func_checks = [
   ['backtrace_symbols', {'dependencies': [execinfo_dep]}],
   ['clock_gettime', {'dependencies': [rt_dep, posix4_dep], 'define': false}],
   ['copyfile'],
+  ['copy_file_range'],
   # gcc/clang's sanitizer helper library provides dlopen but not dlsym, thus
   # when enabling asan the dlopen check doesn't notice that -ldl is actually
   # required. Just checking for dlsym() ought to suffice.
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 6d572c3820..0b26836f68 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -85,6 +85,9 @@
 /* Define to 1 if you have the <copyfile.h> header file. */
 #undef HAVE_COPYFILE_H
 
+/* Define to 1 if you have the `copy_file_range' function. */
+#undef HAVE_COPY_FILE_RANGE
+
 /* Define to 1 if you have the <crtdefs.h> header file. */
 #undef HAVE_CRTDEFS_H
 
diff --git a/src/tools/msvc/Solution.pm b/src/tools/msvc/Solution.pm
index ef10cda576..671d958af7 100644
--- a/src/tools/msvc/Solution.pm
+++ b/src/tools/msvc/Solution.pm
@@ -230,6 +230,7 @@ sub GenerateFiles
 		HAVE_COMPUTED_GOTO         => undef,
 		HAVE_COPYFILE              => undef,
 		HAVE_COPYFILE_H            => undef,
+		HAVE_COPY_FILE_RANGE       => undef,
 		HAVE_CRTDEFS_H             => undef,
 		HAVE_CRYPTO_LOCK           => undef,
 		HAVE_DECL_FDATASYNC        => 0,
-- 
2.40.1

0007-Use-copy_file_range-to-implement-copy_file.patchtext/x-patch; charset=US-ASCII; name=0007-Use-copy_file_range-to-implement-copy_file.patchDownload

From d89cbae1851627be4e146efedc92ba9d0a67ad6a Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sun, 30 Apr 2023 11:10:08 +1200
Subject: [PATCH 07/11] Use copy_file_range() to implement copy_file().

If copy_file_range() is available, use it to implement copy_file(), so
that the operating system has opportunities for efficient copying,
block cloning and pushdown.  This affects the commands CREATE DATABASE
STRATEGY=FILE_COPY and ALTER TABLE SET TABLESPACE, which perform bulk
file copies.

On older Linux systems, copy_file_range() might fail with EXDEV, so we
look out for that and fall back to the traditional read/write loop.

XXX Should we also let the user opt out?
---
 doc/src/sgml/monitoring.sgml            |  4 ++
 src/backend/storage/file/copydir.c      | 94 +++++++++++++++++++------
 src/backend/utils/activity/wait_event.c |  3 +
 src/include/utils/wait_event.h          |  1 +
 4 files changed, 82 insertions(+), 20 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 99f7f95c39..2161b32b17 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1317,6 +1317,10 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       <entry>Waiting for a write to update the <filename>pg_control</filename>
        file.</entry>
      </row>
+     <row>
+      <entry><literal>CopyFileRange</literal></entry>
+      <entry>Waiting for range to be copied during a file copy operation.</entry>
+     </row>
      <row>
       <entry><literal>CopyFileRead</literal></entry>
       <entry>Waiting for a read during a file copy operation.</entry>
diff --git a/src/backend/storage/file/copydir.c b/src/backend/storage/file/copydir.c
index 82f77536b4..497d357d8c 100644
--- a/src/backend/storage/file/copydir.c
+++ b/src/backend/storage/file/copydir.c
@@ -126,6 +126,14 @@ copy_file(const char *fromfile, const char *tofile)
 	/* Size of copy buffer (read and write requests) */
 #define COPY_BUF_SIZE (8 * BLCKSZ)
 
+	/*
+	 * Size of ranges when using copy_file_range().  We could in theory just
+	 * use the whole file size, but we want to check for interrupts
+	 * periodically while copying.  We don't want to make it too small though,
+	 * to give the operating system the chance to clone large extents.
+	 */
+#define COPY_FILE_RANGE_CHUNK_SIZE (1024 * 1024)
+
 	/*
 	 * Size of data flush requests.  It seems beneficial on most platforms to
 	 * do this every 1MB or so.  But macOS, at least with early releases of
@@ -138,8 +146,13 @@ copy_file(const char *fromfile, const char *tofile)
 #define FLUSH_DISTANCE (1024 * 1024)
 #endif
 
+#ifdef HAVE_COPY_FILE_RANGE
+	/* Don't allocate the buffer unless we have to fall back to read/write. */
+	buffer = NULL;
+#else
 	/* Use palloc to ensure we get a maxaligned buffer */
 	buffer = palloc(COPY_BUF_SIZE);
+#endif
 
 	/*
 	 * Open the files
@@ -176,27 +189,67 @@ copy_file(const char *fromfile, const char *tofile)
 			flush_offset = offset;
 		}
 
-		pgstat_report_wait_start(WAIT_EVENT_COPY_FILE_READ);
-		nbytes = read(srcfd, buffer, COPY_BUF_SIZE);
-		pgstat_report_wait_end();
-		if (nbytes < 0)
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not read file \"%s\": %m", fromfile)));
-		if (nbytes == 0)
-			break;
-		errno = 0;
-		pgstat_report_wait_start(WAIT_EVENT_COPY_FILE_WRITE);
-		if ((int) write(dstfd, buffer, nbytes) != nbytes)
+		nbytes = 0;			/* silence compiler */
+
+#ifdef HAVE_COPY_FILE_RANGE
+		if (buffer == NULL)
+		{
+			pgstat_report_wait_start(WAIT_EVENT_COPY_FILE_RANGE);
+			nbytes = copy_file_range(srcfd, NULL, dstfd, NULL,
+									 COPY_FILE_RANGE_CHUNK_SIZE, 0);
+			pgstat_report_wait_end();
+
+			if (nbytes < 0)
+			{
+				if (errno == EXDEV)
+				{
+					/*
+					 * Linux < 5.3 fails like this for cross-filesystem copies.
+					 * Allocate the buffer to fall back to read/write mode.
+					 */
+					buffer = palloc(COPY_BUF_SIZE);
+				}
+				else
+					ereport(ERROR,
+							(errcode_for_file_access(),
+							 errmsg("could not copy to file \"%s\": %m", tofile)));
+			}
+		}
+#endif
+
+		if (buffer)
 		{
-			/* if write didn't set errno, assume problem is no disk space */
-			if (errno == 0)
-				errno = ENOSPC;
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not write to file \"%s\": %m", tofile)));
+			pgstat_report_wait_start(WAIT_EVENT_COPY_FILE_READ);
+			nbytes = read(srcfd, buffer, COPY_BUF_SIZE);
+			pgstat_report_wait_end();
+
+			if (nbytes < 0)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not read file \"%s\": %m", fromfile)));
+
+			if (nbytes > 0)
+			{
+				errno = 0;
+				pgstat_report_wait_start(WAIT_EVENT_COPY_FILE_WRITE);
+				if ((int) write(dstfd, buffer, nbytes) != nbytes)
+				{
+					/*
+					 * If write didn't set errno, assume problem is no disk
+					 * space.
+					 */
+					if (errno == 0)
+						errno = ENOSPC;
+					ereport(ERROR,
+							(errcode_for_file_access(),
+							 errmsg("could not write to file \"%s\": %m", tofile)));
+				}
+				pgstat_report_wait_end();
+			}
 		}
-		pgstat_report_wait_end();
+
+		if (nbytes == 0)
+			break;
 	}
 
 	if (offset > flush_offset)
@@ -212,5 +265,6 @@ copy_file(const char *fromfile, const char *tofile)
 				(errcode_for_file_access(),
 				 errmsg("could not close file \"%s\": %m", fromfile)));
 
-	pfree(buffer);
+	if (buffer)
+		pfree(buffer);
 }
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index 7940d64639..9c3cd088c0 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -567,6 +567,9 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_CONTROL_FILE_WRITE_UPDATE:
 			event_name = "ControlFileWriteUpdate";
 			break;
+		case WAIT_EVENT_COPY_FILE_RANGE:
+			event_name = "CopyFileRange";
+			break;
 		case WAIT_EVENT_COPY_FILE_READ:
 			event_name = "CopyFileRead";
 			break;
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 518d3b0a1f..517de1544b 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -172,6 +172,7 @@ typedef enum
 	WAIT_EVENT_CONTROL_FILE_SYNC_UPDATE,
 	WAIT_EVENT_CONTROL_FILE_WRITE,
 	WAIT_EVENT_CONTROL_FILE_WRITE_UPDATE,
+	WAIT_EVENT_COPY_FILE_RANGE,
 	WAIT_EVENT_COPY_FILE_READ,
 	WAIT_EVENT_COPY_FILE_WRITE,
 	WAIT_EVENT_DATA_FILE_EXTEND,
-- 
2.40.1

0008-Teach-copy_file-to-concatenate-segmented-files.patchtext/x-patch; charset=US-ASCII; name=0008-Teach-copy_file-to-concatenate-segmented-files.patchDownload

From f83a0a9f80614e18b780e7636e5c2e567b2f701e Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sun, 30 Apr 2023 15:36:20 +1200
Subject: [PATCH 08/11] Teach copy_file() to concatenate segmented files.

This means that relations are automatically converted to large file
format during COPY DATABASE ... STRATEGY=FILE_COPY and ALTER TABLE ...
SET TABLESPACE operations.
---
 src/backend/storage/file/copydir.c | 43 +++++++++++++++++++++++++++++-
 1 file changed, 42 insertions(+), 1 deletion(-)

diff --git a/src/backend/storage/file/copydir.c b/src/backend/storage/file/copydir.c
index 497d357d8c..0b472f1ac2 100644
--- a/src/backend/storage/file/copydir.c
+++ b/src/backend/storage/file/copydir.c
@@ -71,7 +71,19 @@ copydir(const char *fromdir, const char *todir, bool recurse)
 				copydir(fromfile, tofile, true);
 		}
 		else if (xlde_type == PGFILETYPE_REG)
+		{
+			const char *s;
+
+			/*
+			 * Skip legacy segment files ending in ".N".  copy_file() will deal
+			 * with those.
+			 */
+			s = strrchr(fromfile, '.');
+			if (s && strspn(s + 1, "0123456789") == strlen(s + 1))
+				continue;
+
 			copy_file(fromfile, tofile);
+		}
 	}
 	FreeDir(xldir);
 
@@ -117,6 +129,7 @@ void
 copy_file(const char *fromfile, const char *tofile)
 {
 	char	   *buffer;
+	int			segno;
 	int			srcfd;
 	int			dstfd;
 	int			nbytes;
@@ -154,6 +167,8 @@ copy_file(const char *fromfile, const char *tofile)
 	buffer = palloc(COPY_BUF_SIZE);
 #endif
 
+	segno = 0;
+
 	/*
 	 * Open the files
 	 */
@@ -248,8 +263,34 @@ copy_file(const char *fromfile, const char *tofile)
 			}
 		}
 
+		/*
+		 * If we ran out of source data on the expected boundary of a legacy
+		 * relation file segment, try opening the next segment.
+		 */
 		if (nbytes == 0)
-			break;
+		{
+			char		nextpath[MAXPGPATH];
+			int			nextfd;
+
+			if (offset % (RELSEG_SIZE * BLCKSZ) != 0)
+				break;
+
+			snprintf(nextpath, sizeof(nextpath), "%s.%d", fromfile, ++segno);
+			nextfd = OpenTransientFile(nextpath, O_RDONLY | PG_BINARY);
+			if (nextfd < 0)
+			{
+				if (errno == ENOENT)
+					break;
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not open file \"%s\": %m", nextpath)));
+			}
+			if (CloseTransientFile(srcfd) != 0)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not close file \"%s\": %m", fromfile)));
+			srcfd = nextfd;
+		}
 	}
 
 	if (offset > flush_offset)
-- 
2.40.1

0009-Use-copy_file_range-in-pg_upgrade.patchtext/x-patch; charset=US-ASCII; name=0009-Use-copy_file_range-in-pg_upgrade.patchDownload

From b435220922d7cd916f1b7acce313c8174738991c Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sun, 30 Apr 2023 14:45:45 +1200
Subject: [PATCH 09/11] Use copy_file_range() in pg_upgrade.

This gives the kernel the opportunity to copy or clone efficiently.
We watch out for EXDEV and fall back to read/write for old Linux
kernels.

XXX Should we also let the user opt out?
---
 src/bin/pg_upgrade/file.c | 65 ++++++++++++++++++++++++++++++---------
 1 file changed, 51 insertions(+), 14 deletions(-)

diff --git a/src/bin/pg_upgrade/file.c b/src/bin/pg_upgrade/file.c
index d173602882..836b2bbbd0 100644
--- a/src/bin/pg_upgrade/file.c
+++ b/src/bin/pg_upgrade/file.c
@@ -9,6 +9,7 @@
 
 #include "postgres_fe.h"
 
+#include <limits.h>
 #include <sys/stat.h>
 #include <fcntl.h>
 #ifdef HAVE_COPYFILE_H
@@ -98,32 +99,68 @@ copyFile(const char *src, const char *dst,
 	/* copy in fairly large chunks for best efficiency */
 #define COPY_BUF_SIZE (50 * BLCKSZ)
 
+#ifdef HAVE_COPY_FILE_RANGE
+	buffer = NULL;
+#else
 	buffer = (char *) pg_malloc(COPY_BUF_SIZE);
+#endif
 
 	/* perform data copying i.e read src source, write to destination */
 	while (true)
 	{
-		ssize_t		nbytes = read(src_fd, buffer, COPY_BUF_SIZE);
+		ssize_t		nbytes = 0;
 
-		if (nbytes < 0)
-			pg_fatal("error while copying relation \"%s.%s\": could not read file \"%s\": %s",
-					 schemaName, relName, src, strerror(errno));
+#ifdef HAVE_COPY_FILE_RANGE
+		if (buffer == NULL)
+		{
+			nbytes = copy_file_range(src_fd, NULL, dest_fd, NULL, SSIZE_MAX, 0);
+			if (nbytes < 0)
+			{
+				if (errno == EXDEV)
+				{
+					/* Linux < 5.3 might fail.  Fall back to read/write. */
+					buffer = (char *) pg_malloc(COPY_BUF_SIZE);
+				}
+				else
+				{
+					pg_fatal("error while copying relation \"%s.%s\": could not read file \"%s\": %s",
 
-		if (nbytes == 0)
-			break;
+							 schemaName, relName, src, strerror(errno));
+				}
+			}
+		}
+#endif
 
-		errno = 0;
-		if (write(dest_fd, buffer, nbytes) != nbytes)
+		if (buffer)
 		{
-			/* if write didn't set errno, assume problem is no disk space */
-			if (errno == 0)
-				errno = ENOSPC;
-			pg_fatal("error while copying relation \"%s.%s\": could not write file \"%s\": %s",
-					 schemaName, relName, dst, strerror(errno));
+			nbytes = read(src_fd, buffer, COPY_BUF_SIZE);
+
+			if (nbytes < 0)
+				pg_fatal("error while copying relation \"%s.%s\": could not read file \"%s\": %s",
+						 schemaName, relName, src, strerror(errno));
+			if (nbytes > 0)
+			{
+				errno = 0;
+				if (write(dest_fd, buffer, nbytes) != nbytes)
+				{
+					/*
+					 * If write didn't set errno, assume problem is no disk
+					 * space.
+					 */
+					if (errno == 0)
+						errno = ENOSPC;
+					pg_fatal("error while copying relation \"%s.%s\": could not write file \"%s\": %s",
+							 schemaName, relName, dst, strerror(errno));
+				}
+			}
 		}
+
+		if (nbytes == 0)
+			break;
 	}
 
-	pg_free(buffer);
+	if (buffer)
+		pg_free(buffer);
 	close(src_fd);
 	close(dest_fd);
 
-- 
2.40.1

0010-Teach-pg_upgrade-to-concatenate-segmented-files.patchtext/x-patch; charset=US-ASCII; name=0010-Teach-pg_upgrade-to-concatenate-segmented-files.patchDownload

From 8683941485516e594174f8cb04d437962e4698f8 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sun, 30 Apr 2023 16:05:46 +1200
Subject: [PATCH 10/11] Teach pg_upgrade to concatenate segmented files.

When using copy mode, segmented relation forks are automatically
concatenated into modern large format.

When using hard link or clone mode, segment files continue to exist in
the destination cluster.

We lose the ability to use the Windows CopyFile() optimization, because
it doesn't support concatenation.  XXX Could be restored as a way of
copying segment 0.

XXX Allow user to opt out of concatenation for copy mode too?
---
 src/bin/pg_upgrade/file.c          | 40 ++++++++++++++++++++----------
 src/bin/pg_upgrade/relfilenumber.c |  4 +++
 2 files changed, 31 insertions(+), 13 deletions(-)

diff --git a/src/bin/pg_upgrade/file.c b/src/bin/pg_upgrade/file.c
index 836b2bbbd0..b4e991f95d 100644
--- a/src/bin/pg_upgrade/file.c
+++ b/src/bin/pg_upgrade/file.c
@@ -82,10 +82,11 @@ void
 copyFile(const char *src, const char *dst,
 		 const char *schemaName, const char *relName)
 {
-#ifndef WIN32
 	int			src_fd;
 	int			dest_fd;
 	char	   *buffer;
+	pgoff_t		total_bytes = 0;
+	int			segno = 0;
 
 	if ((src_fd = open(src, O_RDONLY | PG_BINARY, 0)) < 0)
 		pg_fatal("error while copying relation \"%s.%s\": could not open file \"%s\": %s",
@@ -155,25 +156,38 @@ copyFile(const char *src, const char *dst,
 			}
 		}
 
+		total_bytes += nbytes;
+
 		if (nbytes == 0)
-			break;
+		{
+			char next_path[MAXPGPATH];
+			int next_fd;
+
+			/* If not at a segment boundary size, this must be the end. */
+			if (total_bytes % (RELSEG_SIZE * BLCKSZ) != 0)
+				break;
+
+			/* Is there another segment? */
+			snprintf(next_path, sizeof(next_path), "%s.%d", src, ++segno);
+			next_fd = open(next_path, O_RDONLY | PG_BINARY, 0);
+			if (next_fd < 0)
+			{
+				if (errno == ENOENT)
+					break;
+				pg_fatal("error while copying relation \"%s.%s\": could not read file \"%s\": %s",
+						 schemaName, relName, next_path, strerror(errno));
+			}
+
+			/* Yes.  Start copying from that one. */
+			close(src_fd);
+			src_fd = next_fd;
+		}
 	}
 
 	if (buffer)
 		pg_free(buffer);
 	close(src_fd);
 	close(dest_fd);
-
-#else							/* WIN32 */
-
-	if (CopyFile(src, dst, true) == 0)
-	{
-		_dosmaperr(GetLastError());
-		pg_fatal("error while copying relation \"%s.%s\" (\"%s\" to \"%s\"): %s",
-				 schemaName, relName, src, dst, strerror(errno));
-	}
-
-#endif							/* WIN32 */
 }
 
 
diff --git a/src/bin/pg_upgrade/relfilenumber.c b/src/bin/pg_upgrade/relfilenumber.c
index 34bc9c1504..ea2abfb00f 100644
--- a/src/bin/pg_upgrade/relfilenumber.c
+++ b/src/bin/pg_upgrade/relfilenumber.c
@@ -185,6 +185,10 @@ transfer_relfile(FileNameMap *map, const char *type_suffix, bool vm_must_add_fro
 	 */
 	for (segno = 0;; segno++)
 	{
+		/* Copy mode knows how to find higher numbered segments itself. */
+		if (user_opts.transfer_mode == TRANSFER_MODE_COPY && segno > 0)
+			break;
+
 		if (segno == 0)
 			extent_suffix[0] = '\0';
 		else
-- 
2.40.1

0011-Teach-basebackup-to-concatenate-segmented-files.patchtext/x-patch; charset=US-ASCII; name=0011-Teach-basebackup-to-concatenate-segmented-files.patchDownload

From fc3316b064486d5c15009fc98771a0686914609a Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 2 May 2023 11:15:10 +1200
Subject: [PATCH 11/11] Teach basebackup to concatenate segmented files.

Since basebackups have to read and write all relations, they have an
opportunity to convert to large file format on the fly.  Take it.

XXX There may be some bugs hiding in here when sizeof(ssize_t) <
sizeof(pgoff_t)?
---
 src/backend/backup/basebackup.c | 92 +++++++++++++++++++++++++--------
 1 file changed, 71 insertions(+), 21 deletions(-)

diff --git a/src/backend/backup/basebackup.c b/src/backend/backup/basebackup.c
index 2dcc04fef2..e2534895eb 100644
--- a/src/backend/backup/basebackup.c
+++ b/src/backend/backup/basebackup.c
@@ -1339,6 +1339,17 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
 			continue;			/* don't recurse into pg_wal */
 		}
 
+		/*
+		 * Skip relation segment files because sendFile() will find them when
+		 * called for the initial segment.
+		 */
+		if (isDbDir)
+		{
+			const char *s = strrchr(de->d_name, '.');
+			if (s && strspn(s + 1, "0123456789") == strlen(s + 1))
+				continue;
+		}
+
 		/* Allow symbolic links in pg_tblspc only */
 		if (strcmp(path, "./pg_tblspc") == 0 && S_ISLNK(statbuf.st_mode))
 		{
@@ -1476,6 +1487,10 @@ is_checksummed_file(const char *fullpath, const char *filename)
  * If dboid is anything other than InvalidOid then any checksum failures
  * detected will get reported to the cumulative stats system.
  *
+ * If the file is multi-segmented, the segments are concatenated and sent as
+ * one file.  On return, statbuf->st_size contains the complete size of the
+ * single sent file.
+ *
  * Returns true if the file was successfully sent, false if 'missing_ok',
  * and the file did not exist.
  */
@@ -1495,10 +1510,34 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
 	char	   *page;
 	PageHeader	phdr;
 	int			segmentno = 0;
-	char	   *segmentpath;
+	int			nsegments = 1;
 	bool		verify_checksum = false;
 	pg_checksum_context checksum_ctx;
 
+	/*
+	 * This function in only called for the head segment of segmented files,
+	 * but we want to concatenate it on the fly into a large file.  If we
+	 * have reached a segment boundary, we'll try to open the next segment.
+	 * We count the segments and sum their sizes into statbuf->st_size.
+	 */
+	while (statbuf->st_size == (pgoff_t) nsegments * RELSEG_SIZE * BLCKSZ)
+	{
+		char nextpath[MAXPGPATH];
+		struct stat nextstat;
+
+		snprintf(nextpath, sizeof(nextpath), "%s.%d", readfilename, nsegments);
+		if (lstat(nextpath, &nextstat) < 0)
+		{
+			if (errno == ENOENT)
+				break;
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not stat file \"%s\": %m", nextpath)));
+		}
+		++nsegments;								/* count segment */
+		statbuf->st_size += nextstat.st_size;		/* sum size */
+	}
+
 	if (pg_checksum_init(&checksum_ctx, manifest->checksum_type) < 0)
 		elog(ERROR, "could not initialize checksum of file \"%s\"",
 			 readfilename);
@@ -1527,23 +1566,7 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
 		filename = last_dir_separator(readfilename) + 1;
 
 		if (is_checksummed_file(readfilename, filename))
-		{
 			verify_checksum = true;
-
-			/*
-			 * Cut off at the segment boundary (".") to get the segment number
-			 * in order to mix it into the checksum.
-			 */
-			segmentpath = strstr(filename, ".");
-			if (segmentpath != NULL)
-			{
-				segmentno = atoi(segmentpath + 1);
-				if (segmentno == 0)
-					ereport(ERROR,
-							(errmsg("invalid segment number %d in file \"%s\"",
-									segmentno, filename)));
-			}
-		}
 	}
 
 	/*
@@ -1554,7 +1577,7 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
 	 */
 	while (len < statbuf->st_size)
 	{
-		size_t		remaining = statbuf->st_size - len;
+		pgoff_t		remaining = statbuf->st_size - len;
 
 		/* Try to read some more data. */
 		cnt = basebackup_read_file(fd, sink->bbs_buffer,
@@ -1676,10 +1699,37 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
 		/*
 		 * If we hit end-of-file, a concurrent truncation must have occurred.
 		 * That's not an error condition, because WAL replay will fix things
-		 * up.
+		 * up.  It might also mean that we need to move to the next input
+		 * segment.
 		 */
 		if (cnt == 0)
+		{
+			/* Are we at the end of a segment?  Try to open the next one. */
+			if (len == ((pgoff_t) segmentno + 1) * RELSEG_SIZE * BLCKSZ)
+			{
+				char		nextpath[MAXPGPATH];
+				int			nextfd;
+
+				/* Try to open the next segment. */
+				nextfd = OpenTransientFile(readfilename, O_RDONLY | PG_BINARY);
+				if (nextfd < 0)
+				{
+					if (errno == ENOENT)
+						break;
+					ereport(ERROR,
+							(errcode_for_file_access(),
+							 errmsg("could not open file \"%s\": %m", nextpath)));
+				}
+
+				close(fd);
+				fd = nextfd;
+				++segmentno;
+				continue;
+			}
+
+			/* Otherwise we're at the end of input data. */
 			break;
+		}
 
 		/* Archive the data we just read. */
 		bbsink_archive_contents(sink, cnt);
@@ -1695,8 +1745,8 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
 	/* If the file was truncated while we were sending it, pad it with zeros */
 	while (len < statbuf->st_size)
 	{
-		size_t		remaining = statbuf->st_size - len;
-		size_t		nbytes = Min(sink->bbs_buffer_length, remaining);
+		pgoff_t		remaining = statbuf->st_size - len;
+		pgoff_t		nbytes = Min(sink->bbs_buffer_length, remaining);
 
 		MemSet(sink->bbs_buffer, 0, nbytes);
 		if (pg_checksum_update(&checksum_ctx,
-- 
2.40.1

Pavel Stehule

pavel.stehule@gmail.com

over 2 years ago

In reply to: Thomas Munro (#1)

Re: Large files for relations

I like this patch - it can save some system sources - I am not sure how
much, because bigger tables usually use partitioning usually.

Important note - this feature breaks sharing files on the backup side - so
before disabling 1GB sized files, this issue should be solved.

Regards

Pavel

Thomas Munro

thomas.munro@gmail.com

over 2 years ago

In reply to: Pavel Stehule (#2)

Re: Large files for relations

On Tue, May 2, 2023 at 3:28 PM Pavel Stehule <pavel.stehule@gmail.com> wrote:

I like this patch - it can save some system sources - I am not sure how much, because bigger tables usually use partitioning usually.

Yeah, if you only use partitions of < 1GB it won't make a difference.
Larger partitions are not uncommon, though.

Important note - this feature breaks sharing files on the backup side - so before disabling 1GB sized files, this issue should be solved.

Hmm, right, so there is a backup granularity continuum with "whole
database cluster" at one end, "only files whose size, mtime [or
optionally also checksum] changed since last backup" in the middle,
and "only blocks that changed since LSN of last backup" at the other
end. Getting closer to the right end of that continuum can make
backups require less reading, less network transfer, less writing
and/or less storage space depending on details. But this proposal
moves the middle thing further to the left by changing the granularity
from 1GB to whole relation, which can be gargantuan with this patch.
Ultimately we need to be all the way at the right on that continuum,
and there are clearly several people working on that goal.

I'm not involved in any of those projects, but it's fun to think about
an alien technology that produces complete standalone backups like
rsync --link-dest (as opposed to "full" backups followed by a chain of
"incremental" backups that depend on it so you need to retain them
carefully) while still sharing disk blocks with older backups, and
doing so with block granularity. TL;DW something something WAL
something something copy_file_range().

Thomas Munro

thomas.munro@gmail.com

over 2 years ago

In reply to: Thomas Munro (#3)

Re: Large files for relations

On Wed, May 3, 2023 at 5:21 PM Thomas Munro <thomas.munro@gmail.com> wrote:

rsync --link-dest

I wonder if rsync will grow a mode that can use copy_file_range() to
share blocks with a reference file (= previous backup). Something
like --copy-range-dest. That'd work for large-file relations
(assuming a file system that has block sharing, like XFS and ZFS).
You wouldn't get the "mtime is enough, I don't even need to read the
bytes" optimisation, which I assume makes all database hackers feel a
bit queasy anyway, but you'd get the space savings via the usual
rolling checksum or a cheaper version that only looks for strong
checksum matches at the same offset, or whatever other tricks rsync
might have up its sleeve.

Corey Huinker

corey.huinker@gmail.com

over 2 years ago

In reply to: Thomas Munro (#4)

Re: Large files for relations

On Wed, May 3, 2023 at 1:37 AM Thomas Munro <thomas.munro@gmail.com> wrote:

On Wed, May 3, 2023 at 5:21 PM Thomas Munro <thomas.munro@gmail.com>
wrote:

rsync --link-dest

I wonder if rsync will grow a mode that can use copy_file_range() to
share blocks with a reference file (= previous backup). Something
like --copy-range-dest. That'd work for large-file relations
(assuming a file system that has block sharing, like XFS and ZFS).
You wouldn't get the "mtime is enough, I don't even need to read the
bytes" optimisation, which I assume makes all database hackers feel a
bit queasy anyway, but you'd get the space savings via the usual
rolling checksum or a cheaper version that only looks for strong
checksum matches at the same offset, or whatever other tricks rsync
might have up its sleeve.

I understand the need to reduce open file handles, despite the
possibilities enabled by using large numbers of small file sizes.
Snowflake, for instance, sees everything in 1MB chunks, which makes
massively parallel sequential scans (Snowflake's _only_ query plan)
possible, though I don't know if they accomplish that via separate files,
or via segments within a large file.

I am curious whether a move like this to create a generational change in
file file format shouldn't be more ambitious, perhaps altering the block
format to insert a block format version number, whether that be at every
block, or every megabyte, or some other interval, and whether we store it
in-file or in a separate file to accompany the first non-segmented. Having
such versioning information would allow blocks of different formats to
co-exist in the same table, which could be critical to future changes such
as 64 bit XIDs, etc.

Stephen Frost

sfrost@snowman.net

over 2 years ago

In reply to: Corey Huinker (#5)

Re: Large files for relations

Greetings,

* Corey Huinker (corey.huinker@gmail.com) wrote:

On Wed, May 3, 2023 at 1:37 AM Thomas Munro <thomas.munro@gmail.com> wrote:

On Wed, May 3, 2023 at 5:21 PM Thomas Munro <thomas.munro@gmail.com>
wrote:

rsync --link-dest

... rsync isn't really a safe tool to use for PG backups by itself
unless you're using it with archiving and with start/stop backup and
with checksums enabled.

I wonder if rsync will grow a mode that can use copy_file_range() to
share blocks with a reference file (= previous backup). Something
like --copy-range-dest. That'd work for large-file relations
(assuming a file system that has block sharing, like XFS and ZFS).
You wouldn't get the "mtime is enough, I don't even need to read the
bytes" optimisation, which I assume makes all database hackers feel a
bit queasy anyway, but you'd get the space savings via the usual
rolling checksum or a cheaper version that only looks for strong
checksum matches at the same offset, or whatever other tricks rsync
might have up its sleeve.

There's also really good reasons to have multiple full backups and not
just a single full backup and then lots and lots of incrementals which
basically boils down to "are you really sure that one copy of that one
really important file won't every disappear from your backup
repository..?"

That said, pgbackrest does now have block-level incremental backups
(where we define our own block size ...) and there's reasons we decided
against going down the LSN-based approach (not the least of which is
that the LSN isn't always updated...), but long story short, moving to
larger than 1G files should be something that pgbackrest will be able
to handle without as much impact as there would have been previously in
terms of incremental backups. There is a loss in the ability to use
mtime to scan just the parts of the relation that changed and that's
unfortunate but I wouldn't see it as really a game changer (and yes,
there's certainly an argument for not trusting mtime, though I don't
think we've yet had a report where there was an mtime issue that our
mtime-validity checking didn't catch and force pgbackrest into
checksum-based revalidation automatically which resulted in an invalid
backup... of course, not enough people test their backups...).

I understand the need to reduce open file handles, despite the
possibilities enabled by using large numbers of small file sizes.

I'm also generally in favor of reducing the number of open file handles
that we have to deal with. Addressing the concerns raised nearby about
weird corner-cases of non-1G length ABCDEF.1 files existing while
ABCDEF.2, and more, files exist is certainly another good argument in
favor of getting rid of segments.

I am curious whether a move like this to create a generational change in
file file format shouldn't be more ambitious, perhaps altering the block
format to insert a block format version number, whether that be at every
block, or every megabyte, or some other interval, and whether we store it
in-file or in a separate file to accompany the first non-segmented. Having
such versioning information would allow blocks of different formats to
co-exist in the same table, which could be critical to future changes such
as 64 bit XIDs, etc.

To the extent you're interested in this, there are patches posted which
are alrady trying to move us in a direction that would allow for
different page formats that add in space for other features such as
64bit XIDs, better checksums, and TDE tags to be supported.

https://commitfest.postgresql.org/43/3986/

Currently those patches are expecting it to be declared at initdb time,
but the way they're currently written that's more of a soft requirement
as you can tell on a per-page basis what features are enabled for that
page. Might make sense to support it in that form first anyway though,
before going down the more ambitious route of allowing different pages
to have different sets of features enabled for them concurrently.

When it comes to 'a separate file', we do have forks already and those
serve a very valuable but distinct use-case where you can get
information from the much smaller fork (be it the FSM or the VM or some
future thing) while something like 64bit XIDs or a stronger checksum is
something you'd really need on every page. I have serious doubts about
a proposal where we'd store information needed on every page read in
some far away block that's still in the same file such as using
something every 1MB as that would turn every block access into two..

Thanks,

Stephen

Jim Mlodgenski

jimmy76@gmail.com

over 2 years ago

In reply to: Thomas Munro (#1)

Re: Large files for relations

On Mon, May 1, 2023 at 9:29 PM Thomas Munro <thomas.munro@gmail.com> wrote:

I am not aware of any modern/non-historic filesystem[2] that can't do
large files with ease. Anyone know of anything to worry about on that
front?

There is some trouble in the ambiguity of what we mean by "modern" and
"large files". There are still a large number of users of ext4 where the
max file size is 16TB. Switching to a single large file per relation would
effectively cut the max table size in half for those users. How would a
user with say a 20TB table running on ext4 be impacted by this change?

Thomas Munro

thomas.munro@gmail.com

over 2 years ago

In reply to: Jim Mlodgenski (#7)

Re: Large files for relations

On Fri, May 12, 2023 at 8:16 AM Jim Mlodgenski <jimmy76@gmail.com> wrote:

On Mon, May 1, 2023 at 9:29 PM Thomas Munro <thomas.munro@gmail.com> wrote:

I am not aware of any modern/non-historic filesystem[2] that can't do
large files with ease. Anyone know of anything to worry about on that
front?

There is some trouble in the ambiguity of what we mean by "modern" and "large files". There are still a large number of users of ext4 where the max file size is 16TB. Switching to a single large file per relation would effectively cut the max table size in half for those users. How would a user with say a 20TB table running on ext4 be impacted by this change?

Hrmph. Yeah, that might be a bit of a problem. I see it discussed in
various places that MySQL/InnoDB can't have tables bigger than 16TB on
ext4 because of this, when it's in its default one-file-per-object
mode (as opposed to its big-tablespace-files-to-hold-all-the-objects
mode like DB2, Oracle etc, in which case I think you can have multiple
16TB segment files and get past that ext4 limit). It's frustrating
because 16TB is still really, really big and you probably should be
using partitions, or more partitions, to avoid all kinds of other
scalability problems at that size. But however hypothetical the
scenario might be, it should work, and this is certainly a plausible
argument against the "aggressive" plan described above with the hard
cut-off where we get to drop the segmented mode.

Concretely, a 20TB pg_upgrade in copy mode would fail while trying to
concatenate with the above patches, so you'd have to use link or
reflink mode (you'd probably want to use that anyway unless due to
sheer volume of data to copy otherwise, since ext4 is also not capable
of block-range sharing), but then you'd be out of luck after N future
major releases, according to that plan where we start deleting the
code, so you'd need to organise some smaller partitions before that
time comes. Or pg_upgrade to a target on xfs etc. I wonder if a
future version of extN will increase its max file size.

A less aggressive version of the plan would be that we just keep the
segment code for the foreseeable future with no planned cut off, and
we make all of those "piggy back" transformations that I showed in the
patch set optional. For example, I had it so that CLUSTER would
quietly convert your relation to large format, if it was still in
segmented format (might as well if you're writing all the data out
anyway, right?), but perhaps that could depend on a GUC. Likewise for
base backup. Etc. Then someone concerned about hitting the 16TB
limit on ext4 could opt out. Or something like that. It seems funny
though, that's exactly the user who should want this feature (they
have 16,000 relation segment files).

Dagfinn Ilmari Mannsåker

ilmari@ilmari.org

over 2 years ago

In reply to: Thomas Munro (#8)

Re: Large files for relations

Thomas Munro <thomas.munro@gmail.com> writes:

On Fri, May 12, 2023 at 8:16 AM Jim Mlodgenski <jimmy76@gmail.com> wrote:

On Mon, May 1, 2023 at 9:29 PM Thomas Munro <thomas.munro@gmail.com> wrote:

I am not aware of any modern/non-historic filesystem[2] that can't do
large files with ease. Anyone know of anything to worry about on that
front?

There is some trouble in the ambiguity of what we mean by "modern" and
"large files". There are still a large number of users of ext4 where
the max file size is 16TB. Switching to a single large file per
relation would effectively cut the max table size in half for those
users. How would a user with say a 20TB table running on ext4 be
impacted by this change?

[…]

A less aggressive version of the plan would be that we just keep the
segment code for the foreseeable future with no planned cut off, and
we make all of those "piggy back" transformations that I showed in the
patch set optional. For example, I had it so that CLUSTER would
quietly convert your relation to large format, if it was still in
segmented format (might as well if you're writing all the data out
anyway, right?), but perhaps that could depend on a GUC. Likewise for
base backup. Etc. Then someone concerned about hitting the 16TB
limit on ext4 could opt out. Or something like that. It seems funny
though, that's exactly the user who should want this feature (they
have 16,000 relation segment files).

If we're going to have to keep the segment code for the foreseeable
future anyway, could we not get most of the benefit by increasing the
segment size to something like 1TB? The vast majority of tables would
fit in one file, and there would be less risk of hitting filesystem
limits.

- ilmari

#10

Jim Mlodgenski

jimmy76@gmail.com

over 2 years ago

In reply to: Thomas Munro (#8)

Re: Large files for relations

On Thu, May 11, 2023 at 7:38 PM Thomas Munro <thomas.munro@gmail.com> wrote:

On Fri, May 12, 2023 at 8:16 AM Jim Mlodgenski <jimmy76@gmail.com> wrote:

On Mon, May 1, 2023 at 9:29 PM Thomas Munro <thomas.munro@gmail.com>

wrote:

I am not aware of any modern/non-historic filesystem[2] that can't do
large files with ease. Anyone know of anything to worry about on that
front?

There is some trouble in the ambiguity of what we mean by "modern" and

"large files". There are still a large number of users of ext4 where the
max file size is 16TB. Switching to a single large file per relation would
effectively cut the max table size in half for those users. How would a
user with say a 20TB table running on ext4 be impacted by this change?

Hrmph. Yeah, that might be a bit of a problem. I see it discussed in
various places that MySQL/InnoDB can't have tables bigger than 16TB on
ext4 because of this, when it's in its default one-file-per-object
mode (as opposed to its big-tablespace-files-to-hold-all-the-objects
mode like DB2, Oracle etc, in which case I think you can have multiple
16TB segment files and get past that ext4 limit). It's frustrating
because 16TB is still really, really big and you probably should be
using partitions, or more partitions, to avoid all kinds of other
scalability problems at that size. But however hypothetical the
scenario might be, it should work,

Agreed, it is frustrating, but it is not hypothetical. I have seen a number
of
users having single tables larger than 16TB and don't use partitioning
because
of the limitations we have today. The most common reason is needing multiple
unique constraints on the table that don't include the partition key.
Something
like a user_id and email. There are workarounds for those cases, but usually
it's easier to deal with a single large table than to deal with the sharp
edges
those workarounds introduce.

#11

Stephen Frost

sfrost@snowman.net

over 2 years ago

In reply to: Dagfinn Ilmari Mannsåker (#9)

Re: Large files for relations

Greetings,

* Dagfinn Ilmari Mannsåker (ilmari@ilmari.org) wrote:

Thomas Munro <thomas.munro@gmail.com> writes:

On Fri, May 12, 2023 at 8:16 AM Jim Mlodgenski <jimmy76@gmail.com> wrote:

On Mon, May 1, 2023 at 9:29 PM Thomas Munro <thomas.munro@gmail.com> wrote:

I am not aware of any modern/non-historic filesystem[2] that can't do
large files with ease. Anyone know of anything to worry about on that
front?

There is some trouble in the ambiguity of what we mean by "modern" and
"large files". There are still a large number of users of ext4 where
the max file size is 16TB. Switching to a single large file per
relation would effectively cut the max table size in half for those
users. How would a user with say a 20TB table running on ext4 be
impacted by this change?

[…]

A less aggressive version of the plan would be that we just keep the
segment code for the foreseeable future with no planned cut off, and
we make all of those "piggy back" transformations that I showed in the
patch set optional. For example, I had it so that CLUSTER would
quietly convert your relation to large format, if it was still in
segmented format (might as well if you're writing all the data out
anyway, right?), but perhaps that could depend on a GUC. Likewise for
base backup. Etc. Then someone concerned about hitting the 16TB
limit on ext4 could opt out. Or something like that. It seems funny
though, that's exactly the user who should want this feature (they
have 16,000 relation segment files).

If we're going to have to keep the segment code for the foreseeable
future anyway, could we not get most of the benefit by increasing the
segment size to something like 1TB? The vast majority of tables would
fit in one file, and there would be less risk of hitting filesystem
limits.

While I tend to agree that 1GB is too small, 1TB seems like it's
possibly going to end up on the too big side of things, or at least,
if we aren't getting rid of the segment code then it's possibly throwing
away the benefits we have from the smaller segments without really
giving us all that much. Going from 1G to 10G would reduce the number
of open file descriptors by quite a lot without having much of a net
change on other things. 50G or 100G would reduce the FD handles further
but starts to make us lose out a bit more on some of the nice parts of
having multiple segments.

Just some thoughts.

Thanks,

Stephen

#12

MARK CALLAGHAN

mdcallag@gmail.com

over 2 years ago

In reply to: Thomas Munro (#8)

Re: Large files for relations

Repeating what was mentioned on Twitter, because I had some experience with
the topic. With fewer files per table there will be more contention on the
per-inode mutex (which might now be the per-inode rwsem). I haven't read
filesystem source in a long time. Back in the day, and perhaps today, it
was locked for the duration of a write to storage (locked within the
kernel) and was briefly locked while setting up a read.

The workaround for writes was one of:
1) enable disk write cache or use battery-backed HW RAID to make writes
faster (yes disks, I encountered this prior to 2010)
2) use XFS and O_DIRECT in which case the per-inode mutex (rwsem) wasn't
locked for the duration of a write

I have a vague memory that filesystems have improved in this regard.

On Thu, May 11, 2023 at 4:38 PM Thomas Munro <thomas.munro@gmail.com> wrote:

On Fri, May 12, 2023 at 8:16 AM Jim Mlodgenski <jimmy76@gmail.com> wrote:

On Mon, May 1, 2023 at 9:29 PM Thomas Munro <thomas.munro@gmail.com>

wrote:

I am not aware of any modern/non-historic filesystem[2] that can't do
large files with ease. Anyone know of anything to worry about on that
front?

There is some trouble in the ambiguity of what we mean by "modern" and

"large files". There are still a large number of users of ext4 where the
max file size is 16TB. Switching to a single large file per relation would
effectively cut the max table size in half for those users. How would a
user with say a 20TB table running on ext4 be impacted by this change?

Hrmph. Yeah, that might be a bit of a problem. I see it discussed in
various places that MySQL/InnoDB can't have tables bigger than 16TB on
ext4 because of this, when it's in its default one-file-per-object
mode (as opposed to its big-tablespace-files-to-hold-all-the-objects
mode like DB2, Oracle etc, in which case I think you can have multiple
16TB segment files and get past that ext4 limit). It's frustrating
because 16TB is still really, really big and you probably should be
using partitions, or more partitions, to avoid all kinds of other
scalability problems at that size. But however hypothetical the
scenario might be, it should work, and this is certainly a plausible
argument against the "aggressive" plan described above with the hard
cut-off where we get to drop the segmented mode.

Concretely, a 20TB pg_upgrade in copy mode would fail while trying to
concatenate with the above patches, so you'd have to use link or
reflink mode (you'd probably want to use that anyway unless due to
sheer volume of data to copy otherwise, since ext4 is also not capable
of block-range sharing), but then you'd be out of luck after N future
major releases, according to that plan where we start deleting the
code, so you'd need to organise some smaller partitions before that
time comes. Or pg_upgrade to a target on xfs etc. I wonder if a
future version of extN will increase its max file size.

A less aggressive version of the plan would be that we just keep the
segment code for the foreseeable future with no planned cut off, and
we make all of those "piggy back" transformations that I showed in the
patch set optional. For example, I had it so that CLUSTER would
quietly convert your relation to large format, if it was still in
segmented format (might as well if you're writing all the data out
anyway, right?), but perhaps that could depend on a GUC. Likewise for
base backup. Etc. Then someone concerned about hitting the 16TB
limit on ext4 could opt out. Or something like that. It seems funny
though, that's exactly the user who should want this feature (they
have 16,000 relation segment files).

--
Mark Callaghan
mdcallag@gmail.com

#13

Thomas Munro

thomas.munro@gmail.com

over 2 years ago

In reply to: MARK CALLAGHAN (#12)

Re: Large files for relations

On Sat, May 13, 2023 at 4:41 AM MARK CALLAGHAN <mdcallag@gmail.com> wrote:

Repeating what was mentioned on Twitter, because I had some experience with the topic. With fewer files per table there will be more contention on the per-inode mutex (which might now be the per-inode rwsem). I haven't read filesystem source in a long time. Back in the day, and perhaps today, it was locked for the duration of a write to storage (locked within the kernel) and was briefly locked while setting up a read.

The workaround for writes was one of:
1) enable disk write cache or use battery-backed HW RAID to make writes faster (yes disks, I encountered this prior to 2010)
2) use XFS and O_DIRECT in which case the per-inode mutex (rwsem) wasn't locked for the duration of a write

I have a vague memory that filesystems have improved in this regard.

(I am interpreting your "use XFS" to mean "use XFS instead of ext4".)

Right, 80s file systems like UFS (and I suspect ext and ext2, which
were probably based on similar ideas and ran on non-SMP machines?)
used coarse grained locking including vnodes/inodes level. Then over
time various OSes and file systems have improved concurrency. Brief
digression, as someone who got started on IRIX in the 90 and still
thinks those were probably the coolest computers: At SGI, first they
replaced SysV UFS with EFS (E for extent-based allocation) and
invented O_DIRECT to skip the buffer pool, and then blew the doors off
everything with XFS, which maximised I/O concurrency and possibly (I
guess, it's not open source so who knows?) involved a revamped VFS to
lower stuff like inode locks, motivated by monster IRIX boxes with up
to 1024 CPUs and huge storage arrays. In the Linux ext3 era, I
remember hearing lots of reports of various kinds of large systems
going faster just by switching to XFS and there is lots of writing
about that. ext4 certainly changed enormously. One reason back in
those days (mid 2000s?) was the old
fsync-actually-fsyncs-everything-in-the-known-universe-and-not-just-your-file
thing, and another was the lack of write concurrency especially for
direct I/O, and probably lots more things. But that's all ancient
history...

As for ext4, we've detected and debugged clues about the gradual
weakening of locking over time on this list: we know that concurrent
read/write to the same page of a file was previously atomic, but when
we switched to pread/pwrite for most data (ie not making use of the
current file position), it ceased to be (a concurrent reader can see a
mash-up of old and new data with visible cache line-ish stripes in it,
so there isn't even a write-lock for the page); then we noticed that
in later kernels even read/write ceased to be atomic (implicating a
change in file size/file position interlocking, I guess). I also
vaguely recall reading on here a long time ago that lseek()
performance was dramatically improved with weaker inode interlocking,
perhaps even in response to this very program's pathological SEEK_END
call frequency (something I hope to fix, but I digress). So I think
it's possible that the effect you mentioned is gone?

I can think of a few differences compared to those other RDBMSs.
There the discussion was about one-file-per-relation vs
one-big-file-for-everything, whereas we're talking about
one-file-per-relation vs many-files-per-relation (which doesn't change
the point much, just making clear that I'm not proposing a 42PB file
to whole everything, so you can still partition to get different
files). We also usually call fsync in series in our checkpointer
(after first getting the writebacks started with sync_file_range()
some time sooner). Currently our code believes that it is not safe to
call fdatasync() for files whose size might have changed. There is no
basis for that in POSIX or in any system that I currently know of
(though I haven't looked into it seriously), but I believe there was a
historical file system that at some point in history interpreted
"non-essential meta data" (the stuff POSIX allows it not to flush to
disk) to include "the size of the file" (whereas POSIX really just
meant that you don't have to synchronise the mtime and similar), which
is probably why PostgreSQL has some code that calls fsync() on newly
created empty WAL segments to "make sure the indirect blocks are down
on disk" before allowing itself to use only fdatasync() later to
overwrite it with data. The point being that, for the most important
kind of interactive/user facing I/O latency, namely WAL flushes, we
already use fdatasync(). It's possible that we could use it to flush
relation data too (ie the relation files in question here, usually
synchronised by the checkpointer) according to POSIX but it doesn't
immediately seem like something that should be at all hot and it's
background work. But perhaps I lack imagination.

Thanks, thought-provoking stuff.

#14

Thomas Munro

thomas.munro@gmail.com

over 2 years ago

In reply to: Thomas Munro (#13)

Re: Large files for relations

On Sat, May 13, 2023 at 11:01 AM Thomas Munro <thomas.munro@gmail.com> wrote:

On Sat, May 13, 2023 at 4:41 AM MARK CALLAGHAN <mdcallag@gmail.com> wrote:

use XFS and O_DIRECT

As for direct I/O, we're only just getting started on that. We
currently can't produce more than one concurrent WAL write, and then
for relation data, we just got very basic direct I/O support but we
haven't yet got the asynchronous machinery to drive it properly (work
in progress, more soon). I was just now trying to find out what the
state of parallel direct writes is in ext4, and it looks like it's
finally happening:

https://www.phoronix.com/news/Linux-6.3-EXT4

#15

MARK CALLAGHAN

mdcallag@gmail.com

over 2 years ago

In reply to: Thomas Munro (#13)

Re: Large files for relations

On Fri, May 12, 2023 at 4:02 PM Thomas Munro <thomas.munro@gmail.com> wrote:

On Sat, May 13, 2023 at 4:41 AM MARK CALLAGHAN <mdcallag@gmail.com> wrote:

Repeating what was mentioned on Twitter, because I had some experience

with the topic. With fewer files per table there will be more contention on
the per-inode mutex (which might now be the per-inode rwsem). I haven't
read filesystem source in a long time. Back in the day, and perhaps today,
it was locked for the duration of a write to storage (locked within the
kernel) and was briefly locked while setting up a read.

The workaround for writes was one of:
1) enable disk write cache or use battery-backed HW RAID to make writes

faster (yes disks, I encountered this prior to 2010)

2) use XFS and O_DIRECT in which case the per-inode mutex (rwsem) wasn't

locked for the duration of a write

I have a vague memory that filesystems have improved in this regard.

(I am interpreting your "use XFS" to mean "use XFS instead of ext4".)

Yes, although when the decision was made it was probably ext-3 -> XFS. We
suffered from fsync a file == fsync the filesystem
because MySQL binlogs use buffered IO and are appended on write. Switching
from ext-? to XFS was an easy perf win
so I don't have much experience with ext-? over the past decade.

Right, 80s file systems like UFS (and I suspect ext and ext2, which

Late 80s is when I last hacked on Unix fileys code, excluding browsing XFS
and ext source. Unix was easy back then -- one big kernel lock covers
everything.

some time sooner). Currently our code believes that it is not safe to
call fdatasync() for files whose size might have changed. There is no

Long ago we added code for InnoDB to avoid fsync/fdatasync in some cases
when O_DIRECT was used. While great for performance
we also forgot to make sure they were still done when files were extended.
Eventually we fixed that.

Thanks for all of the details.

--
Mark Callaghan
mdcallag@gmail.com

#16

Robert Haas

robertmhaas@gmail.com

over 2 years ago

In reply to: Stephen Frost (#11)

Re: Large files for relations

On Fri, May 12, 2023 at 9:53 AM Stephen Frost <sfrost@snowman.net> wrote:

While I tend to agree that 1GB is too small, 1TB seems like it's
possibly going to end up on the too big side of things, or at least,
if we aren't getting rid of the segment code then it's possibly throwing
away the benefits we have from the smaller segments without really
giving us all that much. Going from 1G to 10G would reduce the number
of open file descriptors by quite a lot without having much of a net
change on other things. 50G or 100G would reduce the FD handles further
but starts to make us lose out a bit more on some of the nice parts of
having multiple segments.

This is my view as well, more or less. I don't really like our current
handling of relation segments; we know it has bugs, and making it
non-buggy feels difficult. And there are performance issues as well --
file descriptor consumption, for sure, but also probably that crossing
a file boundary likely breaks the operating system's ability to do
readahead to some degree. However, I think we're going to find that
moving to a system where we have just one file per relation fork and
that file can be arbitrarily large is not fantastic, either. Jim's
point about running into filesystem limits is a good one (hi Jim, long
time no see!) and the problem he points out with ext4 is almost
certainly not the only one. It doesn't just have to be filesystems,
either. It could be a limitation of an archiving tool (tar, zip, cpio)
or a file copy utility or whatever as well. A quick Google search
suggests that most such things have been updated to use 64-bit sizes,
but my point is that the set of things that can potentially cause
problems is broader than just the filesystem. Furthermore, even when
there's no hard limit at play, a smaller file size can occasionally be
*convenient*, as in Pavel's example of using hard links to share
storage between backups. From that point of view, a 16GB or 64GB or
256GB file size limit seems more convenient than no limit and more
convenient than a large limit like 1TB.

However, the bugs are the flies in the ointment (ahem). If we just
make the segment size bigger but don't get rid of segments altogether,
then we still have to fix the bugs that can occur when you do have
multiple segments. I think part of Thomas's motivation is to dodge
that whole category of problems. If we gradually deprecate
multi-segment mode in favor of single-file-per-relation-fork, then the
fact that the segment handling code has bugs becomes progressively
less relevant. While that does make some sense, I'm not sure I really
agree with the approach. The problem is that we're trading problems
that we at least theoretically can fix somehow by hitting our code
with a big enough hammer for an unknown set of problems that stem from
limitations of software we don't control, maybe don't even know about.

--
Robert Haas
EDB: http://www.enterprisedb.com

#17

Thomas Munro

thomas.munro@gmail.com

over 2 years ago

In reply to: Robert Haas (#16)

Re: Large files for relations

Thanks all for the feedback. It was a nice idea and it *almost*
works, but it seems like we just can't drop segmented mode. And the
automatic transition schemes I showed don't make much sense without
that goal.

What I'm hearing is that something simple like this might be more acceptable:

* initdb --rel-segsize (cf --wal-segsize), default unchanged
* pg_upgrade would convert if source and target don't match

I would probably also leave out those Windows file API changes, too.
--rel-segsize would simply refuse larger sizes until someone does the
work on that platform, to keep the initial proposal small.

I would probably leave the experimental copy_on_write() ideas out too,
for separate discussion in a separate proposal.

#18

Peter Eisentraut

peter.eisentraut@enterprisedb.com

over 2 years ago

In reply to: Thomas Munro (#17)

Re: Large files for relations

On 24.05.23 02:34, Thomas Munro wrote:

Thanks all for the feedback. It was a nice idea and it *almost*
works, but it seems like we just can't drop segmented mode. And the
automatic transition schemes I showed don't make much sense without
that goal.

What I'm hearing is that something simple like this might be more acceptable:

* initdb --rel-segsize (cf --wal-segsize), default unchanged

makes sense

* pg_upgrade would convert if source and target don't match

This would be good, but it could also be an optional or later feature.

Maybe that should be a different mode, like
--copy-and-adjust-as-necessary, so that users would have to opt into
what would presumably be slower than plain --copy, rather than being
surprised by it, if they unwittingly used incompatible initdb options.

I would probably also leave out those Windows file API changes, too.
--rel-segsize would simply refuse larger sizes until someone does the
work on that platform, to keep the initial proposal small.

Those changes from off_t to pgoff_t? Yes, it would be good to do
without those. Apart of the practical problems that have been brought
up, this was a major annoyance with the proposed patch set IMO.

I would probably leave the experimental copy_on_write() ideas out too,
for separate discussion in a separate proposal.

right

#19

Robert Haas

robertmhaas@gmail.com

over 2 years ago

In reply to: Peter Eisentraut (#18)

Re: Large files for relations

On Wed, May 24, 2023 at 2:18 AM Peter Eisentraut
<peter.eisentraut@enterprisedb.com> wrote:

What I'm hearing is that something simple like this might be more acceptable:

* initdb --rel-segsize (cf --wal-segsize), default unchanged

makes sense

+1.

* pg_upgrade would convert if source and target don't match

This would be good, but it could also be an optional or later feature.

+1. I think that would be nice to have, but not absolutely required.

IMHO it's best not to overcomplicate these projects. Not everything
needs to be part of the initial commit. If the initial commit happens
2 months from now and then stuff like this gets added over the next 8,
that's strictly better than trying to land the whole patch set next
March.

--
Robert Haas
EDB: http://www.enterprisedb.com

#20

Stephen Frost

sfrost@snowman.net

over 2 years ago

In reply to: Peter Eisentraut (#18)

Re: Large files for relations

Greetings,

* Peter Eisentraut (peter.eisentraut@enterprisedb.com) wrote:

On 24.05.23 02:34, Thomas Munro wrote:

Thanks all for the feedback. It was a nice idea and it *almost*
works, but it seems like we just can't drop segmented mode. And the
automatic transition schemes I showed don't make much sense without
that goal.

What I'm hearing is that something simple like this might be more acceptable:

* initdb --rel-segsize (cf --wal-segsize), default unchanged

makes sense

Agreed, this seems alright in general. Having more initdb-time options
to help with certain use-cases rather than having things be compile-time
is definitely just generally speaking a good direction to be going in,
imv.

* pg_upgrade would convert if source and target don't match

This would be good, but it could also be an optional or later feature.

Agreed.

Maybe that should be a different mode, like --copy-and-adjust-as-necessary,
so that users would have to opt into what would presumably be slower than
plain --copy, rather than being surprised by it, if they unwittingly used
incompatible initdb options.

I'm curious as to why it would be slower than a regular copy..?

I would probably also leave out those Windows file API changes, too.
--rel-segsize would simply refuse larger sizes until someone does the
work on that platform, to keep the initial proposal small.

Those changes from off_t to pgoff_t? Yes, it would be good to do without
those. Apart of the practical problems that have been brought up, this was
a major annoyance with the proposed patch set IMO.

I would probably leave the experimental copy_on_write() ideas out too,
for separate discussion in a separate proposal.

right

You mean copy_file_range() here, right?

Shouldn't we just add support for that today into pg_upgrade,
independently of this? Seems like a worthwhile improvement even without
the benefit it would provide to changing segment sizes.

Thanks,

Stephen

#21

Thomas Munro

thomas.munro@gmail.com

over 2 years ago

In reply to: Stephen Frost (#20)

1 attachment(s)

Re: Large files for relations

On Thu, May 25, 2023 at 1:08 PM Stephen Frost <sfrost@snowman.net> wrote:

* Peter Eisentraut (peter.eisentraut@enterprisedb.com) wrote:

On 24.05.23 02:34, Thomas Munro wrote:

* pg_upgrade would convert if source and target don't match

This would be good, but it could also be an optional or later feature.

Agreed.

OK. I do have a patch for that, but I'll put that (+ copy_file_range)
aside for now so we can talk about the basic feature. Without that,
pg_upgrade just rejects mismatching clusters as it always did, no
change required.

I would probably also leave out those Windows file API changes, too.
--rel-segsize would simply refuse larger sizes until someone does the
work on that platform, to keep the initial proposal small.

Those changes from off_t to pgoff_t? Yes, it would be good to do without
those. Apart of the practical problems that have been brought up, this was
a major annoyance with the proposed patch set IMO.

+1, it was not nice.

Alright, since I had some time to kill in an airport, here is a
starter patch for initdb --rel-segsize. Some random thoughts:

Another potential option name would be --segsize, if we think we're
going to use this for temp files too eventually.

Maybe it's not so beautiful to have that global variable
rel_segment_size (which replaces REL_SEGSIZE everywhere). Another
idea would be to make it static in md.c and call smgrsetsegmentsize(),
or something like that. That could be a nice place to compute the
"shift" value up front, instead of computing it each time in
blockno_to_segno(), but that's probably not worth bothering with (?).
BSR/LZCNT/CLZ instructions are pretty fast on modern chips. That's
about the only place where someone could say that this change makes
things worse for people not interested in the new feature, so I was
careful to get rid of / and % operations with no-longer-constant RHS.

I had to promote segment size to int64 (global variable, field in
control file), because otherwise it couldn't represent
--rel-segsize=32TB (it'd be too big by one). Other ideas would be to
store the shift value instead of the size, or store the max block
number, eg subtract one, or use InvalidBlockNumber to mean "no limit"
(with more branches to test for it). The only problem I ran into with
the larger type was that 'SHOW segment_size' now needs a custom show
function because we don't have int64 GUCs.

A C type confusion problem that I noticed: some code uses BlockNumber
and some code uses int for segment numbers. It's not really a
reachable problem for practical reasons (you'd need over 2 billion
directories and VFDs to reach it), but it's wrong to use int if
segment size can be set as low as BLCKSZ (one file per block); you
could have more segments than an int can represent. We could go for
uint32, BlockNumber or create SegmentNumber (which I think I've
proposed before, and lost track of...). We can address that
separately (perhaps by finding my old patch...)

Attachments:

0001-Allow-relation-segment-size-to-be-set-by-initdb.patchtext/x-patch; charset=US-ASCII; name=0001-Allow-relation-segment-size-to-be-set-by-initdb.patchDownload

From c6809aafd147d0ac286ab73c2d8fbe571c698550 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Fri, 26 May 2023 01:41:11 +1200
Subject: [PATCH 1/2] Allow relation segment size to be set by initdb.

Previously, relation segment size was a rarely modified compile time
option.  Make it an initdb option, so that users with very large tables
can avoid using so many files and file descriptors.

The initdb option --rel-segsize is modeled on the existing --wal-segsize
option.

The data type used to store the size is int64, not BlockNumber, because
it seems reasonable to want to be able to say --rel-segsize=32TB (=
don't use segments at all), but that would overflow uint32.

The default behavior is unchanged: 1GB segments.  On Windows, we can't
go above 2GB for now due (we'd have to make a lot of changes due to
Windows' small off_t).

Discussion: https://postgr.es/m/CA%2BhUKG%2BBGXwMbrvzXAjL8VMGf25y_ga_XnO741g10y0%3Dm6dDiA%40mail.gmail.com

diff --git a/configure b/configure
index 1b415142d1..a3dee3ea74 100755
--- a/configure
+++ b/configure
@@ -841,8 +841,6 @@ enable_coverage
 enable_dtrace
 enable_tap_tests
 with_blocksize
-with_segsize
-with_segsize_blocks
 with_wal_blocksize
 with_CC
 with_llvm
@@ -1551,9 +1549,6 @@ Optional Packages:
   --with-pgport=PORTNUM   set default port number [5432]
   --with-blocksize=BLOCKSIZE
                           set table block size in kB [8]
-  --with-segsize=SEGSIZE  set table segment size in GB [1]
-  --with-segsize-blocks=SEGSIZE_BLOCKS
-                          set table segment size in blocks [0]
   --with-wal-blocksize=BLOCKSIZE
                           set WAL block size in kB [8]
   --with-CC=CMD           set compiler (deprecated)
@@ -3731,85 +3726,6 @@ cat >>confdefs.h <<_ACEOF
 _ACEOF
 
 
-#
-# Relation segment size
-#
-
-
-
-# Check whether --with-segsize was given.
-if test "${with_segsize+set}" = set; then :
-  withval=$with_segsize;
-  case $withval in
-    yes)
-      as_fn_error $? "argument required for --with-segsize option" "$LINENO" 5
-      ;;
-    no)
-      as_fn_error $? "argument required for --with-segsize option" "$LINENO" 5
-      ;;
-    *)
-      segsize=$withval
-      ;;
-  esac
-
-else
-  segsize=1
-fi
-
-
-
-
-
-# Check whether --with-segsize-blocks was given.
-if test "${with_segsize_blocks+set}" = set; then :
-  withval=$with_segsize_blocks;
-  case $withval in
-    yes)
-      as_fn_error $? "argument required for --with-segsize-blocks option" "$LINENO" 5
-      ;;
-    no)
-      as_fn_error $? "argument required for --with-segsize-blocks option" "$LINENO" 5
-      ;;
-    *)
-      segsize_blocks=$withval
-      ;;
-  esac
-
-else
-  segsize_blocks=0
-fi
-
-
-
-# If --with-segsize-blocks is non-zero, it is used, --with-segsize
-# otherwise. segsize-blocks is only really useful for developers wanting to
-# test segment related code. Warn if both are used.
-if test $segsize_blocks -ne 0 -a $segsize -ne 1; then
-  { $as_echo "$as_me:${as_lineno-$LINENO}: WARNING: both --with-segsize and --with-segsize-blocks specified, --with-segsize-blocks wins" >&5
-$as_echo "$as_me: WARNING: both --with-segsize and --with-segsize-blocks specified, --with-segsize-blocks wins" >&2;}
-fi
-
-{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for segment size" >&5
-$as_echo_n "checking for segment size... " >&6; }
-if test $segsize_blocks -eq 0; then
-  # this expression is set up to avoid unnecessary integer overflow
-  # blocksize is already guaranteed to be a factor of 1024
-  RELSEG_SIZE=`expr '(' 1024 / ${blocksize} ')' '*' ${segsize} '*' 1024`
-  test $? -eq 0 || exit 1
-  { $as_echo "$as_me:${as_lineno-$LINENO}: result: ${segsize}GB" >&5
-$as_echo "${segsize}GB" >&6; }
-else
-  RELSEG_SIZE=$segsize_blocks
-  { $as_echo "$as_me:${as_lineno-$LINENO}: result: ${RELSEG_SIZE} blocks" >&5
-$as_echo "${RELSEG_SIZE} blocks" >&6; }
-fi
-
-
-cat >>confdefs.h <<_ACEOF
-#define RELSEG_SIZE ${RELSEG_SIZE}
-_ACEOF
-
-
 #
 # WAL block size
 #
@@ -15548,13 +15464,6 @@ _ACEOF
 
 
 
-# If we don't have largefile support, can't handle segment size >= 2GB.
-if test "$ac_cv_sizeof_off_t" -lt 8; then
-  if expr $RELSEG_SIZE '*' $blocksize '>=' 2 '*' 1024 '*' 1024; then
-    as_fn_error $? "Large file support is not enabled. Segment size cannot be larger than 1GB." "$LINENO" 5
-  fi
-fi
-
 # The cast to long int works around a bug in the HP C Compiler
 # version HP92453-01 B.11.11.23709.GP, which incorrectly rejects
 # declarations like `int a3[[(sizeof (unsigned char)) >= 0]];'.
diff --git a/configure.ac b/configure.ac
index 09558ada0f..1c3c7cad4f 100644
--- a/configure.ac
+++ b/configure.ac
@@ -282,54 +282,6 @@ AC_DEFINE_UNQUOTED([BLCKSZ], ${BLCKSZ}, [
  Changing BLCKSZ requires an initdb.
 ])
 
-#
-# Relation segment size
-#
-PGAC_ARG_REQ(with, segsize, [SEGSIZE], [set table segment size in GB [1]],
-             [segsize=$withval],
-             [segsize=1])
-PGAC_ARG_REQ(with, segsize-blocks, [SEGSIZE_BLOCKS], [set table segment size in blocks [0]],
-             [segsize_blocks=$withval],
-             [segsize_blocks=0])
-
-# If --with-segsize-blocks is non-zero, it is used, --with-segsize
-# otherwise. segsize-blocks is only really useful for developers wanting to
-# test segment related code. Warn if both are used.
-if test $segsize_blocks -ne 0 -a $segsize -ne 1; then
-  AC_MSG_WARN([both --with-segsize and --with-segsize-blocks specified, --with-segsize-blocks wins])
-fi
-
-AC_MSG_CHECKING([for segment size])
-if test $segsize_blocks -eq 0; then
-  # this expression is set up to avoid unnecessary integer overflow
-  # blocksize is already guaranteed to be a factor of 1024
-  RELSEG_SIZE=`expr '(' 1024 / ${blocksize} ')' '*' ${segsize} '*' 1024`
-  test $? -eq 0 || exit 1
-  AC_MSG_RESULT([${segsize}GB])
-else
-  RELSEG_SIZE=$segsize_blocks
-  AC_MSG_RESULT([${RELSEG_SIZE} blocks])
-fi
-
-AC_DEFINE_UNQUOTED([RELSEG_SIZE], ${RELSEG_SIZE}, [
- RELSEG_SIZE is the maximum number of blocks allowed in one disk file.
- Thus, the maximum size of a single file is RELSEG_SIZE * BLCKSZ;
- relations bigger than that are divided into multiple files.
-
- RELSEG_SIZE * BLCKSZ must be less than your OS' limit on file size.
- This is often 2 GB or 4GB in a 32-bit operating system, unless you
- have large file support enabled.  By default, we make the limit 1 GB
- to avoid any possible integer-overflow problems within the OS.
- A limit smaller than necessary only means we divide a large
- relation into more chunks than necessary, so it seems best to err
- in the direction of a small limit.
-
- A power-of-2 value is recommended to save a few cycles in md.c,
- but is not absolutely required.
-
- Changing RELSEG_SIZE requires an initdb.
-])
-
 #
 # WAL block size
 #
@@ -1757,13 +1709,6 @@ fi
 dnl Check for largefile support (must be after AC_SYS_LARGEFILE)
 AC_CHECK_SIZEOF([off_t])
 
-# If we don't have largefile support, can't handle segment size >= 2GB.
-if test "$ac_cv_sizeof_off_t" -lt 8; then
-  if expr $RELSEG_SIZE '*' $blocksize '>=' 2 '*' 1024 '*' 1024; then
-    AC_MSG_ERROR([Large file support is not enabled. Segment size cannot be larger than 1GB.])
-  fi
-fi
-
 AC_CHECK_SIZEOF([bool], [],
 [#ifdef HAVE_STDBOOL_H
 #include <stdbool.h>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 5da74b3c40..d739577982 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -10877,10 +10877,9 @@ dynamic_library_path = 'C:\tools\postgresql;H:\my_project\lib;$libdir'
       <listitem>
        <para>
         Reports the number of blocks (pages) that can be stored within a file
-        segment.  It is determined by the value of <literal>RELSEG_SIZE</literal>
-        when building the server.  The maximum size of a segment file in bytes
-        is equal to <varname>segment_size</varname> multiplied by
-        <varname>block_size</varname>; by default this is 1GB.
+        segment.  It is changeable with the <literal>--rel-segsize</literal> option
+        with a cluster is initialized with <application>initdb</application>.
+        By default this is 1GB.
        </para>
       </listitem>
      </varlistentry>
diff --git a/doc/src/sgml/ref/initdb.sgml b/doc/src/sgml/ref/initdb.sgml
index 87945b4b62..18c4bfdaf8 100644
--- a/doc/src/sgml/ref/initdb.sgml
+++ b/doc/src/sgml/ref/initdb.sgml
@@ -457,6 +457,30 @@ PostgreSQL documentation
        </para>
       </listitem>
      </varlistentry>
+
+     <varlistentry id="app-initdb-option-rel-segsize">
+      <term><option>--rel-segsize=<replaceable>size</replaceable></option></term>
+      <listitem>
+       <para>
+        Set the maximum size of relation segment files.  The size must have a suffix
+        <literal>kB</literal>, <literal>MB</literal>, <literal>GB</literal> or
+        <literal>TB</literal>.  The default size is 1GB, which was chosen to
+        support large relations on operating systems without large file support.
+        This option can only be set during initialization, and cannot be
+        changed later.
+       </para>
+
+       <para>
+        Setting this to a value higher than the default reduces the
+        number of file descriptors that must be managed while accessing very large
+        tables.  Note that values higher than the file system can support may
+        result in errors while trying to extend a table (for example Linux ext4
+        limits files to 16TB), and values above 2GB are not supported on
+        operating systems without a large <literal>off_t</literal> data type
+        (notably Windows).
+       </para>
+      </listitem>
+     </varlistentry>
     </variablelist>
    </para>
 
diff --git a/meson.build b/meson.build
index 16b2e86646..e8c6e16e7a 100644
--- a/meson.build
+++ b/meson.build
@@ -430,16 +430,6 @@ cdata.set('USE_ASSERT_CHECKING', get_option('cassert') ? 1 : false)
 
 blocksize = get_option('blocksize').to_int() * 1024
 
-if get_option('segsize_blocks') != 0
-  if get_option('segsize') != 1
-    warning('both segsize and segsize_blocks specified, segsize_blocks wins')
-  endif
-
-  segsize = get_option('segsize_blocks')
-else
-  segsize = (get_option('segsize') * 1024 * 1024 * 1024) / blocksize
-endif
-
 cdata.set('BLCKSZ', blocksize, description:
 '''Size of a disk block --- this also limits the size of a tuple. You can set
    it bigger if you need bigger tuples (although TOAST should reduce the need
@@ -450,7 +440,6 @@ cdata.set('BLCKSZ', blocksize, description:
    Changing BLCKSZ requires an initdb.''')
 
 cdata.set('XLOG_BLCKSZ', get_option('wal_blocksize').to_int() * 1024)
-cdata.set('RELSEG_SIZE', segsize)
 cdata.set('DEF_PGPORT', get_option('pgport'))
 cdata.set_quoted('DEF_PGPORT_STR', get_option('pgport').to_string())
 cdata.set_quoted('PG_KRB_SRVNAM', get_option('krb_srvnam'))
@@ -3302,9 +3291,6 @@ if meson.version().version_compare('>=0.57')
     {
       'data block size': '@0@ kB'.format(cdata.get('BLCKSZ') / 1024),
       'WAL block size': '@0@ kB'.format(cdata.get('XLOG_BLCKSZ') / 1024),
-      'segment size': get_option('segsize_blocks') != 0 ?
-        '@0@ blocks'.format(cdata.get('RELSEG_SIZE')) :
-        '@0@ GB'.format(get_option('segsize')),
     },
     section: 'Data layout',
   )
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index b2430f617c..f441a9051d 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -3901,7 +3901,7 @@ WriteControlFile(void)
 	ControlFile->floatFormat = FLOATFORMAT_VALUE;
 
 	ControlFile->blcksz = BLCKSZ;
-	ControlFile->relseg_size = RELSEG_SIZE;
+	ControlFile->relseg_size = rel_segment_size;
 	ControlFile->xlog_blcksz = XLOG_BLCKSZ;
 	ControlFile->xlog_seg_size = wal_segment_size;
 
@@ -4071,13 +4071,6 @@ ReadControlFile(void)
 						   " but the server was compiled with BLCKSZ %d.",
 						   ControlFile->blcksz, BLCKSZ),
 				 errhint("It looks like you need to recompile or initdb.")));
-	if (ControlFile->relseg_size != RELSEG_SIZE)
-		ereport(FATAL,
-				(errmsg("database files are incompatible with server"),
-				 errdetail("The database cluster was initialized with RELSEG_SIZE %d,"
-						   " but the server was compiled with RELSEG_SIZE %d.",
-						   ControlFile->relseg_size, RELSEG_SIZE),
-				 errhint("It looks like you need to recompile or initdb.")));
 	if (ControlFile->xlog_blcksz != XLOG_BLCKSZ)
 		ereport(FATAL,
 				(errmsg("database files are incompatible with server"),
@@ -4158,6 +4151,8 @@ ReadControlFile(void)
 
 	CalculateCheckpointSegments();
 
+	rel_segment_size = ControlFile->relseg_size;
+
 	/* Make the initdb settings visible as GUC variables, too */
 	SetConfigOption("data_checksums", DataChecksumsEnabled() ? "yes" : "no",
 					PGC_INTERNAL, PGC_S_DYNAMIC_DEFAULT);
diff --git a/src/backend/backup/basebackup.c b/src/backend/backup/basebackup.c
index 45be21131c..d684ce192d 100644
--- a/src/backend/backup/basebackup.c
+++ b/src/backend/backup/basebackup.c
@@ -40,6 +40,7 @@
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/reinit.h"
+#include "storage/smgr.h"
 #include "utils/builtins.h"
 #include "utils/guc.h"
 #include "utils/ps_status.h"
@@ -1594,7 +1595,7 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
 				 */
 				if (!PageIsNew(page) && PageGetLSN(page) < sink->bbs_state->startptr)
 				{
-					checksum = pg_checksum_page((char *) page, blkno + segmentno * RELSEG_SIZE);
+					checksum = pg_checksum_page((char *) page, blkno + segmentno * rel_segment_size);
 					phdr = (PageHeader) page;
 					if (phdr->pd_checksum != checksum)
 					{
diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 49e956b2c5..a90c4281c5 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -221,7 +221,7 @@ BootstrapModeMain(int argc, char *argv[], bool check_only)
 	argv++;
 	argc--;
 
-	while ((flag = getopt(argc, argv, "B:c:d:D:Fkr:X:-:")) != -1)
+	while ((flag = getopt(argc, argv, "B:c:d:D:Fkr:X:R:-:")) != -1)
 	{
 		switch (flag)
 		{
@@ -279,6 +279,9 @@ BootstrapModeMain(int argc, char *argv[], bool check_only)
 			case 'r':
 				strlcpy(OutputFileName, optarg, MAXPGPATH);
 				break;
+			case 'R':
+				rel_segment_size = strtoi64(optarg, NULL, 0);
+				break;
 			case 'X':
 				{
 					int			WalSegSz = strtoul(optarg, NULL, 0);
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index 41ab64100e..7eaf0dc481 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -55,9 +55,9 @@
 #include "utils/resowner.h"
 
 /*
- * We break BufFiles into gigabyte-sized segments, regardless of RELSEG_SIZE.
- * The reason is that we'd like large BufFiles to be spread across multiple
- * tablespaces when available.
+ * We break BufFiles into gigabyte-sized segments, regardless of
+ * rel_segment_size.  The reason is that we'd like large BufFiles to be spread
+ * across multiple tablespaces when available.
  */
 #define MAX_PHYSICAL_FILESIZE	0x40000000
 #define BUFFILE_SEG_SIZE		(MAX_PHYSICAL_FILESIZE / BLCKSZ)
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 65bb22541c..47801548d4 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -32,6 +32,7 @@
 #include "pg_trace.h"
 #include "pgstat.h"
 #include "postmaster/bgwriter.h"
+#include "port/pg_bitutils.h"
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
 #include "storage/md.h"
@@ -45,15 +46,15 @@
  * The magnetic disk storage manager keeps track of open file
  * descriptors in its own descriptor pool.  This is done to make it
  * easier to support relations that are larger than the operating
- * system's file size limit (often 2GBytes).  In order to do that,
- * we break relations up into "segment" files that are each shorter than
- * the OS file size limit.  The segment size is set by the RELSEG_SIZE
- * configuration constant in pg_config.h.
+ * system's file size limit (historically 2GB, sometimes much larger but still
+ * smaller than the maximum possible relation size).  In order to do that, we
+ * break relations up into "segment" files of a user-specified size chosen at
+ * initdb time and accessed as rel_segment_size.
  *
  * On disk, a relation must consist of consecutively numbered segment
  * files in the pattern
- *	-- Zero or more full segments of exactly RELSEG_SIZE blocks each
- *	-- Exactly one partial segment of size 0 <= size < RELSEG_SIZE blocks
+ *	-- Zero or more full segments of exactly rel_segment_size blocks each
+ *	-- Exactly one partial segment of size 0 <= size < rel_segment_size blocks
  *	-- Optionally, any number of inactive segments of size 0 blocks.
  * The full and partial segments are collectively the "active" segments.
  * Inactive segments are those that once contained data but are currently
@@ -110,7 +111,7 @@ static MemoryContext MdCxt;		/* context for all MdfdVec objects */
 #define EXTENSION_CREATE_RECOVERY	(1 << 3)
 /*
  * Allow opening segments which are preceded by segments smaller than
- * RELSEG_SIZE, e.g. inactive segments (see above). Note that this breaks
+ * rel_segment_size, e.g. inactive segments (see above). Note that this breaks
  * mdnblocks() and related functionality henceforth - which currently is ok,
  * because this is only required in the checkpointer which never uses
  * mdnblocks().
@@ -142,6 +143,31 @@ static MdfdVec *_mdfd_getseg(SMgrRelation reln, ForkNumber forknum,
 static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum,
 							  MdfdVec *seg);
 
+/* Given a block number, which segment is it in? */
+static inline uint32
+blockno_to_segno(BlockNumber blockno)
+{
+	/* Because it's a power of two, we can use a shift instead of "/". */
+	Assert(pg_popcount64(rel_segment_size) == 1);
+	return (uint64) blockno >> pg_leftmost_one_pos64(rel_segment_size);
+}
+
+/* Given a block number, which block is that within its segment? */
+static inline BlockNumber
+blockno_within_segment(BlockNumber blockno)
+{
+	/* Because it's a power of two, we can use a mask instead of "%". */
+	Assert(pg_popcount64(rel_segment_size) == 1);
+	return blockno & (rel_segment_size - 1);
+}
+
+/* Given a block number, convert it to byte offset within a segment. */
+static inline off_t
+blockno_to_seekpos(BlockNumber blockno)
+{
+	return blockno_within_segment(blockno) * (off_t) BLCKSZ;
+}
+
 static inline int
 _mdfd_open_flags(void)
 {
@@ -487,9 +513,9 @@ mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 
 	v = _mdfd_getseg(reln, forknum, blocknum, skipFsync, EXTENSION_CREATE);
 
-	seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+	seekpos = blockno_to_seekpos(blocknum);
 
-	Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+	Assert(seekpos < (off_t) BLCKSZ * rel_segment_size);
 
 	if ((nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, seekpos, WAIT_EVENT_DATA_FILE_EXTEND)) != BLCKSZ)
 	{
@@ -511,7 +537,7 @@ mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	if (!skipFsync && !SmgrIsTemp(reln))
 		register_dirty_segment(reln, forknum, v);
 
-	Assert(_mdnblocks(reln, forknum, v) <= ((BlockNumber) RELSEG_SIZE));
+	Assert(_mdnblocks(reln, forknum, v) <= rel_segment_size);
 }
 
 /*
@@ -549,19 +575,19 @@ mdzeroextend(SMgrRelation reln, ForkNumber forknum,
 
 	while (remblocks > 0)
 	{
-		BlockNumber segstartblock = curblocknum % ((BlockNumber) RELSEG_SIZE);
-		off_t		seekpos = (off_t) BLCKSZ * segstartblock;
+		BlockNumber segstartblock = blockno_within_segment(blocknum);
+		off_t		seekpos = blockno_to_seekpos(blocknum);
 		int			numblocks;
 
-		if (segstartblock + remblocks > RELSEG_SIZE)
-			numblocks = RELSEG_SIZE - segstartblock;
+		if (segstartblock + remblocks > rel_segment_size)
+			numblocks = rel_segment_size - segstartblock;
 		else
 			numblocks = remblocks;
 
 		v = _mdfd_getseg(reln, forknum, curblocknum, skipFsync, EXTENSION_CREATE);
 
-		Assert(segstartblock < RELSEG_SIZE);
-		Assert(segstartblock + numblocks <= RELSEG_SIZE);
+		Assert(segstartblock < rel_segment_size);
+		Assert(segstartblock + numblocks <= rel_segment_size);
 
 		/*
 		 * If available and useful, use posix_fallocate() (via FileAllocate())
@@ -615,7 +641,7 @@ mdzeroextend(SMgrRelation reln, ForkNumber forknum,
 		if (!skipFsync && !SmgrIsTemp(reln))
 			register_dirty_segment(reln, forknum, v);
 
-		Assert(_mdnblocks(reln, forknum, v) <= ((BlockNumber) RELSEG_SIZE));
+		Assert(_mdnblocks(reln, forknum, v) <= rel_segment_size);
 
 		remblocks -= numblocks;
 		curblocknum += numblocks;
@@ -667,7 +693,7 @@ mdopenfork(SMgrRelation reln, ForkNumber forknum, int behavior)
 	mdfd->mdfd_vfd = fd;
 	mdfd->mdfd_segno = 0;
 
-	Assert(_mdnblocks(reln, forknum, mdfd) <= ((BlockNumber) RELSEG_SIZE));
+	Assert(_mdnblocks(reln, forknum, mdfd) <= rel_segment_size);
 
 	return mdfd;
 }
@@ -723,9 +749,9 @@ mdprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
 	if (v == NULL)
 		return false;
 
-	seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+	seekpos = blockno_to_seekpos(blocknum);
 
-	Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+	Assert(seekpos < (off_t) BLCKSZ * rel_segment_size);
 
 	(void) FilePrefetch(v->mdfd_vfd, seekpos, BLCKSZ, WAIT_EVENT_DATA_FILE_PREFETCH);
 #endif							/* USE_PREFETCH */
@@ -757,9 +783,9 @@ mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	v = _mdfd_getseg(reln, forknum, blocknum, false,
 					 EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY);
 
-	seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+	seekpos = blockno_to_seekpos(blocknum);
 
-	Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+	Assert(seekpos < (off_t) BLCKSZ * rel_segment_size);
 
 	nbytes = FileRead(v->mdfd_vfd, buffer, BLCKSZ, seekpos, WAIT_EVENT_DATA_FILE_READ);
 
@@ -831,9 +857,9 @@ mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	v = _mdfd_getseg(reln, forknum, blocknum, skipFsync,
 					 EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY);
 
-	seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+	seekpos = blockno_to_seekpos(blocknum);
 
-	Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+	Assert(seekpos < (off_t) BLCKSZ * rel_segment_size);
 
 	nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, seekpos, WAIT_EVENT_DATA_FILE_WRITE);
 
@@ -904,17 +930,17 @@ mdwriteback(SMgrRelation reln, ForkNumber forknum,
 			return;
 
 		/* compute offset inside the current segment */
-		segnum_start = blocknum / RELSEG_SIZE;
+		segnum_start = blockno_to_segno(blocknum);
 
 		/* compute number of desired writes within the current segment */
-		segnum_end = (blocknum + nblocks - 1) / RELSEG_SIZE;
+		segnum_end = blockno_to_segno(blocknum + nblocks - 1);
 		if (segnum_start != segnum_end)
-			nflush = RELSEG_SIZE - (blocknum % ((BlockNumber) RELSEG_SIZE));
+			nflush = rel_segment_size - blockno_within_segment(blocknum);
 
 		Assert(nflush >= 1);
 		Assert(nflush <= nblocks);
 
-		seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+		seekpos = blockno_to_seekpos(blocknum);
 
 		FileWriteback(v->mdfd_vfd, seekpos, (off_t) BLCKSZ * nflush, WAIT_EVENT_DATA_FILE_FLUSH);
 
@@ -945,8 +971,8 @@ mdnblocks(SMgrRelation reln, ForkNumber forknum)
 
 	/*
 	 * Start from the last open segments, to avoid redundant seeks.  We have
-	 * previously verified that these segments are exactly RELSEG_SIZE long,
-	 * and it's useless to recheck that each time.
+	 * previously verified that these segments are exactly rel_segment_size
+	 * long, and it's useless to recheck that each time.
 	 *
 	 * NOTE: this assumption could only be wrong if another backend has
 	 * truncated the relation.  We rely on higher code levels to handle that
@@ -962,13 +988,13 @@ mdnblocks(SMgrRelation reln, ForkNumber forknum)
 	for (;;)
 	{
 		nblocks = _mdnblocks(reln, forknum, v);
-		if (nblocks > ((BlockNumber) RELSEG_SIZE))
+		if (nblocks > rel_segment_size)
 			elog(FATAL, "segment too big");
-		if (nblocks < ((BlockNumber) RELSEG_SIZE))
-			return (segno * ((BlockNumber) RELSEG_SIZE)) + nblocks;
+		if (nblocks < rel_segment_size)
+			return (segno * rel_segment_size) + nblocks;
 
 		/*
-		 * If segment is exactly RELSEG_SIZE, advance to next one.
+		 * If segment is exactly rel_segment_size, advance to next one.
 		 */
 		segno++;
 
@@ -981,7 +1007,7 @@ mdnblocks(SMgrRelation reln, ForkNumber forknum)
 		 */
 		v = _mdfd_openseg(reln, forknum, segno, 0);
 		if (v == NULL)
-			return segno * ((BlockNumber) RELSEG_SIZE);
+			return segno * rel_segment_size;
 	}
 }
 
@@ -1022,7 +1048,7 @@ mdtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks)
 	{
 		MdfdVec    *v;
 
-		priorblocks = (curopensegs - 1) * RELSEG_SIZE;
+		priorblocks = (curopensegs - 1) * rel_segment_size;
 
 		v = &reln->md_seg_fds[forknum][curopensegs - 1];
 
@@ -1047,13 +1073,13 @@ mdtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks)
 			FileClose(v->mdfd_vfd);
 			_fdvec_resize(reln, forknum, curopensegs - 1);
 		}
-		else if (priorblocks + ((BlockNumber) RELSEG_SIZE) > nblocks)
+		else if (priorblocks + rel_segment_size > nblocks)
 		{
 			/*
 			 * This is the last segment we want to keep. Truncate the file to
 			 * the right length. NOTE: if nblocks is exactly a multiple K of
-			 * RELSEG_SIZE, we will truncate the K+1st segment to 0 length but
-			 * keep it. This adheres to the invariant given in the header
+			 * rel_setment_size, we will truncate the K+1st segment to 0 length
+			 * but keep it. This adheres to the invariant given in the header
 			 * comments.
 			 */
 			BlockNumber lastsegblocks = nblocks - priorblocks;
@@ -1369,7 +1395,7 @@ _mdfd_openseg(SMgrRelation reln, ForkNumber forknum, BlockNumber segno,
 	v->mdfd_vfd = fd;
 	v->mdfd_segno = segno;
 
-	Assert(_mdnblocks(reln, forknum, v) <= ((BlockNumber) RELSEG_SIZE));
+	Assert(_mdnblocks(reln, forknum, v) <= rel_segment_size);
 
 	/* all done */
 	return v;
@@ -1396,7 +1422,7 @@ _mdfd_getseg(SMgrRelation reln, ForkNumber forknum, BlockNumber blkno,
 		   (EXTENSION_FAIL | EXTENSION_CREATE | EXTENSION_RETURN_NULL |
 			EXTENSION_DONT_OPEN));
 
-	targetseg = blkno / ((BlockNumber) RELSEG_SIZE);
+	targetseg = blockno_to_segno(blkno);
 
 	/* if an existing and opened segment, we're done */
 	if (targetseg < reln->md_num_open_segs[forknum])
@@ -1433,7 +1459,7 @@ _mdfd_getseg(SMgrRelation reln, ForkNumber forknum, BlockNumber blkno,
 
 		Assert(nextsegno == v->mdfd_segno + 1);
 
-		if (nblocks > ((BlockNumber) RELSEG_SIZE))
+		if (nblocks > rel_segment_size)
 			elog(FATAL, "segment too big");
 
 		if ((behavior & EXTENSION_CREATE) ||
@@ -1448,31 +1474,31 @@ _mdfd_getseg(SMgrRelation reln, ForkNumber forknum, BlockNumber blkno,
 			 * ahead and create the segments so we can finish out the replay.
 			 *
 			 * We have to maintain the invariant that segments before the last
-			 * active segment are of size RELSEG_SIZE; therefore, if
+			 * active segment are of size rel_segment_size; therefore, if
 			 * extending, pad them out with zeroes if needed.  (This only
 			 * matters if in recovery, or if the caller is extending the
 			 * relation discontiguously, but that can happen in hash indexes.)
 			 */
-			if (nblocks < ((BlockNumber) RELSEG_SIZE))
+			if (nblocks < rel_segment_size)
 			{
 				char	   *zerobuf = palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE,
 													 MCXT_ALLOC_ZERO);
 
 				mdextend(reln, forknum,
-						 nextsegno * ((BlockNumber) RELSEG_SIZE) - 1,
+						 nextsegno * rel_segment_size - 1,
 						 zerobuf, skipFsync);
 				pfree(zerobuf);
 			}
 			flags = O_CREAT;
 		}
 		else if (!(behavior & EXTENSION_DONT_CHECK_SIZE) &&
-				 nblocks < ((BlockNumber) RELSEG_SIZE))
+				 nblocks < rel_segment_size)
 		{
 			/*
 			 * When not extending (or explicitly including truncated
 			 * segments), only open the next segment if the current one is
-			 * exactly RELSEG_SIZE.  If not (this branch), either return NULL
-			 * or fail.
+			 * exactly rel_segment_size.  If not (this branch), either return
+			 * NULL or fail.
 			 */
 			if (behavior & EXTENSION_RETURN_NULL)
 			{
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index f76c4605db..5cada9f130 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -24,10 +24,18 @@
 #include "storage/ipc.h"
 #include "storage/md.h"
 #include "storage/smgr.h"
+#include "utils/guc_tables.h"
 #include "utils/hsearch.h"
 #include "utils/inval.h"
 
 
+/*
+ * The number of blocks that should be in a segment file.  Has a wider type
+ * than BlockNumber, so that can represent the case the whole relation fits in
+ * one file.
+ */
+int64		rel_segment_size;
+
 /*
  * This struct of function pointers defines the API between smgr.c and
  * any individual storage manager module.  Note that smgr subfunctions are
@@ -764,3 +772,9 @@ ProcessBarrierSmgrRelease(void)
 	smgrreleaseall();
 	return true;
 }
+
+const char *
+show_segment_size(void)
+{
+	return ShowGUCInt64WithUnits(rel_segment_size, GUC_UNIT_BLOCKS);
+}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index a9033b7a54..c9d6f732f8 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -5273,6 +5273,22 @@ GetConfigOptionByName(const char *name, const char **varname, bool missing_ok)
 	return ShowGUCOption(record, true);
 }
 
+/*
+ * Show unit-based values with appropriate unit, as ShowGUCOption() would.
+ * This can be used by custom show hooks.
+ */
+char *
+ShowGUCInt64WithUnits(int64 value, int flags)
+{
+	int64		number;
+	const char *unit;
+	char		buffer[256];
+
+	convert_int_from_base_unit(value, flags & GUC_UNIT, &number, &unit);
+	snprintf(buffer, sizeof(buffer), INT64_FORMAT "%s", number, unit);
+	return pstrdup(buffer);
+}
+
 /*
  * ShowGUCOption: get string value of variable
  *
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 68aecad66f..3794b9dc15 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -586,10 +586,10 @@ static int	max_function_args;
 static int	max_index_keys;
 static int	max_identifier_length;
 static int	block_size;
-static int	segment_size;
 static int	shared_memory_size_mb;
 static int	shared_memory_size_in_huge_pages;
 static int	wal_block_size;
+static int	phony_segment_size;
 static bool data_checksums;
 static bool integer_datetimes;
 
@@ -3125,15 +3125,19 @@ struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	/*
+	 * We used a phony GUC with a custome show function, because we don't
+	 * support GUCs with a wide enough type.
+	 */
 	{
 		{"segment_size", PGC_INTERNAL, PRESET_OPTIONS,
 			gettext_noop("Shows the number of pages per disk file."),
 			NULL,
 			GUC_UNIT_BLOCKS | GUC_NOT_IN_SAMPLE | GUC_DISALLOW_IN_FILE
 		},
-		&segment_size,
-		RELSEG_SIZE, RELSEG_SIZE, RELSEG_SIZE,
-		NULL, NULL, NULL
+		&phony_segment_size,
+		0, 0, 0,
+		NULL, NULL, show_segment_size
 	},
 
 	{
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 09a5c98cc0..ecb6950c35 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -80,6 +80,7 @@
 #include "getopt_long.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
+#include "port/pg_bitutils.h"
 
 
 /* Ideally this would be in a .h file, but it hardly seems worth the trouble */
@@ -169,6 +170,8 @@ static bool data_checksums = false;
 static char *xlog_dir = NULL;
 static char *str_wal_segment_size_mb = NULL;
 static int	wal_segment_size_mb;
+static char *str_rel_segment_size = NULL;
+static int64 rel_segment_size;
 
 
 /* internal vars */
@@ -1535,9 +1538,10 @@ bootstrap_template1(void)
 	unsetenv("PGCLIENTENCODING");
 
 	snprintf(cmd, sizeof(cmd),
-			 "\"%s\" --boot -X %d %s %s %s %s",
+			 "\"%s\" --boot -X %d -R " INT64_FORMAT " %s %s %s %s",
 			 backend_exec,
 			 wal_segment_size_mb * (1024 * 1024),
+			 rel_segment_size,
 			 data_checksums ? "-k" : "",
 			 boot_options, extra_options,
 			 debug ? "-d 5" : "");
@@ -2481,6 +2485,7 @@ usage(const char *progname)
 	printf(_("  -W, --pwprompt            prompt for a password for the new superuser\n"));
 	printf(_("  -X, --waldir=WALDIR       location for the write-ahead log directory\n"));
 	printf(_("      --wal-segsize=SIZE    size of WAL segments, in megabytes\n"));
+	printf(_("      --rel-segsize=SIZE    size of relation segments\n"));
 	printf(_("\nLess commonly used options:\n"));
 	printf(_("  -c, --set NAME=VALUE      override default setting for server parameter\n"));
 	printf(_("  -d, --debug               generate lots of debugging output\n"));
@@ -3129,6 +3134,7 @@ main(int argc, char *argv[])
 		{"locale-provider", required_argument, NULL, 15},
 		{"icu-locale", required_argument, NULL, 16},
 		{"icu-rules", required_argument, NULL, 17},
+		{"rel-segsize", required_argument, NULL, 18},
 		{NULL, 0, NULL, 0}
 	};
 
@@ -3309,6 +3315,9 @@ main(int argc, char *argv[])
 			case 17:
 				icu_rules = pg_strdup(optarg);
 				break;
+			case 18:
+				str_rel_segment_size = pg_strdup(optarg);
+				break;
 			default:
 				/* getopt_long already emitted a complaint */
 				pg_log_error_hint("Try \"%s --help\" for more information.", progname);
@@ -3389,6 +3398,43 @@ main(int argc, char *argv[])
 			pg_fatal("argument of --wal-segsize must be a power of 2 between 1 and 1024");
 	}
 
+	/* set rel segment size */
+	if (str_rel_segment_size == NULL)
+	{
+		rel_segment_size = (1024 * 1024 * 1024) / BLCKSZ;
+	}
+	else
+	{
+		int64		bytes;
+		char	   *endptr;
+
+		bytes = strtol(str_rel_segment_size, &endptr, 10);
+		if (endptr == str_rel_segment_size)
+			pg_fatal("argument of --rel-segsize must begin with a number");
+		if (bytes == 0)
+			pg_fatal("argument of --rel-segsize must be greater than zero");
+
+		if (strcmp(endptr, "kB") == 0)
+			bytes *= 1024;
+		else if (strcmp(endptr, "MB") == 0)
+			bytes *= 1024 * 1024;
+		else if (strcmp(endptr, "GB") == 0)
+			bytes *= 1024 * 1024 * 1024;
+		else if (strcmp(endptr, "TB") == 0)
+			bytes *= UINT64CONST(1024) * 1024 * 1024 * 1024;
+		else
+			pg_fatal("argument of --rel-segsize must end with kB, MB, GB or TB");
+
+		if (bytes % BLCKSZ != 0)
+			pg_fatal("argument of --rel-segsize must be a multiple of BLCKSZ");
+		if (pg_popcount64(bytes) != 1)
+			pg_fatal("argument of --rel-segsize must be a power of two");
+		if (sizeof(off_t) < 8 && bytes > (1 << 31))
+			pg_fatal("argument of --rel-segsize is too large for this platform's off_t");
+
+		rel_segment_size = bytes / BLCKSZ;
+	}
+
 	get_restricted_token();
 
 	setup_pgdata();
diff --git a/src/bin/pg_checksums/pg_checksums.c b/src/bin/pg_checksums/pg_checksums.c
index 19eb67e485..8685f03bf2 100644
--- a/src/bin/pg_checksums/pg_checksums.c
+++ b/src/bin/pg_checksums/pg_checksums.c
@@ -231,7 +231,7 @@ scan_file(const char *fn, int segmentno)
 		if (PageIsNew(buf.data))
 			continue;
 
-		csum = pg_checksum_page(buf.data, blockno + segmentno * RELSEG_SIZE);
+		csum = pg_checksum_page(buf.data, blockno + segmentno * ControlFile->relseg_size);
 		if (mode == PG_MODE_CHECK)
 		{
 			if (csum != header->pd_checksum)
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index c390ec51ce..ff5aaf43ff 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -305,7 +305,7 @@ main(int argc, char *argv[])
 	/* we don't print floatFormat since can't say much useful about it */
 	printf(_("Database block size:                  %u\n"),
 		   ControlFile->blcksz);
-	printf(_("Blocks per segment of large relation: %u\n"),
+	printf(_("Blocks per segment of large relation: " INT64_FORMAT "\n"),
 		   ControlFile->relseg_size);
 	printf(_("WAL block size:                       %u\n"),
 		   ControlFile->xlog_blcksz);
diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index e7ef2b8bd0..2dcd886371 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -683,7 +683,7 @@ GuessControlValues(void)
 	ControlFile.maxAlign = MAXIMUM_ALIGNOF;
 	ControlFile.floatFormat = FLOATFORMAT_VALUE;
 	ControlFile.blcksz = BLCKSZ;
-	ControlFile.relseg_size = RELSEG_SIZE;
+	ControlFile.relseg_size = 1024 * 1024 * 1024;
 	ControlFile.xlog_blcksz = XLOG_BLCKSZ;
 	ControlFile.xlog_seg_size = DEFAULT_XLOG_SEG_SIZE;
 	ControlFile.nameDataLen = NAMEDATALEN;
@@ -751,7 +751,7 @@ PrintControlValues(bool guessed)
 	/* we don't print floatFormat since can't say much useful about it */
 	printf(_("Database block size:                  %u\n"),
 		   ControlFile.blcksz);
-	printf(_("Blocks per segment of large relation: %u\n"),
+	printf(_("Blocks per segment of large relation: " INT64_FORMAT "\n"),
 		   ControlFile.relseg_size);
 	printf(_("WAL block size:                       %u\n"),
 		   ControlFile.xlog_blcksz);
diff --git a/src/bin/pg_rewind/filemap.c b/src/bin/pg_rewind/filemap.c
index bd5c598e20..693ee195ed 100644
--- a/src/bin/pg_rewind/filemap.c
+++ b/src/bin/pg_rewind/filemap.c
@@ -296,8 +296,8 @@ process_target_wal_block_change(ForkNumber forknum, RelFileLocator rlocator,
 	BlockNumber blkno_inseg;
 	int			segno;
 
-	segno = blkno / RELSEG_SIZE;
-	blkno_inseg = blkno % RELSEG_SIZE;
+	segno = blkno / rel_segment_size;
+	blkno_inseg = blkno % rel_segment_size;;
 
 	path = datasegpath(rlocator, forknum, segno);
 	entry = lookup_filehash_entry(path);
diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c
index f7f3b8227f..f3db47ca04 100644
--- a/src/bin/pg_rewind/pg_rewind.c
+++ b/src/bin/pg_rewind/pg_rewind.c
@@ -61,6 +61,7 @@ static ControlFileData ControlFile_source_after;
 
 const char *progname;
 int			WalSegSz;
+int64		rel_segment_size;
 
 /* Configuration options */
 char	   *datadir_target = NULL;
@@ -1028,6 +1029,8 @@ digestControlFile(ControlFileData *ControlFile, const char *content,
 						  WalSegSz),
 				 WalSegSz);
 
+	rel_segment_size = ControlFile->relseg_size;
+
 	/* Additional checks on control file */
 	checkControlFile(ControlFile);
 }
diff --git a/src/bin/pg_rewind/pg_rewind.h b/src/bin/pg_rewind/pg_rewind.h
index ef8bdc1fbb..04e84f393b 100644
--- a/src/bin/pg_rewind/pg_rewind.h
+++ b/src/bin/pg_rewind/pg_rewind.h
@@ -24,6 +24,7 @@ extern bool showprogress;
 extern bool dry_run;
 extern bool do_sync;
 extern int	WalSegSz;
+extern int64 rel_segment_size;
 
 /* Target history */
 extern TimeLineHistoryEntry *targetHistory;
diff --git a/src/bin/pg_upgrade/relfilenumber.c b/src/bin/pg_upgrade/relfilenumber.c
index 34bc9c1504..f0760ed522 100644
--- a/src/bin/pg_upgrade/relfilenumber.c
+++ b/src/bin/pg_upgrade/relfilenumber.c
@@ -180,7 +180,7 @@ transfer_relfile(FileNameMap *map, const char *type_suffix, bool vm_must_add_fro
 
 	/*
 	 * Now copy/link any related segments as well. Remember, PG breaks large
-	 * files into 1GB segments, the first segment has no extension, subsequent
+	 * files into segments, the first segment has no extension, subsequent
 	 * segments are named relfilenumber.1, relfilenumber.2, relfilenumber.3.
 	 */
 	for (segno = 0;; segno++)
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index dc953977c5..6a66494c2e 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -203,7 +203,7 @@ typedef struct ControlFileData
 	 * compatible with the backend executable.
 	 */
 	uint32		blcksz;			/* data block size for this DB */
-	uint32		relseg_size;	/* blocks per segment of large relation */
+	int64		relseg_size;	/* blocks per segment of large relation */
 
 	uint32		xlog_blcksz;	/* block size within WAL files */
 	uint32		xlog_seg_size;	/* size of each WAL segment */
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 6d572c3820..8ec9cc9b9f 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -659,19 +659,6 @@
    your system. */
 #undef PTHREAD_CREATE_JOINABLE
 
-/* RELSEG_SIZE is the maximum number of blocks allowed in one disk file. Thus,
-   the maximum size of a single file is RELSEG_SIZE * BLCKSZ; relations bigger
-   than that are divided into multiple files. RELSEG_SIZE * BLCKSZ must be
-   less than your OS' limit on file size. This is often 2 GB or 4GB in a
-   32-bit operating system, unless you have large file support enabled. By
-   default, we make the limit 1 GB to avoid any possible integer-overflow
-   problems within the OS. A limit smaller than necessary only means we divide
-   a large relation into more chunks than necessary, so it seems best to err
-   in the direction of a small limit. A power-of-2 value is recommended to
-   save a few cycles in md.c, but is not absolutely required. Changing
-   RELSEG_SIZE requires an initdb. */
-#undef RELSEG_SIZE
-
 /* The size of `bool', as computed by sizeof. */
 #undef SIZEOF_BOOL
 
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index a9a179aaba..7a02a13e14 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -18,6 +18,8 @@
 #include "storage/block.h"
 #include "storage/relfilelocator.h"
 
+extern int64 rel_segment_size;
+
 /*
  * smgr.c maintains a table of SMgrRelation objects, which are essentially
  * cached file handles.  An SMgrRelation is created (if not already present)
@@ -109,5 +111,6 @@ extern void smgrtruncate(SMgrRelation reln, ForkNumber *forknum,
 extern void smgrimmedsync(SMgrRelation reln, ForkNumber forknum);
 extern void AtEOXact_SMgr(void);
 extern bool ProcessBarrierSmgrRelease(void);
+extern const char *show_segment_size(void);
 
 #endif							/* SMGR_H */
diff --git a/src/include/utils/guc_tables.h b/src/include/utils/guc_tables.h
index d5a0880678..9514f6c1a5 100644
--- a/src/include/utils/guc_tables.h
+++ b/src/include/utils/guc_tables.h
@@ -291,6 +291,7 @@ extern struct config_generic **get_explain_guc_options(int *num);
 
 /* get string value of variable */
 extern char *ShowGUCOption(struct config_generic *record, bool use_units);
+extern char *ShowGUCInt64WithUnits(int64 value, int flags);
 
 /* get whether or not the GUC variable is visible to current user */
 extern bool ConfigOptionIsVisible(struct config_generic *conf);
diff --git a/src/tools/msvc/Solution.pm b/src/tools/msvc/Solution.pm
index b6d31c3583..909de3bb9a 100644
--- a/src/tools/msvc/Solution.pm
+++ b/src/tools/msvc/Solution.pm
@@ -415,8 +415,6 @@ sub GenerateFiles
 		  qq{"PostgreSQL $package_version$extraver, compiled by Visual C++ build " CppAsString2(_MSC_VER) ", $bits-bit"},
 		PROFILE_PID_DIR => undef,
 		PTHREAD_CREATE_JOINABLE => undef,
-		RELSEG_SIZE => (1024 / $self->{options}->{blocksize}) *
-		  $self->{options}->{segsize} * 1024,
 		SIZEOF_BOOL => 1,
 		SIZEOF_LONG => 4,
 		SIZEOF_OFF_T => undef,
-- 
2.39.2

#22

Thomas Munro

thomas.munro@gmail.com

over 2 years ago

In reply to: Thomas Munro (#21)

Re: Large files for relations

On Sun, May 28, 2023 at 2:48 AM Thomas Munro <thomas.munro@gmail.com> wrote:

(you'd need over 2 billion
directories ...

directory *entries* (segment files), I meant to write there.

#23

Peter Eisentraut

peter.eisentraut@enterprisedb.com

over 2 years ago

In reply to: Thomas Munro (#21)

Re: Large files for relations

On 28.05.23 02:48, Thomas Munro wrote:

Another potential option name would be --segsize, if we think we're
going to use this for temp files too eventually.

Maybe it's not so beautiful to have that global variable
rel_segment_size (which replaces REL_SEGSIZE everywhere). Another
idea would be to make it static in md.c and call smgrsetsegmentsize(),
or something like that.

I think one way to look at this is that the segment size is a
configuration property of the md.c smgr. I have been thinking a bit
about how smgr-level configuration could look. You can't use a catalog
table, but we also can't have smgr plugins get space in pg_control.

Anyway, I'm not asking you to design this now. A global variable via
pg_control seems fine for now. But it wouldn't be an smgr API call, I
think.

#24

David Steele

david@pgmasters.net

over 2 years ago

In reply to: Thomas Munro (#21)

Re: Large files for relations

On 5/28/23 08:48, Thomas Munro wrote:

Alright, since I had some time to kill in an airport, here is a
starter patch for initdb --rel-segsize.

I've gone through this patch and it looks pretty good to me. A few things:

+ * rel_setment_size, we will truncate the K+1st segment to 0 length

rel_setment_size -> rel_segment_size

+ * We used a phony GUC with a custome show function, because we don't

custome -> custom

+ if (strcmp(endptr, "kB") == 0)

Why kB here instead of KB to match MB, GB, TB below?

+ int64 relseg_size; /* blocks per segment of large relation */

This will require PG_CONTROL_VERSION to be bumped -- but you are
probably waiting until commit time to avoid annoying conflicts, though I
don't think it is as likely as with CATALOG_VERSION_NO.

Some random thoughts:

Another potential option name would be --segsize, if we think we're
going to use this for temp files too eventually.

I feel like temp file segsize should be separately configurable for the
same reason that we are leaving it as 1GB for now.

Maybe it's not so beautiful to have that global variable
rel_segment_size (which replaces REL_SEGSIZE everywhere).

Maybe not, but it is the way these things are done in general, .e.g.
wal_segment_size, so I don't think it will be too controversial.

Another
idea would be to make it static in md.c and call smgrsetsegmentsize(),
or something like that. That could be a nice place to compute the
"shift" value up front, instead of computing it each time in
blockno_to_segno(), but that's probably not worth bothering with (?).
BSR/LZCNT/CLZ instructions are pretty fast on modern chips. That's
about the only place where someone could say that this change makes
things worse for people not interested in the new feature, so I was
careful to get rid of / and % operations with no-longer-constant RHS.

Right -- not sure we should be troubling ourselves with trying to
optimize away ops that are very fast, unless they are computed trillions
of times.

I had to promote segment size to int64 (global variable, field in
control file), because otherwise it couldn't represent
--rel-segsize=32TB (it'd be too big by one). Other ideas would be to
store the shift value instead of the size, or store the max block
number, eg subtract one, or use InvalidBlockNumber to mean "no limit"
(with more branches to test for it). The only problem I ran into with
the larger type was that 'SHOW segment_size' now needs a custom show
function because we don't have int64 GUCs.

A custom show function seems like a reasonable solution here.

A C type confusion problem that I noticed: some code uses BlockNumber
and some code uses int for segment numbers. It's not really a
reachable problem for practical reasons (you'd need over 2 billion
directories and VFDs to reach it), but it's wrong to use int if
segment size can be set as low as BLCKSZ (one file per block); you
could have more segments than an int can represent. We could go for
uint32, BlockNumber or create SegmentNumber (which I think I've
proposed before, and lost track of...). We can address that
separately (perhaps by finding my old patch...)

I think addressing this separately is fine, though maybe enforcing some
reasonable minimum in initdb would be a good idea for this patch. For my
2c SEGSIZE == BLOCKSZ just makes very little sense.

Lastly, I think the blockno_to_segno(), blockno_within_segment(), and
blockno_to_seekpos() functions add enough readability that they should
be committed regardless of how this patch proceeds.

Regards,
-David

#25

Thomas Munro

thomas.munro@gmail.com

over 2 years ago

In reply to: David Steele (#24)

Re: Large files for relations

On Mon, Jun 12, 2023 at 8:53 PM David Steele <david@pgmasters.net> wrote:

+ if (strcmp(endptr, "kB") == 0)

Why kB here instead of KB to match MB, GB, TB below?

Those are SI prefixes[1]https://en.wikipedia.org/wiki/Metric_prefix, and we use kB elsewhere too. ("K" was used
for kelvins, so they went with "k" for kilo. Obviously these aren't
fully SI, because B is supposed to mean bel. A gigabel would be
pretty loud... more than "sufficient power to create a black hole"[2]https://en.wiktionary.org/wiki/gigabel,
hehe.)

+ int64 relseg_size; /* blocks per segment of large relation */

This will require PG_CONTROL_VERSION to be bumped -- but you are
probably waiting until commit time to avoid annoying conflicts, though I
don't think it is as likely as with CATALOG_VERSION_NO.

Oh yeah, thanks.

Another
idea would be to make it static in md.c and call smgrsetsegmentsize(),
or something like that. That could be a nice place to compute the
"shift" value up front, instead of computing it each time in
blockno_to_segno(), but that's probably not worth bothering with (?).
BSR/LZCNT/CLZ instructions are pretty fast on modern chips. That's
about the only place where someone could say that this change makes
things worse for people not interested in the new feature, so I was
careful to get rid of / and % operations with no-longer-constant RHS.

Right -- not sure we should be troubling ourselves with trying to
optimize away ops that are very fast, unless they are computed trillions
of times.

This obviously has some things in common with David Christensen's
nearby patch for block sizes[3]/messages/by-id/CAOxo6XKx7DyDgBkWwPfnGSXQYNLpNrSWtYnK6-1u+QHUwRa1Gg@mail.gmail.com, and we should be shifting and masking
there too if that route is taken (as opposed to a specialise-the-code
route or somethign else). My binary-log trick is probably a little
too cute though... I should probably just go and set a shift variable.

Thanks for looking!

[1]: https://en.wikipedia.org/wiki/Metric_prefix
[2]: https://en.wiktionary.org/wiki/gigabel
[3]: /messages/by-id/CAOxo6XKx7DyDgBkWwPfnGSXQYNLpNrSWtYnK6-1u+QHUwRa1Gg@mail.gmail.com

#26

Thomas Munro

thomas.munro@gmail.com

almost 2 years ago

In reply to: Thomas Munro (#25)

1 attachment(s)

Re: Large files for relations

Rebased. I had intended to try to get this into v17, but a couple of
unresolved problems came up while rebasing over the new incremental
backup stuff. You snooze, you lose. Hopefully we can sort these out
in time for the next commitfest:

* should pg_combinebasebackup read the control file to fetch the segment size?
* hunt for other segment-size related problems that may be lurking in
new incremental backup stuff
* basebackup_incremental.c wants to use memory in proportion to
segment size, which looks like a problem, and I wrote about that in a
new thread[1]/messages/by-id/CA+hUKG+2hZ0sBztPW4mkLfng0qfkNtAHFUfxOMLizJ0BPmi5+g@mail.gmail.com

[1]: /messages/by-id/CA+hUKG+2hZ0sBztPW4mkLfng0qfkNtAHFUfxOMLizJ0BPmi5+g@mail.gmail.com

Attachments:

v3-0001-Allow-relation-segment-size-to-be-set-by-initdb.patchapplication/x-patch; name=v3-0001-Allow-relation-segment-size-to-be-set-by-initdb.patchDownload

From 85678257fef94aa3ca3efb39ce55fb66df7c889e Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Fri, 26 May 2023 01:41:11 +1200
Subject: [PATCH v3] Allow relation segment size to be set by initdb.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Previously, relation segment size was a rarely modified compile time
option.  Make it an initdb option, so that users with very large tables
can avoid using so many files and file descriptors.

The initdb option --rel-segsize is modeled on the existing --wal-segsize
option.

The data type used to store the size is int64, not BlockNumber, because
it seems reasonable to want to be able to say --rel-segsize=32TB (=
don't use segments at all), but that would overflow uint32.

It should be fairly straightforward to teach pg_upgrade (or some new
dedicated tool) to convert an existing cluster to a new segment size,
but that is not done yet, so for now this is only useful for entirely
new clusters.

The default behavior is unchanged: 1GB segments.  On Windows, we can't
go above 2GB for now due (we'd have to make a lot of changes due to
Windows' small off_t).

XXX work remains to be done for incremental backups

Reviewed-by: David Steele <david@pgmasters.net>
Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Reviewed-by: Stephen Frost <sfrost@snowman.net>
Reviewed-by: Jim Mlodgenski <jimmy76@gmail.com>
Reivewed-by: Dagfinn Ilmari Mannsåker <ilmari@ilmari.org>
Reviewed-by: Pavel Stehule <pavel.stehule@gmail.com>
Discussion: https://postgr.es/m/CA%2BhUKG%2BBGXwMbrvzXAjL8VMGf25y_ga_XnO741g10y0%3Dm6dDiA%40mail.gmail.com
---
 configure                                   |  91 --------------
 configure.ac                                |  55 ---------
 doc/src/sgml/config.sgml                    |   7 +-
 doc/src/sgml/ref/initdb.sgml                |  24 ++++
 meson.build                                 |  14 ---
 src/backend/access/transam/xlog.c           |  11 +-
 src/backend/backup/basebackup.c             |   7 +-
 src/backend/backup/basebackup_incremental.c |  31 +++--
 src/backend/bootstrap/bootstrap.c           |   5 +-
 src/backend/storage/file/buffile.c          |   6 +-
 src/backend/storage/smgr/md.c               | 128 ++++++++++++--------
 src/backend/storage/smgr/smgr.c             |  14 +++
 src/backend/utils/misc/guc.c                |  16 +++
 src/backend/utils/misc/guc_tables.c         |  12 +-
 src/bin/initdb/initdb.c                     |  47 ++++++-
 src/bin/pg_checksums/pg_checksums.c         |   2 +-
 src/bin/pg_combinebackup/reconstruct.c      |  18 ++-
 src/bin/pg_controldata/pg_controldata.c     |   2 +-
 src/bin/pg_resetwal/pg_resetwal.c           |   4 +-
 src/bin/pg_rewind/filemap.c                 |   4 +-
 src/bin/pg_rewind/pg_rewind.c               |   3 +
 src/bin/pg_rewind/pg_rewind.h               |   1 +
 src/bin/pg_upgrade/relfilenumber.c          |   2 +-
 src/include/catalog/pg_control.h            |   2 +-
 src/include/pg_config.h.in                  |  13 --
 src/include/storage/smgr.h                  |   3 +
 src/include/utils/guc_tables.h              |   1 +
 27 files changed, 249 insertions(+), 274 deletions(-)

diff --git a/configure b/configure
index 36feeafbb23..49a7f0f2c4a 100755
--- a/configure
+++ b/configure
@@ -842,8 +842,6 @@ enable_dtrace
 enable_tap_tests
 enable_injection_points
 with_blocksize
-with_segsize
-with_segsize_blocks
 with_wal_blocksize
 with_llvm
 enable_depend
@@ -1551,9 +1549,6 @@ Optional Packages:
   --with-pgport=PORTNUM   set default port number [5432]
   --with-blocksize=BLOCKSIZE
                           set table block size in kB [8]
-  --with-segsize=SEGSIZE  set table segment size in GB [1]
-  --with-segsize-blocks=SEGSIZE_BLOCKS
-                          set table segment size in blocks [0]
   --with-wal-blocksize=BLOCKSIZE
                           set WAL block size in kB [8]
   --with-llvm             build with LLVM based JIT support
@@ -3759,85 +3754,6 @@ cat >>confdefs.h <<_ACEOF
 _ACEOF
 
 
-#
-# Relation segment size
-#
-
-
-
-# Check whether --with-segsize was given.
-if test "${with_segsize+set}" = set; then :
-  withval=$with_segsize;
-  case $withval in
-    yes)
-      as_fn_error $? "argument required for --with-segsize option" "$LINENO" 5
-      ;;
-    no)
-      as_fn_error $? "argument required for --with-segsize option" "$LINENO" 5
-      ;;
-    *)
-      segsize=$withval
-      ;;
-  esac
-
-else
-  segsize=1
-fi
-
-
-
-
-
-# Check whether --with-segsize-blocks was given.
-if test "${with_segsize_blocks+set}" = set; then :
-  withval=$with_segsize_blocks;
-  case $withval in
-    yes)
-      as_fn_error $? "argument required for --with-segsize-blocks option" "$LINENO" 5
-      ;;
-    no)
-      as_fn_error $? "argument required for --with-segsize-blocks option" "$LINENO" 5
-      ;;
-    *)
-      segsize_blocks=$withval
-      ;;
-  esac
-
-else
-  segsize_blocks=0
-fi
-
-
-
-# If --with-segsize-blocks is non-zero, it is used, --with-segsize
-# otherwise. segsize-blocks is only really useful for developers wanting to
-# test segment related code. Warn if both are used.
-if test $segsize_blocks -ne 0 -a $segsize -ne 1; then
-  { $as_echo "$as_me:${as_lineno-$LINENO}: WARNING: both --with-segsize and --with-segsize-blocks specified, --with-segsize-blocks wins" >&5
-$as_echo "$as_me: WARNING: both --with-segsize and --with-segsize-blocks specified, --with-segsize-blocks wins" >&2;}
-fi
-
-{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for segment size" >&5
-$as_echo_n "checking for segment size... " >&6; }
-if test $segsize_blocks -eq 0; then
-  # this expression is set up to avoid unnecessary integer overflow
-  # blocksize is already guaranteed to be a factor of 1024
-  RELSEG_SIZE=`expr '(' 1024 / ${blocksize} ')' '*' ${segsize} '*' 1024`
-  test $? -eq 0 || exit 1
-  { $as_echo "$as_me:${as_lineno-$LINENO}: result: ${segsize}GB" >&5
-$as_echo "${segsize}GB" >&6; }
-else
-  RELSEG_SIZE=$segsize_blocks
-  { $as_echo "$as_me:${as_lineno-$LINENO}: result: ${RELSEG_SIZE} blocks" >&5
-$as_echo "${RELSEG_SIZE} blocks" >&6; }
-fi
-
-
-cat >>confdefs.h <<_ACEOF
-#define RELSEG_SIZE ${RELSEG_SIZE}
-_ACEOF
-
-
 #
 # WAL block size
 #
@@ -15107,13 +15023,6 @@ _ACEOF
 
 
 
-# If we don't have largefile support, can't handle segment size >= 2GB.
-if test "$ac_cv_sizeof_off_t" -lt 8; then
-  if expr $RELSEG_SIZE '*' $blocksize '>=' 2 '*' 1024 '*' 1024; then
-    as_fn_error $? "Large file support is not enabled. Segment size cannot be larger than 1GB." "$LINENO" 5
-  fi
-fi
-
 # The cast to long int works around a bug in the HP C Compiler
 # version HP92453-01 B.11.11.23709.GP, which incorrectly rejects
 # declarations like `int a3[[(sizeof (unsigned char)) >= 0]];'.
diff --git a/configure.ac b/configure.ac
index 57f734879e1..a04716aebf5 100644
--- a/configure.ac
+++ b/configure.ac
@@ -288,54 +288,6 @@ AC_DEFINE_UNQUOTED([BLCKSZ], ${BLCKSZ}, [
  Changing BLCKSZ requires an initdb.
 ])
 
-#
-# Relation segment size
-#
-PGAC_ARG_REQ(with, segsize, [SEGSIZE], [set table segment size in GB [1]],
-             [segsize=$withval],
-             [segsize=1])
-PGAC_ARG_REQ(with, segsize-blocks, [SEGSIZE_BLOCKS], [set table segment size in blocks [0]],
-             [segsize_blocks=$withval],
-             [segsize_blocks=0])
-
-# If --with-segsize-blocks is non-zero, it is used, --with-segsize
-# otherwise. segsize-blocks is only really useful for developers wanting to
-# test segment related code. Warn if both are used.
-if test $segsize_blocks -ne 0 -a $segsize -ne 1; then
-  AC_MSG_WARN([both --with-segsize and --with-segsize-blocks specified, --with-segsize-blocks wins])
-fi
-
-AC_MSG_CHECKING([for segment size])
-if test $segsize_blocks -eq 0; then
-  # this expression is set up to avoid unnecessary integer overflow
-  # blocksize is already guaranteed to be a factor of 1024
-  RELSEG_SIZE=`expr '(' 1024 / ${blocksize} ')' '*' ${segsize} '*' 1024`
-  test $? -eq 0 || exit 1
-  AC_MSG_RESULT([${segsize}GB])
-else
-  RELSEG_SIZE=$segsize_blocks
-  AC_MSG_RESULT([${RELSEG_SIZE} blocks])
-fi
-
-AC_DEFINE_UNQUOTED([RELSEG_SIZE], ${RELSEG_SIZE}, [
- RELSEG_SIZE is the maximum number of blocks allowed in one disk file.
- Thus, the maximum size of a single file is RELSEG_SIZE * BLCKSZ;
- relations bigger than that are divided into multiple files.
-
- RELSEG_SIZE * BLCKSZ must be less than your OS' limit on file size.
- This is often 2 GB or 4GB in a 32-bit operating system, unless you
- have large file support enabled.  By default, we make the limit 1 GB
- to avoid any possible integer-overflow problems within the OS.
- A limit smaller than necessary only means we divide a large
- relation into more chunks than necessary, so it seems best to err
- in the direction of a small limit.
-
- A power-of-2 value is recommended to save a few cycles in md.c,
- but is not absolutely required.
-
- Changing RELSEG_SIZE requires an initdb.
-])
-
 #
 # WAL block size
 #
@@ -1712,13 +1664,6 @@ fi
 dnl Check for largefile support (must be after AC_SYS_LARGEFILE)
 AC_CHECK_SIZEOF([off_t])
 
-# If we don't have largefile support, can't handle segment size >= 2GB.
-if test "$ac_cv_sizeof_off_t" -lt 8; then
-  if expr $RELSEG_SIZE '*' $blocksize '>=' 2 '*' 1024 '*' 1024; then
-    AC_MSG_ERROR([Large file support is not enabled. Segment size cannot be larger than 1GB.])
-  fi
-fi
-
 AC_CHECK_SIZEOF([bool], [],
 [#ifdef HAVE_STDBOOL_H
 #include <stdbool.h>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index b38cbd714aa..e7638e3d3f4 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -11040,10 +11040,9 @@ dynamic_library_path = 'C:\tools\postgresql;H:\my_project\lib;$libdir'
       <listitem>
        <para>
         Reports the number of blocks (pages) that can be stored within a file
-        segment.  It is determined by the value of <literal>RELSEG_SIZE</literal>
-        when building the server.  The maximum size of a segment file in bytes
-        is equal to <varname>segment_size</varname> multiplied by
-        <varname>block_size</varname>; by default this is 1GB.
+        segment.  It is changeable with the <literal>--rel-segsize</literal> option
+        with a cluster is initialized with <application>initdb</application>.
+        By default this is 1GB.
        </para>
       </listitem>
      </varlistentry>
diff --git a/doc/src/sgml/ref/initdb.sgml b/doc/src/sgml/ref/initdb.sgml
index cd75cae10e2..db1ed95694c 100644
--- a/doc/src/sgml/ref/initdb.sgml
+++ b/doc/src/sgml/ref/initdb.sgml
@@ -470,6 +470,30 @@ PostgreSQL documentation
        </para>
       </listitem>
      </varlistentry>
+
+     <varlistentry id="app-initdb-option-rel-segsize">
+      <term><option>--rel-segsize=<replaceable>size</replaceable></option></term>
+      <listitem>
+       <para>
+        Set the maximum size of relation segment files.  The size must have a suffix
+        <literal>kB</literal>, <literal>MB</literal>, <literal>GB</literal> or
+        <literal>TB</literal>.  The default size is 1GB, which was chosen to
+        support large relations on operating systems without large file support.
+        This option can only be set during initialization, and cannot be
+        changed later.
+       </para>
+
+       <para>
+        Setting this to a value higher than the default reduces the
+        number of file descriptors that must be managed while accessing very large
+        tables.  Note that values higher than the file system can support may
+        result in errors while trying to extend a table (for example Linux ext4
+        limits files to 16TB), and values above 2GB are not supported on
+        operating systems without a large <literal>off_t</literal> data type
+        (currently Windows).
+       </para>
+      </listitem>
+     </varlistentry>
     </variablelist>
    </para>
 
diff --git a/meson.build b/meson.build
index 85788f9dd8f..551b46a9831 100644
--- a/meson.build
+++ b/meson.build
@@ -420,16 +420,6 @@ cdata.set('USE_INJECTION_POINTS', get_option('injection_points') ? 1 : false)
 
 blocksize = get_option('blocksize').to_int() * 1024
 
-if get_option('segsize_blocks') != 0
-  if get_option('segsize') != 1
-    warning('both segsize and segsize_blocks specified, segsize_blocks wins')
-  endif
-
-  segsize = get_option('segsize_blocks')
-else
-  segsize = (get_option('segsize') * 1024 * 1024 * 1024) / blocksize
-endif
-
 cdata.set('BLCKSZ', blocksize, description:
 '''Size of a disk block --- this also limits the size of a tuple. You can set
    it bigger if you need bigger tuples (although TOAST should reduce the need
@@ -440,7 +430,6 @@ cdata.set('BLCKSZ', blocksize, description:
    Changing BLCKSZ requires an initdb.''')
 
 cdata.set('XLOG_BLCKSZ', get_option('wal_blocksize').to_int() * 1024)
-cdata.set('RELSEG_SIZE', segsize)
 cdata.set('DEF_PGPORT', get_option('pgport'))
 cdata.set_quoted('DEF_PGPORT_STR', get_option('pgport').to_string())
 cdata.set_quoted('PG_KRB_SRVNAM', get_option('krb_srvnam'))
@@ -3359,9 +3348,6 @@ if meson.version().version_compare('>=0.57')
     {
       'data block size': '@0@ kB'.format(cdata.get('BLCKSZ') / 1024),
       'WAL block size': '@0@ kB'.format(cdata.get('XLOG_BLCKSZ') / 1024),
-      'segment size': get_option('segsize_blocks') != 0 ?
-        '@0@ blocks'.format(cdata.get('RELSEG_SIZE')) :
-        '@0@ GB'.format(get_option('segsize')),
     },
     section: 'Data layout',
   )
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 20a5f862090..1c705c3469a 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4178,7 +4178,7 @@ WriteControlFile(void)
 	ControlFile->floatFormat = FLOATFORMAT_VALUE;
 
 	ControlFile->blcksz = BLCKSZ;
-	ControlFile->relseg_size = RELSEG_SIZE;
+	ControlFile->relseg_size = rel_segment_size;
 	ControlFile->xlog_blcksz = XLOG_BLCKSZ;
 	ControlFile->xlog_seg_size = wal_segment_size;
 
@@ -4348,13 +4348,6 @@ ReadControlFile(void)
 						   " but the server was compiled with BLCKSZ %d.",
 						   ControlFile->blcksz, BLCKSZ),
 				 errhint("It looks like you need to recompile or initdb.")));
-	if (ControlFile->relseg_size != RELSEG_SIZE)
-		ereport(FATAL,
-				(errmsg("database files are incompatible with server"),
-				 errdetail("The database cluster was initialized with RELSEG_SIZE %d,"
-						   " but the server was compiled with RELSEG_SIZE %d.",
-						   ControlFile->relseg_size, RELSEG_SIZE),
-				 errhint("It looks like you need to recompile or initdb.")));
 	if (ControlFile->xlog_blcksz != XLOG_BLCKSZ)
 		ereport(FATAL,
 				(errmsg("database files are incompatible with server"),
@@ -4436,6 +4429,8 @@ ReadControlFile(void)
 
 	CalculateCheckpointSegments();
 
+	rel_segment_size = ControlFile->relseg_size;
+
 	/* Make the initdb settings visible as GUC variables, too */
 	SetConfigOption("data_checksums", DataChecksumsEnabled() ? "yes" : "no",
 					PGC_INTERNAL, PGC_S_DYNAMIC_DEFAULT);
diff --git a/src/backend/backup/basebackup.c b/src/backend/backup/basebackup.c
index 5fbbe5ffd20..87e57ec2352 100644
--- a/src/backend/backup/basebackup.c
+++ b/src/backend/backup/basebackup.c
@@ -43,6 +43,7 @@
 #include "storage/dsm_impl.h"
 #include "storage/ipc.h"
 #include "storage/reinit.h"
+#include "storage/smgr.h"
 #include "utils/builtins.h"
 #include "utils/guc.h"
 #include "utils/ps_status.h"
@@ -1206,7 +1207,7 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
 	 * But we don't need it at all if this is not an incremental backup.
 	 */
 	if (ib != NULL)
-		relative_block_numbers = palloc(sizeof(BlockNumber) * RELSEG_SIZE);
+		relative_block_numbers = palloc(sizeof(BlockNumber) * rel_segment_size);
 
 	/*
 	 * Determine if the current path is a database directory that can contain
@@ -1682,7 +1683,7 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
 			 */
 			cnt = read_file_data_into_buffer(sink, readfilename, fd,
 											 bytes_done, remaining,
-											 blkno + segno * RELSEG_SIZE,
+											 blkno + segno * rel_segment_size,
 											 verify_checksum,
 											 &checksum_failures);
 		}
@@ -1704,7 +1705,7 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
 			cnt = read_file_data_into_buffer(sink, readfilename, fd,
 											 relative_blkno * BLCKSZ,
 											 BLCKSZ,
-											 relative_blkno + segno * RELSEG_SIZE,
+											 relative_blkno + segno * rel_segment_size,
 											 verify_checksum,
 											 &checksum_failures);
 
diff --git a/src/backend/backup/basebackup_incremental.c b/src/backend/backup/basebackup_incremental.c
index ebc41f28be5..274964e63a1 100644
--- a/src/backend/backup/basebackup_incremental.c
+++ b/src/backend/backup/basebackup_incremental.c
@@ -28,6 +28,7 @@
 #include "common/int.h"
 #include "datatype/timestamp.h"
 #include "postmaster/walsummarizer.h"
+#include "storage/smgr.h"
 #include "utils/timestamp.h"
 
 #define	BLOCKS_PER_READ			512
@@ -699,9 +700,9 @@ GetIncrementalFilePath(Oid dboid, Oid spcoid, RelFileNumber relfilenumber,
  * an incremental file in the backup instead of the entire file. On return,
  * *num_blocks_required will be set to the number of blocks that need to be
  * sent, and the actual block numbers will have been stored in
- * relative_block_numbers, which should be an array of at least RELSEG_SIZE.
- * In addition, *truncation_block_length will be set to the value that should
- * be included in the incremental file.
+ * relative_block_numbers, which should be an array of at least
+ * rel_segment_size.  * In addition, *truncation_block_length will be set to
+ * the value that should be included in the incremental file.
  */
 FileBackupMethod
 GetFileBackupMethod(IncrementalBackupInfo *ib, const char *path,
@@ -712,7 +713,7 @@ GetFileBackupMethod(IncrementalBackupInfo *ib, const char *path,
 					BlockNumber *relative_block_numbers,
 					unsigned *truncation_block_length)
 {
-	BlockNumber absolute_block_numbers[RELSEG_SIZE];
+	BlockNumber *absolute_block_numbers;
 	BlockNumber limit_block;
 	BlockNumber start_blkno;
 	BlockNumber stop_blkno;
@@ -735,7 +736,7 @@ GetFileBackupMethod(IncrementalBackupInfo *ib, const char *path,
 	 * If the file size is too large or not a multiple of BLCKSZ, then
 	 * something weird is happening, so give up and send the whole file.
 	 */
-	if ((size % BLCKSZ) != 0 || size / BLCKSZ > RELSEG_SIZE)
+	if ((size % BLCKSZ) != 0 || size / BLCKSZ > rel_segment_size)
 		return BACK_UP_FILE_FULLY;
 
 	/*
@@ -823,7 +824,7 @@ GetFileBackupMethod(IncrementalBackupInfo *ib, const char *path,
 	 * If the limit_block is less than or equal to the point where this
 	 * segment starts, send the whole file.
 	 */
-	if (limit_block <= segno * RELSEG_SIZE)
+	if (limit_block <= segno * rel_segment_size)
 		return BACK_UP_FILE_FULLY;
 
 	/*
@@ -832,16 +833,18 @@ GetFileBackupMethod(IncrementalBackupInfo *ib, const char *path,
 	 * We shouldn't overflow computing the start or stop block numbers, but if
 	 * it manages to happen somehow, detect it and throw an error.
 	 */
-	start_blkno = segno * RELSEG_SIZE;
+	start_blkno = segno * rel_segment_size;
 	stop_blkno = start_blkno + (size / BLCKSZ);
-	if (start_blkno / RELSEG_SIZE != segno || stop_blkno < start_blkno)
+	if (start_blkno / rel_segment_size != segno || stop_blkno < start_blkno)
 		ereport(ERROR,
 				errcode(ERRCODE_INTERNAL_ERROR),
 				errmsg_internal("overflow computing block number bounds for segment %u with size %zu",
 								segno, size));
+	absolute_block_numbers = palloc(sizeof(BlockNumber) * rel_segment_size);
 	nblocks = BlockRefTableEntryGetBlocks(brtentry, start_blkno, stop_blkno,
-										  absolute_block_numbers, RELSEG_SIZE);
-	Assert(nblocks <= RELSEG_SIZE);
+										  absolute_block_numbers,
+										  rel_segment_size);
+	Assert(nblocks <= rel_segment_size);
 
 	/*
 	 * If we're going to have to send nearly all of the blocks, then just send
@@ -856,7 +859,10 @@ GetFileBackupMethod(IncrementalBackupInfo *ib, const char *path,
 	 * nothing good about sending an incremental file in that case.
 	 */
 	if (nblocks * BLCKSZ > size * 0.9)
+	{
+		pfree(absolute_block_numbers);
 		return BACK_UP_FILE_FULLY;
+	}
 
 	/*
 	 * Looks like we can send an incremental file, so sort the absolute the
@@ -872,6 +878,7 @@ GetFileBackupMethod(IncrementalBackupInfo *ib, const char *path,
 		  compare_block_numbers);
 	for (i = 0; i < nblocks; ++i)
 		relative_block_numbers[i] = absolute_block_numbers[i] - start_blkno;
+	pfree(absolute_block_numbers);
 	*num_blocks_required = nblocks;
 
 	/*
@@ -885,7 +892,7 @@ GetFileBackupMethod(IncrementalBackupInfo *ib, const char *path,
 	*truncation_block_length = size / BLCKSZ;
 	if (BlockNumberIsValid(limit_block))
 	{
-		unsigned	relative_limit = limit_block - segno * RELSEG_SIZE;
+		unsigned	relative_limit = limit_block - segno * rel_segment_size;
 
 		if (*truncation_block_length < relative_limit)
 			*truncation_block_length = relative_limit;
@@ -904,7 +911,7 @@ GetIncrementalFileSize(unsigned num_blocks_required)
 	size_t		result;
 
 	/* Make sure we're not going to overflow. */
-	Assert(num_blocks_required <= RELSEG_SIZE);
+	Assert(num_blocks_required <= rel_segment_size);
 
 	/*
 	 * Three four byte quantities (magic number, truncation block length,
diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 986f6f1d9ca..880e913ce3c 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -217,7 +217,7 @@ BootstrapModeMain(int argc, char *argv[], bool check_only)
 	argv++;
 	argc--;
 
-	while ((flag = getopt(argc, argv, "B:c:d:D:Fkr:X:-:")) != -1)
+	while ((flag = getopt(argc, argv, "B:c:d:D:Fkr:X:R:-:")) != -1)
 	{
 		switch (flag)
 		{
@@ -275,6 +275,9 @@ BootstrapModeMain(int argc, char *argv[], bool check_only)
 			case 'r':
 				strlcpy(OutputFileName, optarg, MAXPGPATH);
 				break;
+			case 'R':
+				rel_segment_size = strtoi64(optarg, NULL, 0);
+				break;
 			case 'X':
 				SetConfigOption("wal_segment_size", optarg, PGC_INTERNAL, PGC_S_DYNAMIC_DEFAULT);
 				break;
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index a263875fd5a..e0b9fed9b7b 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -55,9 +55,9 @@
 #include "utils/resowner.h"
 
 /*
- * We break BufFiles into gigabyte-sized segments, regardless of RELSEG_SIZE.
- * The reason is that we'd like large BufFiles to be spread across multiple
- * tablespaces when available.
+ * We break BufFiles into gigabyte-sized segments, regardless of
+ * rel_segment_size.  The reason is that we'd like large BufFiles to be spread
+ * across multiple tablespaces when available.
  */
 #define MAX_PHYSICAL_FILESIZE	0x40000000
 #define BUFFILE_SEG_SIZE		(MAX_PHYSICAL_FILESIZE / BLCKSZ)
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index bf0f3ca76d1..b95a15b8599 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -31,6 +31,7 @@
 #include "miscadmin.h"
 #include "pg_trace.h"
 #include "pgstat.h"
+#include "port/pg_bitutils.h"
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
 #include "storage/md.h"
@@ -43,15 +44,15 @@
  * The magnetic disk storage manager keeps track of open file
  * descriptors in its own descriptor pool.  This is done to make it
  * easier to support relations that are larger than the operating
- * system's file size limit (often 2GBytes).  In order to do that,
- * we break relations up into "segment" files that are each shorter than
- * the OS file size limit.  The segment size is set by the RELSEG_SIZE
- * configuration constant in pg_config.h.
+ * system's file size limit (historically 2GB, sometimes much larger but still
+ * smaller than the maximum possible relation size).  In order to do that, we
+ * break relations up into "segment" files of a user-specified size chosen at
+ * initdb time and accessed as rel_segment_size.
  *
  * On disk, a relation must consist of consecutively numbered segment
  * files in the pattern
- *	-- Zero or more full segments of exactly RELSEG_SIZE blocks each
- *	-- Exactly one partial segment of size 0 <= size < RELSEG_SIZE blocks
+ *	-- Zero or more full segments of exactly rel_segment_size blocks each
+ *	-- Exactly one partial segment of size 0 <= size < rel_segment_size blocks
  *	-- Optionally, any number of inactive segments of size 0 blocks.
  * The full and partial segments are collectively the "active" segments.
  * Inactive segments are those that once contained data but are currently
@@ -108,7 +109,7 @@ static MemoryContext MdCxt;		/* context for all MdfdVec objects */
 #define EXTENSION_CREATE_RECOVERY	(1 << 3)
 /*
  * Allow opening segments which are preceded by segments smaller than
- * RELSEG_SIZE, e.g. inactive segments (see above). Note that this breaks
+ * rel_segment_size, e.g. inactive segments (see above). Note that this breaks
  * mdnblocks() and related functionality henceforth - which currently is ok,
  * because this is only required in the checkpointer which never uses
  * mdnblocks().
@@ -140,6 +141,31 @@ static MdfdVec *_mdfd_getseg(SMgrRelation reln, ForkNumber forknum,
 static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum,
 							  MdfdVec *seg);
 
+/* Given a block number, which segment is it in? */
+static inline uint32
+blockno_to_segno(BlockNumber blockno)
+{
+	/* Because it's a power of two, we can use a shift instead of "/". */
+	Assert(pg_popcount64(rel_segment_size) == 1);
+	return (uint64) blockno >> pg_leftmost_one_pos64(rel_segment_size);
+}
+
+/* Given a block number, which block is that within its segment? */
+static inline BlockNumber
+blockno_within_segment(BlockNumber blockno)
+{
+	/* Because it's a power of two, we can use a mask instead of "%". */
+	Assert(pg_popcount64(rel_segment_size) == 1);
+	return blockno & (rel_segment_size - 1);
+}
+
+/* Given a block number, convert it to byte offset within a segment. */
+static inline off_t
+blockno_to_seekpos(BlockNumber blockno)
+{
+	return blockno_within_segment(blockno) * (off_t) BLCKSZ;
+}
+
 static inline int
 _mdfd_open_flags(void)
 {
@@ -488,9 +514,9 @@ mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 
 	v = _mdfd_getseg(reln, forknum, blocknum, skipFsync, EXTENSION_CREATE);
 
-	seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+	seekpos = blockno_to_seekpos(blocknum);
 
-	Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+	Assert(seekpos < (off_t) BLCKSZ * rel_segment_size);
 
 	if ((nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, seekpos, WAIT_EVENT_DATA_FILE_EXTEND)) != BLCKSZ)
 	{
@@ -512,7 +538,7 @@ mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	if (!skipFsync && !SmgrIsTemp(reln))
 		register_dirty_segment(reln, forknum, v);
 
-	Assert(_mdnblocks(reln, forknum, v) <= ((BlockNumber) RELSEG_SIZE));
+	Assert(_mdnblocks(reln, forknum, v) <= rel_segment_size);
 }
 
 /*
@@ -550,19 +576,19 @@ mdzeroextend(SMgrRelation reln, ForkNumber forknum,
 
 	while (remblocks > 0)
 	{
-		BlockNumber segstartblock = curblocknum % ((BlockNumber) RELSEG_SIZE);
-		off_t		seekpos = (off_t) BLCKSZ * segstartblock;
+		BlockNumber segstartblock = blockno_within_segment(blocknum);
+		off_t		seekpos = blockno_to_seekpos(blocknum);
 		int			numblocks;
 
-		if (segstartblock + remblocks > RELSEG_SIZE)
-			numblocks = RELSEG_SIZE - segstartblock;
+		if (segstartblock + remblocks > rel_segment_size)
+			numblocks = rel_segment_size - segstartblock;
 		else
 			numblocks = remblocks;
 
 		v = _mdfd_getseg(reln, forknum, curblocknum, skipFsync, EXTENSION_CREATE);
 
-		Assert(segstartblock < RELSEG_SIZE);
-		Assert(segstartblock + numblocks <= RELSEG_SIZE);
+		Assert(segstartblock < rel_segment_size);
+		Assert(segstartblock + numblocks <= rel_segment_size);
 
 		/*
 		 * If available and useful, use posix_fallocate() (via
@@ -616,7 +642,7 @@ mdzeroextend(SMgrRelation reln, ForkNumber forknum,
 		if (!skipFsync && !SmgrIsTemp(reln))
 			register_dirty_segment(reln, forknum, v);
 
-		Assert(_mdnblocks(reln, forknum, v) <= ((BlockNumber) RELSEG_SIZE));
+		Assert(_mdnblocks(reln, forknum, v) <= rel_segment_size);
 
 		remblocks -= numblocks;
 		curblocknum += numblocks;
@@ -668,7 +694,7 @@ mdopenfork(SMgrRelation reln, ForkNumber forknum, int behavior)
 	mdfd->mdfd_vfd = fd;
 	mdfd->mdfd_segno = 0;
 
-	Assert(_mdnblocks(reln, forknum, mdfd) <= ((BlockNumber) RELSEG_SIZE));
+	Assert(_mdnblocks(reln, forknum, mdfd) <= rel_segment_size);
 
 	return mdfd;
 }
@@ -732,13 +758,13 @@ mdprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 		if (v == NULL)
 			return false;
 
-		seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+		seekpos = blockno_to_seekpos(blocknum);
 
-		Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+		Assert(seekpos < (off_t) BLCKSZ * rel_segment_size);
 
 		nblocks_this_segment =
 			Min(nblocks,
-				RELSEG_SIZE - (blocknum % ((BlockNumber) RELSEG_SIZE)));
+				rel_segment_size - blockno_within_segment(blocknum));
 
 		(void) FilePrefetch(v->mdfd_vfd, seekpos, BLCKSZ * nblocks_this_segment,
 							WAIT_EVENT_DATA_FILE_PREFETCH);
@@ -824,13 +850,13 @@ mdreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 		v = _mdfd_getseg(reln, forknum, blocknum, false,
 						 EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY);
 
-		seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+		seekpos = blockno_to_seekpos(blocknum);
 
-		Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+		Assert(seekpos < (off_t) BLCKSZ * rel_segment_size);
 
 		nblocks_this_segment =
 			Min(nblocks,
-				RELSEG_SIZE - (blocknum % ((BlockNumber) RELSEG_SIZE)));
+				rel_segment_size - blockno_within_segment(blocknum));
 		nblocks_this_segment = Min(nblocks_this_segment, lengthof(iov));
 
 		iovcnt = buffers_to_iovec(iov, buffers, nblocks_this_segment);
@@ -947,13 +973,13 @@ mdwritev(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 		v = _mdfd_getseg(reln, forknum, blocknum, skipFsync,
 						 EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY);
 
-		seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+		seekpos = blockno_to_seekpos(blocknum);
 
-		Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+		Assert(seekpos < (off_t) BLCKSZ * rel_segment_size);
 
 		nblocks_this_segment =
 			Min(nblocks,
-				RELSEG_SIZE - (blocknum % ((BlockNumber) RELSEG_SIZE)));
+				rel_segment_size - blockno_within_segment(blocknum));
 		nblocks_this_segment = Min(nblocks_this_segment, lengthof(iov));
 
 		iovcnt = buffers_to_iovec(iov, (void **) buffers, nblocks_this_segment);
@@ -1058,17 +1084,17 @@ mdwriteback(SMgrRelation reln, ForkNumber forknum,
 			return;
 
 		/* compute offset inside the current segment */
-		segnum_start = blocknum / RELSEG_SIZE;
+		segnum_start = blockno_to_segno(blocknum);
 
 		/* compute number of desired writes within the current segment */
-		segnum_end = (blocknum + nblocks - 1) / RELSEG_SIZE;
+		segnum_end = blockno_to_segno(blocknum + nblocks - 1);
 		if (segnum_start != segnum_end)
-			nflush = RELSEG_SIZE - (blocknum % ((BlockNumber) RELSEG_SIZE));
+			nflush = rel_segment_size - blockno_within_segment(blocknum);
 
 		Assert(nflush >= 1);
 		Assert(nflush <= nblocks);
 
-		seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+		seekpos = blockno_to_seekpos(blocknum);
 
 		FileWriteback(v->mdfd_vfd, seekpos, (off_t) BLCKSZ * nflush, WAIT_EVENT_DATA_FILE_FLUSH);
 
@@ -1099,8 +1125,8 @@ mdnblocks(SMgrRelation reln, ForkNumber forknum)
 
 	/*
 	 * Start from the last open segments, to avoid redundant seeks.  We have
-	 * previously verified that these segments are exactly RELSEG_SIZE long,
-	 * and it's useless to recheck that each time.
+	 * previously verified that these segments are exactly rel_segment_size
+	 * long, and it's useless to recheck that each time.
 	 *
 	 * NOTE: this assumption could only be wrong if another backend has
 	 * truncated the relation.  We rely on higher code levels to handle that
@@ -1116,13 +1142,13 @@ mdnblocks(SMgrRelation reln, ForkNumber forknum)
 	for (;;)
 	{
 		nblocks = _mdnblocks(reln, forknum, v);
-		if (nblocks > ((BlockNumber) RELSEG_SIZE))
+		if (nblocks > rel_segment_size)
 			elog(FATAL, "segment too big");
-		if (nblocks < ((BlockNumber) RELSEG_SIZE))
-			return (segno * ((BlockNumber) RELSEG_SIZE)) + nblocks;
+		if (nblocks < rel_segment_size)
+			return (segno * rel_segment_size) + nblocks;
 
 		/*
-		 * If segment is exactly RELSEG_SIZE, advance to next one.
+		 * If segment is exactly rel_segment_size, advance to next one.
 		 */
 		segno++;
 
@@ -1135,7 +1161,7 @@ mdnblocks(SMgrRelation reln, ForkNumber forknum)
 		 */
 		v = _mdfd_openseg(reln, forknum, segno, 0);
 		if (v == NULL)
-			return segno * ((BlockNumber) RELSEG_SIZE);
+			return segno * rel_segment_size;
 	}
 }
 
@@ -1176,7 +1202,7 @@ mdtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks)
 	{
 		MdfdVec    *v;
 
-		priorblocks = (curopensegs - 1) * RELSEG_SIZE;
+		priorblocks = (curopensegs - 1) * rel_segment_size;
 
 		v = &reln->md_seg_fds[forknum][curopensegs - 1];
 
@@ -1201,13 +1227,13 @@ mdtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks)
 			FileClose(v->mdfd_vfd);
 			_fdvec_resize(reln, forknum, curopensegs - 1);
 		}
-		else if (priorblocks + ((BlockNumber) RELSEG_SIZE) > nblocks)
+		else if (priorblocks + rel_segment_size > nblocks)
 		{
 			/*
 			 * This is the last segment we want to keep. Truncate the file to
 			 * the right length. NOTE: if nblocks is exactly a multiple K of
-			 * RELSEG_SIZE, we will truncate the K+1st segment to 0 length but
-			 * keep it. This adheres to the invariant given in the header
+			 * rel_setment_size, we will truncate the K+1st segment to 0 length
+			 * but keep it. This adheres to the invariant given in the header
 			 * comments.
 			 */
 			BlockNumber lastsegblocks = nblocks - priorblocks;
@@ -1566,7 +1592,7 @@ _mdfd_openseg(SMgrRelation reln, ForkNumber forknum, BlockNumber segno,
 	v->mdfd_vfd = fd;
 	v->mdfd_segno = segno;
 
-	Assert(_mdnblocks(reln, forknum, v) <= ((BlockNumber) RELSEG_SIZE));
+	Assert(_mdnblocks(reln, forknum, v) <= rel_segment_size);
 
 	/* all done */
 	return v;
@@ -1593,7 +1619,7 @@ _mdfd_getseg(SMgrRelation reln, ForkNumber forknum, BlockNumber blkno,
 		   (EXTENSION_FAIL | EXTENSION_CREATE | EXTENSION_RETURN_NULL |
 			EXTENSION_DONT_OPEN));
 
-	targetseg = blkno / ((BlockNumber) RELSEG_SIZE);
+	targetseg = blockno_to_segno(blkno);
 
 	/* if an existing and opened segment, we're done */
 	if (targetseg < reln->md_num_open_segs[forknum])
@@ -1630,7 +1656,7 @@ _mdfd_getseg(SMgrRelation reln, ForkNumber forknum, BlockNumber blkno,
 
 		Assert(nextsegno == v->mdfd_segno + 1);
 
-		if (nblocks > ((BlockNumber) RELSEG_SIZE))
+		if (nblocks > rel_segment_size)
 			elog(FATAL, "segment too big");
 
 		if ((behavior & EXTENSION_CREATE) ||
@@ -1645,31 +1671,31 @@ _mdfd_getseg(SMgrRelation reln, ForkNumber forknum, BlockNumber blkno,
 			 * ahead and create the segments so we can finish out the replay.
 			 *
 			 * We have to maintain the invariant that segments before the last
-			 * active segment are of size RELSEG_SIZE; therefore, if
+			 * active segment are of size rel_segment_size; therefore, if
 			 * extending, pad them out with zeroes if needed.  (This only
 			 * matters if in recovery, or if the caller is extending the
 			 * relation discontiguously, but that can happen in hash indexes.)
 			 */
-			if (nblocks < ((BlockNumber) RELSEG_SIZE))
+			if (nblocks < rel_segment_size)
 			{
 				char	   *zerobuf = palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE,
 													 MCXT_ALLOC_ZERO);
 
 				mdextend(reln, forknum,
-						 nextsegno * ((BlockNumber) RELSEG_SIZE) - 1,
+						 nextsegno * rel_segment_size - 1,
 						 zerobuf, skipFsync);
 				pfree(zerobuf);
 			}
 			flags = O_CREAT;
 		}
 		else if (!(behavior & EXTENSION_DONT_CHECK_SIZE) &&
-				 nblocks < ((BlockNumber) RELSEG_SIZE))
+				 nblocks < rel_segment_size)
 		{
 			/*
 			 * When not extending (or explicitly including truncated
 			 * segments), only open the next segment if the current one is
-			 * exactly RELSEG_SIZE.  If not (this branch), either return NULL
-			 * or fail.
+			 * exactly rel_segment_size.  If not (this branch), either return
+			 * NULL or fail.
 			 */
 			if (behavior & EXTENSION_RETURN_NULL)
 			{
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index a5b18328b89..82ede1c4f0a 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -57,10 +57,18 @@
 #include "storage/ipc.h"
 #include "storage/md.h"
 #include "storage/smgr.h"
+#include "utils/guc_tables.h"
 #include "utils/hsearch.h"
 #include "utils/inval.h"
 
 
+/*
+ * The number of blocks that should be in a segment file.  Has a wider type
+ * than BlockNumber, so that can represent the case the whole relation fits in
+ * one file.
+ */
+int64		rel_segment_size;
+
 /*
  * This struct of function pointers defines the API between smgr.c and
  * any individual storage manager module.  Note that smgr subfunctions are
@@ -820,3 +828,9 @@ ProcessBarrierSmgrRelease(void)
 	smgrreleaseall();
 	return true;
 }
+
+const char *
+show_segment_size(void)
+{
+	return ShowGUCInt64WithUnits(rel_segment_size, GUC_UNIT_BLOCKS);
+}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index dd5a46469a6..009db70a8b6 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -5393,6 +5393,22 @@ GetConfigOptionByName(const char *name, const char **varname, bool missing_ok)
 	return ShowGUCOption(record, true);
 }
 
+/*
+ * Show unit-based values with appropriate unit, as ShowGUCOption() would.
+ * This can be used by custom show hooks.
+ */
+char *
+ShowGUCInt64WithUnits(int64 value, int flags)
+{
+	int64		number;
+	const char *unit;
+	char		buffer[256];
+
+	convert_int_from_base_unit(value, flags & GUC_UNIT, &number, &unit);
+	snprintf(buffer, sizeof(buffer), INT64_FORMAT "%s", number, unit);
+	return pstrdup(buffer);
+}
+
 /*
  * ShowGUCOption: get string value of variable
  *
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 45013582a74..45cd53ab79b 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -594,10 +594,10 @@ static int	max_function_args;
 static int	max_index_keys;
 static int	max_identifier_length;
 static int	block_size;
-static int	segment_size;
 static int	shared_memory_size_mb;
 static int	shared_memory_size_in_huge_pages;
 static int	wal_block_size;
+static int	phony_segment_size;
 static bool data_checksums;
 static bool integer_datetimes;
 
@@ -3239,15 +3239,19 @@ struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	/*
+	 * We used a phony GUC with a custom show function, because we don't
+	 * support GUCs with a wide enough type.
+	 */
 	{
 		{"segment_size", PGC_INTERNAL, PRESET_OPTIONS,
 			gettext_noop("Shows the number of pages per disk file."),
 			NULL,
 			GUC_UNIT_BLOCKS | GUC_NOT_IN_SAMPLE | GUC_DISALLOW_IN_FILE
 		},
-		&segment_size,
-		RELSEG_SIZE, RELSEG_SIZE, RELSEG_SIZE,
-		NULL, NULL, NULL
+		&phony_segment_size,
+		0, 0, 0,
+		NULL, NULL, show_segment_size
 	},
 
 	{
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 200b2e8e317..0f24e0337a7 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -81,6 +81,7 @@
 #include "getopt_long.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
+#include "port/pg_bitutils.h"
 
 
 /* Ideally this would be in a .h file, but it hardly seems worth the trouble */
@@ -165,6 +166,8 @@ static bool show_setting = false;
 static bool data_checksums = false;
 static char *xlog_dir = NULL;
 static int	wal_segment_size_mb = (DEFAULT_XLOG_SEG_SIZE) / (1024 * 1024);
+static char *str_rel_segment_size = NULL;
+static int64 rel_segment_size;
 static DataDirSyncMethod sync_method = DATA_DIR_SYNC_METHOD_FSYNC;
 
 
@@ -1536,12 +1539,12 @@ bootstrap_template1(void)
 
 	printfPQExpBuffer(&cmd, "\"%s\" --boot %s %s", backend_exec, boot_options, extra_options);
 	appendPQExpBuffer(&cmd, " -X %d", wal_segment_size_mb * (1024 * 1024));
+	appendPQExpBuffer(&cmd, " -R " INT64_FORMAT, rel_segment_size);
 	if (data_checksums)
 		appendPQExpBuffer(&cmd, " -k");
 	if (debug)
 		appendPQExpBuffer(&cmd, " -d 5");
 
-
 	PG_CMD_OPEN(cmd.data);
 
 	for (line = bki_lines; *line != NULL; line++)
@@ -2456,6 +2459,7 @@ usage(const char *progname)
 	printf(_("  -W, --pwprompt            prompt for a password for the new superuser\n"));
 	printf(_("  -X, --waldir=WALDIR       location for the write-ahead log directory\n"));
 	printf(_("      --wal-segsize=SIZE    size of WAL segments, in megabytes\n"));
+	printf(_("      --rel-segsize=SIZE    size of relation segments\n"));
 	printf(_("\nLess commonly used options:\n"));
 	printf(_("  -c, --set NAME=VALUE      override default setting for server parameter\n"));
 	printf(_("  -d, --debug               generate lots of debugging output\n"));
@@ -3107,6 +3111,7 @@ main(int argc, char *argv[])
 		{"icu-locale", required_argument, NULL, 16},
 		{"icu-rules", required_argument, NULL, 17},
 		{"sync-method", required_argument, NULL, 18},
+		{"rel-segsize", required_argument, NULL, 19},
 		{NULL, 0, NULL, 0}
 	};
 
@@ -3291,6 +3296,9 @@ main(int argc, char *argv[])
 				if (!parse_sync_method(optarg, &sync_method))
 					exit(1);
 				break;
+			case 19:
+				str_rel_segment_size = pg_strdup(optarg);
+				break;
 			default:
 				/* getopt_long already emitted a complaint */
 				pg_log_error_hint("Try \"%s --help\" for more information.", progname);
@@ -3357,6 +3365,43 @@ main(int argc, char *argv[])
 	if (!IsValidWalSegSize(wal_segment_size_mb * 1024 * 1024))
 		pg_fatal("argument of %s must be a power of two between 1 and 1024", "--wal-segsize");
 
+	/* set rel segment size */
+	if (str_rel_segment_size == NULL)
+	{
+		rel_segment_size = (1024 * 1024 * 1024) / BLCKSZ;
+	}
+	else
+	{
+		int64		bytes;
+		char	   *endptr;
+
+		bytes = strtol(str_rel_segment_size, &endptr, 10);
+		if (endptr == str_rel_segment_size)
+			pg_fatal("argument of --rel-segsize must begin with a number");
+		if (bytes == 0)
+			pg_fatal("argument of --rel-segsize must be greater than zero");
+
+		if (strcmp(endptr, "kB") == 0)
+			bytes *= 1024;
+		else if (strcmp(endptr, "MB") == 0)
+			bytes *= 1024 * 1024;
+		else if (strcmp(endptr, "GB") == 0)
+			bytes *= 1024 * 1024 * 1024;
+		else if (strcmp(endptr, "TB") == 0)
+			bytes *= UINT64CONST(1024) * 1024 * 1024 * 1024;
+		else
+			pg_fatal("argument of --rel-segsize must end with kB, MB, GB or TB");
+
+		if (bytes % BLCKSZ != 0)
+			pg_fatal("argument of --rel-segsize must be a multiple of BLCKSZ");
+		if (pg_popcount64(bytes) != 1)
+			pg_fatal("argument of --rel-segsize must be a power of two");
+		if (sizeof(off_t) < 8 && bytes > (1 << 31))
+			pg_fatal("argument of --rel-segsize is too large for this platform's off_t");
+
+		rel_segment_size = bytes / BLCKSZ;
+	}
+
 	get_restricted_token();
 
 	setup_pgdata();
diff --git a/src/bin/pg_checksums/pg_checksums.c b/src/bin/pg_checksums/pg_checksums.c
index 9e6fd435f60..17767b55606 100644
--- a/src/bin/pg_checksums/pg_checksums.c
+++ b/src/bin/pg_checksums/pg_checksums.c
@@ -223,7 +223,7 @@ scan_file(const char *fn, int segmentno)
 		if (PageIsNew(buf.data))
 			continue;
 
-		csum = pg_checksum_page(buf.data, blockno + segmentno * RELSEG_SIZE);
+		csum = pg_checksum_page(buf.data, blockno + segmentno * ControlFile->relseg_size);
 		if (mode == PG_MODE_CHECK)
 		{
 			if (csum != header->pd_checksum)
diff --git a/src/bin/pg_combinebackup/reconstruct.c b/src/bin/pg_combinebackup/reconstruct.c
index 873d3079025..998f593968b 100644
--- a/src/bin/pg_combinebackup/reconstruct.c
+++ b/src/bin/pg_combinebackup/reconstruct.c
@@ -22,6 +22,10 @@
 #include "reconstruct.h"
 #include "storage/block.h"
 
+
+/* XXX this will need to be loaded out of a control file! */
+int64		rel_segment_size = 131072;
+
 /*
  * An rfile stores the data that we need in order to be able to use some file
  * on disk for reconstruction. For any given output file, we create one rfile
@@ -447,16 +451,18 @@ make_incremental_rfile(char *filename)
 
 	/* Read block count. */
 	read_bytes(rf, &rf->num_blocks, sizeof(rf->num_blocks));
-	if (rf->num_blocks > RELSEG_SIZE)
-		pg_fatal("file \"%s\" has block count %u in excess of segment size %u",
-				 filename, rf->num_blocks, RELSEG_SIZE);
+	if (rf->num_blocks > rel_segment_size)
+		pg_fatal("file \"%s\" has block count %u in excess of segment size "
+				 INT64_FORMAT,
+				 filename, rf->num_blocks, rel_segment_size);
 
 	/* Read truncation block length. */
 	read_bytes(rf, &rf->truncation_block_length,
 			   sizeof(rf->truncation_block_length));
-	if (rf->truncation_block_length > RELSEG_SIZE)
-		pg_fatal("file \"%s\" has truncation block length %u in excess of segment size %u",
-				 filename, rf->truncation_block_length, RELSEG_SIZE);
+	if (rf->truncation_block_length > rel_segment_size)
+		pg_fatal("file \"%s\" has truncation block length %u in excess of segment size "
+				 INT64_FORMAT,
+				 filename, rf->truncation_block_length, rel_segment_size);
 
 	/* Read block numbers if there are any. */
 	if (rf->num_blocks > 0)
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 93e0837947c..8687785bad5 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -304,7 +304,7 @@ main(int argc, char *argv[])
 	/* we don't print floatFormat since can't say much useful about it */
 	printf(_("Database block size:                  %u\n"),
 		   ControlFile->blcksz);
-	printf(_("Blocks per segment of large relation: %u\n"),
+	printf(_("Blocks per segment of large relation: " INT64_FORMAT "\n"),
 		   ControlFile->relseg_size);
 	printf(_("WAL block size:                       %u\n"),
 		   ControlFile->xlog_blcksz);
diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index e9dcb5a6d89..92d34833ad9 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -690,7 +690,7 @@ GuessControlValues(void)
 	ControlFile.maxAlign = MAXIMUM_ALIGNOF;
 	ControlFile.floatFormat = FLOATFORMAT_VALUE;
 	ControlFile.blcksz = BLCKSZ;
-	ControlFile.relseg_size = RELSEG_SIZE;
+	ControlFile.relseg_size = 1024 * 1024 * 1024;
 	ControlFile.xlog_blcksz = XLOG_BLCKSZ;
 	ControlFile.xlog_seg_size = DEFAULT_XLOG_SEG_SIZE;
 	ControlFile.nameDataLen = NAMEDATALEN;
@@ -758,7 +758,7 @@ PrintControlValues(bool guessed)
 	/* we don't print floatFormat since can't say much useful about it */
 	printf(_("Database block size:                  %u\n"),
 		   ControlFile.blcksz);
-	printf(_("Blocks per segment of large relation: %u\n"),
+	printf(_("Blocks per segment of large relation: " INT64_FORMAT "\n"),
 		   ControlFile.relseg_size);
 	printf(_("WAL block size:                       %u\n"),
 		   ControlFile.xlog_blcksz);
diff --git a/src/bin/pg_rewind/filemap.c b/src/bin/pg_rewind/filemap.c
index 255ddf2ffaf..0c7e4522b6d 100644
--- a/src/bin/pg_rewind/filemap.c
+++ b/src/bin/pg_rewind/filemap.c
@@ -296,8 +296,8 @@ process_target_wal_block_change(ForkNumber forknum, RelFileLocator rlocator,
 	BlockNumber blkno_inseg;
 	int			segno;
 
-	segno = blkno / RELSEG_SIZE;
-	blkno_inseg = blkno % RELSEG_SIZE;
+	segno = blkno / rel_segment_size;
+	blkno_inseg = blkno % rel_segment_size;;
 
 	path = datasegpath(rlocator, forknum, segno);
 	entry = lookup_filehash_entry(path);
diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c
index bde90bf60bb..8553c565f76 100644
--- a/src/bin/pg_rewind/pg_rewind.c
+++ b/src/bin/pg_rewind/pg_rewind.c
@@ -62,6 +62,7 @@ static ControlFileData ControlFile_source_after;
 
 const char *progname;
 int			WalSegSz;
+int64		rel_segment_size;
 
 /* Configuration options */
 char	   *datadir_target = NULL;
@@ -1041,6 +1042,8 @@ digestControlFile(ControlFileData *ControlFile, const char *content,
 		exit(1);
 	}
 
+	rel_segment_size = ControlFile->relseg_size;
+
 	/* Additional checks on control file */
 	checkControlFile(ControlFile);
 }
diff --git a/src/bin/pg_rewind/pg_rewind.h b/src/bin/pg_rewind/pg_rewind.h
index ec43cbe2c67..596741b2b8f 100644
--- a/src/bin/pg_rewind/pg_rewind.h
+++ b/src/bin/pg_rewind/pg_rewind.h
@@ -26,6 +26,7 @@ extern bool dry_run;
 extern bool do_sync;
 extern int	WalSegSz;
 extern DataDirSyncMethod sync_method;
+extern int64 rel_segment_size;
 
 /* Target history */
 extern TimeLineHistoryEntry *targetHistory;
diff --git a/src/bin/pg_upgrade/relfilenumber.c b/src/bin/pg_upgrade/relfilenumber.c
index a1fc5fec78d..9cd3b00fe40 100644
--- a/src/bin/pg_upgrade/relfilenumber.c
+++ b/src/bin/pg_upgrade/relfilenumber.c
@@ -183,7 +183,7 @@ transfer_relfile(FileNameMap *map, const char *type_suffix, bool vm_must_add_fro
 
 	/*
 	 * Now copy/link any related segments as well. Remember, PG breaks large
-	 * files into 1GB segments, the first segment has no extension, subsequent
+	 * files into segments, the first segment has no extension, subsequent
 	 * segments are named relfilenumber.1, relfilenumber.2, relfilenumber.3.
 	 */
 	for (segno = 0;; segno++)
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index a00606ffcdf..354b15fbff1 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -204,7 +204,7 @@ typedef struct ControlFileData
 	 * compatible with the backend executable.
 	 */
 	uint32		blcksz;			/* data block size for this DB */
-	uint32		relseg_size;	/* blocks per segment of large relation */
+	int64		relseg_size;	/* blocks per segment of large relation */
 
 	uint32		xlog_blcksz;	/* block size within WAL files */
 	uint32		xlog_seg_size;	/* size of each WAL segment */
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 591e1ca3df6..50426f4c021 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -637,19 +637,6 @@
    your system. */
 #undef PTHREAD_CREATE_JOINABLE
 
-/* RELSEG_SIZE is the maximum number of blocks allowed in one disk file. Thus,
-   the maximum size of a single file is RELSEG_SIZE * BLCKSZ; relations bigger
-   than that are divided into multiple files. RELSEG_SIZE * BLCKSZ must be
-   less than your OS' limit on file size. This is often 2 GB or 4GB in a
-   32-bit operating system, unless you have large file support enabled. By
-   default, we make the limit 1 GB to avoid any possible integer-overflow
-   problems within the OS. A limit smaller than necessary only means we divide
-   a large relation into more chunks than necessary, so it seems best to err
-   in the direction of a small limit. A power-of-2 value is recommended to
-   save a few cycles in md.c, but is not absolutely required. Changing
-   RELSEG_SIZE requires an initdb. */
-#undef RELSEG_SIZE
-
 /* The size of `bool', as computed by sizeof. */
 #undef SIZEOF_BOOL
 
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index fc5f883ce14..4d853b71222 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -18,6 +18,8 @@
 #include "storage/block.h"
 #include "storage/relfilelocator.h"
 
+extern int64 rel_segment_size;
+
 /*
  * smgr.c maintains a table of SMgrRelation objects, which are essentially
  * cached file handles.  An SMgrRelation is created (if not already present)
@@ -109,6 +111,7 @@ extern void smgrimmedsync(SMgrRelation reln, ForkNumber forknum);
 extern void smgrregistersync(SMgrRelation reln, ForkNumber forknum);
 extern void AtEOXact_SMgr(void);
 extern bool ProcessBarrierSmgrRelease(void);
+extern const char *show_segment_size(void);
 
 static inline void
 smgrread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
diff --git a/src/include/utils/guc_tables.h b/src/include/utils/guc_tables.h
index 0a2e274ebb2..2a8399b32c4 100644
--- a/src/include/utils/guc_tables.h
+++ b/src/include/utils/guc_tables.h
@@ -302,6 +302,7 @@ extern struct config_generic **get_explain_guc_options(int *num);
 
 /* get string value of variable */
 extern char *ShowGUCOption(struct config_generic *record, bool use_units);
+extern char *ShowGUCInt64WithUnits(int64 value, int flags);
 
 /* get whether or not the GUC variable is visible to current user */
 extern bool ConfigOptionIsVisible(struct config_generic *conf);
-- 
2.39.2

#27

Peter Eisentraut

peter@eisentraut.org

over 1 year ago

In reply to: Thomas Munro (#26)

Re: Large files for relations

On 06.03.24 22:54, Thomas Munro wrote:

Rebased. I had intended to try to get this into v17, but a couple of
unresolved problems came up while rebasing over the new incremental
backup stuff. You snooze, you lose. Hopefully we can sort these out
in time for the next commitfest:

* should pg_combinebasebackup read the control file to fetch the segment size?
* hunt for other segment-size related problems that may be lurking in
new incremental backup stuff
* basebackup_incremental.c wants to use memory in proportion to
segment size, which looks like a problem, and I wrote about that in a
new thread[1]

Overall, I like this idea, and the patch seems to have many bases covered.

The patch will need a rebase. I was able to test it on
master@{2024-03-13}, but after that there are conflicts.

In .cirrus.tasks.yml, one of the test tasks uses
--with-segsize-blocks=6, but you are removing that option. You could
replace that with something like

PG_TEST_INITDB_EXTRA_OPTS='--rel-segsize=48kB'

But that won't work exactly because

initdb: error: argument of --rel-segsize must be a power of two

I suppose that's ok as a change, since it makes the arithmetic more
efficient. But maybe it should be called out explicitly in the commit
message.

If I run it with 64kB, the test pgbench/001_pgbench_with_server fails
consistently, so it seems there is still a gap somewhere.

A minor point, the initdb error message

initdb: error: argument of --rel-segsize must be a multiple of BLCKSZ

would be friendlier if actually showed the value of the block size
instead of just the symbol. Similarly for the nearby error message
about the off_t size.

In the control file, all the other fields use unsigned types. Should
relseg_size be uint64?

PG_CONTROL_VERSION needs to be changed.