[Patch] Windows relation extension failure at 2GB and 4GB

Started by Bryan Green3 months ago32 messages
#1Bryan Green
dbryan.green@gmail.com
1 attachment(s)

Hello,

I found two related bugs in PostgreSQL's Windows port that prevent files
from exceeding 2GB. While unlikely to affect most installations (default
1GB segments), the code is objectively wrong and worth fixing.

The first bug is a pervasive use of off_t where pgoff_t should be used.
On Windows, off_t is only 32-bit, causing signed integer overflow at
exactly 2GB (2^31 bytes). PostgreSQL already defined pgoff_t as __int64
for this purpose and some function declarations in headers already use
it, but the implementations weren't updated to match.

The problem shows up in multiple layers:

In fd.c and fd.h, the VfdCache structure's fileSize field uses off_t, as
do FileSize(), FileTruncate(), and all the File* functions. These are
the core file I/O abstraction layer that everything else builds on.

In md.c, _mdnblocks() uses off_t for its calculations. The actual
arithmetic for computing file offsets is fine - the casts to pgoff_t
work correctly - but passing these values through functions with off_t
parameters truncates them.

In pg_iovec.h, pg_preadv() and pg_pwritev() take off_t offset
parameters, truncating any 64-bit offsets passed from above.

In file_utils.c, pg_pwrite_zeros() takes an off_t offset parameter. This
function is called by FileZero() to extend files with zeros, so it's hit
during relation extension. This was the actual culprit in my testing -
mdzeroextend() would compute a correct 64-bit offset, but it got
truncated to 32-bit (and negative) when passed to pg_pwrite_zeros().

After fixing all those off_t issues, there's a second bug at 4GB in the
Windows implementations of pg_pwrite()/pg_pread() in win32pwrite.c and
win32pread.c. The current implementation uses an OVERLAPPED structure
for positioned I/O, but only sets the Offset field (low 32 bits),
leaving OffsetHigh at zero. This works up to 4GB by accident, but beyond
that, offsets wrap around.

I can reproduce both bugs reliably with --with-segsize=8. The file grows
to exactly 2GB and fails with "could not extend file: Invalid argument"
despite having 300GB free. After fixing the off_t issues, it grows to
exactly 4GB and hits the OVERLAPPED bug. Both are independently verifiable.

The fix touches nine files:
src/include/storage/fd.h - File* function declarations
src/backend/storage/file/fd.c - File* implementations and VfdCache
src/backend/storage/smgr/md.c - _mdnblocks and other functions
src/include/port/pg_iovec.h - pg_preadv/pg_pwritev signatures
src/include/common/file_utils.h - pg_pwrite_zeros declaration
src/common/file_utils.c - pg_pwrite_zeros implementation
src/include/port/win32_port.h - pg_pread/pg_pwrite declarations
src/port/win32pwrite.c - Windows pwrite implementation
src/port/win32pread.c - Windows pread implementation

It's safe for all platforms since pgoff_t equals off_t on Unix where
off_t is already 64-bit. Only Windows behavior changes.

That said, I'm finding off_t used in many other places throughout the
codebase - buffile.c, various other file utilities such as backup and
archive, probably more. This is likely causing latent bugs elsewhere on
Windows, though most are masked by the 1GB default segment size. I'm
investigating the full scope, but I think this needs to be broken up
into multiple patches. The core file I/O layer (fd.c, md.c,
pg_pwrite/pg_pread) should probably go first since that's what's
actively breaking file extension.

Not urgent since few people hit this in practice, but it's clearly wrong
code.
Someone building with larger segments would see failures at 2GB and
potential corruption at 4GB. Windows supports files up to 16 exabytes -
no good reason to limit PostgreSQL to 2GB.

I have attached the patch to fix the relation extension problems for
Windows to this email.

Can provide the other patches that changes off_t for pgoff_t in the rest
of the code if there's interest in fixing this.

To reproduce the bugs on Windows:

1) Build with large segment size: meson setup build
--prefix=C:\pgsql-test -Dsegsize=8.
2) Create a large table and insert data that will make it bigger than 2GB.

CREATE TABLE large_test (
id bigserial PRIMARY KEY,
data1 text,
data2 text,
data3 text
);

INSERT INTO large_test (data1, data2, data3)
SELECT
repeat('A', 300),
repeat('B', 300),
repeat('C', 300)
FROM generate_series(1, 5000000);

SELECT pg_size_pretty(pg_relation_size('large_test'));

You will notice at this point that the first bug surfaces.

3) If you want to reproduce the 2nd bug then you should apply the patch
and then comment out 'overlapped.OffsetHigh = (DWORD) (offset >> 32);'
is win32pwrite.c.
4) Assuming you did 3, do the test in 2 again. If you are watching the
data/base/N/xxxxx file growing you will notice that it gets past 2GB but
now fails at 4GB.

BG

Attachments:

0001-Fix-Windows-file-IO.patchtext/plain; charset=UTF-8; name=0001-Fix-Windows-file-IO.patchDownload
From d5cdf919b3a46c8edef0356d61588925a257371e Mon Sep 17 00:00:00 2001
From: Bryan Green <dbryan.green@gmail.com>
Date: Mon, 27 Oct 2025 13:20:01 -0600
Subject: [PATCH] Fix Windows file I/O to support files larger than 2GB

PostgreSQL's Windows port has been unable to handle files larger than 2GB
due to pervasive use of off_t for file offsets, which is only 32-bit on
Windows. This causes signed integer overflow at exactly 2^31 bytes.

The codebase already defines pgoff_t as __int64 (64-bit) on Windows for
this purpose, and some function declarations in headers use it, but many
implementations still used off_t.

This issue is unlikely to affect most users since the default RELSEG_SIZE
is 1GB, keeping individual segment files small. However, anyone building
with --with-segsize larger than 2 would hit this bug. Tested with
--with-segsize=8 and verified that files can now grow beyond 4GB.

Note: off_t is still used in other parts of the codebase (e.g. buffile.c)
which may have similar issues on Windows, but those are outside the
critical path for relation file extension and can be addressed separately.

On Unix-like systems, pgoff_t is defined as off_t, so this change only
affects Windows behavior.
---
 src/backend/storage/file/fd.c   | 38 ++++++++++++-------------
 src/backend/storage/smgr/md.c   | 50 ++++++++++++++++-----------------
 src/common/file_utils.c         |  4 +--
 src/include/common/file_utils.h |  4 +--
 src/include/port/pg_iovec.h     |  4 +--
 src/include/port/win32_port.h   |  4 +--
 src/include/storage/fd.h        | 26 ++++++++---------
 src/port/win32pread.c           | 10 +++----
 src/port/win32pwrite.c          | 10 +++----
 9 files changed, 75 insertions(+), 75 deletions(-)

diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index a4ec7959f3..b25e74831e 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -201,7 +201,7 @@ typedef struct vfd
 	File		nextFree;		/* link to next free VFD, if in freelist */
 	File		lruMoreRecently;	/* doubly linked recency-of-use list */
 	File		lruLessRecently;
-	off_t		fileSize;		/* current size of file (0 if not temporary) */
+	pgoff_t		fileSize;		/* current size of file (0 if not temporary) */
 	char	   *fileName;		/* name of file, or NULL for unused VFD */
 	/* NB: fileName is malloc'd, and must be free'd when closing the VFD */
 	int			fileFlags;		/* open(2) flags for (re)opening the file */
@@ -519,7 +519,7 @@ pg_file_exists(const char *name)
  * offset of 0 with nbytes 0 means that the entire file should be flushed
  */
 void
-pg_flush_data(int fd, off_t offset, off_t nbytes)
+pg_flush_data(int fd, pgoff_t offset, pgoff_t nbytes)
 {
 	/*
 	 * Right now file flushing is primarily used to avoid making later
@@ -635,7 +635,7 @@ retry:
 		 * may simply not be enough address space.  If so, silently fall
 		 * through to the next implementation.
 		 */
-		if (nbytes <= (off_t) SSIZE_MAX)
+		if (nbytes <= (pgoff_t) SSIZE_MAX)
 			p = mmap(NULL, nbytes, PROT_READ, MAP_SHARED, fd, offset);
 		else
 			p = MAP_FAILED;
@@ -697,7 +697,7 @@ retry:
  * Truncate an open file to a given length.
  */
 static int
-pg_ftruncate(int fd, off_t length)
+pg_ftruncate(int fd, pgoff_t length)
 {
 	int			ret;
 
@@ -714,7 +714,7 @@ retry:
  * Truncate a file to a given length by name.
  */
 int
-pg_truncate(const char *path, off_t length)
+pg_truncate(const char *path, pgoff_t length)
 {
 	int			ret;
 #ifdef WIN32
@@ -1526,7 +1526,7 @@ FileAccess(File file)
  * Called whenever a temporary file is deleted to report its size.
  */
 static void
-ReportTemporaryFileUsage(const char *path, off_t size)
+ReportTemporaryFileUsage(const char *path, pgoff_t size)
 {
 	pgstat_report_tempfile(size);
 
@@ -2077,7 +2077,7 @@ FileClose(File file)
  * this.
  */
 int
-FilePrefetch(File file, off_t offset, off_t amount, uint32 wait_event_info)
+FilePrefetch(File file, pgoff_t offset, pgoff_t amount, uint32 wait_event_info)
 {
 	Assert(FileIsValid(file));
 
@@ -2108,7 +2108,7 @@ retry:
 	{
 		struct radvisory
 		{
-			off_t		ra_offset;	/* offset into the file */
+			pgoff_t		ra_offset;	/* offset into the file */
 			int			ra_count;	/* size of the read     */
 		}			ra;
 		int			returnCode;
@@ -2133,7 +2133,7 @@ retry:
 }
 
 void
-FileWriteback(File file, off_t offset, off_t nbytes, uint32 wait_event_info)
+FileWriteback(File file, pgoff_t offset, pgoff_t nbytes, uint32 wait_event_info)
 {
 	int			returnCode;
 
@@ -2159,7 +2159,7 @@ FileWriteback(File file, off_t offset, off_t nbytes, uint32 wait_event_info)
 }
 
 ssize_t
-FileReadV(File file, const struct iovec *iov, int iovcnt, off_t offset,
+FileReadV(File file, const struct iovec *iov, int iovcnt, pgoff_t offset,
 		  uint32 wait_event_info)
 {
 	ssize_t		returnCode;
@@ -2216,7 +2216,7 @@ retry:
 
 int
 FileStartReadV(PgAioHandle *ioh, File file,
-			   int iovcnt, off_t offset,
+			   int iovcnt, pgoff_t offset,
 			   uint32 wait_event_info)
 {
 	int			returnCode;
@@ -2241,7 +2241,7 @@ FileStartReadV(PgAioHandle *ioh, File file,
 }
 
 ssize_t
-FileWriteV(File file, const struct iovec *iov, int iovcnt, off_t offset,
+FileWriteV(File file, const struct iovec *iov, int iovcnt, pgoff_t offset,
 		   uint32 wait_event_info)
 {
 	ssize_t		returnCode;
@@ -2270,7 +2270,7 @@ FileWriteV(File file, const struct iovec *iov, int iovcnt, off_t offset,
 	 */
 	if (temp_file_limit >= 0 && (vfdP->fdstate & FD_TEMP_FILE_LIMIT))
 	{
-		off_t		past_write = offset;
+		pgoff_t		past_write = offset;
 
 		for (int i = 0; i < iovcnt; ++i)
 			past_write += iov[i].iov_len;
@@ -2309,7 +2309,7 @@ retry:
 		 */
 		if (vfdP->fdstate & FD_TEMP_FILE_LIMIT)
 		{
-			off_t		past_write = offset + returnCode;
+			pgoff_t		past_write = offset + returnCode;
 
 			if (past_write > vfdP->fileSize)
 			{
@@ -2373,7 +2373,7 @@ FileSync(File file, uint32 wait_event_info)
  * appropriate error.
  */
 int
-FileZero(File file, off_t offset, off_t amount, uint32 wait_event_info)
+FileZero(File file, pgoff_t offset, pgoff_t amount, uint32 wait_event_info)
 {
 	int			returnCode;
 	ssize_t		written;
@@ -2418,7 +2418,7 @@ FileZero(File file, off_t offset, off_t amount, uint32 wait_event_info)
  * appropriate error.
  */
 int
-FileFallocate(File file, off_t offset, off_t amount, uint32 wait_event_info)
+FileFallocate(File file, pgoff_t offset, pgoff_t amount, uint32 wait_event_info)
 {
 #ifdef HAVE_POSIX_FALLOCATE
 	int			returnCode;
@@ -2457,7 +2457,7 @@ retry:
 	return FileZero(file, offset, amount, wait_event_info);
 }
 
-off_t
+pgoff_t
 FileSize(File file)
 {
 	Assert(FileIsValid(file));
@@ -2468,14 +2468,14 @@ FileSize(File file)
 	if (FileIsNotOpen(file))
 	{
 		if (FileAccess(file) < 0)
-			return (off_t) -1;
+			return (pgoff_t) -1;
 	}
 
 	return lseek(VfdCache[file].fd, 0, SEEK_END);
 }
 
 int
-FileTruncate(File file, off_t offset, uint32 wait_event_info)
+FileTruncate(File file, pgoff_t offset, uint32 wait_event_info)
 {
 	int			returnCode;
 
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 235ba7e191..e3f335a834 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -487,7 +487,7 @@ void
 mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 		 const void *buffer, bool skipFsync)
 {
-	off_t		seekpos;
+	pgoff_t		seekpos;
 	int			nbytes;
 	MdfdVec    *v;
 
@@ -515,9 +515,9 @@ mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 
 	v = _mdfd_getseg(reln, forknum, blocknum, skipFsync, EXTENSION_CREATE);
 
-	seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+	seekpos = (pgoff_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
 
-	Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+	Assert(seekpos < (pgoff_t) BLCKSZ * RELSEG_SIZE);
 
 	if ((nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, seekpos, WAIT_EVENT_DATA_FILE_EXTEND)) != BLCKSZ)
 	{
@@ -578,7 +578,7 @@ mdzeroextend(SMgrRelation reln, ForkNumber forknum,
 	while (remblocks > 0)
 	{
 		BlockNumber segstartblock = curblocknum % ((BlockNumber) RELSEG_SIZE);
-		off_t		seekpos = (off_t) BLCKSZ * segstartblock;
+		pgoff_t		seekpos = (pgoff_t) BLCKSZ * segstartblock;
 		int			numblocks;
 
 		if (segstartblock + remblocks > RELSEG_SIZE)
@@ -607,7 +607,7 @@ mdzeroextend(SMgrRelation reln, ForkNumber forknum,
 			int			ret;
 
 			ret = FileFallocate(v->mdfd_vfd,
-								seekpos, (off_t) BLCKSZ * numblocks,
+								seekpos, (pgoff_t) BLCKSZ * numblocks,
 								WAIT_EVENT_DATA_FILE_EXTEND);
 			if (ret != 0)
 			{
@@ -630,7 +630,7 @@ mdzeroextend(SMgrRelation reln, ForkNumber forknum,
 			 * whole length of the extension.
 			 */
 			ret = FileZero(v->mdfd_vfd,
-						   seekpos, (off_t) BLCKSZ * numblocks,
+						   seekpos, (pgoff_t) BLCKSZ * numblocks,
 						   WAIT_EVENT_DATA_FILE_EXTEND);
 			if (ret < 0)
 				ereport(ERROR,
@@ -745,7 +745,7 @@ mdprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 
 	while (nblocks > 0)
 	{
-		off_t		seekpos;
+		pgoff_t		seekpos;
 		MdfdVec    *v;
 		int			nblocks_this_segment;
 
@@ -754,9 +754,9 @@ mdprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 		if (v == NULL)
 			return false;
 
-		seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+		seekpos = (pgoff_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
 
-		Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+		Assert(seekpos < (pgoff_t) BLCKSZ * RELSEG_SIZE);
 
 		nblocks_this_segment =
 			Min(nblocks,
@@ -851,7 +851,7 @@ mdreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	{
 		struct iovec iov[PG_IOV_MAX];
 		int			iovcnt;
-		off_t		seekpos;
+		pgoff_t		seekpos;
 		int			nbytes;
 		MdfdVec    *v;
 		BlockNumber nblocks_this_segment;
@@ -861,9 +861,9 @@ mdreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 		v = _mdfd_getseg(reln, forknum, blocknum, false,
 						 EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY);
 
-		seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+		seekpos = (pgoff_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
 
-		Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+		Assert(seekpos < (pgoff_t) BLCKSZ * RELSEG_SIZE);
 
 		nblocks_this_segment =
 			Min(nblocks,
@@ -986,7 +986,7 @@ mdstartreadv(PgAioHandle *ioh,
 			 SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 			 void **buffers, BlockNumber nblocks)
 {
-	off_t		seekpos;
+	pgoff_t		seekpos;
 	MdfdVec    *v;
 	BlockNumber nblocks_this_segment;
 	struct iovec *iov;
@@ -996,9 +996,9 @@ mdstartreadv(PgAioHandle *ioh,
 	v = _mdfd_getseg(reln, forknum, blocknum, false,
 					 EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY);
 
-	seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+	seekpos = (pgoff_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
 
-	Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+	Assert(seekpos < (pgoff_t) BLCKSZ * RELSEG_SIZE);
 
 	nblocks_this_segment =
 		Min(nblocks,
@@ -1068,7 +1068,7 @@ mdwritev(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	{
 		struct iovec iov[PG_IOV_MAX];
 		int			iovcnt;
-		off_t		seekpos;
+		pgoff_t		seekpos;
 		int			nbytes;
 		MdfdVec    *v;
 		BlockNumber nblocks_this_segment;
@@ -1078,9 +1078,9 @@ mdwritev(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 		v = _mdfd_getseg(reln, forknum, blocknum, skipFsync,
 						 EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY);
 
-		seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+		seekpos = (pgoff_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
 
-		Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+		Assert(seekpos < (pgoff_t) BLCKSZ * RELSEG_SIZE);
 
 		nblocks_this_segment =
 			Min(nblocks,
@@ -1173,7 +1173,7 @@ mdwriteback(SMgrRelation reln, ForkNumber forknum,
 	while (nblocks > 0)
 	{
 		BlockNumber nflush = nblocks;
-		off_t		seekpos;
+		pgoff_t		seekpos;
 		MdfdVec    *v;
 		int			segnum_start,
 					segnum_end;
@@ -1202,9 +1202,9 @@ mdwriteback(SMgrRelation reln, ForkNumber forknum,
 		Assert(nflush >= 1);
 		Assert(nflush <= nblocks);
 
-		seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+		seekpos = (pgoff_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
 
-		FileWriteback(v->mdfd_vfd, seekpos, (off_t) BLCKSZ * nflush, WAIT_EVENT_DATA_FILE_FLUSH);
+		FileWriteback(v->mdfd_vfd, seekpos, (pgoff_t) BLCKSZ * nflush, WAIT_EVENT_DATA_FILE_FLUSH);
 
 		nblocks -= nflush;
 		blocknum += nflush;
@@ -1348,7 +1348,7 @@ mdtruncate(SMgrRelation reln, ForkNumber forknum,
 			 */
 			BlockNumber lastsegblocks = nblocks - priorblocks;
 
-			if (FileTruncate(v->mdfd_vfd, (off_t) lastsegblocks * BLCKSZ, WAIT_EVENT_DATA_FILE_TRUNCATE) < 0)
+			if (FileTruncate(v->mdfd_vfd, (pgoff_t) lastsegblocks * BLCKSZ, WAIT_EVENT_DATA_FILE_TRUNCATE) < 0)
 				ereport(ERROR,
 						(errcode_for_file_access(),
 						 errmsg("could not truncate file \"%s\" to %u blocks: %m",
@@ -1484,9 +1484,9 @@ mdfd(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uint32 *off)
 	v = _mdfd_getseg(reln, forknum, blocknum, false,
 					 EXTENSION_FAIL);
 
-	*off = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+	*off = (pgoff_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
 
-	Assert(*off < (off_t) BLCKSZ * RELSEG_SIZE);
+	Assert(*off < (pgoff_t) BLCKSZ * RELSEG_SIZE);
 
 	return FileGetRawDesc(v->mdfd_vfd);
 }
@@ -1868,7 +1868,7 @@ _mdfd_getseg(SMgrRelation reln, ForkNumber forknum, BlockNumber blkno,
 static BlockNumber
 _mdnblocks(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 {
-	off_t		len;
+	pgoff_t		len;
 
 	len = FileSize(seg->mdfd_vfd);
 	if (len < 0)
diff --git a/src/common/file_utils.c b/src/common/file_utils.c
index 7b62687a2a..cdf08ab5cb 100644
--- a/src/common/file_utils.c
+++ b/src/common/file_utils.c
@@ -656,7 +656,7 @@ compute_remaining_iovec(struct iovec *destination,
  * error is returned, it is unspecified how much has been written.
  */
 ssize_t
-pg_pwritev_with_retry(int fd, const struct iovec *iov, int iovcnt, off_t offset)
+pg_pwritev_with_retry(int fd, const struct iovec *iov, int iovcnt, pgoff_t offset)
 {
 	struct iovec iov_copy[PG_IOV_MAX];
 	ssize_t		sum = 0;
@@ -706,7 +706,7 @@ pg_pwritev_with_retry(int fd, const struct iovec *iov, int iovcnt, off_t offset)
  * is returned with errno set.
  */
 ssize_t
-pg_pwrite_zeros(int fd, size_t size, off_t offset)
+pg_pwrite_zeros(int fd, size_t size, pgoff_t offset)
 {
 	static const PGIOAlignedBlock zbuffer = {0};	/* worth BLCKSZ */
 	void	   *zerobuf_addr = unconstify(PGIOAlignedBlock *, &zbuffer)->data;
diff --git a/src/include/common/file_utils.h b/src/include/common/file_utils.h
index 9fd88953e4..4239713803 100644
--- a/src/include/common/file_utils.h
+++ b/src/include/common/file_utils.h
@@ -55,9 +55,9 @@ extern int	compute_remaining_iovec(struct iovec *destination,
 extern ssize_t pg_pwritev_with_retry(int fd,
 									 const struct iovec *iov,
 									 int iovcnt,
-									 off_t offset);
+									 pgoff_t offset);
 
-extern ssize_t pg_pwrite_zeros(int fd, size_t size, off_t offset);
+extern ssize_t pg_pwrite_zeros(int fd, size_t size, pgoff_t offset);
 
 /* Filename components */
 #define PG_TEMP_FILES_DIR "pgsql_tmp"
diff --git a/src/include/port/pg_iovec.h b/src/include/port/pg_iovec.h
index 90be3af449..845ded8c71 100644
--- a/src/include/port/pg_iovec.h
+++ b/src/include/port/pg_iovec.h
@@ -51,7 +51,7 @@ struct iovec
  * this changes the current file position.
  */
 static inline ssize_t
-pg_preadv(int fd, const struct iovec *iov, int iovcnt, off_t offset)
+pg_preadv(int fd, const struct iovec *iov, int iovcnt, pgoff_t offset)
 {
 #if HAVE_DECL_PREADV
 	/*
@@ -90,7 +90,7 @@ pg_preadv(int fd, const struct iovec *iov, int iovcnt, off_t offset)
  * this changes the current file position.
  */
 static inline ssize_t
-pg_pwritev(int fd, const struct iovec *iov, int iovcnt, off_t offset)
+pg_pwritev(int fd, const struct iovec *iov, int iovcnt, pgoff_t offset)
 {
 #if HAVE_DECL_PWRITEV
 	/*
diff --git a/src/include/port/win32_port.h b/src/include/port/win32_port.h
index ff7028bdc8..f54ccef7db 100644
--- a/src/include/port/win32_port.h
+++ b/src/include/port/win32_port.h
@@ -584,9 +584,9 @@ typedef unsigned short mode_t;
 #endif
 
 /* in port/win32pread.c */
-extern ssize_t pg_pread(int fd, void *buf, size_t nbyte, off_t offset);
+extern ssize_t pg_pread(int fd, void *buf, size_t nbyte, pgoff_t offset);
 
 /* in port/win32pwrite.c */
-extern ssize_t pg_pwrite(int fd, const void *buf, size_t nbyte, off_t offset);
+extern ssize_t pg_pwrite(int fd, const void *buf, size_t nbyte, pgoff_t offset);
 
 #endif							/* PG_WIN32_PORT_H */
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index b77d8e5e30..3e821ce8fb 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -108,17 +108,17 @@ extern File PathNameOpenFile(const char *fileName, int fileFlags);
 extern File PathNameOpenFilePerm(const char *fileName, int fileFlags, mode_t fileMode);
 extern File OpenTemporaryFile(bool interXact);
 extern void FileClose(File file);
-extern int	FilePrefetch(File file, off_t offset, off_t amount, uint32 wait_event_info);
-extern ssize_t FileReadV(File file, const struct iovec *iov, int iovcnt, off_t offset, uint32 wait_event_info);
-extern ssize_t FileWriteV(File file, const struct iovec *iov, int iovcnt, off_t offset, uint32 wait_event_info);
-extern int	FileStartReadV(struct PgAioHandle *ioh, File file, int iovcnt, off_t offset, uint32 wait_event_info);
+extern int	FilePrefetch(File file, pgoff_t offset, pgoff_t amount, uint32 wait_event_info);
+extern ssize_t FileReadV(File file, const struct iovec *iov, int iovcnt, pgoff_t offset, uint32 wait_event_info);
+extern ssize_t FileWriteV(File file, const struct iovec *iov, int iovcnt, pgoff_t offset, uint32 wait_event_info);
+extern int	FileStartReadV(struct PgAioHandle *ioh, File file, int iovcnt, pgoff_t offset, uint32 wait_event_info);
 extern int	FileSync(File file, uint32 wait_event_info);
-extern int	FileZero(File file, off_t offset, off_t amount, uint32 wait_event_info);
-extern int	FileFallocate(File file, off_t offset, off_t amount, uint32 wait_event_info);
+extern int	FileZero(File file, pgoff_t offset, pgoff_t amount, uint32 wait_event_info);
+extern int	FileFallocate(File file, pgoff_t offset, pgoff_t amount, uint32 wait_event_info);
 
-extern off_t FileSize(File file);
-extern int	FileTruncate(File file, off_t offset, uint32 wait_event_info);
-extern void FileWriteback(File file, off_t offset, off_t nbytes, uint32 wait_event_info);
+extern pgoff_t FileSize(File file);
+extern int	FileTruncate(File file, pgoff_t offset, uint32 wait_event_info);
+extern void FileWriteback(File file, pgoff_t offset, pgoff_t nbytes, uint32 wait_event_info);
 extern char *FilePathName(File file);
 extern int	FileGetRawDesc(File file);
 extern int	FileGetRawFlags(File file);
@@ -186,8 +186,8 @@ extern int	pg_fsync_no_writethrough(int fd);
 extern int	pg_fsync_writethrough(int fd);
 extern int	pg_fdatasync(int fd);
 extern bool pg_file_exists(const char *name);
-extern void pg_flush_data(int fd, off_t offset, off_t nbytes);
-extern int	pg_truncate(const char *path, off_t length);
+extern void pg_flush_data(int fd, pgoff_t offset, pgoff_t nbytes);
+extern int	pg_truncate(const char *path, pgoff_t length);
 extern void fsync_fname(const char *fname, bool isdir);
 extern int	fsync_fname_ext(const char *fname, bool isdir, bool ignore_perm, int elevel);
 extern int	durable_rename(const char *oldfile, const char *newfile, int elevel);
@@ -196,7 +196,7 @@ extern void SyncDataDirectory(void);
 extern int	data_sync_elevel(int elevel);
 
 static inline ssize_t
-FileRead(File file, void *buffer, size_t amount, off_t offset,
+FileRead(File file, void *buffer, size_t amount, pgoff_t offset,
 		 uint32 wait_event_info)
 {
 	struct iovec iov = {
@@ -208,7 +208,7 @@ FileRead(File file, void *buffer, size_t amount, off_t offset,
 }
 
 static inline ssize_t
-FileWrite(File file, const void *buffer, size_t amount, off_t offset,
+FileWrite(File file, const void *buffer, size_t amount, pgoff_t offset,
 		  uint32 wait_event_info)
 {
 	struct iovec iov = {
diff --git a/src/port/win32pread.c b/src/port/win32pread.c
index 32d56c462e..1f00dfd8e6 100644
--- a/src/port/win32pread.c
+++ b/src/port/win32pread.c
@@ -17,7 +17,7 @@
 #include <windows.h>
 
 ssize_t
-pg_pread(int fd, void *buf, size_t size, off_t offset)
+pg_pread(int fd, void *buf, size_t size, pgoff_t offset)
 {
 	OVERLAPPED	overlapped = {0};
 	HANDLE		handle;
@@ -30,16 +30,16 @@ pg_pread(int fd, void *buf, size_t size, off_t offset)
 		return -1;
 	}
 
-	/* Avoid overflowing DWORD. */
+	/* Avoid overflowing DWORD */
 	size = Min(size, 1024 * 1024 * 1024);
 
-	/* Note that this changes the file position, despite not using it. */
-	overlapped.Offset = offset;
+	overlapped.Offset = (DWORD) offset;
+	overlapped.OffsetHigh = (DWORD) (offset >> 32);
+
 	if (!ReadFile(handle, buf, size, &result, &overlapped))
 	{
 		if (GetLastError() == ERROR_HANDLE_EOF)
 			return 0;
-
 		_dosmaperr(GetLastError());
 		return -1;
 	}
diff --git a/src/port/win32pwrite.c b/src/port/win32pwrite.c
index 249aa6c468..d9a0d23c2b 100644
--- a/src/port/win32pwrite.c
+++ b/src/port/win32pwrite.c
@@ -15,9 +15,8 @@
 #include "c.h"
 
 #include <windows.h>
-
 ssize_t
-pg_pwrite(int fd, const void *buf, size_t size, off_t offset)
+pg_pwrite(int fd, const void *buf, size_t size, pgoff_t offset)
 {
 	OVERLAPPED	overlapped = {0};
 	HANDLE		handle;
@@ -30,11 +29,12 @@ pg_pwrite(int fd, const void *buf, size_t size, off_t offset)
 		return -1;
 	}
 
-	/* Avoid overflowing DWORD. */
+	/* Avoid overflowing DWORD */
 	size = Min(size, 1024 * 1024 * 1024);
 
-	/* Note that this changes the file position, despite not using it. */
-	overlapped.Offset = offset;
+	overlapped.Offset = (DWORD) offset;
+	overlapped.OffsetHigh = (DWORD) (offset >> 32);
+
 	if (!WriteFile(handle, buf, size, &result, &overlapped))
 	{
 		_dosmaperr(GetLastError());
-- 
2.46.0.windows.1

#2Michael Paquier
michael@paquier.xyz
In reply to: Bryan Green (#1)
Re: [Patch] Windows relation extension failure at 2GB and 4GB

On Tue, Oct 28, 2025 at 09:42:11AM -0500, Bryan Green wrote:

I found two related bugs in PostgreSQL's Windows port that prevent files
from exceeding 2GB. While unlikely to affect most installations (default 1GB
segments), the code is objectively wrong and worth fixing.

The first bug is a pervasive use of off_t where pgoff_t should be used. On
Windows, off_t is only 32-bit, causing signed integer overflow at exactly
2GB (2^31 bytes). PostgreSQL already defined pgoff_t as __int64 for this
purpose and some function declarations in headers already use it, but the
implementations weren't updated to match.

Ugh. That's the same problem as "long", which is 8 bytes wide
everywhere except WIN32. Removing traces of "long" from the code has
been a continuous effort over the years because of these silent
overflow issues.

After fixing all those off_t issues, there's a second bug at 4GB in the
Windows implementations of pg_pwrite()/pg_pread() in win32pwrite.c and
win32pread.c. The current implementation uses an OVERLAPPED structure for
positioned I/O, but only sets the Offset field (low 32 bits), leaving
OffsetHigh at zero. This works up to 4GB by accident, but beyond that,
offsets wrap around.

I can reproduce both bugs reliably with --with-segsize=8. The file grows to
exactly 2GB and fails with "could not extend file: Invalid argument" despite
having 300GB free. After fixing the off_t issues, it grows to exactly 4GB
and hits the OVERLAPPED bug. Both are independently verifiable.

The most popular option in terms of installation on Windows is the EDB
installer, where I bet that a file segment size of 1GB is what's
embedded in the code compiled. This argument would not hold with WAL
segment sizes if we begin to support even higher sizes than 1GB at
some point, and we use pwrite() in the WAL insert code. That should
not be a problem even in the near future.

It's safe for all platforms since pgoff_t equals off_t on Unix where off_t
is already 64-bit. Only Windows behavior changes.

win32_port.h and port.h say so, yeah.

Not urgent since few people hit this in practice, but it's clearly wrong
code.
Someone building with larger segments would see failures at 2GB and
potential corruption at 4GB. Windows supports files up to 16 exabytes - no
good reason to limit PostgreSQL to 2GB.

The same kind of limitations with 4GB files existed with stat() and
fstat(), but it was much more complicated than what you are doing
here, where COPY was not able to work with files larger than 4GB on
WIN32. See the saga from bed90759fcbc.

I have attached the patch to fix the relation extension problems for Windows
to this email.

Can provide the other patches that changes off_t for pgoff_t in the rest of
the code if there's interest in fixing this.

Yeah, I think that we should rip out these issues, and move to the
more portable pgoff_t across the board. I doubt that any of this
could be backpatched due to the potential ABI breakages these
signatures changes would cause. Implementing things in incremental
steps is more sensible when it comes to such changes, as a revert
blast can be reduced if a portion is incorrectly handled.

I'm seeing as well the things you are pointing in buffile.c. These
should be fixed as well. The WAL code is less annoying due to the 1GB
WAL segment size limit, still consistency across the board makes the
code easier to reason about, at least.

Among the files you have mentioned, there is also copydir.c.

pg_rewind seems also broken with files larger than 4GB, from what I
can see in libpq_source.c and filemap.c.. Worse. Oops.

-	/* Note that this changes the file position, despite not using it. */
-	overlapped.Offset = offset;
+	overlapped.Offset = (DWORD) offset;
+	overlapped.OffsetHigh = (DWORD) (offset >> 32);

Based on the docs at [1]https://learn.microsoft.com/en-us/windows/win32/api/minwinbase/ns-minwinbase-overlapped -- Michael, yes, this change makes sense.

It seems to me that a couple of extra code paths should be handled in
the first patch, and I have spotted three of them. None of them are
critical as they are related to WAL segments, just become masked and
inconsistent:
- xlogrecovery.c, pg_pread() called with a cast to off_t. WAL
segments have a max size of 1GB, meaning that we're OK.
- xlogreader.c, pg_pread() with a cast to off_t.
- walreceiver.c, pg_pwrite().

Except for these three spots, the first patch looks like a cut good
enough on its own.

Glad to see someone who takes time to spend time on this kind of
stuff with portability in mind, by the way.

[1]: https://learn.microsoft.com/en-us/windows/win32/api/minwinbase/ns-minwinbase-overlapped -- Michael
--
Michael

#3Bryan Green
dbryan.green@gmail.com
In reply to: Michael Paquier (#2)
Re: [Patch] Windows relation extension failure at 2GB and 4GB

On 11/5/2025 11:05 PM, Michael Paquier wrote:

On Tue, Oct 28, 2025 at 09:42:11AM -0500, Bryan Green wrote:

I found two related bugs in PostgreSQL's Windows port that prevent files
from exceeding 2GB. While unlikely to affect most installations (default 1GB
segments), the code is objectively wrong and worth fixing.

The first bug is a pervasive use of off_t where pgoff_t should be used. On
Windows, off_t is only 32-bit, causing signed integer overflow at exactly
2GB (2^31 bytes). PostgreSQL already defined pgoff_t as __int64 for this
purpose and some function declarations in headers already use it, but the
implementations weren't updated to match.

Ugh. That's the same problem as "long", which is 8 bytes wide
everywhere except WIN32. Removing traces of "long" from the code has
been a continuous effort over the years because of these silent
overflow issues.

Exactly - these silent overflows are particularly nasty since they only
manifest under specific conditions (large files on Windows) and can
cause data corruption rather than immediate crashes.

After fixing all those off_t issues, there's a second bug at 4GB in the
Windows implementations of pg_pwrite()/pg_pread() in win32pwrite.c and
win32pread.c. The current implementation uses an OVERLAPPED structure for
positioned I/O, but only sets the Offset field (low 32 bits), leaving
OffsetHigh at zero. This works up to 4GB by accident, but beyond that,
offsets wrap around.

I can reproduce both bugs reliably with --with-segsize=8. The file grows to
exactly 2GB and fails with "could not extend file: Invalid argument" despite
having 300GB free. After fixing the off_t issues, it grows to exactly 4GB
and hits the OVERLAPPED bug. Both are independently verifiable.

The most popular option in terms of installation on Windows is the EDB
installer, where I bet that a file segment size of 1GB is what's
embedded in the code compiled. This argument would not hold with WAL

Right, which is why this has gone unnoticed. The 1GB default masks both
bugs completely. It's only when someone uses --with-segsize > 2 that the
issues appear.

segment sizes if we begin to support even higher sizes than 1GB at
some point, and we use pwrite() in the WAL insert code. That should
not be a problem even in the near future.

It's safe for all platforms since pgoff_t equals off_t on Unix where off_t
is already 64-bit. Only Windows behavior changes.

win32_port.h and port.h say so, yeah.

Not urgent since few people hit this in practice, but it's clearly wrong
code.
Someone building with larger segments would see failures at 2GB and
potential corruption at 4GB. Windows supports files up to 16 exabytes - no
good reason to limit PostgreSQL to 2GB.

The same kind of limitations with 4GB files existed with stat() and
fstat(), but it was much more complicated than what you are doing
here, where COPY was not able to work with files larger than 4GB on
WIN32. See the saga from bed90759fcbc.

I have attached the patch to fix the relation extension problems for Windows
to this email.

Can provide the other patches that changes off_t for pgoff_t in the rest of
the code if there's interest in fixing this.

Yeah, I think that we should rip out these issues, and move to the
more portable pgoff_t across the board. I doubt that any of this
could be backpatched due to the potential ABI breakages these
signatures changes would cause. Implementing things in incremental
steps is more sensible when it comes to such changes, as a revert
blast can be reduced if a portion is incorrectly handled.

I'm seeing as well the things you are pointing in buffile.c. These
should be fixed as well. The WAL code is less annoying due to the 1GB
WAL segment size limit, still consistency across the board makes the
code easier to reason about, at least.

Among the files you have mentioned, there is also copydir.c.

pg_rewind seems also broken with files larger than 4GB, from what I
can see in libpq_source.c and filemap.c.. Worse. Oops.

-	/* Note that this changes the file position, despite not using it. */
-	overlapped.Offset = offset;
+	overlapped.Offset = (DWORD) offset;
+	overlapped.OffsetHigh = (DWORD) (offset >> 32);

Based on the docs at [1], yes, this change makes sense.

It seems to me that a couple of extra code paths should be handled in
the first patch, and I have spotted three of them. None of them are
critical as they are related to WAL segments, just become masked and
inconsistent:
- xlogrecovery.c, pg_pread() called with a cast to off_t. WAL
segments have a max size of 1GB, meaning that we're OK.
- xlogreader.c, pg_pread() with a cast to off_t.
- walreceiver.c, pg_pwrite().

I'll include these in the first patch for consistency even though
they're not currently problematic. Better to fix all the function call
sites together rather than leaving known inconsistencies.

Except for these three spots, the first patch looks like a cut good
enough on its own.

Glad to see someone who takes time to spend time on this kind of
stuff with portability in mind, by the way.

Windows portability issues tend to hide in corners like this. I'll
prepare the updated patch series addressing your feedback and post v2
shortly.

[1]: https://learn.microsoft.com/en-us/windows/win32/api/minwinbase/ns-minwinbase-overlapped
--
Michael

Thanks for taking the time to look over this patch.

--
Bryan Green
EDB: https://www.enterprisedb.com

#4Thomas Munro
thomas.munro@gmail.com
In reply to: Bryan Green (#1)
Re: [Patch] Windows relation extension failure at 2GB and 4GB

On Wed, Oct 29, 2025 at 3:42 AM Bryan Green <dbryan.green@gmail.com> wrote:

That said, I'm finding off_t used in many other places throughout the
codebase - buffile.c, various other file utilities such as backup and
archive, probably more. This is likely causing latent bugs elsewhere on
Windows, though most are masked by the 1GB default segment size. I'm
investigating the full scope, but I think this needs to be broken up
into multiple patches. The core file I/O layer (fd.c, md.c,
pg_pwrite/pg_pread) should probably go first since that's what's
actively breaking file extension.

The way I understand this situation, there are two kinds of file I/O,
with respect to large files:

1. Some places *have* to deal with large files (eg navigating in a
potentially large tar file), and there we should already be using
pgoff_t and the relevant system call wrappers should be using the
int64_t stuff Windows provides. These are primarily frontend code.
2. Some places use segmentation *specifically because* there are
systems with 32 bit off_t. These are mostly backend code dealing with
relation data files. The only system left with narrow off_t is
Windows.

In reality the stuff in category 1 has been developed through a
process of bug reports and patches (970b97e and 970b97e^ springs to
mind as the most recent case I had something to with, but see also
stat()-related stuff, and see aa5518304 where we addressed the one
spot in buffile.c that had to consider multiple segments). But the
fact that Windows can't use segments > 2GB because the fd.c and
smgr.c/md.c layers work with off_t is certainly a well known
limitation, ie specifically that relation and temporary/buf files are
special in this way. I'm mostly baffled by the fact that --relsegsize
actually *lets* you set it higher than 2 on that platform. Perhaps we
should at least backpatch a configure check or static assertion to
block that? It's not good if it compiles but doesn't actually work.

For master I think it makes sense to clean this up, as you say,
because the fuzzy boundary between the two categories of file I/O is
bound to cause more problems, it's just unfinished business that has
been tackled piecemeal as required by bug reports... In fact, on a
thread[1]/messages/by-id/CA+hUKG+BGXwMbrvzXAjL8VMGf25y_ga_XnO741g10y0=m6dDiA@mail.gmail.com where I explored making the segment size a runtime option
specified at initdb time, I even posted patches much like yours in the
first version, spreading pgoff_t into more places, and then in a later
version it was suggested that it might be better to just block
settings that are too big for your off_t, so I did that. I probably
thought that we already did that somewhere for the current
compile-time constant...

Not urgent since few people hit this in practice, but it's clearly wrong
code.

Yeah. In my experience dealing with bug reports, the Windows users
community skews very heavily towards just consuming EDB's read-built
installer. We rarely hear about configuration-level problems, so I
suppose it's not surprising that no one has ever complained that it
lets you configure it in a way that we hackers all know is certainly
going to break.

[1]: /messages/by-id/CA+hUKG+BGXwMbrvzXAjL8VMGf25y_ga_XnO741g10y0=m6dDiA@mail.gmail.com

#5Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#4)
Re: [Patch] Windows relation extension failure at 2GB and 4GB

On Thu, Nov 6, 2025 at 10:20 PM Thomas Munro <thomas.munro@gmail.com> wrote:

The only system left with narrow off_t is Windows.

By the way, that's not always true: Meson + MinGW have a 64-bit off_t
on Windows, because meson decides to set -D_FIILE_OFFSET_BITS=64 for
all C compilers except msvc[1]https://github.com/mesonbuild/meson/blob/97a1c567c9813176e4bec40f6055f228b2121609/mesonbuild/compilers/compilers.py#L1144 (only other exclusion is macOS, but
that is 64-bit only these days; there are other systems like FreeBSD
where sizeof(off_t) is always 8 but it doesn't seem to know about that
or bother to check), and MinGW's headers react to that. I suspect
autoconf's AC_SYS_LARGEFILE would do that too with MinGW, IDK, but
configure.ac doesn't call it for win32 by special condition. That
creates a strange difference between meson and autoconf builds IMHO,
but if we resolve that in the only direction possible AFAICS we'd
still have a strange difference between MSVC and MinGW.

Observing that mess, I kinda wonder what would happen if we just used
a big hammer to redefine off_t to be __int64 ourselves. On the face
of it, it sounds like an inherently bad idea that could bite you when
interacting with libraries whose headers use off_t. On the other
hand, the world of open source libraries we care about might already
be resistant to that chaos, if libraries are being built with and
without -D_FIILE_OFFSET_BITS=64 willy-nilly, or they actually can't
deal with large files at all in which case that's something we'd have
to deal with whatever we do. I don't know, it's just a thought that
occurred to me while contemplating how unpleasant it is to splatter
pgoff_t all over our tree, and yet *still* have to tread very
carefully with the boundaries of external libraries that might be
using off_t, researching each case...

[1]: https://github.com/mesonbuild/meson/blob/97a1c567c9813176e4bec40f6055f228b2121609/mesonbuild/compilers/compilers.py#L1144

#6Bryan Green
dbryan.green@gmail.com
In reply to: Thomas Munro (#4)
Re: [Patch] Windows relation extension failure at 2GB and 4GB

On 11/6/2025 3:20 AM, Thomas Munro wrote:

On Wed, Oct 29, 2025 at 3:42 AM Bryan Green <dbryan.green@gmail.com> wrote:

That said, I'm finding off_t used in many other places throughout the
codebase - buffile.c, various other file utilities such as backup and
archive, probably more. This is likely causing latent bugs elsewhere on
Windows, though most are masked by the 1GB default segment size. I'm
investigating the full scope, but I think this needs to be broken up
into multiple patches. The core file I/O layer (fd.c, md.c,
pg_pwrite/pg_pread) should probably go first since that's what's
actively breaking file extension.

The way I understand this situation, there are two kinds of file I/O,
with respect to large files:

1. Some places *have* to deal with large files (eg navigating in a
potentially large tar file), and there we should already be using
pgoff_t and the relevant system call wrappers should be using the
int64_t stuff Windows provides. These are primarily frontend code.
2. Some places use segmentation *specifically because* there are
systems with 32 bit off_t. These are mostly backend code dealing with
relation data files. The only system left with narrow off_t is
Windows.

In reality the stuff in category 1 has been developed through a
process of bug reports and patches (970b97e and 970b97e^ springs to
mind as the most recent case I had something to with, but see also
stat()-related stuff, and see aa5518304 where we addressed the one
spot in buffile.c that had to consider multiple segments). But the
fact that Windows can't use segments > 2GB because the fd.c and
smgr.c/md.c layers work with off_t is certainly a well known
limitation, ie specifically that relation and temporary/buf files are
special in this way. I'm mostly baffled by the fact that --relsegsize
actually *lets* you set it higher than 2 on that platform. Perhaps we
should at least backpatch a configure check or static assertion to
block that? It's not good if it compiles but doesn't actually work.

I agree that the backpatch should just block setting -relsegsize > 2GB
on Windows.

For master I think it makes sense to clean this up, as you say,
because the fuzzy boundary between the two categories of file I/O is
bound to cause more problems, it's just unfinished business that has
been tackled piecemeal as required by bug reports... In fact, on a
thread[1] where I explored making the segment size a runtime option
specified at initdb time, I even posted patches much like yours in the
first version, spreading pgoff_t into more places, and then in a later
version it was suggested that it might be better to just block
settings that are too big for your off_t, so I did that. I probably
thought that we already did that somewhere for the current
compile-time constant...

For master, I'd like to proceed with the cleanup approach - spreading
pgoff_t into the core I/O layer (fd.c, md.c, pg_pread/pg_pwrite
wrappers, etc). That would let us eliminate the artificial 2GB ceiling
on Windows and clean up the file I/O category boundary.

Not urgent since few people hit this in practice, but it's clearly wrong
code.

Yeah. In my experience dealing with bug reports, the Windows users
community skews very heavily towards just consuming EDB's read-built
installer. We rarely hear about configuration-level problems, so I
suppose it's not surprising that no one has ever complained that it
lets you configure it in a way that we hackers all know is certainly
going to break.

[1] /messages/by-id/CA+hUKG+BGXwMbrvzXAjL8VMGf25y_ga_XnO741g10y0=m6dDiA@mail.gmail.com

Thanks for the feedback.

--
Bryan Green
EDB: https://www.enterprisedb.com

#7Bryan Green
dbryan.green@gmail.com
In reply to: Michael Paquier (#2)
1 attachment(s)
Re: [Patch] Windows relation extension failure at 2GB and 4GB

On 11/5/2025 11:05 PM, Michael Paquier wrote:

On Tue, Oct 28, 2025 at 09:42:11AM -0500, Bryan Green wrote:

I found two related bugs in PostgreSQL's Windows port that prevent files
from exceeding 2GB. While unlikely to affect most installations (default 1GB
segments), the code is objectively wrong and worth fixing.

The first bug is a pervasive use of off_t where pgoff_t should be used. On
Windows, off_t is only 32-bit, causing signed integer overflow at exactly
2GB (2^31 bytes). PostgreSQL already defined pgoff_t as __int64 for this
purpose and some function declarations in headers already use it, but the
implementations weren't updated to match.

Ugh. That's the same problem as "long", which is 8 bytes wide
everywhere except WIN32. Removing traces of "long" from the code has
been a continuous effort over the years because of these silent
overflow issues.

After fixing all those off_t issues, there's a second bug at 4GB in the
Windows implementations of pg_pwrite()/pg_pread() in win32pwrite.c and
win32pread.c. The current implementation uses an OVERLAPPED structure for
positioned I/O, but only sets the Offset field (low 32 bits), leaving
OffsetHigh at zero. This works up to 4GB by accident, but beyond that,
offsets wrap around.

I can reproduce both bugs reliably with --with-segsize=8. The file grows to
exactly 2GB and fails with "could not extend file: Invalid argument" despite
having 300GB free. After fixing the off_t issues, it grows to exactly 4GB
and hits the OVERLAPPED bug. Both are independently verifiable.

The most popular option in terms of installation on Windows is the EDB
installer, where I bet that a file segment size of 1GB is what's
embedded in the code compiled. This argument would not hold with WAL
segment sizes if we begin to support even higher sizes than 1GB at
some point, and we use pwrite() in the WAL insert code. That should
not be a problem even in the near future.

It's safe for all platforms since pgoff_t equals off_t on Unix where off_t
is already 64-bit. Only Windows behavior changes.

win32_port.h and port.h say so, yeah.

Not urgent since few people hit this in practice, but it's clearly wrong
code.
Someone building with larger segments would see failures at 2GB and
potential corruption at 4GB. Windows supports files up to 16 exabytes - no
good reason to limit PostgreSQL to 2GB.

The same kind of limitations with 4GB files existed with stat() and
fstat(), but it was much more complicated than what you are doing
here, where COPY was not able to work with files larger than 4GB on
WIN32. See the saga from bed90759fcbc.

I have attached the patch to fix the relation extension problems for Windows
to this email.

Can provide the other patches that changes off_t for pgoff_t in the rest of
the code if there's interest in fixing this.

Yeah, I think that we should rip out these issues, and move to the
more portable pgoff_t across the board. I doubt that any of this
could be backpatched due to the potential ABI breakages these
signatures changes would cause. Implementing things in incremental
steps is more sensible when it comes to such changes, as a revert
blast can be reduced if a portion is incorrectly handled.

I'm seeing as well the things you are pointing in buffile.c. These
should be fixed as well. The WAL code is less annoying due to the 1GB
WAL segment size limit, still consistency across the board makes the
code easier to reason about, at least.

Among the files you have mentioned, there is also copydir.c.

pg_rewind seems also broken with files larger than 4GB, from what I
can see in libpq_source.c and filemap.c.. Worse. Oops.

-	/* Note that this changes the file position, despite not using it. */
-	overlapped.Offset = offset;
+	overlapped.Offset = (DWORD) offset;
+	overlapped.OffsetHigh = (DWORD) (offset >> 32);

Based on the docs at [1], yes, this change makes sense.

It seems to me that a couple of extra code paths should be handled in
the first patch, and I have spotted three of them. None of them are
critical as they are related to WAL segments, just become masked and
inconsistent:
- xlogrecovery.c, pg_pread() called with a cast to off_t. WAL
segments have a max size of 1GB, meaning that we're OK.
- xlogreader.c, pg_pread() with a cast to off_t.
- walreceiver.c, pg_pwrite().

Except for these three spots, the first patch looks like a cut good
enough on its own.

Latest patch attached that includes these code paths.

Glad to see someone who takes time to spend time on this kind of
stuff with portability in mind, by the way.

[1]: https://learn.microsoft.com/en-us/windows/win32/api/minwinbase/ns-minwinbase-overlapped
--
Michael

Thanks for the quick reviewing.

--
Bryan Green
EDB: https://www.enterprisedb.com

Attachments:

v2-0001-Fix-Windows-file-IO.patchtext/plain; charset=UTF-8; name=v2-0001-Fix-Windows-file-IO.patchDownload
From d3f7543a35b3b72a7069188302cbfc7e4de9120b Mon Sep 17 00:00:00 2001
From: Bryan Green <dbryan.green@gmail.com>
Date: Thu, 6 Nov 2025 10:56:02 -0600
Subject: [PATCH] Fix Windows file I/O to support files larger than 2GB

PostgreSQL's Windows port has been unable to handle files larger than 2GB
due to pervasive use of off_t for file offsets, which is only 32-bit on
Windows. This causes signed integer overflow at exactly 2^31 bytes.

The codebase already defines pgoff_t as __int64 (64-bit) on Windows for
this purpose, and some function declarations in headers use it, but many
implementations still used off_t.

This issue is unlikely to affect most users since the default RELSEG_SIZE
is 1GB, keeping individual segment files small. However, anyone building
with --with-segsize larger than 2 would hit this bug. Tested with
--with-segsize=8 and verified that files can now grow beyond 4GB.

This version also addresses three additional code paths in WAL handling
that used casts to off_t when calling pg_pread() or pg_pwrite():
- xlogrecovery.c: pg_pread() called with cast to off_t
- xlogreader.c: pg_pread() with cast to off_t
- walreceiver.c: pg_pwrite() with cast to off_t

While these are not critical (WAL segments have a max size of 1GB), the
casts are now corrected to pgoff_t for consistency and to avoid any
potential future issues.

Note: off_t is still used in other parts of the codebase (e.g. buffile.c)
which may have similar issues on Windows, but those are outside the
critical path for relation file extension and can be addressed separately.

On Unix-like systems, pgoff_t is defined as off_t, so this change only
affects Windows behavior.
---
 src/backend/access/transam/xlogreader.c   |  2 +-
 src/backend/access/transam/xlogrecovery.c |  2 +-
 src/backend/replication/walreceiver.c     |  2 +-
 src/backend/storage/file/fd.c             | 38 ++++++++---------
 src/backend/storage/smgr/md.c             | 50 +++++++++++------------
 src/common/file_utils.c                   |  4 +-
 src/include/common/file_utils.h           |  4 +-
 src/include/port/pg_iovec.h               |  4 +-
 src/include/port/win32_port.h             |  4 +-
 src/include/storage/fd.h                  | 26 ++++++------
 src/port/win32pread.c                     | 10 ++---
 src/port/win32pwrite.c                    | 10 ++---
 12 files changed, 78 insertions(+), 78 deletions(-)

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index dcc8d4f9c1..8ea837003f 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1574,7 +1574,7 @@ WALRead(XLogReaderState *state,
 
 		/* Reset errno first; eases reporting non-errno-affecting errors */
 		errno = 0;
-		readbytes = pg_pread(state->seg.ws_file, p, segbytes, (off_t) startoff);
+		readbytes = pg_pread(state->seg.ws_file, p, segbytes, (pgoff_t) startoff);
 
 #ifndef FRONTEND
 		pgstat_report_wait_end();
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 550de6e4a5..c723d03d96 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -3429,7 +3429,7 @@ retry:
 	io_start = pgstat_prepare_io_time(track_wal_io_timing);
 
 	pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
-	r = pg_pread(readFile, readBuf, XLOG_BLCKSZ, (off_t) readOff);
+	r = pg_pread(readFile, readBuf, XLOG_BLCKSZ, (pgoff_t) readOff);
 	if (r != XLOG_BLCKSZ)
 	{
 		char		fname[MAXFNAMELEN];
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 7361ffc9dc..ec243db3a4 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -928,7 +928,7 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr, TimeLineID tli)
 		start = pgstat_prepare_io_time(track_wal_io_timing);
 
 		pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE);
-		byteswritten = pg_pwrite(recvFile, buf, segbytes, (off_t) startoff);
+		byteswritten = pg_pwrite(recvFile, buf, segbytes, (pgoff_t) startoff);
 		pgstat_report_wait_end();
 
 		pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL,
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index a4ec7959f3..b25e74831e 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -201,7 +201,7 @@ typedef struct vfd
 	File		nextFree;		/* link to next free VFD, if in freelist */
 	File		lruMoreRecently;	/* doubly linked recency-of-use list */
 	File		lruLessRecently;
-	off_t		fileSize;		/* current size of file (0 if not temporary) */
+	pgoff_t		fileSize;		/* current size of file (0 if not temporary) */
 	char	   *fileName;		/* name of file, or NULL for unused VFD */
 	/* NB: fileName is malloc'd, and must be free'd when closing the VFD */
 	int			fileFlags;		/* open(2) flags for (re)opening the file */
@@ -519,7 +519,7 @@ pg_file_exists(const char *name)
  * offset of 0 with nbytes 0 means that the entire file should be flushed
  */
 void
-pg_flush_data(int fd, off_t offset, off_t nbytes)
+pg_flush_data(int fd, pgoff_t offset, pgoff_t nbytes)
 {
 	/*
 	 * Right now file flushing is primarily used to avoid making later
@@ -635,7 +635,7 @@ retry:
 		 * may simply not be enough address space.  If so, silently fall
 		 * through to the next implementation.
 		 */
-		if (nbytes <= (off_t) SSIZE_MAX)
+		if (nbytes <= (pgoff_t) SSIZE_MAX)
 			p = mmap(NULL, nbytes, PROT_READ, MAP_SHARED, fd, offset);
 		else
 			p = MAP_FAILED;
@@ -697,7 +697,7 @@ retry:
  * Truncate an open file to a given length.
  */
 static int
-pg_ftruncate(int fd, off_t length)
+pg_ftruncate(int fd, pgoff_t length)
 {
 	int			ret;
 
@@ -714,7 +714,7 @@ retry:
  * Truncate a file to a given length by name.
  */
 int
-pg_truncate(const char *path, off_t length)
+pg_truncate(const char *path, pgoff_t length)
 {
 	int			ret;
 #ifdef WIN32
@@ -1526,7 +1526,7 @@ FileAccess(File file)
  * Called whenever a temporary file is deleted to report its size.
  */
 static void
-ReportTemporaryFileUsage(const char *path, off_t size)
+ReportTemporaryFileUsage(const char *path, pgoff_t size)
 {
 	pgstat_report_tempfile(size);
 
@@ -2077,7 +2077,7 @@ FileClose(File file)
  * this.
  */
 int
-FilePrefetch(File file, off_t offset, off_t amount, uint32 wait_event_info)
+FilePrefetch(File file, pgoff_t offset, pgoff_t amount, uint32 wait_event_info)
 {
 	Assert(FileIsValid(file));
 
@@ -2108,7 +2108,7 @@ retry:
 	{
 		struct radvisory
 		{
-			off_t		ra_offset;	/* offset into the file */
+			pgoff_t		ra_offset;	/* offset into the file */
 			int			ra_count;	/* size of the read     */
 		}			ra;
 		int			returnCode;
@@ -2133,7 +2133,7 @@ retry:
 }
 
 void
-FileWriteback(File file, off_t offset, off_t nbytes, uint32 wait_event_info)
+FileWriteback(File file, pgoff_t offset, pgoff_t nbytes, uint32 wait_event_info)
 {
 	int			returnCode;
 
@@ -2159,7 +2159,7 @@ FileWriteback(File file, off_t offset, off_t nbytes, uint32 wait_event_info)
 }
 
 ssize_t
-FileReadV(File file, const struct iovec *iov, int iovcnt, off_t offset,
+FileReadV(File file, const struct iovec *iov, int iovcnt, pgoff_t offset,
 		  uint32 wait_event_info)
 {
 	ssize_t		returnCode;
@@ -2216,7 +2216,7 @@ retry:
 
 int
 FileStartReadV(PgAioHandle *ioh, File file,
-			   int iovcnt, off_t offset,
+			   int iovcnt, pgoff_t offset,
 			   uint32 wait_event_info)
 {
 	int			returnCode;
@@ -2241,7 +2241,7 @@ FileStartReadV(PgAioHandle *ioh, File file,
 }
 
 ssize_t
-FileWriteV(File file, const struct iovec *iov, int iovcnt, off_t offset,
+FileWriteV(File file, const struct iovec *iov, int iovcnt, pgoff_t offset,
 		   uint32 wait_event_info)
 {
 	ssize_t		returnCode;
@@ -2270,7 +2270,7 @@ FileWriteV(File file, const struct iovec *iov, int iovcnt, off_t offset,
 	 */
 	if (temp_file_limit >= 0 && (vfdP->fdstate & FD_TEMP_FILE_LIMIT))
 	{
-		off_t		past_write = offset;
+		pgoff_t		past_write = offset;
 
 		for (int i = 0; i < iovcnt; ++i)
 			past_write += iov[i].iov_len;
@@ -2309,7 +2309,7 @@ retry:
 		 */
 		if (vfdP->fdstate & FD_TEMP_FILE_LIMIT)
 		{
-			off_t		past_write = offset + returnCode;
+			pgoff_t		past_write = offset + returnCode;
 
 			if (past_write > vfdP->fileSize)
 			{
@@ -2373,7 +2373,7 @@ FileSync(File file, uint32 wait_event_info)
  * appropriate error.
  */
 int
-FileZero(File file, off_t offset, off_t amount, uint32 wait_event_info)
+FileZero(File file, pgoff_t offset, pgoff_t amount, uint32 wait_event_info)
 {
 	int			returnCode;
 	ssize_t		written;
@@ -2418,7 +2418,7 @@ FileZero(File file, off_t offset, off_t amount, uint32 wait_event_info)
  * appropriate error.
  */
 int
-FileFallocate(File file, off_t offset, off_t amount, uint32 wait_event_info)
+FileFallocate(File file, pgoff_t offset, pgoff_t amount, uint32 wait_event_info)
 {
 #ifdef HAVE_POSIX_FALLOCATE
 	int			returnCode;
@@ -2457,7 +2457,7 @@ retry:
 	return FileZero(file, offset, amount, wait_event_info);
 }
 
-off_t
+pgoff_t
 FileSize(File file)
 {
 	Assert(FileIsValid(file));
@@ -2468,14 +2468,14 @@ FileSize(File file)
 	if (FileIsNotOpen(file))
 	{
 		if (FileAccess(file) < 0)
-			return (off_t) -1;
+			return (pgoff_t) -1;
 	}
 
 	return lseek(VfdCache[file].fd, 0, SEEK_END);
 }
 
 int
-FileTruncate(File file, off_t offset, uint32 wait_event_info)
+FileTruncate(File file, pgoff_t offset, uint32 wait_event_info)
 {
 	int			returnCode;
 
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 235ba7e191..e3f335a834 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -487,7 +487,7 @@ void
 mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 		 const void *buffer, bool skipFsync)
 {
-	off_t		seekpos;
+	pgoff_t		seekpos;
 	int			nbytes;
 	MdfdVec    *v;
 
@@ -515,9 +515,9 @@ mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 
 	v = _mdfd_getseg(reln, forknum, blocknum, skipFsync, EXTENSION_CREATE);
 
-	seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+	seekpos = (pgoff_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
 
-	Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+	Assert(seekpos < (pgoff_t) BLCKSZ * RELSEG_SIZE);
 
 	if ((nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, seekpos, WAIT_EVENT_DATA_FILE_EXTEND)) != BLCKSZ)
 	{
@@ -578,7 +578,7 @@ mdzeroextend(SMgrRelation reln, ForkNumber forknum,
 	while (remblocks > 0)
 	{
 		BlockNumber segstartblock = curblocknum % ((BlockNumber) RELSEG_SIZE);
-		off_t		seekpos = (off_t) BLCKSZ * segstartblock;
+		pgoff_t		seekpos = (pgoff_t) BLCKSZ * segstartblock;
 		int			numblocks;
 
 		if (segstartblock + remblocks > RELSEG_SIZE)
@@ -607,7 +607,7 @@ mdzeroextend(SMgrRelation reln, ForkNumber forknum,
 			int			ret;
 
 			ret = FileFallocate(v->mdfd_vfd,
-								seekpos, (off_t) BLCKSZ * numblocks,
+								seekpos, (pgoff_t) BLCKSZ * numblocks,
 								WAIT_EVENT_DATA_FILE_EXTEND);
 			if (ret != 0)
 			{
@@ -630,7 +630,7 @@ mdzeroextend(SMgrRelation reln, ForkNumber forknum,
 			 * whole length of the extension.
 			 */
 			ret = FileZero(v->mdfd_vfd,
-						   seekpos, (off_t) BLCKSZ * numblocks,
+						   seekpos, (pgoff_t) BLCKSZ * numblocks,
 						   WAIT_EVENT_DATA_FILE_EXTEND);
 			if (ret < 0)
 				ereport(ERROR,
@@ -745,7 +745,7 @@ mdprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 
 	while (nblocks > 0)
 	{
-		off_t		seekpos;
+		pgoff_t		seekpos;
 		MdfdVec    *v;
 		int			nblocks_this_segment;
 
@@ -754,9 +754,9 @@ mdprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 		if (v == NULL)
 			return false;
 
-		seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+		seekpos = (pgoff_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
 
-		Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+		Assert(seekpos < (pgoff_t) BLCKSZ * RELSEG_SIZE);
 
 		nblocks_this_segment =
 			Min(nblocks,
@@ -851,7 +851,7 @@ mdreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	{
 		struct iovec iov[PG_IOV_MAX];
 		int			iovcnt;
-		off_t		seekpos;
+		pgoff_t		seekpos;
 		int			nbytes;
 		MdfdVec    *v;
 		BlockNumber nblocks_this_segment;
@@ -861,9 +861,9 @@ mdreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 		v = _mdfd_getseg(reln, forknum, blocknum, false,
 						 EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY);
 
-		seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+		seekpos = (pgoff_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
 
-		Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+		Assert(seekpos < (pgoff_t) BLCKSZ * RELSEG_SIZE);
 
 		nblocks_this_segment =
 			Min(nblocks,
@@ -986,7 +986,7 @@ mdstartreadv(PgAioHandle *ioh,
 			 SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 			 void **buffers, BlockNumber nblocks)
 {
-	off_t		seekpos;
+	pgoff_t		seekpos;
 	MdfdVec    *v;
 	BlockNumber nblocks_this_segment;
 	struct iovec *iov;
@@ -996,9 +996,9 @@ mdstartreadv(PgAioHandle *ioh,
 	v = _mdfd_getseg(reln, forknum, blocknum, false,
 					 EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY);
 
-	seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+	seekpos = (pgoff_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
 
-	Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+	Assert(seekpos < (pgoff_t) BLCKSZ * RELSEG_SIZE);
 
 	nblocks_this_segment =
 		Min(nblocks,
@@ -1068,7 +1068,7 @@ mdwritev(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	{
 		struct iovec iov[PG_IOV_MAX];
 		int			iovcnt;
-		off_t		seekpos;
+		pgoff_t		seekpos;
 		int			nbytes;
 		MdfdVec    *v;
 		BlockNumber nblocks_this_segment;
@@ -1078,9 +1078,9 @@ mdwritev(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 		v = _mdfd_getseg(reln, forknum, blocknum, skipFsync,
 						 EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY);
 
-		seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+		seekpos = (pgoff_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
 
-		Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+		Assert(seekpos < (pgoff_t) BLCKSZ * RELSEG_SIZE);
 
 		nblocks_this_segment =
 			Min(nblocks,
@@ -1173,7 +1173,7 @@ mdwriteback(SMgrRelation reln, ForkNumber forknum,
 	while (nblocks > 0)
 	{
 		BlockNumber nflush = nblocks;
-		off_t		seekpos;
+		pgoff_t		seekpos;
 		MdfdVec    *v;
 		int			segnum_start,
 					segnum_end;
@@ -1202,9 +1202,9 @@ mdwriteback(SMgrRelation reln, ForkNumber forknum,
 		Assert(nflush >= 1);
 		Assert(nflush <= nblocks);
 
-		seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+		seekpos = (pgoff_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
 
-		FileWriteback(v->mdfd_vfd, seekpos, (off_t) BLCKSZ * nflush, WAIT_EVENT_DATA_FILE_FLUSH);
+		FileWriteback(v->mdfd_vfd, seekpos, (pgoff_t) BLCKSZ * nflush, WAIT_EVENT_DATA_FILE_FLUSH);
 
 		nblocks -= nflush;
 		blocknum += nflush;
@@ -1348,7 +1348,7 @@ mdtruncate(SMgrRelation reln, ForkNumber forknum,
 			 */
 			BlockNumber lastsegblocks = nblocks - priorblocks;
 
-			if (FileTruncate(v->mdfd_vfd, (off_t) lastsegblocks * BLCKSZ, WAIT_EVENT_DATA_FILE_TRUNCATE) < 0)
+			if (FileTruncate(v->mdfd_vfd, (pgoff_t) lastsegblocks * BLCKSZ, WAIT_EVENT_DATA_FILE_TRUNCATE) < 0)
 				ereport(ERROR,
 						(errcode_for_file_access(),
 						 errmsg("could not truncate file \"%s\" to %u blocks: %m",
@@ -1484,9 +1484,9 @@ mdfd(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uint32 *off)
 	v = _mdfd_getseg(reln, forknum, blocknum, false,
 					 EXTENSION_FAIL);
 
-	*off = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+	*off = (pgoff_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
 
-	Assert(*off < (off_t) BLCKSZ * RELSEG_SIZE);
+	Assert(*off < (pgoff_t) BLCKSZ * RELSEG_SIZE);
 
 	return FileGetRawDesc(v->mdfd_vfd);
 }
@@ -1868,7 +1868,7 @@ _mdfd_getseg(SMgrRelation reln, ForkNumber forknum, BlockNumber blkno,
 static BlockNumber
 _mdnblocks(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 {
-	off_t		len;
+	pgoff_t		len;
 
 	len = FileSize(seg->mdfd_vfd);
 	if (len < 0)
diff --git a/src/common/file_utils.c b/src/common/file_utils.c
index 7b62687a2a..cdf08ab5cb 100644
--- a/src/common/file_utils.c
+++ b/src/common/file_utils.c
@@ -656,7 +656,7 @@ compute_remaining_iovec(struct iovec *destination,
  * error is returned, it is unspecified how much has been written.
  */
 ssize_t
-pg_pwritev_with_retry(int fd, const struct iovec *iov, int iovcnt, off_t offset)
+pg_pwritev_with_retry(int fd, const struct iovec *iov, int iovcnt, pgoff_t offset)
 {
 	struct iovec iov_copy[PG_IOV_MAX];
 	ssize_t		sum = 0;
@@ -706,7 +706,7 @@ pg_pwritev_with_retry(int fd, const struct iovec *iov, int iovcnt, off_t offset)
  * is returned with errno set.
  */
 ssize_t
-pg_pwrite_zeros(int fd, size_t size, off_t offset)
+pg_pwrite_zeros(int fd, size_t size, pgoff_t offset)
 {
 	static const PGIOAlignedBlock zbuffer = {0};	/* worth BLCKSZ */
 	void	   *zerobuf_addr = unconstify(PGIOAlignedBlock *, &zbuffer)->data;
diff --git a/src/include/common/file_utils.h b/src/include/common/file_utils.h
index 9fd88953e4..4239713803 100644
--- a/src/include/common/file_utils.h
+++ b/src/include/common/file_utils.h
@@ -55,9 +55,9 @@ extern int	compute_remaining_iovec(struct iovec *destination,
 extern ssize_t pg_pwritev_with_retry(int fd,
 									 const struct iovec *iov,
 									 int iovcnt,
-									 off_t offset);
+									 pgoff_t offset);
 
-extern ssize_t pg_pwrite_zeros(int fd, size_t size, off_t offset);
+extern ssize_t pg_pwrite_zeros(int fd, size_t size, pgoff_t offset);
 
 /* Filename components */
 #define PG_TEMP_FILES_DIR "pgsql_tmp"
diff --git a/src/include/port/pg_iovec.h b/src/include/port/pg_iovec.h
index 90be3af449..845ded8c71 100644
--- a/src/include/port/pg_iovec.h
+++ b/src/include/port/pg_iovec.h
@@ -51,7 +51,7 @@ struct iovec
  * this changes the current file position.
  */
 static inline ssize_t
-pg_preadv(int fd, const struct iovec *iov, int iovcnt, off_t offset)
+pg_preadv(int fd, const struct iovec *iov, int iovcnt, pgoff_t offset)
 {
 #if HAVE_DECL_PREADV
 	/*
@@ -90,7 +90,7 @@ pg_preadv(int fd, const struct iovec *iov, int iovcnt, off_t offset)
  * this changes the current file position.
  */
 static inline ssize_t
-pg_pwritev(int fd, const struct iovec *iov, int iovcnt, off_t offset)
+pg_pwritev(int fd, const struct iovec *iov, int iovcnt, pgoff_t offset)
 {
 #if HAVE_DECL_PWRITEV
 	/*
diff --git a/src/include/port/win32_port.h b/src/include/port/win32_port.h
index ff7028bdc8..f54ccef7db 100644
--- a/src/include/port/win32_port.h
+++ b/src/include/port/win32_port.h
@@ -584,9 +584,9 @@ typedef unsigned short mode_t;
 #endif
 
 /* in port/win32pread.c */
-extern ssize_t pg_pread(int fd, void *buf, size_t nbyte, off_t offset);
+extern ssize_t pg_pread(int fd, void *buf, size_t nbyte, pgoff_t offset);
 
 /* in port/win32pwrite.c */
-extern ssize_t pg_pwrite(int fd, const void *buf, size_t nbyte, off_t offset);
+extern ssize_t pg_pwrite(int fd, const void *buf, size_t nbyte, pgoff_t offset);
 
 #endif							/* PG_WIN32_PORT_H */
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index b77d8e5e30..3e821ce8fb 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -108,17 +108,17 @@ extern File PathNameOpenFile(const char *fileName, int fileFlags);
 extern File PathNameOpenFilePerm(const char *fileName, int fileFlags, mode_t fileMode);
 extern File OpenTemporaryFile(bool interXact);
 extern void FileClose(File file);
-extern int	FilePrefetch(File file, off_t offset, off_t amount, uint32 wait_event_info);
-extern ssize_t FileReadV(File file, const struct iovec *iov, int iovcnt, off_t offset, uint32 wait_event_info);
-extern ssize_t FileWriteV(File file, const struct iovec *iov, int iovcnt, off_t offset, uint32 wait_event_info);
-extern int	FileStartReadV(struct PgAioHandle *ioh, File file, int iovcnt, off_t offset, uint32 wait_event_info);
+extern int	FilePrefetch(File file, pgoff_t offset, pgoff_t amount, uint32 wait_event_info);
+extern ssize_t FileReadV(File file, const struct iovec *iov, int iovcnt, pgoff_t offset, uint32 wait_event_info);
+extern ssize_t FileWriteV(File file, const struct iovec *iov, int iovcnt, pgoff_t offset, uint32 wait_event_info);
+extern int	FileStartReadV(struct PgAioHandle *ioh, File file, int iovcnt, pgoff_t offset, uint32 wait_event_info);
 extern int	FileSync(File file, uint32 wait_event_info);
-extern int	FileZero(File file, off_t offset, off_t amount, uint32 wait_event_info);
-extern int	FileFallocate(File file, off_t offset, off_t amount, uint32 wait_event_info);
+extern int	FileZero(File file, pgoff_t offset, pgoff_t amount, uint32 wait_event_info);
+extern int	FileFallocate(File file, pgoff_t offset, pgoff_t amount, uint32 wait_event_info);
 
-extern off_t FileSize(File file);
-extern int	FileTruncate(File file, off_t offset, uint32 wait_event_info);
-extern void FileWriteback(File file, off_t offset, off_t nbytes, uint32 wait_event_info);
+extern pgoff_t FileSize(File file);
+extern int	FileTruncate(File file, pgoff_t offset, uint32 wait_event_info);
+extern void FileWriteback(File file, pgoff_t offset, pgoff_t nbytes, uint32 wait_event_info);
 extern char *FilePathName(File file);
 extern int	FileGetRawDesc(File file);
 extern int	FileGetRawFlags(File file);
@@ -186,8 +186,8 @@ extern int	pg_fsync_no_writethrough(int fd);
 extern int	pg_fsync_writethrough(int fd);
 extern int	pg_fdatasync(int fd);
 extern bool pg_file_exists(const char *name);
-extern void pg_flush_data(int fd, off_t offset, off_t nbytes);
-extern int	pg_truncate(const char *path, off_t length);
+extern void pg_flush_data(int fd, pgoff_t offset, pgoff_t nbytes);
+extern int	pg_truncate(const char *path, pgoff_t length);
 extern void fsync_fname(const char *fname, bool isdir);
 extern int	fsync_fname_ext(const char *fname, bool isdir, bool ignore_perm, int elevel);
 extern int	durable_rename(const char *oldfile, const char *newfile, int elevel);
@@ -196,7 +196,7 @@ extern void SyncDataDirectory(void);
 extern int	data_sync_elevel(int elevel);
 
 static inline ssize_t
-FileRead(File file, void *buffer, size_t amount, off_t offset,
+FileRead(File file, void *buffer, size_t amount, pgoff_t offset,
 		 uint32 wait_event_info)
 {
 	struct iovec iov = {
@@ -208,7 +208,7 @@ FileRead(File file, void *buffer, size_t amount, off_t offset,
 }
 
 static inline ssize_t
-FileWrite(File file, const void *buffer, size_t amount, off_t offset,
+FileWrite(File file, const void *buffer, size_t amount, pgoff_t offset,
 		  uint32 wait_event_info)
 {
 	struct iovec iov = {
diff --git a/src/port/win32pread.c b/src/port/win32pread.c
index 32d56c462e..1f00dfd8e6 100644
--- a/src/port/win32pread.c
+++ b/src/port/win32pread.c
@@ -17,7 +17,7 @@
 #include <windows.h>
 
 ssize_t
-pg_pread(int fd, void *buf, size_t size, off_t offset)
+pg_pread(int fd, void *buf, size_t size, pgoff_t offset)
 {
 	OVERLAPPED	overlapped = {0};
 	HANDLE		handle;
@@ -30,16 +30,16 @@ pg_pread(int fd, void *buf, size_t size, off_t offset)
 		return -1;
 	}
 
-	/* Avoid overflowing DWORD. */
+	/* Avoid overflowing DWORD */
 	size = Min(size, 1024 * 1024 * 1024);
 
-	/* Note that this changes the file position, despite not using it. */
-	overlapped.Offset = offset;
+	overlapped.Offset = (DWORD) offset;
+	overlapped.OffsetHigh = (DWORD) (offset >> 32);
+
 	if (!ReadFile(handle, buf, size, &result, &overlapped))
 	{
 		if (GetLastError() == ERROR_HANDLE_EOF)
 			return 0;
-
 		_dosmaperr(GetLastError());
 		return -1;
 	}
diff --git a/src/port/win32pwrite.c b/src/port/win32pwrite.c
index 249aa6c468..d9a0d23c2b 100644
--- a/src/port/win32pwrite.c
+++ b/src/port/win32pwrite.c
@@ -15,9 +15,8 @@
 #include "c.h"
 
 #include <windows.h>
-
 ssize_t
-pg_pwrite(int fd, const void *buf, size_t size, off_t offset)
+pg_pwrite(int fd, const void *buf, size_t size, pgoff_t offset)
 {
 	OVERLAPPED	overlapped = {0};
 	HANDLE		handle;
@@ -30,11 +29,12 @@ pg_pwrite(int fd, const void *buf, size_t size, off_t offset)
 		return -1;
 	}
 
-	/* Avoid overflowing DWORD. */
+	/* Avoid overflowing DWORD */
 	size = Min(size, 1024 * 1024 * 1024);
 
-	/* Note that this changes the file position, despite not using it. */
-	overlapped.Offset = offset;
+	overlapped.Offset = (DWORD) offset;
+	overlapped.OffsetHigh = (DWORD) (offset >> 32);
+
 	if (!WriteFile(handle, buf, size, &result, &overlapped))
 	{
 		_dosmaperr(GetLastError());
-- 
2.46.0.windows.1

#8Andres Freund
andres@anarazel.de
In reply to: Bryan Green (#7)
Re: [Patch] Windows relation extension failure at 2GB and 4GB

Hi,

On 2025-11-06 11:17:52 -0600, Bryan Green wrote:

From d3f7543a35b3b72a7069188302cbfc7e4de9120b Mon Sep 17 00:00:00 2001
From: Bryan Green <dbryan.green@gmail.com>
Date: Thu, 6 Nov 2025 10:56:02 -0600
Subject: [PATCH] Fix Windows file I/O to support files larger than 2GB

Could we add a testcase that actually exercises at least some of the
codepaths? We presumably wouldn't want to actually write that much data, but
it shouldn't be hard to write portable code to create a file with holes...

Greetings,

Andres Freund

#9Michael Paquier
michael@paquier.xyz
In reply to: Andres Freund (#8)
Re: [Patch] Windows relation extension failure at 2GB and 4GB

On Thu, Nov 06, 2025 at 12:45:30PM -0500, Andres Freund wrote:

Could we add a testcase that actually exercises at least some of the
codepaths? We presumably wouldn't want to actually write that much data, but
it shouldn't be hard to write portable code to create a file with holes...

With something that relies on a pg_pwrite() and pg_pread(), that does
not sound like an issue to me.

FWIW, I have wanted a test module that does FS-level operations for
some time. Here, we could just have thin wrappers of the write and
read calls and a give way for the tests to pass directly arguments to
them via a SQL function call. That would be easier to extend
depending on what comes next. Not sure that this is absolutely
mandatory for the sake of the proposal, though, but long-term that's
something we should do more to stress the portability of the code.
--
Michael

#10Michael Paquier
michael@paquier.xyz
In reply to: Thomas Munro (#5)
Re: [Patch] Windows relation extension failure at 2GB and 4GB

On Thu, Nov 06, 2025 at 11:10:00PM +1300, Thomas Munro wrote:

Observing that mess, I kinda wonder what would happen if we just used
a big hammer to redefine off_t to be __int64 ourselves. On the face
of it, it sounds like an inherently bad idea that could bite you when
interacting with libraries whose headers use off_t. On the other
hand, the world of open source libraries we care about might already
be resistant to that chaos, if libraries are being built with and
without -D_FIILE_OFFSET_BITS=64 willy-nilly, or they actually can't
deal with large files at all in which case that's something we'd have
to deal with whatever we do. I don't know, it's just a thought that
occurred to me while contemplating how unpleasant it is to splatter
pgoff_t all over our tree, and yet *still* have to tread very
carefully with the boundaries of external libraries that might be
using off_t, researching each case...

Not sure about that. It's always been something we have tackled with
the various pg_ and PG_ structures and definitions. Not sure that
this has to apply here since we already have one pgoff_t for the
purpose of portability for quite a few years already (d00a3472cfc4).
--
Michael

#11Michael Paquier
michael@paquier.xyz
In reply to: Bryan Green (#7)
Re: [Patch] Windows relation extension failure at 2GB and 4GB

On Thu, Nov 06, 2025 at 11:17:52AM -0600, Bryan Green wrote:

On 11/5/2025 11:05 PM, Michael Paquier wrote:

It seems to me that a couple of extra code paths should be handled in
the first patch, and I have spotted three of them. None of them are
critical as they are related to WAL segments, just become masked and
inconsistent:
- xlogrecovery.c, pg_pread() called with a cast to off_t. WAL
segments have a max size of 1GB, meaning that we're OK.
- xlogreader.c, pg_pread() with a cast to off_t.
- walreceiver.c, pg_pwrite().

Except for these three spots, the first patch looks like a cut good
enough on its own.

Latest patch attached that includes these code paths.

That feels OK for me. Thomas, do you have a different view on the
matter for HEAD? Like long, I would just switch to something that we
have in the tree that's fixed.

And +1 for the idea to restrict the segment size to never be more than
2GB based on a ./configure and meson check on the back branches. In
PG15 and older branches, we already enforced a check by the way. See
src/tools/msvc/Solution.pm which was the only way to compile the code
with visual studio so one would have never seen the limitations except
if they had the idea to edit the perl scripts (FWIW, I've done exactly
that in the past for a past project at $company, never touched the
segsize):
# only allow segsize 1 for now, as we can't do large files yet in windows
die "Bad segsize $options->{segsize}"
unless $options->{segsize} == 1;

So this is a meson issue that goes down to v16, when using a VS
compiler. Was there a different compiler where off_t is also 4 bytes?
MinGW is mentioned as clear by Thomas.
--
Michael

#12Bryan Green
dbryan.green@gmail.com
In reply to: Andres Freund (#8)
1 attachment(s)
Re: [Patch] Windows relation extension failure at 2GB and 4GB

On 11/6/25 11:45, Andres Freund wrote:

Hi,

On 2025-11-06 11:17:52 -0600, Bryan Green wrote:

From d3f7543a35b3b72a7069188302cbfc7e4de9120b Mon Sep 17 00:00:00 2001
From: Bryan Green <dbryan.green@gmail.com>
Date: Thu, 6 Nov 2025 10:56:02 -0600
Subject: [PATCH] Fix Windows file I/O to support files larger than 2GB

Could we add a testcase that actually exercises at least some of the
codepaths? We presumably wouldn't want to actually write that much data, but
it shouldn't be hard to write portable code to create a file with holes...

Greetings,

Andres Freund

Agreed. Added a test module called test_large_files. There is a README.

--
Bryan Green
EDB: https://www.enterprisedb.com

Attachments:

v3-0001-Fix-Windows-file-IO.patchtext/plain; charset=UTF-8; name=v3-0001-Fix-Windows-file-IO.patchDownload
From 76e97361f7fa45008ec524f0a83eab5c3da46506 Mon Sep 17 00:00:00 2001
From: Bryan Green <dbryan.green@gmail.com>
Date: Thu, 6 Nov 2025 10:56:02 -0600
Subject: [PATCH v3] Fix Windows file I/O to support files larger than 2GB

PostgreSQL's Windows port has been unable to handle files larger than 2GB
due to pervasive use of off_t for file offsets, which is only 32-bit on
Windows. This causes signed integer overflow at exactly 2^31 bytes.

The codebase already defines pgoff_t as __int64 (64-bit) on Windows for
this purpose, and some function declarations in headers use it, but many
implementations still used off_t.

This issue is unlikely to affect most users since the default RELSEG_SIZE
is 1GB, keeping individual segment files small. However, anyone building
with --with-segsize larger than 2 would hit this bug. Tested with
--with-segsize=8 and verified that files can now grow beyond 4GB.

This version also addresses three additional code paths in WAL handling
that used casts to off_t when calling pg_pread() or pg_pwrite():
- xlogrecovery.c: pg_pread() called with cast to off_t
- xlogreader.c: pg_pread() with cast to off_t
- walreceiver.c: pg_pwrite() with cast to off_t

While these are not critical (WAL segments have a max size of 1GB), the
casts are now corrected to pgoff_t for consistency and to avoid any
potential future issues.

Note: off_t is still used in other parts of the codebase (e.g. buffile.c)
which may have similar issues on Windows, but those are outside the
critical path for relation file extension and can be addressed separately.

On Unix-like systems, pgoff_t is defined as off_t, so this change only
affects Windows behavior.
---
 src/backend/access/transam/xlogreader.c       |   2 +-
 src/backend/access/transam/xlogrecovery.c     |   2 +-
 src/backend/replication/walreceiver.c         |   2 +-
 src/backend/storage/file/fd.c                 |  38 +--
 src/backend/storage/smgr/md.c                 |  50 ++--
 src/common/file_utils.c                       |   4 +-
 src/include/common/file_utils.h               |   4 +-
 src/include/port/pg_iovec.h                   |   4 +-
 src/include/port/win32_port.h                 |   4 +-
 src/include/storage/fd.h                      |  26 +-
 src/port/win32pread.c                         |  10 +-
 src/port/win32pwrite.c                        |  10 +-
 src/test/modules/meson.build                  |   1 +
 src/test/modules/test_large_files/Makefile    |  20 ++
 src/test/modules/test_large_files/README      |  53 ++++
 src/test/modules/test_large_files/meson.build |  29 ++
 .../t/001_windows_large_files.pl              |  65 +++++
 .../test_large_files--1.0.sql                 |  36 +++
 .../test_large_files/test_large_files.c       | 270 ++++++++++++++++++
 .../test_large_files/test_large_files.control |   5 +
 20 files changed, 557 insertions(+), 78 deletions(-)
 create mode 100644 src/test/modules/test_large_files/Makefile
 create mode 100644 src/test/modules/test_large_files/README
 create mode 100644 src/test/modules/test_large_files/meson.build
 create mode 100644 src/test/modules/test_large_files/t/001_windows_large_files.pl
 create mode 100644 src/test/modules/test_large_files/test_large_files--1.0.sql
 create mode 100644 src/test/modules/test_large_files/test_large_files.c
 create mode 100644 src/test/modules/test_large_files/test_large_files.control

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index dcc8d4f9c1..8ea837003f 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1574,7 +1574,7 @@ WALRead(XLogReaderState *state,
 
 		/* Reset errno first; eases reporting non-errno-affecting errors */
 		errno = 0;
-		readbytes = pg_pread(state->seg.ws_file, p, segbytes, (off_t) startoff);
+		readbytes = pg_pread(state->seg.ws_file, p, segbytes, (pgoff_t) startoff);
 
 #ifndef FRONTEND
 		pgstat_report_wait_end();
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 550de6e4a5..c723d03d96 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -3429,7 +3429,7 @@ retry:
 	io_start = pgstat_prepare_io_time(track_wal_io_timing);
 
 	pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
-	r = pg_pread(readFile, readBuf, XLOG_BLCKSZ, (off_t) readOff);
+	r = pg_pread(readFile, readBuf, XLOG_BLCKSZ, (pgoff_t) readOff);
 	if (r != XLOG_BLCKSZ)
 	{
 		char		fname[MAXFNAMELEN];
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 7361ffc9dc..ec243db3a4 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -928,7 +928,7 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr, TimeLineID tli)
 		start = pgstat_prepare_io_time(track_wal_io_timing);
 
 		pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE);
-		byteswritten = pg_pwrite(recvFile, buf, segbytes, (off_t) startoff);
+		byteswritten = pg_pwrite(recvFile, buf, segbytes, (pgoff_t) startoff);
 		pgstat_report_wait_end();
 
 		pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL,
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index a4ec7959f3..b25e74831e 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -201,7 +201,7 @@ typedef struct vfd
 	File		nextFree;		/* link to next free VFD, if in freelist */
 	File		lruMoreRecently;	/* doubly linked recency-of-use list */
 	File		lruLessRecently;
-	off_t		fileSize;		/* current size of file (0 if not temporary) */
+	pgoff_t		fileSize;		/* current size of file (0 if not temporary) */
 	char	   *fileName;		/* name of file, or NULL for unused VFD */
 	/* NB: fileName is malloc'd, and must be free'd when closing the VFD */
 	int			fileFlags;		/* open(2) flags for (re)opening the file */
@@ -519,7 +519,7 @@ pg_file_exists(const char *name)
  * offset of 0 with nbytes 0 means that the entire file should be flushed
  */
 void
-pg_flush_data(int fd, off_t offset, off_t nbytes)
+pg_flush_data(int fd, pgoff_t offset, pgoff_t nbytes)
 {
 	/*
 	 * Right now file flushing is primarily used to avoid making later
@@ -635,7 +635,7 @@ retry:
 		 * may simply not be enough address space.  If so, silently fall
 		 * through to the next implementation.
 		 */
-		if (nbytes <= (off_t) SSIZE_MAX)
+		if (nbytes <= (pgoff_t) SSIZE_MAX)
 			p = mmap(NULL, nbytes, PROT_READ, MAP_SHARED, fd, offset);
 		else
 			p = MAP_FAILED;
@@ -697,7 +697,7 @@ retry:
  * Truncate an open file to a given length.
  */
 static int
-pg_ftruncate(int fd, off_t length)
+pg_ftruncate(int fd, pgoff_t length)
 {
 	int			ret;
 
@@ -714,7 +714,7 @@ retry:
  * Truncate a file to a given length by name.
  */
 int
-pg_truncate(const char *path, off_t length)
+pg_truncate(const char *path, pgoff_t length)
 {
 	int			ret;
 #ifdef WIN32
@@ -1526,7 +1526,7 @@ FileAccess(File file)
  * Called whenever a temporary file is deleted to report its size.
  */
 static void
-ReportTemporaryFileUsage(const char *path, off_t size)
+ReportTemporaryFileUsage(const char *path, pgoff_t size)
 {
 	pgstat_report_tempfile(size);
 
@@ -2077,7 +2077,7 @@ FileClose(File file)
  * this.
  */
 int
-FilePrefetch(File file, off_t offset, off_t amount, uint32 wait_event_info)
+FilePrefetch(File file, pgoff_t offset, pgoff_t amount, uint32 wait_event_info)
 {
 	Assert(FileIsValid(file));
 
@@ -2108,7 +2108,7 @@ retry:
 	{
 		struct radvisory
 		{
-			off_t		ra_offset;	/* offset into the file */
+			pgoff_t		ra_offset;	/* offset into the file */
 			int			ra_count;	/* size of the read     */
 		}			ra;
 		int			returnCode;
@@ -2133,7 +2133,7 @@ retry:
 }
 
 void
-FileWriteback(File file, off_t offset, off_t nbytes, uint32 wait_event_info)
+FileWriteback(File file, pgoff_t offset, pgoff_t nbytes, uint32 wait_event_info)
 {
 	int			returnCode;
 
@@ -2159,7 +2159,7 @@ FileWriteback(File file, off_t offset, off_t nbytes, uint32 wait_event_info)
 }
 
 ssize_t
-FileReadV(File file, const struct iovec *iov, int iovcnt, off_t offset,
+FileReadV(File file, const struct iovec *iov, int iovcnt, pgoff_t offset,
 		  uint32 wait_event_info)
 {
 	ssize_t		returnCode;
@@ -2216,7 +2216,7 @@ retry:
 
 int
 FileStartReadV(PgAioHandle *ioh, File file,
-			   int iovcnt, off_t offset,
+			   int iovcnt, pgoff_t offset,
 			   uint32 wait_event_info)
 {
 	int			returnCode;
@@ -2241,7 +2241,7 @@ FileStartReadV(PgAioHandle *ioh, File file,
 }
 
 ssize_t
-FileWriteV(File file, const struct iovec *iov, int iovcnt, off_t offset,
+FileWriteV(File file, const struct iovec *iov, int iovcnt, pgoff_t offset,
 		   uint32 wait_event_info)
 {
 	ssize_t		returnCode;
@@ -2270,7 +2270,7 @@ FileWriteV(File file, const struct iovec *iov, int iovcnt, off_t offset,
 	 */
 	if (temp_file_limit >= 0 && (vfdP->fdstate & FD_TEMP_FILE_LIMIT))
 	{
-		off_t		past_write = offset;
+		pgoff_t		past_write = offset;
 
 		for (int i = 0; i < iovcnt; ++i)
 			past_write += iov[i].iov_len;
@@ -2309,7 +2309,7 @@ retry:
 		 */
 		if (vfdP->fdstate & FD_TEMP_FILE_LIMIT)
 		{
-			off_t		past_write = offset + returnCode;
+			pgoff_t		past_write = offset + returnCode;
 
 			if (past_write > vfdP->fileSize)
 			{
@@ -2373,7 +2373,7 @@ FileSync(File file, uint32 wait_event_info)
  * appropriate error.
  */
 int
-FileZero(File file, off_t offset, off_t amount, uint32 wait_event_info)
+FileZero(File file, pgoff_t offset, pgoff_t amount, uint32 wait_event_info)
 {
 	int			returnCode;
 	ssize_t		written;
@@ -2418,7 +2418,7 @@ FileZero(File file, off_t offset, off_t amount, uint32 wait_event_info)
  * appropriate error.
  */
 int
-FileFallocate(File file, off_t offset, off_t amount, uint32 wait_event_info)
+FileFallocate(File file, pgoff_t offset, pgoff_t amount, uint32 wait_event_info)
 {
 #ifdef HAVE_POSIX_FALLOCATE
 	int			returnCode;
@@ -2457,7 +2457,7 @@ retry:
 	return FileZero(file, offset, amount, wait_event_info);
 }
 
-off_t
+pgoff_t
 FileSize(File file)
 {
 	Assert(FileIsValid(file));
@@ -2468,14 +2468,14 @@ FileSize(File file)
 	if (FileIsNotOpen(file))
 	{
 		if (FileAccess(file) < 0)
-			return (off_t) -1;
+			return (pgoff_t) -1;
 	}
 
 	return lseek(VfdCache[file].fd, 0, SEEK_END);
 }
 
 int
-FileTruncate(File file, off_t offset, uint32 wait_event_info)
+FileTruncate(File file, pgoff_t offset, uint32 wait_event_info)
 {
 	int			returnCode;
 
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 235ba7e191..e3f335a834 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -487,7 +487,7 @@ void
 mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 		 const void *buffer, bool skipFsync)
 {
-	off_t		seekpos;
+	pgoff_t		seekpos;
 	int			nbytes;
 	MdfdVec    *v;
 
@@ -515,9 +515,9 @@ mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 
 	v = _mdfd_getseg(reln, forknum, blocknum, skipFsync, EXTENSION_CREATE);
 
-	seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+	seekpos = (pgoff_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
 
-	Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+	Assert(seekpos < (pgoff_t) BLCKSZ * RELSEG_SIZE);
 
 	if ((nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, seekpos, WAIT_EVENT_DATA_FILE_EXTEND)) != BLCKSZ)
 	{
@@ -578,7 +578,7 @@ mdzeroextend(SMgrRelation reln, ForkNumber forknum,
 	while (remblocks > 0)
 	{
 		BlockNumber segstartblock = curblocknum % ((BlockNumber) RELSEG_SIZE);
-		off_t		seekpos = (off_t) BLCKSZ * segstartblock;
+		pgoff_t		seekpos = (pgoff_t) BLCKSZ * segstartblock;
 		int			numblocks;
 
 		if (segstartblock + remblocks > RELSEG_SIZE)
@@ -607,7 +607,7 @@ mdzeroextend(SMgrRelation reln, ForkNumber forknum,
 			int			ret;
 
 			ret = FileFallocate(v->mdfd_vfd,
-								seekpos, (off_t) BLCKSZ * numblocks,
+								seekpos, (pgoff_t) BLCKSZ * numblocks,
 								WAIT_EVENT_DATA_FILE_EXTEND);
 			if (ret != 0)
 			{
@@ -630,7 +630,7 @@ mdzeroextend(SMgrRelation reln, ForkNumber forknum,
 			 * whole length of the extension.
 			 */
 			ret = FileZero(v->mdfd_vfd,
-						   seekpos, (off_t) BLCKSZ * numblocks,
+						   seekpos, (pgoff_t) BLCKSZ * numblocks,
 						   WAIT_EVENT_DATA_FILE_EXTEND);
 			if (ret < 0)
 				ereport(ERROR,
@@ -745,7 +745,7 @@ mdprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 
 	while (nblocks > 0)
 	{
-		off_t		seekpos;
+		pgoff_t		seekpos;
 		MdfdVec    *v;
 		int			nblocks_this_segment;
 
@@ -754,9 +754,9 @@ mdprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 		if (v == NULL)
 			return false;
 
-		seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+		seekpos = (pgoff_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
 
-		Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+		Assert(seekpos < (pgoff_t) BLCKSZ * RELSEG_SIZE);
 
 		nblocks_this_segment =
 			Min(nblocks,
@@ -851,7 +851,7 @@ mdreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	{
 		struct iovec iov[PG_IOV_MAX];
 		int			iovcnt;
-		off_t		seekpos;
+		pgoff_t		seekpos;
 		int			nbytes;
 		MdfdVec    *v;
 		BlockNumber nblocks_this_segment;
@@ -861,9 +861,9 @@ mdreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 		v = _mdfd_getseg(reln, forknum, blocknum, false,
 						 EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY);
 
-		seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+		seekpos = (pgoff_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
 
-		Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+		Assert(seekpos < (pgoff_t) BLCKSZ * RELSEG_SIZE);
 
 		nblocks_this_segment =
 			Min(nblocks,
@@ -986,7 +986,7 @@ mdstartreadv(PgAioHandle *ioh,
 			 SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 			 void **buffers, BlockNumber nblocks)
 {
-	off_t		seekpos;
+	pgoff_t		seekpos;
 	MdfdVec    *v;
 	BlockNumber nblocks_this_segment;
 	struct iovec *iov;
@@ -996,9 +996,9 @@ mdstartreadv(PgAioHandle *ioh,
 	v = _mdfd_getseg(reln, forknum, blocknum, false,
 					 EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY);
 
-	seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+	seekpos = (pgoff_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
 
-	Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+	Assert(seekpos < (pgoff_t) BLCKSZ * RELSEG_SIZE);
 
 	nblocks_this_segment =
 		Min(nblocks,
@@ -1068,7 +1068,7 @@ mdwritev(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	{
 		struct iovec iov[PG_IOV_MAX];
 		int			iovcnt;
-		off_t		seekpos;
+		pgoff_t		seekpos;
 		int			nbytes;
 		MdfdVec    *v;
 		BlockNumber nblocks_this_segment;
@@ -1078,9 +1078,9 @@ mdwritev(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 		v = _mdfd_getseg(reln, forknum, blocknum, skipFsync,
 						 EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY);
 
-		seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+		seekpos = (pgoff_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
 
-		Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+		Assert(seekpos < (pgoff_t) BLCKSZ * RELSEG_SIZE);
 
 		nblocks_this_segment =
 			Min(nblocks,
@@ -1173,7 +1173,7 @@ mdwriteback(SMgrRelation reln, ForkNumber forknum,
 	while (nblocks > 0)
 	{
 		BlockNumber nflush = nblocks;
-		off_t		seekpos;
+		pgoff_t		seekpos;
 		MdfdVec    *v;
 		int			segnum_start,
 					segnum_end;
@@ -1202,9 +1202,9 @@ mdwriteback(SMgrRelation reln, ForkNumber forknum,
 		Assert(nflush >= 1);
 		Assert(nflush <= nblocks);
 
-		seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+		seekpos = (pgoff_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
 
-		FileWriteback(v->mdfd_vfd, seekpos, (off_t) BLCKSZ * nflush, WAIT_EVENT_DATA_FILE_FLUSH);
+		FileWriteback(v->mdfd_vfd, seekpos, (pgoff_t) BLCKSZ * nflush, WAIT_EVENT_DATA_FILE_FLUSH);
 
 		nblocks -= nflush;
 		blocknum += nflush;
@@ -1348,7 +1348,7 @@ mdtruncate(SMgrRelation reln, ForkNumber forknum,
 			 */
 			BlockNumber lastsegblocks = nblocks - priorblocks;
 
-			if (FileTruncate(v->mdfd_vfd, (off_t) lastsegblocks * BLCKSZ, WAIT_EVENT_DATA_FILE_TRUNCATE) < 0)
+			if (FileTruncate(v->mdfd_vfd, (pgoff_t) lastsegblocks * BLCKSZ, WAIT_EVENT_DATA_FILE_TRUNCATE) < 0)
 				ereport(ERROR,
 						(errcode_for_file_access(),
 						 errmsg("could not truncate file \"%s\" to %u blocks: %m",
@@ -1484,9 +1484,9 @@ mdfd(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uint32 *off)
 	v = _mdfd_getseg(reln, forknum, blocknum, false,
 					 EXTENSION_FAIL);
 
-	*off = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+	*off = (pgoff_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
 
-	Assert(*off < (off_t) BLCKSZ * RELSEG_SIZE);
+	Assert(*off < (pgoff_t) BLCKSZ * RELSEG_SIZE);
 
 	return FileGetRawDesc(v->mdfd_vfd);
 }
@@ -1868,7 +1868,7 @@ _mdfd_getseg(SMgrRelation reln, ForkNumber forknum, BlockNumber blkno,
 static BlockNumber
 _mdnblocks(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 {
-	off_t		len;
+	pgoff_t		len;
 
 	len = FileSize(seg->mdfd_vfd);
 	if (len < 0)
diff --git a/src/common/file_utils.c b/src/common/file_utils.c
index 7b62687a2a..cdf08ab5cb 100644
--- a/src/common/file_utils.c
+++ b/src/common/file_utils.c
@@ -656,7 +656,7 @@ compute_remaining_iovec(struct iovec *destination,
  * error is returned, it is unspecified how much has been written.
  */
 ssize_t
-pg_pwritev_with_retry(int fd, const struct iovec *iov, int iovcnt, off_t offset)
+pg_pwritev_with_retry(int fd, const struct iovec *iov, int iovcnt, pgoff_t offset)
 {
 	struct iovec iov_copy[PG_IOV_MAX];
 	ssize_t		sum = 0;
@@ -706,7 +706,7 @@ pg_pwritev_with_retry(int fd, const struct iovec *iov, int iovcnt, off_t offset)
  * is returned with errno set.
  */
 ssize_t
-pg_pwrite_zeros(int fd, size_t size, off_t offset)
+pg_pwrite_zeros(int fd, size_t size, pgoff_t offset)
 {
 	static const PGIOAlignedBlock zbuffer = {0};	/* worth BLCKSZ */
 	void	   *zerobuf_addr = unconstify(PGIOAlignedBlock *, &zbuffer)->data;
diff --git a/src/include/common/file_utils.h b/src/include/common/file_utils.h
index 9fd88953e4..4239713803 100644
--- a/src/include/common/file_utils.h
+++ b/src/include/common/file_utils.h
@@ -55,9 +55,9 @@ extern int	compute_remaining_iovec(struct iovec *destination,
 extern ssize_t pg_pwritev_with_retry(int fd,
 									 const struct iovec *iov,
 									 int iovcnt,
-									 off_t offset);
+									 pgoff_t offset);
 
-extern ssize_t pg_pwrite_zeros(int fd, size_t size, off_t offset);
+extern ssize_t pg_pwrite_zeros(int fd, size_t size, pgoff_t offset);
 
 /* Filename components */
 #define PG_TEMP_FILES_DIR "pgsql_tmp"
diff --git a/src/include/port/pg_iovec.h b/src/include/port/pg_iovec.h
index 90be3af449..845ded8c71 100644
--- a/src/include/port/pg_iovec.h
+++ b/src/include/port/pg_iovec.h
@@ -51,7 +51,7 @@ struct iovec
  * this changes the current file position.
  */
 static inline ssize_t
-pg_preadv(int fd, const struct iovec *iov, int iovcnt, off_t offset)
+pg_preadv(int fd, const struct iovec *iov, int iovcnt, pgoff_t offset)
 {
 #if HAVE_DECL_PREADV
 	/*
@@ -90,7 +90,7 @@ pg_preadv(int fd, const struct iovec *iov, int iovcnt, off_t offset)
  * this changes the current file position.
  */
 static inline ssize_t
-pg_pwritev(int fd, const struct iovec *iov, int iovcnt, off_t offset)
+pg_pwritev(int fd, const struct iovec *iov, int iovcnt, pgoff_t offset)
 {
 #if HAVE_DECL_PWRITEV
 	/*
diff --git a/src/include/port/win32_port.h b/src/include/port/win32_port.h
index ff7028bdc8..f54ccef7db 100644
--- a/src/include/port/win32_port.h
+++ b/src/include/port/win32_port.h
@@ -584,9 +584,9 @@ typedef unsigned short mode_t;
 #endif
 
 /* in port/win32pread.c */
-extern ssize_t pg_pread(int fd, void *buf, size_t nbyte, off_t offset);
+extern ssize_t pg_pread(int fd, void *buf, size_t nbyte, pgoff_t offset);
 
 /* in port/win32pwrite.c */
-extern ssize_t pg_pwrite(int fd, const void *buf, size_t nbyte, off_t offset);
+extern ssize_t pg_pwrite(int fd, const void *buf, size_t nbyte, pgoff_t offset);
 
 #endif							/* PG_WIN32_PORT_H */
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index b77d8e5e30..3e821ce8fb 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -108,17 +108,17 @@ extern File PathNameOpenFile(const char *fileName, int fileFlags);
 extern File PathNameOpenFilePerm(const char *fileName, int fileFlags, mode_t fileMode);
 extern File OpenTemporaryFile(bool interXact);
 extern void FileClose(File file);
-extern int	FilePrefetch(File file, off_t offset, off_t amount, uint32 wait_event_info);
-extern ssize_t FileReadV(File file, const struct iovec *iov, int iovcnt, off_t offset, uint32 wait_event_info);
-extern ssize_t FileWriteV(File file, const struct iovec *iov, int iovcnt, off_t offset, uint32 wait_event_info);
-extern int	FileStartReadV(struct PgAioHandle *ioh, File file, int iovcnt, off_t offset, uint32 wait_event_info);
+extern int	FilePrefetch(File file, pgoff_t offset, pgoff_t amount, uint32 wait_event_info);
+extern ssize_t FileReadV(File file, const struct iovec *iov, int iovcnt, pgoff_t offset, uint32 wait_event_info);
+extern ssize_t FileWriteV(File file, const struct iovec *iov, int iovcnt, pgoff_t offset, uint32 wait_event_info);
+extern int	FileStartReadV(struct PgAioHandle *ioh, File file, int iovcnt, pgoff_t offset, uint32 wait_event_info);
 extern int	FileSync(File file, uint32 wait_event_info);
-extern int	FileZero(File file, off_t offset, off_t amount, uint32 wait_event_info);
-extern int	FileFallocate(File file, off_t offset, off_t amount, uint32 wait_event_info);
+extern int	FileZero(File file, pgoff_t offset, pgoff_t amount, uint32 wait_event_info);
+extern int	FileFallocate(File file, pgoff_t offset, pgoff_t amount, uint32 wait_event_info);
 
-extern off_t FileSize(File file);
-extern int	FileTruncate(File file, off_t offset, uint32 wait_event_info);
-extern void FileWriteback(File file, off_t offset, off_t nbytes, uint32 wait_event_info);
+extern pgoff_t FileSize(File file);
+extern int	FileTruncate(File file, pgoff_t offset, uint32 wait_event_info);
+extern void FileWriteback(File file, pgoff_t offset, pgoff_t nbytes, uint32 wait_event_info);
 extern char *FilePathName(File file);
 extern int	FileGetRawDesc(File file);
 extern int	FileGetRawFlags(File file);
@@ -186,8 +186,8 @@ extern int	pg_fsync_no_writethrough(int fd);
 extern int	pg_fsync_writethrough(int fd);
 extern int	pg_fdatasync(int fd);
 extern bool pg_file_exists(const char *name);
-extern void pg_flush_data(int fd, off_t offset, off_t nbytes);
-extern int	pg_truncate(const char *path, off_t length);
+extern void pg_flush_data(int fd, pgoff_t offset, pgoff_t nbytes);
+extern int	pg_truncate(const char *path, pgoff_t length);
 extern void fsync_fname(const char *fname, bool isdir);
 extern int	fsync_fname_ext(const char *fname, bool isdir, bool ignore_perm, int elevel);
 extern int	durable_rename(const char *oldfile, const char *newfile, int elevel);
@@ -196,7 +196,7 @@ extern void SyncDataDirectory(void);
 extern int	data_sync_elevel(int elevel);
 
 static inline ssize_t
-FileRead(File file, void *buffer, size_t amount, off_t offset,
+FileRead(File file, void *buffer, size_t amount, pgoff_t offset,
 		 uint32 wait_event_info)
 {
 	struct iovec iov = {
@@ -208,7 +208,7 @@ FileRead(File file, void *buffer, size_t amount, off_t offset,
 }
 
 static inline ssize_t
-FileWrite(File file, const void *buffer, size_t amount, off_t offset,
+FileWrite(File file, const void *buffer, size_t amount, pgoff_t offset,
 		  uint32 wait_event_info)
 {
 	struct iovec iov = {
diff --git a/src/port/win32pread.c b/src/port/win32pread.c
index 32d56c462e..1f00dfd8e6 100644
--- a/src/port/win32pread.c
+++ b/src/port/win32pread.c
@@ -17,7 +17,7 @@
 #include <windows.h>
 
 ssize_t
-pg_pread(int fd, void *buf, size_t size, off_t offset)
+pg_pread(int fd, void *buf, size_t size, pgoff_t offset)
 {
 	OVERLAPPED	overlapped = {0};
 	HANDLE		handle;
@@ -30,16 +30,16 @@ pg_pread(int fd, void *buf, size_t size, off_t offset)
 		return -1;
 	}
 
-	/* Avoid overflowing DWORD. */
+	/* Avoid overflowing DWORD */
 	size = Min(size, 1024 * 1024 * 1024);
 
-	/* Note that this changes the file position, despite not using it. */
-	overlapped.Offset = offset;
+	overlapped.Offset = (DWORD) offset;
+	overlapped.OffsetHigh = (DWORD) (offset >> 32);
+
 	if (!ReadFile(handle, buf, size, &result, &overlapped))
 	{
 		if (GetLastError() == ERROR_HANDLE_EOF)
 			return 0;
-
 		_dosmaperr(GetLastError());
 		return -1;
 	}
diff --git a/src/port/win32pwrite.c b/src/port/win32pwrite.c
index 249aa6c468..d9a0d23c2b 100644
--- a/src/port/win32pwrite.c
+++ b/src/port/win32pwrite.c
@@ -15,9 +15,8 @@
 #include "c.h"
 
 #include <windows.h>
-
 ssize_t
-pg_pwrite(int fd, const void *buf, size_t size, off_t offset)
+pg_pwrite(int fd, const void *buf, size_t size, pgoff_t offset)
 {
 	OVERLAPPED	overlapped = {0};
 	HANDLE		handle;
@@ -30,11 +29,12 @@ pg_pwrite(int fd, const void *buf, size_t size, off_t offset)
 		return -1;
 	}
 
-	/* Avoid overflowing DWORD. */
+	/* Avoid overflowing DWORD */
 	size = Min(size, 1024 * 1024 * 1024);
 
-	/* Note that this changes the file position, despite not using it. */
-	overlapped.Offset = offset;
+	overlapped.Offset = (DWORD) offset;
+	overlapped.OffsetHigh = (DWORD) (offset >> 32);
+
 	if (!WriteFile(handle, buf, size, &result, &overlapped))
 	{
 		_dosmaperr(GetLastError());
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 14fc761c4c..95af220a4d 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -28,6 +28,7 @@ subdir('test_ginpostinglist')
 subdir('test_int128')
 subdir('test_integerset')
 subdir('test_json_parser')
+subdir('test_large_files')
 subdir('test_lfind')
 subdir('test_lwlock_tranches')
 subdir('test_misc')
diff --git a/src/test/modules/test_large_files/Makefile b/src/test/modules/test_large_files/Makefile
new file mode 100644
index 0000000000..26bb53a51f
--- /dev/null
+++ b/src/test/modules/test_large_files/Makefile
@@ -0,0 +1,20 @@
+# src/test/modules/test_large_files/Makefile
+
+MODULE_big = test_large_files
+OBJS = test_large_files.o
+
+EXTENSION = test_large_files
+DATA = test_large_files--1.0.sql
+
+REGRESS = test_large_files
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_large_files
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_large_files/README b/src/test/modules/test_large_files/README
new file mode 100644
index 0000000000..9df6a2ce84
--- /dev/null
+++ b/src/test/modules/test_large_files/README
@@ -0,0 +1,53 @@
+Test Module for Windows Large File I/O
+
+This test module provides functions to test PostgreSQL's ability to
+handle files larger than 4GB on Windows.
+
+Requirements
+
+- Windows platform
+- PostgreSQL built with segment size greater than 2GB
+- NTFS filesystem (for sparse file support)
+
+Functions
+
+test_create_sparse_file(filename text, size_gb int) RETURNS boolean
+
+Creates a sparse file of the specified size in gigabytes. This allows
+testing large offsets without actually writing gigabytes of data to
+disk.
+
+test_sparse_write_read(filename text, offset_gb float8, test_data text)
+RETURNS boolean
+
+Writes test data at the specified offset (in GB) using PostgreSQL's VFD
+layer (FileWrite), then reads it back using FileRead to verify basic I/O
+functionality.
+
+test_verify_offset_native(filename text, offset_gb float8, expected_data
+text) RETURNS boolean
+
+Critical for validation: Uses native Windows APIs (ReadFile with proper
+OVERLAPPED structure) to verify that data written by PostgreSQL is
+actually at the correct offset. This catches bugs where both write and
+read might use the same incorrect offset calculation (making a broken
+test appear to pass).
+
+Without this verification, a test could pass even with broken offset
+handling if both FileWrite and FileRead make the same mistake.
+
+What the Test Verifies
+
+1. Sparse file creation works on Windows
+2. PostgreSQL's FileWrite can write at offsets > 4GB
+3. PostgreSQL's FileRead can read from offsets > 4GB
+4. Data is actually at the correct offset (verified with native Windows
+   APIs)
+
+The native verification step is critical because without it, a test
+could pass even with broken offset handling. For example, if both
+FileWrite and FileRead truncate offsets to 32 bits, writing at 4.5GB
+would actually write at ~512MB, and reading at 4.5GB would read from
+~512MB - the test would find matching data but at the wrong location.
+The native verification catches this by independently checking the
+actual file offset.
diff --git a/src/test/modules/test_large_files/meson.build b/src/test/modules/test_large_files/meson.build
new file mode 100644
index 0000000000..c755e2cf16
--- /dev/null
+++ b/src/test/modules/test_large_files/meson.build
@@ -0,0 +1,29 @@
+# src/test/modules/test_large_files/meson.build
+
+test_large_files_sources = files(
+  'test_large_files.c',
+)
+
+if host_system == 'windows'
+  test_large_files = shared_module('test_large_files',
+    test_large_files_sources,
+    kwargs: pg_test_mod_args,
+  )
+  test_install_libs += test_large_files
+
+  test_install_data += files(
+    'test_large_files.control',
+    'test_large_files--1.0.sql',
+  )
+
+  tests += {
+    'name': 'test_large_files',
+    'sd': meson.current_source_dir(),
+    'bd': meson.current_build_dir(),
+    'tap': {
+      'tests': [
+        't/001_windows_large_files.pl',
+      ],
+    },
+  }
+endif
diff --git a/src/test/modules/test_large_files/t/001_windows_large_files.pl b/src/test/modules/test_large_files/t/001_windows_large_files.pl
new file mode 100644
index 0000000000..2fb0ef5e36
--- /dev/null
+++ b/src/test/modules/test_large_files/t/001_windows_large_files.pl
@@ -0,0 +1,65 @@
+#!/usr/bin/perl
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+=pod
+
+=head1 NAME
+
+001_windows_large_files.pl - Test Windows support for files >4GB
+
+=head1 SYNOPSIS
+
+  prove src/test/modules/test_large_files/t/001_windows_large_files.pl
+
+=head1 DESCRIPTION
+
+This test verifies that PostgreSQL on Windows can correctly handle file
+operations at offsets beyond 4GB. This requires PostgreSQL to be
+built with a segment size greater than 2GB.
+
+The test uses sparse files to avoid actually writing gigabytes of data.
+
+=cut
+
+use strict;
+use warnings;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+use File::Spec;
+use File::Temp;
+
+if ($^O ne 'MSWin32')
+{
+	plan skip_all => 'test is Windows-specific';
+}
+
+plan tests => 4;
+
+my $node = PostgreSQL::Test::Cluster->new('main');
+$node->init;
+$node->start;
+
+$node->safe_psql('postgres', 'CREATE EXTENSION test_large_files;');
+pass("test_large_files extension loaded");
+
+my $tempdir = File::Temp->newdir();
+my $testfile = File::Spec->catfile($tempdir, 'large_file_test.dat');
+
+note "Test file: $testfile";
+
+my $create_result = $node->safe_psql('postgres',
+	"SELECT test_create_sparse_file('$testfile', 5);");
+is($create_result, 't', "Created 5GB sparse file");
+
+my $test_4_5gb = $node->safe_psql('postgres',
+	"SELECT test_sparse_write_read('$testfile', 4.5, 'TEST_DATA_AT_4.5GB');");
+is($test_4_5gb, 't', "Write/read successful at 4.5GB offset");
+
+my $verify_4_5gb = $node->safe_psql('postgres',
+	"SELECT test_verify_offset_native('$testfile', 4.5, 'TEST_DATA_AT_4.5GB');");
+is($verify_4_5gb, 't', "Native verification confirms data at correct 4.5GB offset");
+
+$node->stop;
+
+done_testing();
diff --git a/src/test/modules/test_large_files/test_large_files--1.0.sql b/src/test/modules/test_large_files/test_large_files--1.0.sql
new file mode 100644
index 0000000000..c4db84106c
--- /dev/null
+++ b/src/test/modules/test_large_files/test_large_files--1.0.sql
@@ -0,0 +1,36 @@
+-- src/test/modules/test_large_files/test_large_files--1.0.sql
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_large_files" to load this file. \quit
+
+--
+-- test_create_sparse_file(filename text, size_gb int) returns boolean
+--
+-- Creates a sparse file for testing. Windows only.
+--
+CREATE FUNCTION test_create_sparse_file(filename text, size_gb int)
+RETURNS boolean
+AS 'MODULE_PATHNAME', 'test_create_sparse_file'
+LANGUAGE C STRICT;
+
+--
+-- test_sparse_write_read(filename text, offset_gb numeric, test_data text) returns boolean
+--
+-- Writes data at a large offset and reads it back to verify correctness.
+-- Tests pg_pwrite/pg_pread with offsets beyond 2GB and 4GB. Windows only.
+--
+CREATE FUNCTION test_sparse_write_read(filename text, offset_gb float8, test_data text)
+RETURNS boolean
+AS 'MODULE_PATHNAME', 'test_sparse_write_read'
+LANGUAGE C STRICT;
+
+--
+-- test_verify_offset_native(filename text, offset_gb numeric, expected_data text) returns boolean
+--
+-- Uses native Windows APIs to verify data is at the correct offset.
+-- This ensures PostgreSQL's I/O didn't write to a wrapped/incorrect offset.
+--
+CREATE FUNCTION test_verify_offset_native(filename text, offset_gb float8, expected_data text)
+RETURNS boolean
+AS 'MODULE_PATHNAME', 'test_verify_offset_native'
+LANGUAGE C STRICT;
diff --git a/src/test/modules/test_large_files/test_large_files.c b/src/test/modules/test_large_files/test_large_files.c
new file mode 100644
index 0000000000..531230da4b
--- /dev/null
+++ b/src/test/modules/test_large_files/test_large_files.c
@@ -0,0 +1,270 @@
+/* src/test/modules/test_large_files/test_large_files.c */
+
+#include "postgres.h"
+
+#include "fmgr.h"
+#include "storage/fd.h"
+#include "utils/builtins.h"
+
+#ifdef WIN32
+#include <windows.h>
+#include <winioctl.h>
+#endif
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(test_sparse_write_read);
+PG_FUNCTION_INFO_V1(test_create_sparse_file);
+PG_FUNCTION_INFO_V1(test_verify_offset_native);
+
+/*
+ * test_verify_offset_native(filename text, offset_gb numeric, expected_data text) returns boolean
+ *
+ * Uses native Windows APIs to read data at the specified offset and verify it matches.
+ * This ensures PostgreSQL's I/O functions wrote to the CORRECT offset, not a wrapped one.
+ * Windows only.
+ */
+Datum
+test_verify_offset_native(PG_FUNCTION_ARGS)
+{
+#ifdef WIN32
+	text	   *filename_text = PG_GETARG_TEXT_PP(0);
+	float8		offset_gb = PG_GETARG_FLOAT8(1);
+	text	   *expected_text = PG_GETARG_TEXT_PP(2);
+	char	   *filename;
+	char	   *expected_data;
+	char	   *read_buffer;
+	int			expected_len;
+	int64		offset;
+	HANDLE		hFile;
+	OVERLAPPED	overlapped = {0};
+	DWORD		bytesRead;
+	bool		success = false;
+
+	filename = text_to_cstring(filename_text);
+	expected_data = text_to_cstring(expected_text);
+	expected_len = strlen(expected_data) + 1;
+
+	/* Calculate offset in bytes */
+	offset = (int64) (offset_gb * 1024.0 * 1024.0 * 1024.0);
+
+	/* Open file with native Windows API */
+	hFile = CreateFile(filename,
+					   GENERIC_READ,
+					   FILE_SHARE_READ | FILE_SHARE_WRITE,
+					   NULL,
+					   OPEN_EXISTING,
+					   FILE_ATTRIBUTE_NORMAL,
+					   NULL);
+
+	if (hFile == INVALID_HANDLE_VALUE)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\" for verification: %lu",
+						filename, GetLastError())));
+
+	/* Set up OVERLAPPED structure with proper 64-bit offset */
+	overlapped.Offset = (DWORD)(offset & 0xFFFFFFFF);
+	overlapped.OffsetHigh = (DWORD)(offset >> 32);
+
+	/* Allocate read buffer */
+	read_buffer = palloc(expected_len);
+
+	/* Read using native Windows API */
+	if (!ReadFile(hFile, read_buffer, expected_len, &bytesRead, &overlapped))
+	{
+		DWORD error = GetLastError();
+		CloseHandle(hFile);
+		pfree(read_buffer);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("native ReadFile failed at offset %lld: %lu",
+						offset, error)));
+	}
+
+	if (bytesRead != expected_len)
+	{
+		CloseHandle(hFile);
+		pfree(read_buffer);
+		ereport(ERROR,
+				(errmsg("native ReadFile read %lu bytes, expected %d",
+						bytesRead, expected_len)));
+	}
+
+	/* Verify data matches */
+	success = (memcmp(expected_data, read_buffer, expected_len) == 0);
+
+	pfree(read_buffer);
+	CloseHandle(hFile);
+
+	if (!success)
+		ereport(ERROR,
+				(errmsg("data mismatch at offset %lld: PostgreSQL wrote to wrong location",
+						offset)));
+
+	PG_RETURN_BOOL(success);
+#else
+	ereport(ERROR,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("this test is only supported on Windows")));
+	PG_RETURN_BOOL(false);
+#endif
+}
+
+/*
+ * test_create_sparse_file(filename text, size_gb int) returns boolean
+ *
+ * Creates a sparse file of the specified size in gigabytes.
+ * Windows only.
+ */
+Datum
+test_create_sparse_file(PG_FUNCTION_ARGS)
+{
+#ifdef WIN32
+	text	   *filename_text = PG_GETARG_TEXT_PP(0);
+	int32		size_gb = PG_GETARG_INT32(1);
+	char	   *filename;
+	HANDLE		hFile;
+	DWORD		bytesReturned;
+	LARGE_INTEGER fileSize;
+	bool		success = false;
+
+	filename = text_to_cstring(filename_text);
+
+	/* Open/create the file */
+	hFile = CreateFile(filename,
+					   GENERIC_WRITE,
+					   0,
+					   NULL,
+					   CREATE_ALWAYS,
+					   FILE_ATTRIBUTE_NORMAL,
+					   NULL);
+
+	if (hFile == INVALID_HANDLE_VALUE)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create file \"%s\": %lu",
+						filename, GetLastError())));
+
+	/* Mark as sparse */
+	if (!DeviceIoControl(hFile, FSCTL_SET_SPARSE, NULL, 0, NULL, 0,
+						 &bytesReturned, NULL))
+	{
+		CloseHandle(hFile);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not set file sparse: %lu", GetLastError())));
+	}
+
+	/* Set file size */
+	fileSize.QuadPart = (int64) size_gb * 1024 * 1024 * 1024;
+	if (!SetFilePointerEx(hFile, fileSize, NULL, FILE_BEGIN))
+	{
+		CloseHandle(hFile);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not set file pointer: %lu", GetLastError())));
+	}
+
+	if (!SetEndOfFile(hFile))
+	{
+		CloseHandle(hFile);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not set end of file: %lu", GetLastError())));
+	}
+
+	success = true;
+	CloseHandle(hFile);
+
+	PG_RETURN_BOOL(success);
+#else
+	ereport(ERROR,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("sparse files are only supported on Windows")));
+	PG_RETURN_BOOL(false);
+#endif
+}
+
+/*
+ * test_sparse_write_read(filename text, offset_gb numeric, test_data text) returns boolean
+ *
+ * Writes test data at the specified offset (in GB) and reads it back to verify.
+ * Tests that pg_pwrite and pg_pread work correctly with large offsets.
+ * Windows only.
+ */
+Datum
+test_sparse_write_read(PG_FUNCTION_ARGS)
+{
+#ifdef WIN32
+	text	   *filename_text = PG_GETARG_TEXT_PP(0);
+	float8		offset_gb = PG_GETARG_FLOAT8(1);
+	text	   *test_data_text = PG_GETARG_TEXT_PP(2);
+	char	   *filename;
+	char	   *test_data;
+	char	   *read_buffer;
+	int			test_data_len;
+	pgoff_t		offset;
+	int			fd;
+	ssize_t		written;
+	ssize_t		nread;
+	bool		success = false;
+
+	filename = text_to_cstring(filename_text);
+	test_data = text_to_cstring(test_data_text);
+	test_data_len = strlen(test_data) + 1;	/* include null terminator */
+
+	/* Calculate offset in bytes */
+	offset = (pgoff_t) (offset_gb * 1024.0 * 1024.0 * 1024.0);
+
+	/* Open the file using PostgreSQL's VFD layer */
+	fd = BasicOpenFile(filename, O_RDWR | PG_BINARY);
+	if (fd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m", filename)));
+
+	/* Write test data at the specified offset using pg_pwrite */
+	written = pg_pwrite(fd, test_data, test_data_len, offset);
+	if (written != test_data_len)
+	{
+		close(fd);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file at offset %lld: wrote %zd of %d bytes",
+						(long long) offset, written, test_data_len)));
+	}
+
+	/* Allocate buffer for reading */
+	read_buffer = palloc(test_data_len);
+
+	/* Read back the data using pg_pread */
+	nread = pg_pread(fd, read_buffer, test_data_len, offset);
+	if (nread != test_data_len)
+	{
+		close(fd);
+		pfree(read_buffer);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read from file at offset %lld: read %zd of %d bytes",
+						(long long) offset, nread, test_data_len)));
+	}
+
+	/* Verify data matches */
+	success = (memcmp(test_data, read_buffer, test_data_len) == 0);
+
+	pfree(read_buffer);
+	close(fd);
+
+	if (!success)
+		ereport(ERROR,
+				(errmsg("data mismatch: read data does not match written data")));
+
+	PG_RETURN_BOOL(success);
+#else
+	ereport(ERROR,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("this test is only supported on Windows")));
+	PG_RETURN_BOOL(false);
+#endif
+}
diff --git a/src/test/modules/test_large_files/test_large_files.control b/src/test/modules/test_large_files/test_large_files.control
new file mode 100644
index 0000000000..9b0a30974b
--- /dev/null
+++ b/src/test/modules/test_large_files/test_large_files.control
@@ -0,0 +1,5 @@
+# test_large_files extension
+comment = 'Test module for large file I/O on Windows'
+default_version = '1.0'
+module_pathname = '$libdir/test_large_files'
+relocatable = true
-- 
2.49.0

#13Thomas Munro
thomas.munro@gmail.com
In reply to: Michael Paquier (#11)
Re: [Patch] Windows relation extension failure at 2GB and 4GB

On Fri, Nov 7, 2025 at 11:29 AM Michael Paquier <michael@paquier.xyz> wrote:

On Thu, Nov 06, 2025 at 11:17:52AM -0600, Bryan Green wrote:

Latest patch attached that includes these code paths.

That feels OK for me. Thomas, do you have a different view on the
matter for HEAD? Like long, I would just switch to something that we
have in the tree that's fixed.

WFM.

- /* Note that this changes the file position, despite not using it. */

Why drop these comments? They still apply.

-

Accidental whitespace change?

         struct radvisory
         {
-            off_t        ra_offset;    /* offset into the file */
+            pgoff_t        ra_offset;    /* offset into the file */

IIRC this is a struct definition from an Apple man page, maybe leave unchanged?

Looking at the v3 that arrived while I was typing this:

+ errmsg("sparse files are only supported on Windows")));

Nit: maybe sparse file test only supported on Windows? Also, nice test!

And +1 for the idea to restrict the segment size to never be more than
2GB based on a ./configure and meson check on the back branches. In
PG15 and older branches, we already enforced a check by the way. See
src/tools/msvc/Solution.pm which was the only way to compile the code
with visual studio so one would have never seen the limitations except
if they had the idea to edit the perl scripts (FWIW, I've done exactly
that in the past for a past project at $company, never touched the
segsize):
# only allow segsize 1 for now, as we can't do large files yet in windows
die "Bad segsize $options->{segsize}"
unless $options->{segsize} == 1;

So this is a meson issue that goes down to v16, when using a VS
compiler. Was there a different compiler where off_t is also 4 bytes?

Ohh, this all makes more sense now. I wasn't wrong to think there was
already a check, it just didn't get ported to meson.

I wouldn't personally pitch the commit message as "Fix ...", which
sounds like a bug fix. There *is* a bug, but it's in the meson work.
Something more like "Allow large relation files on Windows" seems more
appropriate for this one, but YMMV.

MinGW is mentioned as clear by Thomas.

Only MinGW + meson. MinGW + configure has 32-bit off_t as far as I
can tell because we do:

if test "$PORTNAME" != "win32"; then
AC_SYS_LARGEFILE
...

I don't personally know of any current Unix without LFS, they just
vary on whether it's always on or you have to ask for it, as autoconf
and meson know. But I suppose the check for oversized segments should
use sizeof(off_t), not the OS's identity.

There are of course a few filesystems even today that don't let you
make a file as big as our maximum size, but that's another topic.

#14Michael Paquier
michael@paquier.xyz
In reply to: Thomas Munro (#13)
Re: [Patch] Windows relation extension failure at 2GB and 4GB

On Fri, Nov 07, 2025 at 02:45:32PM +1300, Thomas Munro wrote:

Only MinGW + meson. MinGW + configure has 32-bit off_t as far as I
can tell because we do:

if test "$PORTNAME" != "win32"; then
AC_SYS_LARGEFILE
...

I don't personally know of any current Unix without LFS, they just
vary on whether it's always on or you have to ask for it, as autoconf
and meson know. But I suppose the check for oversized segments should
use sizeof(off_t), not the OS's identity.

Yes, I was first wondering about the addition of a WIN32 check for
meson, but this is a much better idea for both ./configure and meson.

There is a cc.sizeof(), which I guess should be enough to report the
size of off_t, and fail if we try a size larger than 4GB for the
segment file when a 4-byte off_t is detected. It's something that I'd
rather backpatch first down to v16, before moving on with more pgoff_t
integration in the tree, mostly for history clarity. That's clearly
an oversight.
--
Michael

#15Thomas Munro
thomas.munro@gmail.com
In reply to: Michael Paquier (#14)
Re: [Patch] Windows relation extension failure at 2GB and 4GB

On Fri, Nov 7, 2025 at 3:13 PM Michael Paquier <michael@paquier.xyz> wrote:

There is a cc.sizeof(), which I guess should be enough to report the
size of off_t, and fail if we try a size larger than 4GB for the
segment file when a 4-byte off_t is detected.

(It's signed per POSIX and on Windows so I assume you meant to write 2GB here.)

#16Michael Paquier
michael@paquier.xyz
In reply to: Thomas Munro (#15)
Re: [Patch] Windows relation extension failure at 2GB and 4GB

On Fri, Nov 07, 2025 at 03:56:07PM +1300, Thomas Munro wrote:

On Fri, Nov 7, 2025 at 3:13 PM Michael Paquier <michael@paquier.xyz> wrote:

There is a cc.sizeof(), which I guess should be enough to report the
size of off_t, and fail if we try a size larger than 4GB for the
segment file when a 4-byte off_t is detected.

(It's signed per POSIX and on Windows so I assume you meant to write 2GB here.)

Yes, right..
--
Michael

#17Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#16)
1 attachment(s)
Re: [Patch] Windows relation extension failure at 2GB and 4GB

On Fri, Nov 07, 2025 at 12:32:21PM +0900, Michael Paquier wrote:

On Fri, Nov 07, 2025 at 03:56:07PM +1300, Thomas Munro wrote:

(It's signed per POSIX and on Windows so I assume you meant to write 2GB here.)

Yes, right..

So, please find attached a patch that adds a check for large files in
meson as we do now in ./configure, for a backpatch down to v16 (once
the branch freeze is lifted next week of course).

Thoughts?
--
Michael

Attachments:

0001-Add-check-for-large-files-in-meson.build.patchtext/x-diff; charset=us-asciiDownload
From 88e1d4c56f74c604d3773da4c45c0c592724adc5 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Sat, 8 Nov 2025 18:36:52 +0900
Subject: [PATCH] Add check for large files in meson.build

This check existed in the MSVC scripts that have been removed in v16,
and was missing from meson.

Backpatch-through: 16
---
 meson.build | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/meson.build b/meson.build
index 00c46d400714..c7bc2df2c82f 100644
--- a/meson.build
+++ b/meson.build
@@ -452,6 +452,14 @@ else
   segsize = (get_option('segsize') * 1024 * 1024 * 1024) / blocksize
 endif
 
+# If we don't have largefile support, can't handle segment size >= 2GB.
+if cc.sizeof('off_t', args: test_c_args) < 8
+  segsize_bytes = segsize * blocksize
+  if segsize_bytes >= (2 * 1024 * 1024 * 1024)
+    error('Large file support is not enabled. Segment size cannot be larger than 1GB.')
+  endif
+endif
+
 cdata.set('BLCKSZ', blocksize, description:
 '''Size of a disk block --- this also limits the size of a tuple. You can set
    it bigger if you need bigger tuples (although TOAST should reduce the need
-- 
2.51.0

#18Bryan Green
dbryan.green@gmail.com
In reply to: Michael Paquier (#17)
2 attachment(s)
Re: [Patch] Windows relation extension failure at 2GB and 4GB

On 11/8/2025 3:40 AM, Michael Paquier wrote:

On Fri, Nov 07, 2025 at 12:32:21PM +0900, Michael Paquier wrote:

On Fri, Nov 07, 2025 at 03:56:07PM +1300, Thomas Munro wrote:

(It's signed per POSIX and on Windows so I assume you meant to write 2GB here.)

Yes, right..

So, please find attached a patch that adds a check for large files in
meson as we do now in ./configure, for a backpatch down to v16 (once
the branch freeze is lifted next week of course).

Thoughts?
--
Michael

I have attached v4 of the patch after correcting some whitespace errors
and a struct that didn't need the pgoff_t modification as requested by
Thomas. I also have attached Michael's patch where both patches can be
found by cfbot for the commitfest.

--
Bryan Green
EDB: https://www.enterprisedb.com

Attachments:

v4-0001-Add-Windows-support-for-large-files.patchtext/plain; charset=UTF-8; name=v4-0001-Add-Windows-support-for-large-files.patchDownload
From 0214bb628176da3c7ebe9258466d41ac5bef3599 Mon Sep 17 00:00:00 2001
From: Bryan Green <dbryan.green@gmail.com>
Date: Thu, 6 Nov 2025 10:56:02 -0600
Subject: [PATCH v4] Add Windows support for files larger than 2GB

PostgreSQL's Windows port has been unable to handle files larger than 2GB
due to pervasive use of off_t for file offsets, which is only 32-bit on
Windows. This causes signed integer overflow at exactly 2^31 bytes.

The codebase already defines pgoff_t as __int64 (64-bit) on Windows for
this purpose, and some function declarations in headers use it, but many
implementations still used off_t.

This issue is unlikely to affect most users since the default RELSEG_SIZE
is 1GB, keeping individual segment files small. However, anyone building
with --with-segsize larger than 2 would hit this bug. Tested with
--with-segsize=8 and verified that files can now grow beyond 4GB.

This version also addresses three additional code paths in WAL handling
that used casts to off_t when calling pg_pread() or pg_pwrite():
- xlogrecovery.c: pg_pread() called with cast to off_t
- xlogreader.c: pg_pread() with cast to off_t
- walreceiver.c: pg_pwrite() with cast to off_t

While these are not critical (WAL segments have a max size of 1GB), the
casts are now corrected to pgoff_t for consistency and to avoid any
potential future issues.

Note: off_t is still used in other parts of the codebase (e.g. buffile.c)
which may have similar issues on Windows, but those are outside the
critical path for relation file extension and can be addressed separately.

On Unix-like systems, pgoff_t is defined as off_t, so this change only
affects Windows behavior.
---
 src/backend/access/transam/xlogreader.c       |   2 +-
 src/backend/access/transam/xlogrecovery.c     |   2 +-
 src/backend/replication/walreceiver.c         |   2 +-
 src/backend/storage/file/fd.c                 |  36 +--
 src/backend/storage/smgr/md.c                 |  50 ++--
 src/common/file_utils.c                       |   4 +-
 src/include/common/file_utils.h               |   4 +-
 src/include/port/pg_iovec.h                   |   4 +-
 src/include/port/win32_port.h                 |   4 +-
 src/include/storage/fd.h                      |  26 +-
 src/port/win32pread.c                         |  10 +-
 src/port/win32pwrite.c                        |  10 +-
 src/test/modules/meson.build                  |   1 +
 src/test/modules/test_large_files/Makefile    |  20 ++
 src/test/modules/test_large_files/README      |  53 ++++
 src/test/modules/test_large_files/meson.build |  29 ++
 .../t/001_windows_large_files.pl              |  65 +++++
 .../test_large_files--1.0.sql                 |  36 +++
 .../test_large_files/test_large_files.c       | 270 ++++++++++++++++++
 .../test_large_files/test_large_files.control |   5 +
 20 files changed, 556 insertions(+), 77 deletions(-)
 create mode 100644 src/test/modules/test_large_files/Makefile
 create mode 100644 src/test/modules/test_large_files/README
 create mode 100644 src/test/modules/test_large_files/meson.build
 create mode 100644 src/test/modules/test_large_files/t/001_windows_large_files.pl
 create mode 100644 src/test/modules/test_large_files/test_large_files--1.0.sql
 create mode 100644 src/test/modules/test_large_files/test_large_files.c
 create mode 100644 src/test/modules/test_large_files/test_large_files.control

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 755f351143..9cc7488e89 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1574,7 +1574,7 @@ WALRead(XLogReaderState *state,
 
 		/* Reset errno first; eases reporting non-errno-affecting errors */
 		errno = 0;
-		readbytes = pg_pread(state->seg.ws_file, p, segbytes, (off_t) startoff);
+		readbytes = pg_pread(state->seg.ws_file, p, segbytes, (pgoff_t) startoff);
 
 #ifndef FRONTEND
 		pgstat_report_wait_end();
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index eddc22fc5a..21b8f179ba 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -3429,7 +3429,7 @@ retry:
 	io_start = pgstat_prepare_io_time(track_wal_io_timing);
 
 	pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
-	r = pg_pread(readFile, readBuf, XLOG_BLCKSZ, (off_t) readOff);
+	r = pg_pread(readFile, readBuf, XLOG_BLCKSZ, (pgoff_t) readOff);
 	if (r != XLOG_BLCKSZ)
 	{
 		char		fname[MAXFNAMELEN];
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 2ee8fecee2..4217fc54e2 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -928,7 +928,7 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr, TimeLineID tli)
 		start = pgstat_prepare_io_time(track_wal_io_timing);
 
 		pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE);
-		byteswritten = pg_pwrite(recvFile, buf, segbytes, (off_t) startoff);
+		byteswritten = pg_pwrite(recvFile, buf, segbytes, (pgoff_t) startoff);
 		pgstat_report_wait_end();
 
 		pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL,
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index a4ec7959f3..e9eaaf9c82 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -201,7 +201,7 @@ typedef struct vfd
 	File		nextFree;		/* link to next free VFD, if in freelist */
 	File		lruMoreRecently;	/* doubly linked recency-of-use list */
 	File		lruLessRecently;
-	off_t		fileSize;		/* current size of file (0 if not temporary) */
+	pgoff_t		fileSize;		/* current size of file (0 if not temporary) */
 	char	   *fileName;		/* name of file, or NULL for unused VFD */
 	/* NB: fileName is malloc'd, and must be free'd when closing the VFD */
 	int			fileFlags;		/* open(2) flags for (re)opening the file */
@@ -519,7 +519,7 @@ pg_file_exists(const char *name)
  * offset of 0 with nbytes 0 means that the entire file should be flushed
  */
 void
-pg_flush_data(int fd, off_t offset, off_t nbytes)
+pg_flush_data(int fd, pgoff_t offset, pgoff_t nbytes)
 {
 	/*
 	 * Right now file flushing is primarily used to avoid making later
@@ -635,7 +635,7 @@ retry:
 		 * may simply not be enough address space.  If so, silently fall
 		 * through to the next implementation.
 		 */
-		if (nbytes <= (off_t) SSIZE_MAX)
+		if (nbytes <= (pgoff_t) SSIZE_MAX)
 			p = mmap(NULL, nbytes, PROT_READ, MAP_SHARED, fd, offset);
 		else
 			p = MAP_FAILED;
@@ -697,7 +697,7 @@ retry:
  * Truncate an open file to a given length.
  */
 static int
-pg_ftruncate(int fd, off_t length)
+pg_ftruncate(int fd, pgoff_t length)
 {
 	int			ret;
 
@@ -714,7 +714,7 @@ retry:
  * Truncate a file to a given length by name.
  */
 int
-pg_truncate(const char *path, off_t length)
+pg_truncate(const char *path, pgoff_t length)
 {
 	int			ret;
 #ifdef WIN32
@@ -1526,7 +1526,7 @@ FileAccess(File file)
  * Called whenever a temporary file is deleted to report its size.
  */
 static void
-ReportTemporaryFileUsage(const char *path, off_t size)
+ReportTemporaryFileUsage(const char *path, pgoff_t size)
 {
 	pgstat_report_tempfile(size);
 
@@ -2077,7 +2077,7 @@ FileClose(File file)
  * this.
  */
 int
-FilePrefetch(File file, off_t offset, off_t amount, uint32 wait_event_info)
+FilePrefetch(File file, pgoff_t offset, pgoff_t amount, uint32 wait_event_info)
 {
 	Assert(FileIsValid(file));
 
@@ -2133,7 +2133,7 @@ retry:
 }
 
 void
-FileWriteback(File file, off_t offset, off_t nbytes, uint32 wait_event_info)
+FileWriteback(File file, pgoff_t offset, pgoff_t nbytes, uint32 wait_event_info)
 {
 	int			returnCode;
 
@@ -2159,7 +2159,7 @@ FileWriteback(File file, off_t offset, off_t nbytes, uint32 wait_event_info)
 }
 
 ssize_t
-FileReadV(File file, const struct iovec *iov, int iovcnt, off_t offset,
+FileReadV(File file, const struct iovec *iov, int iovcnt, pgoff_t offset,
 		  uint32 wait_event_info)
 {
 	ssize_t		returnCode;
@@ -2216,7 +2216,7 @@ retry:
 
 int
 FileStartReadV(PgAioHandle *ioh, File file,
-			   int iovcnt, off_t offset,
+			   int iovcnt, pgoff_t offset,
 			   uint32 wait_event_info)
 {
 	int			returnCode;
@@ -2241,7 +2241,7 @@ FileStartReadV(PgAioHandle *ioh, File file,
 }
 
 ssize_t
-FileWriteV(File file, const struct iovec *iov, int iovcnt, off_t offset,
+FileWriteV(File file, const struct iovec *iov, int iovcnt, pgoff_t offset,
 		   uint32 wait_event_info)
 {
 	ssize_t		returnCode;
@@ -2270,7 +2270,7 @@ FileWriteV(File file, const struct iovec *iov, int iovcnt, off_t offset,
 	 */
 	if (temp_file_limit >= 0 && (vfdP->fdstate & FD_TEMP_FILE_LIMIT))
 	{
-		off_t		past_write = offset;
+		pgoff_t		past_write = offset;
 
 		for (int i = 0; i < iovcnt; ++i)
 			past_write += iov[i].iov_len;
@@ -2309,7 +2309,7 @@ retry:
 		 */
 		if (vfdP->fdstate & FD_TEMP_FILE_LIMIT)
 		{
-			off_t		past_write = offset + returnCode;
+			pgoff_t		past_write = offset + returnCode;
 
 			if (past_write > vfdP->fileSize)
 			{
@@ -2373,7 +2373,7 @@ FileSync(File file, uint32 wait_event_info)
  * appropriate error.
  */
 int
-FileZero(File file, off_t offset, off_t amount, uint32 wait_event_info)
+FileZero(File file, pgoff_t offset, pgoff_t amount, uint32 wait_event_info)
 {
 	int			returnCode;
 	ssize_t		written;
@@ -2418,7 +2418,7 @@ FileZero(File file, off_t offset, off_t amount, uint32 wait_event_info)
  * appropriate error.
  */
 int
-FileFallocate(File file, off_t offset, off_t amount, uint32 wait_event_info)
+FileFallocate(File file, pgoff_t offset, pgoff_t amount, uint32 wait_event_info)
 {
 #ifdef HAVE_POSIX_FALLOCATE
 	int			returnCode;
@@ -2457,7 +2457,7 @@ retry:
 	return FileZero(file, offset, amount, wait_event_info);
 }
 
-off_t
+pgoff_t
 FileSize(File file)
 {
 	Assert(FileIsValid(file));
@@ -2468,14 +2468,14 @@ FileSize(File file)
 	if (FileIsNotOpen(file))
 	{
 		if (FileAccess(file) < 0)
-			return (off_t) -1;
+			return (pgoff_t) -1;
 	}
 
 	return lseek(VfdCache[file].fd, 0, SEEK_END);
 }
 
 int
-FileTruncate(File file, off_t offset, uint32 wait_event_info)
+FileTruncate(File file, pgoff_t offset, uint32 wait_event_info)
 {
 	int			returnCode;
 
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 235ba7e191..e3f335a834 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -487,7 +487,7 @@ void
 mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 		 const void *buffer, bool skipFsync)
 {
-	off_t		seekpos;
+	pgoff_t		seekpos;
 	int			nbytes;
 	MdfdVec    *v;
 
@@ -515,9 +515,9 @@ mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 
 	v = _mdfd_getseg(reln, forknum, blocknum, skipFsync, EXTENSION_CREATE);
 
-	seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+	seekpos = (pgoff_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
 
-	Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+	Assert(seekpos < (pgoff_t) BLCKSZ * RELSEG_SIZE);
 
 	if ((nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, seekpos, WAIT_EVENT_DATA_FILE_EXTEND)) != BLCKSZ)
 	{
@@ -578,7 +578,7 @@ mdzeroextend(SMgrRelation reln, ForkNumber forknum,
 	while (remblocks > 0)
 	{
 		BlockNumber segstartblock = curblocknum % ((BlockNumber) RELSEG_SIZE);
-		off_t		seekpos = (off_t) BLCKSZ * segstartblock;
+		pgoff_t		seekpos = (pgoff_t) BLCKSZ * segstartblock;
 		int			numblocks;
 
 		if (segstartblock + remblocks > RELSEG_SIZE)
@@ -607,7 +607,7 @@ mdzeroextend(SMgrRelation reln, ForkNumber forknum,
 			int			ret;
 
 			ret = FileFallocate(v->mdfd_vfd,
-								seekpos, (off_t) BLCKSZ * numblocks,
+								seekpos, (pgoff_t) BLCKSZ * numblocks,
 								WAIT_EVENT_DATA_FILE_EXTEND);
 			if (ret != 0)
 			{
@@ -630,7 +630,7 @@ mdzeroextend(SMgrRelation reln, ForkNumber forknum,
 			 * whole length of the extension.
 			 */
 			ret = FileZero(v->mdfd_vfd,
-						   seekpos, (off_t) BLCKSZ * numblocks,
+						   seekpos, (pgoff_t) BLCKSZ * numblocks,
 						   WAIT_EVENT_DATA_FILE_EXTEND);
 			if (ret < 0)
 				ereport(ERROR,
@@ -745,7 +745,7 @@ mdprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 
 	while (nblocks > 0)
 	{
-		off_t		seekpos;
+		pgoff_t		seekpos;
 		MdfdVec    *v;
 		int			nblocks_this_segment;
 
@@ -754,9 +754,9 @@ mdprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 		if (v == NULL)
 			return false;
 
-		seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+		seekpos = (pgoff_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
 
-		Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+		Assert(seekpos < (pgoff_t) BLCKSZ * RELSEG_SIZE);
 
 		nblocks_this_segment =
 			Min(nblocks,
@@ -851,7 +851,7 @@ mdreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	{
 		struct iovec iov[PG_IOV_MAX];
 		int			iovcnt;
-		off_t		seekpos;
+		pgoff_t		seekpos;
 		int			nbytes;
 		MdfdVec    *v;
 		BlockNumber nblocks_this_segment;
@@ -861,9 +861,9 @@ mdreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 		v = _mdfd_getseg(reln, forknum, blocknum, false,
 						 EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY);
 
-		seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+		seekpos = (pgoff_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
 
-		Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+		Assert(seekpos < (pgoff_t) BLCKSZ * RELSEG_SIZE);
 
 		nblocks_this_segment =
 			Min(nblocks,
@@ -986,7 +986,7 @@ mdstartreadv(PgAioHandle *ioh,
 			 SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 			 void **buffers, BlockNumber nblocks)
 {
-	off_t		seekpos;
+	pgoff_t		seekpos;
 	MdfdVec    *v;
 	BlockNumber nblocks_this_segment;
 	struct iovec *iov;
@@ -996,9 +996,9 @@ mdstartreadv(PgAioHandle *ioh,
 	v = _mdfd_getseg(reln, forknum, blocknum, false,
 					 EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY);
 
-	seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+	seekpos = (pgoff_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
 
-	Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+	Assert(seekpos < (pgoff_t) BLCKSZ * RELSEG_SIZE);
 
 	nblocks_this_segment =
 		Min(nblocks,
@@ -1068,7 +1068,7 @@ mdwritev(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	{
 		struct iovec iov[PG_IOV_MAX];
 		int			iovcnt;
-		off_t		seekpos;
+		pgoff_t		seekpos;
 		int			nbytes;
 		MdfdVec    *v;
 		BlockNumber nblocks_this_segment;
@@ -1078,9 +1078,9 @@ mdwritev(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 		v = _mdfd_getseg(reln, forknum, blocknum, skipFsync,
 						 EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY);
 
-		seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+		seekpos = (pgoff_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
 
-		Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+		Assert(seekpos < (pgoff_t) BLCKSZ * RELSEG_SIZE);
 
 		nblocks_this_segment =
 			Min(nblocks,
@@ -1173,7 +1173,7 @@ mdwriteback(SMgrRelation reln, ForkNumber forknum,
 	while (nblocks > 0)
 	{
 		BlockNumber nflush = nblocks;
-		off_t		seekpos;
+		pgoff_t		seekpos;
 		MdfdVec    *v;
 		int			segnum_start,
 					segnum_end;
@@ -1202,9 +1202,9 @@ mdwriteback(SMgrRelation reln, ForkNumber forknum,
 		Assert(nflush >= 1);
 		Assert(nflush <= nblocks);
 
-		seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+		seekpos = (pgoff_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
 
-		FileWriteback(v->mdfd_vfd, seekpos, (off_t) BLCKSZ * nflush, WAIT_EVENT_DATA_FILE_FLUSH);
+		FileWriteback(v->mdfd_vfd, seekpos, (pgoff_t) BLCKSZ * nflush, WAIT_EVENT_DATA_FILE_FLUSH);
 
 		nblocks -= nflush;
 		blocknum += nflush;
@@ -1348,7 +1348,7 @@ mdtruncate(SMgrRelation reln, ForkNumber forknum,
 			 */
 			BlockNumber lastsegblocks = nblocks - priorblocks;
 
-			if (FileTruncate(v->mdfd_vfd, (off_t) lastsegblocks * BLCKSZ, WAIT_EVENT_DATA_FILE_TRUNCATE) < 0)
+			if (FileTruncate(v->mdfd_vfd, (pgoff_t) lastsegblocks * BLCKSZ, WAIT_EVENT_DATA_FILE_TRUNCATE) < 0)
 				ereport(ERROR,
 						(errcode_for_file_access(),
 						 errmsg("could not truncate file \"%s\" to %u blocks: %m",
@@ -1484,9 +1484,9 @@ mdfd(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uint32 *off)
 	v = _mdfd_getseg(reln, forknum, blocknum, false,
 					 EXTENSION_FAIL);
 
-	*off = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+	*off = (pgoff_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
 
-	Assert(*off < (off_t) BLCKSZ * RELSEG_SIZE);
+	Assert(*off < (pgoff_t) BLCKSZ * RELSEG_SIZE);
 
 	return FileGetRawDesc(v->mdfd_vfd);
 }
@@ -1868,7 +1868,7 @@ _mdfd_getseg(SMgrRelation reln, ForkNumber forknum, BlockNumber blkno,
 static BlockNumber
 _mdnblocks(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 {
-	off_t		len;
+	pgoff_t		len;
 
 	len = FileSize(seg->mdfd_vfd);
 	if (len < 0)
diff --git a/src/common/file_utils.c b/src/common/file_utils.c
index 7b62687a2a..cdf08ab5cb 100644
--- a/src/common/file_utils.c
+++ b/src/common/file_utils.c
@@ -656,7 +656,7 @@ compute_remaining_iovec(struct iovec *destination,
  * error is returned, it is unspecified how much has been written.
  */
 ssize_t
-pg_pwritev_with_retry(int fd, const struct iovec *iov, int iovcnt, off_t offset)
+pg_pwritev_with_retry(int fd, const struct iovec *iov, int iovcnt, pgoff_t offset)
 {
 	struct iovec iov_copy[PG_IOV_MAX];
 	ssize_t		sum = 0;
@@ -706,7 +706,7 @@ pg_pwritev_with_retry(int fd, const struct iovec *iov, int iovcnt, off_t offset)
  * is returned with errno set.
  */
 ssize_t
-pg_pwrite_zeros(int fd, size_t size, off_t offset)
+pg_pwrite_zeros(int fd, size_t size, pgoff_t offset)
 {
 	static const PGIOAlignedBlock zbuffer = {0};	/* worth BLCKSZ */
 	void	   *zerobuf_addr = unconstify(PGIOAlignedBlock *, &zbuffer)->data;
diff --git a/src/include/common/file_utils.h b/src/include/common/file_utils.h
index 9fd88953e4..4239713803 100644
--- a/src/include/common/file_utils.h
+++ b/src/include/common/file_utils.h
@@ -55,9 +55,9 @@ extern int	compute_remaining_iovec(struct iovec *destination,
 extern ssize_t pg_pwritev_with_retry(int fd,
 									 const struct iovec *iov,
 									 int iovcnt,
-									 off_t offset);
+									 pgoff_t offset);
 
-extern ssize_t pg_pwrite_zeros(int fd, size_t size, off_t offset);
+extern ssize_t pg_pwrite_zeros(int fd, size_t size, pgoff_t offset);
 
 /* Filename components */
 #define PG_TEMP_FILES_DIR "pgsql_tmp"
diff --git a/src/include/port/pg_iovec.h b/src/include/port/pg_iovec.h
index 90be3af449..845ded8c71 100644
--- a/src/include/port/pg_iovec.h
+++ b/src/include/port/pg_iovec.h
@@ -51,7 +51,7 @@ struct iovec
  * this changes the current file position.
  */
 static inline ssize_t
-pg_preadv(int fd, const struct iovec *iov, int iovcnt, off_t offset)
+pg_preadv(int fd, const struct iovec *iov, int iovcnt, pgoff_t offset)
 {
 #if HAVE_DECL_PREADV
 	/*
@@ -90,7 +90,7 @@ pg_preadv(int fd, const struct iovec *iov, int iovcnt, off_t offset)
  * this changes the current file position.
  */
 static inline ssize_t
-pg_pwritev(int fd, const struct iovec *iov, int iovcnt, off_t offset)
+pg_pwritev(int fd, const struct iovec *iov, int iovcnt, pgoff_t offset)
 {
 #if HAVE_DECL_PWRITEV
 	/*
diff --git a/src/include/port/win32_port.h b/src/include/port/win32_port.h
index ff7028bdc8..f54ccef7db 100644
--- a/src/include/port/win32_port.h
+++ b/src/include/port/win32_port.h
@@ -584,9 +584,9 @@ typedef unsigned short mode_t;
 #endif
 
 /* in port/win32pread.c */
-extern ssize_t pg_pread(int fd, void *buf, size_t nbyte, off_t offset);
+extern ssize_t pg_pread(int fd, void *buf, size_t nbyte, pgoff_t offset);
 
 /* in port/win32pwrite.c */
-extern ssize_t pg_pwrite(int fd, const void *buf, size_t nbyte, off_t offset);
+extern ssize_t pg_pwrite(int fd, const void *buf, size_t nbyte, pgoff_t offset);
 
 #endif							/* PG_WIN32_PORT_H */
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index b77d8e5e30..3e821ce8fb 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -108,17 +108,17 @@ extern File PathNameOpenFile(const char *fileName, int fileFlags);
 extern File PathNameOpenFilePerm(const char *fileName, int fileFlags, mode_t fileMode);
 extern File OpenTemporaryFile(bool interXact);
 extern void FileClose(File file);
-extern int	FilePrefetch(File file, off_t offset, off_t amount, uint32 wait_event_info);
-extern ssize_t FileReadV(File file, const struct iovec *iov, int iovcnt, off_t offset, uint32 wait_event_info);
-extern ssize_t FileWriteV(File file, const struct iovec *iov, int iovcnt, off_t offset, uint32 wait_event_info);
-extern int	FileStartReadV(struct PgAioHandle *ioh, File file, int iovcnt, off_t offset, uint32 wait_event_info);
+extern int	FilePrefetch(File file, pgoff_t offset, pgoff_t amount, uint32 wait_event_info);
+extern ssize_t FileReadV(File file, const struct iovec *iov, int iovcnt, pgoff_t offset, uint32 wait_event_info);
+extern ssize_t FileWriteV(File file, const struct iovec *iov, int iovcnt, pgoff_t offset, uint32 wait_event_info);
+extern int	FileStartReadV(struct PgAioHandle *ioh, File file, int iovcnt, pgoff_t offset, uint32 wait_event_info);
 extern int	FileSync(File file, uint32 wait_event_info);
-extern int	FileZero(File file, off_t offset, off_t amount, uint32 wait_event_info);
-extern int	FileFallocate(File file, off_t offset, off_t amount, uint32 wait_event_info);
+extern int	FileZero(File file, pgoff_t offset, pgoff_t amount, uint32 wait_event_info);
+extern int	FileFallocate(File file, pgoff_t offset, pgoff_t amount, uint32 wait_event_info);
 
-extern off_t FileSize(File file);
-extern int	FileTruncate(File file, off_t offset, uint32 wait_event_info);
-extern void FileWriteback(File file, off_t offset, off_t nbytes, uint32 wait_event_info);
+extern pgoff_t FileSize(File file);
+extern int	FileTruncate(File file, pgoff_t offset, uint32 wait_event_info);
+extern void FileWriteback(File file, pgoff_t offset, pgoff_t nbytes, uint32 wait_event_info);
 extern char *FilePathName(File file);
 extern int	FileGetRawDesc(File file);
 extern int	FileGetRawFlags(File file);
@@ -186,8 +186,8 @@ extern int	pg_fsync_no_writethrough(int fd);
 extern int	pg_fsync_writethrough(int fd);
 extern int	pg_fdatasync(int fd);
 extern bool pg_file_exists(const char *name);
-extern void pg_flush_data(int fd, off_t offset, off_t nbytes);
-extern int	pg_truncate(const char *path, off_t length);
+extern void pg_flush_data(int fd, pgoff_t offset, pgoff_t nbytes);
+extern int	pg_truncate(const char *path, pgoff_t length);
 extern void fsync_fname(const char *fname, bool isdir);
 extern int	fsync_fname_ext(const char *fname, bool isdir, bool ignore_perm, int elevel);
 extern int	durable_rename(const char *oldfile, const char *newfile, int elevel);
@@ -196,7 +196,7 @@ extern void SyncDataDirectory(void);
 extern int	data_sync_elevel(int elevel);
 
 static inline ssize_t
-FileRead(File file, void *buffer, size_t amount, off_t offset,
+FileRead(File file, void *buffer, size_t amount, pgoff_t offset,
 		 uint32 wait_event_info)
 {
 	struct iovec iov = {
@@ -208,7 +208,7 @@ FileRead(File file, void *buffer, size_t amount, off_t offset,
 }
 
 static inline ssize_t
-FileWrite(File file, const void *buffer, size_t amount, off_t offset,
+FileWrite(File file, const void *buffer, size_t amount, pgoff_t offset,
 		  uint32 wait_event_info)
 {
 	struct iovec iov = {
diff --git a/src/port/win32pread.c b/src/port/win32pread.c
index 32d56c462e..18721cd7b2 100644
--- a/src/port/win32pread.c
+++ b/src/port/win32pread.c
@@ -17,7 +17,7 @@
 #include <windows.h>
 
 ssize_t
-pg_pread(int fd, void *buf, size_t size, off_t offset)
+pg_pread(int fd, void *buf, size_t size, pgoff_t offset)
 {
 	OVERLAPPED	overlapped = {0};
 	HANDLE		handle;
@@ -30,16 +30,16 @@ pg_pread(int fd, void *buf, size_t size, off_t offset)
 		return -1;
 	}
 
-	/* Avoid overflowing DWORD. */
+	/* Avoid overflowing DWORD */
 	size = Min(size, 1024 * 1024 * 1024);
+	/* Note that this changes the file position, despite not using it */
+	overlapped.Offset = (DWORD) offset;
+	overlapped.OffsetHigh = (DWORD) (offset >> 32);
 
-	/* Note that this changes the file position, despite not using it. */
-	overlapped.Offset = offset;
 	if (!ReadFile(handle, buf, size, &result, &overlapped))
 	{
 		if (GetLastError() == ERROR_HANDLE_EOF)
 			return 0;
-
 		_dosmaperr(GetLastError());
 		return -1;
 	}
diff --git a/src/port/win32pwrite.c b/src/port/win32pwrite.c
index 249aa6c468..d9a0d23c2b 100644
--- a/src/port/win32pwrite.c
+++ b/src/port/win32pwrite.c
@@ -15,9 +15,8 @@
 #include "c.h"
 
 #include <windows.h>
-
 ssize_t
-pg_pwrite(int fd, const void *buf, size_t size, off_t offset)
+pg_pwrite(int fd, const void *buf, size_t size, pgoff_t offset)
 {
 	OVERLAPPED	overlapped = {0};
 	HANDLE		handle;
@@ -30,11 +29,12 @@ pg_pwrite(int fd, const void *buf, size_t size, off_t offset)
 		return -1;
 	}
 
-	/* Avoid overflowing DWORD. */
+	/* Avoid overflowing DWORD */
 	size = Min(size, 1024 * 1024 * 1024);
 
-	/* Note that this changes the file position, despite not using it. */
-	overlapped.Offset = offset;
+	overlapped.Offset = (DWORD) offset;
+	overlapped.OffsetHigh = (DWORD) (offset >> 32);
+
 	if (!WriteFile(handle, buf, size, &result, &overlapped))
 	{
 		_dosmaperr(GetLastError());
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 14fc761c4c..95af220a4d 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -28,6 +28,7 @@ subdir('test_ginpostinglist')
 subdir('test_int128')
 subdir('test_integerset')
 subdir('test_json_parser')
+subdir('test_large_files')
 subdir('test_lfind')
 subdir('test_lwlock_tranches')
 subdir('test_misc')
diff --git a/src/test/modules/test_large_files/Makefile b/src/test/modules/test_large_files/Makefile
new file mode 100644
index 0000000000..26bb53a51f
--- /dev/null
+++ b/src/test/modules/test_large_files/Makefile
@@ -0,0 +1,20 @@
+# src/test/modules/test_large_files/Makefile
+
+MODULE_big = test_large_files
+OBJS = test_large_files.o
+
+EXTENSION = test_large_files
+DATA = test_large_files--1.0.sql
+
+REGRESS = test_large_files
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_large_files
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_large_files/README b/src/test/modules/test_large_files/README
new file mode 100644
index 0000000000..d7caae49e6
--- /dev/null
+++ b/src/test/modules/test_large_files/README
@@ -0,0 +1,53 @@
+Test Module for Windows Large File I/O
+
+This test module provides functions to test PostgreSQL's ability to
+handle files larger than 4GB on Windows.
+
+Requirements
+
+- Windows platform
+- PostgreSQL built with segment size greater than 2GB
+- NTFS filesystem (for sparse file support)
+
+Functions
+
+test_create_sparse_file(filename text, size_gb int) RETURNS boolean
+
+Creates a sparse file of the specified size in gigabytes. This allows
+testing large offsets without actually writing gigabytes of data to
+disk.
+
+test_sparse_write_read(filename text, offset_gb float8, test_data text)
+RETURNS boolean
+
+Writes test data at the specified offset (in GB) using PostgreSQL's VFD
+layer (FileWrite), then reads it back using FileRead to verify basic I/O
+functionality.
+
+test_verify_offset_native(filename text, offset_gb float8, expected_data
+text) RETURNS boolean
+
+Critical for validation: Uses native Windows APIs (ReadFile with proper
+OVERLAPPED structure) to verify that data written by PostgreSQL is
+actually at the correct offset. This catches bugs where both write and
+read might use the same incorrect offset calculation (making a broken
+test appear to pass).
+
+Without this verification, a test could pass even with broken offset
+handling if both FileWrite and FileRead make the same mistake.
+
+What the Test Verifies
+
+1. Sparse file creation works on Windows
+2. PostgreSQL's FileWrite can write at offsets > 4GB
+3. PostgreSQL's FileRead can read from offsets > 4GB
+4. Data is actually at the correct offset (verified with native Windows
+   APIs)
+
+The native verification step is critical because without it, a test
+could pass even with broken offset handling. For example, if both
+FileWrite and FileRead truncate offsets to 32 bits, writing at 4.5GB
+would actually write at ~512MB, and reading at 4.5GB would read from
+~512MB - the test would find matching data but at the wrong location.
+The native verification catches this by independently checking the
+actual file offset.
diff --git a/src/test/modules/test_large_files/meson.build b/src/test/modules/test_large_files/meson.build
new file mode 100644
index 0000000000..c755e2cf16
--- /dev/null
+++ b/src/test/modules/test_large_files/meson.build
@@ -0,0 +1,29 @@
+# src/test/modules/test_large_files/meson.build
+
+test_large_files_sources = files(
+  'test_large_files.c',
+)
+
+if host_system == 'windows'
+  test_large_files = shared_module('test_large_files',
+    test_large_files_sources,
+    kwargs: pg_test_mod_args,
+  )
+  test_install_libs += test_large_files
+
+  test_install_data += files(
+    'test_large_files.control',
+    'test_large_files--1.0.sql',
+  )
+
+  tests += {
+    'name': 'test_large_files',
+    'sd': meson.current_source_dir(),
+    'bd': meson.current_build_dir(),
+    'tap': {
+      'tests': [
+        't/001_windows_large_files.pl',
+      ],
+    },
+  }
+endif
diff --git a/src/test/modules/test_large_files/t/001_windows_large_files.pl b/src/test/modules/test_large_files/t/001_windows_large_files.pl
new file mode 100644
index 0000000000..2fb0ef5e36
--- /dev/null
+++ b/src/test/modules/test_large_files/t/001_windows_large_files.pl
@@ -0,0 +1,65 @@
+#!/usr/bin/perl
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+=pod
+
+=head1 NAME
+
+001_windows_large_files.pl - Test Windows support for files >4GB
+
+=head1 SYNOPSIS
+
+  prove src/test/modules/test_large_files/t/001_windows_large_files.pl
+
+=head1 DESCRIPTION
+
+This test verifies that PostgreSQL on Windows can correctly handle file
+operations at offsets beyond 4GB. This requires PostgreSQL to be
+built with a segment size greater than 2GB.
+
+The test uses sparse files to avoid actually writing gigabytes of data.
+
+=cut
+
+use strict;
+use warnings;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+use File::Spec;
+use File::Temp;
+
+if ($^O ne 'MSWin32')
+{
+	plan skip_all => 'test is Windows-specific';
+}
+
+plan tests => 4;
+
+my $node = PostgreSQL::Test::Cluster->new('main');
+$node->init;
+$node->start;
+
+$node->safe_psql('postgres', 'CREATE EXTENSION test_large_files;');
+pass("test_large_files extension loaded");
+
+my $tempdir = File::Temp->newdir();
+my $testfile = File::Spec->catfile($tempdir, 'large_file_test.dat');
+
+note "Test file: $testfile";
+
+my $create_result = $node->safe_psql('postgres',
+	"SELECT test_create_sparse_file('$testfile', 5);");
+is($create_result, 't', "Created 5GB sparse file");
+
+my $test_4_5gb = $node->safe_psql('postgres',
+	"SELECT test_sparse_write_read('$testfile', 4.5, 'TEST_DATA_AT_4.5GB');");
+is($test_4_5gb, 't', "Write/read successful at 4.5GB offset");
+
+my $verify_4_5gb = $node->safe_psql('postgres',
+	"SELECT test_verify_offset_native('$testfile', 4.5, 'TEST_DATA_AT_4.5GB');");
+is($verify_4_5gb, 't', "Native verification confirms data at correct 4.5GB offset");
+
+$node->stop;
+
+done_testing();
diff --git a/src/test/modules/test_large_files/test_large_files--1.0.sql b/src/test/modules/test_large_files/test_large_files--1.0.sql
new file mode 100644
index 0000000000..c4db84106c
--- /dev/null
+++ b/src/test/modules/test_large_files/test_large_files--1.0.sql
@@ -0,0 +1,36 @@
+-- src/test/modules/test_large_files/test_large_files--1.0.sql
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_large_files" to load this file. \quit
+
+--
+-- test_create_sparse_file(filename text, size_gb int) returns boolean
+--
+-- Creates a sparse file for testing. Windows only.
+--
+CREATE FUNCTION test_create_sparse_file(filename text, size_gb int)
+RETURNS boolean
+AS 'MODULE_PATHNAME', 'test_create_sparse_file'
+LANGUAGE C STRICT;
+
+--
+-- test_sparse_write_read(filename text, offset_gb numeric, test_data text) returns boolean
+--
+-- Writes data at a large offset and reads it back to verify correctness.
+-- Tests pg_pwrite/pg_pread with offsets beyond 2GB and 4GB. Windows only.
+--
+CREATE FUNCTION test_sparse_write_read(filename text, offset_gb float8, test_data text)
+RETURNS boolean
+AS 'MODULE_PATHNAME', 'test_sparse_write_read'
+LANGUAGE C STRICT;
+
+--
+-- test_verify_offset_native(filename text, offset_gb numeric, expected_data text) returns boolean
+--
+-- Uses native Windows APIs to verify data is at the correct offset.
+-- This ensures PostgreSQL's I/O didn't write to a wrapped/incorrect offset.
+--
+CREATE FUNCTION test_verify_offset_native(filename text, offset_gb float8, expected_data text)
+RETURNS boolean
+AS 'MODULE_PATHNAME', 'test_verify_offset_native'
+LANGUAGE C STRICT;
diff --git a/src/test/modules/test_large_files/test_large_files.c b/src/test/modules/test_large_files/test_large_files.c
new file mode 100644
index 0000000000..623d2d214c
--- /dev/null
+++ b/src/test/modules/test_large_files/test_large_files.c
@@ -0,0 +1,270 @@
+/* src/test/modules/test_large_files/test_large_files.c */
+
+#include "postgres.h"
+
+#include "fmgr.h"
+#include "storage/fd.h"
+#include "utils/builtins.h"
+
+#ifdef WIN32
+#include <windows.h>
+#include <winioctl.h>
+#endif
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(test_sparse_write_read);
+PG_FUNCTION_INFO_V1(test_create_sparse_file);
+PG_FUNCTION_INFO_V1(test_verify_offset_native);
+
+/*
+ * test_verify_offset_native(filename text, offset_gb numeric, expected_data text) returns boolean
+ *
+ * Uses native Windows APIs to read data at the specified offset and verify it matches.
+ * This ensures PostgreSQL's I/O functions wrote to the CORRECT offset, not a wrapped one.
+ * Windows only.
+ */
+Datum
+test_verify_offset_native(PG_FUNCTION_ARGS)
+{
+#ifdef WIN32
+	text	   *filename_text = PG_GETARG_TEXT_PP(0);
+	float8		offset_gb = PG_GETARG_FLOAT8(1);
+	text	   *expected_text = PG_GETARG_TEXT_PP(2);
+	char	   *filename;
+	char	   *expected_data;
+	char	   *read_buffer;
+	int			expected_len;
+	int64		offset;
+	HANDLE		hFile;
+	OVERLAPPED	overlapped = {0};
+	DWORD		bytesRead;
+	bool		success = false;
+
+	filename = text_to_cstring(filename_text);
+	expected_data = text_to_cstring(expected_text);
+	expected_len = strlen(expected_data) + 1;
+
+	/* Calculate offset in bytes */
+	offset = (int64) (offset_gb * 1024.0 * 1024.0 * 1024.0);
+
+	/* Open file with native Windows API */
+	hFile = CreateFile(filename,
+					   GENERIC_READ,
+					   FILE_SHARE_READ | FILE_SHARE_WRITE,
+					   NULL,
+					   OPEN_EXISTING,
+					   FILE_ATTRIBUTE_NORMAL,
+					   NULL);
+
+	if (hFile == INVALID_HANDLE_VALUE)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\" for verification: %lu",
+						filename, GetLastError())));
+
+	/* Set up OVERLAPPED structure with proper 64-bit offset */
+	overlapped.Offset = (DWORD)(offset & 0xFFFFFFFF);
+	overlapped.OffsetHigh = (DWORD)(offset >> 32);
+
+	/* Allocate read buffer */
+	read_buffer = palloc(expected_len);
+
+	/* Read using native Windows API */
+	if (!ReadFile(hFile, read_buffer, expected_len, &bytesRead, &overlapped))
+	{
+		DWORD error = GetLastError();
+		CloseHandle(hFile);
+		pfree(read_buffer);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("native ReadFile failed at offset %lld: %lu",
+						offset, error)));
+	}
+
+	if (bytesRead != expected_len)
+	{
+		CloseHandle(hFile);
+		pfree(read_buffer);
+		ereport(ERROR,
+				(errmsg("native ReadFile read %lu bytes, expected %d",
+						bytesRead, expected_len)));
+	}
+
+	/* Verify data matches */
+	success = (memcmp(expected_data, read_buffer, expected_len) == 0);
+
+	pfree(read_buffer);
+	CloseHandle(hFile);
+
+	if (!success)
+		ereport(ERROR,
+				(errmsg("data mismatch at offset %lld: PostgreSQL wrote to wrong location",
+						offset)));
+
+	PG_RETURN_BOOL(success);
+#else
+	ereport(ERROR,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("this test is only supported on Windows")));
+	PG_RETURN_BOOL(false);
+#endif
+}
+
+/*
+ * test_create_sparse_file(filename text, size_gb int) returns boolean
+ *
+ * Creates a sparse file of the specified size in gigabytes.
+ * Windows only.
+ */
+Datum
+test_create_sparse_file(PG_FUNCTION_ARGS)
+{
+#ifdef WIN32
+	text	   *filename_text = PG_GETARG_TEXT_PP(0);
+	int32		size_gb = PG_GETARG_INT32(1);
+	char	   *filename;
+	HANDLE		hFile;
+	DWORD		bytesReturned;
+	LARGE_INTEGER fileSize;
+	bool		success = false;
+
+	filename = text_to_cstring(filename_text);
+
+	/* Open/create the file */
+	hFile = CreateFile(filename,
+					   GENERIC_WRITE,
+					   0,
+					   NULL,
+					   CREATE_ALWAYS,
+					   FILE_ATTRIBUTE_NORMAL,
+					   NULL);
+
+	if (hFile == INVALID_HANDLE_VALUE)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create file \"%s\": %lu",
+						filename, GetLastError())));
+
+	/* Mark as sparse */
+	if (!DeviceIoControl(hFile, FSCTL_SET_SPARSE, NULL, 0, NULL, 0,
+						 &bytesReturned, NULL))
+	{
+		CloseHandle(hFile);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not set file sparse: %lu", GetLastError())));
+	}
+
+	/* Set file size */
+	fileSize.QuadPart = (int64) size_gb * 1024 * 1024 * 1024;
+	if (!SetFilePointerEx(hFile, fileSize, NULL, FILE_BEGIN))
+	{
+		CloseHandle(hFile);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not set file pointer: %lu", GetLastError())));
+	}
+
+	if (!SetEndOfFile(hFile))
+	{
+		CloseHandle(hFile);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not set end of file: %lu", GetLastError())));
+	}
+
+	success = true;
+	CloseHandle(hFile);
+
+	PG_RETURN_BOOL(success);
+#else
+	ereport(ERROR,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("sparse file test only supported on Windows")));
+	PG_RETURN_BOOL(false);
+#endif
+}
+
+/*
+ * test_sparse_write_read(filename text, offset_gb numeric, test_data text) returns boolean
+ *
+ * Writes test data at the specified offset (in GB) and reads it back to verify.
+ * Tests that pg_pwrite and pg_pread work correctly with large offsets.
+ * Windows only.
+ */
+Datum
+test_sparse_write_read(PG_FUNCTION_ARGS)
+{
+#ifdef WIN32
+	text	   *filename_text = PG_GETARG_TEXT_PP(0);
+	float8		offset_gb = PG_GETARG_FLOAT8(1);
+	text	   *test_data_text = PG_GETARG_TEXT_PP(2);
+	char	   *filename;
+	char	   *test_data;
+	char	   *read_buffer;
+	int			test_data_len;
+	pgoff_t		offset;
+	int			fd;
+	ssize_t		written;
+	ssize_t		nread;
+	bool		success = false;
+
+	filename = text_to_cstring(filename_text);
+	test_data = text_to_cstring(test_data_text);
+	test_data_len = strlen(test_data) + 1;	/* include null terminator */
+
+	/* Calculate offset in bytes */
+	offset = (pgoff_t) (offset_gb * 1024.0 * 1024.0 * 1024.0);
+
+	/* Open the file using PostgreSQL's VFD layer */
+	fd = BasicOpenFile(filename, O_RDWR | PG_BINARY);
+	if (fd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m", filename)));
+
+	/* Write test data at the specified offset using pg_pwrite */
+	written = pg_pwrite(fd, test_data, test_data_len, offset);
+	if (written != test_data_len)
+	{
+		close(fd);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file at offset %lld: wrote %zd of %d bytes",
+						(long long) offset, written, test_data_len)));
+	}
+
+	/* Allocate buffer for reading */
+	read_buffer = palloc(test_data_len);
+
+	/* Read back the data using pg_pread */
+	nread = pg_pread(fd, read_buffer, test_data_len, offset);
+	if (nread != test_data_len)
+	{
+		close(fd);
+		pfree(read_buffer);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read from file at offset %lld: read %zd of %d bytes",
+						(long long) offset, nread, test_data_len)));
+	}
+
+	/* Verify data matches */
+	success = (memcmp(test_data, read_buffer, test_data_len) == 0);
+
+	pfree(read_buffer);
+	close(fd);
+
+	if (!success)
+		ereport(ERROR,
+				(errmsg("data mismatch: read data does not match written data")));
+
+	PG_RETURN_BOOL(success);
+#else
+	ereport(ERROR,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("this test is only supported on Windows")));
+	PG_RETURN_BOOL(false);
+#endif
+}
diff --git a/src/test/modules/test_large_files/test_large_files.control b/src/test/modules/test_large_files/test_large_files.control
new file mode 100644
index 0000000000..9b0a30974b
--- /dev/null
+++ b/src/test/modules/test_large_files/test_large_files.control
@@ -0,0 +1,5 @@
+# test_large_files extension
+comment = 'Test module for large file I/O on Windows'
+default_version = '1.0'
+module_pathname = '$libdir/test_large_files'
+relocatable = true
-- 
2.46.0.windows.1

0001-Add-check-for-large-files-in-meson.build.patchtext/plain; charset=UTF-8; name=0001-Add-check-for-large-files-in-meson.build.patchDownload
From 88e1d4c56f74c604d3773da4c45c0c592724adc5 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Sat, 8 Nov 2025 18:36:52 +0900
Subject: [PATCH] Add check for large files in meson.build

This check existed in the MSVC scripts that have been removed in v16,
and was missing from meson.

Backpatch-through: 16
---
 meson.build | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/meson.build b/meson.build
index 00c46d400714..c7bc2df2c82f 100644
--- a/meson.build
+++ b/meson.build
@@ -452,6 +452,14 @@ else
   segsize = (get_option('segsize') * 1024 * 1024 * 1024) / blocksize
 endif
 
+# If we don't have largefile support, can't handle segment size >= 2GB.
+if cc.sizeof('off_t', args: test_c_args) < 8
+  segsize_bytes = segsize * blocksize
+  if segsize_bytes >= (2 * 1024 * 1024 * 1024)
+    error('Large file support is not enabled. Segment size cannot be larger than 1GB.')
+  endif
+endif
+
 cdata.set('BLCKSZ', blocksize, description:
 '''Size of a disk block --- this also limits the size of a tuple. You can set
    it bigger if you need bigger tuples (although TOAST should reduce the need
-- 
2.51.0

#19Michael Paquier
michael@paquier.xyz
In reply to: Bryan Green (#18)
Re: [Patch] Windows relation extension failure at 2GB and 4GB

On Tue, Nov 11, 2025 at 01:22:49PM -0600, Bryan Green wrote:

I have attached v4 of the patch after correcting some whitespace errors
and a struct that didn't need the pgoff_t modification as requested by
Thomas. I also have attached Michael's patch where both patches can be
found by cfbot for the commitfest.

Thanks. As the stamps have been pushed for the next minor release, I
have applied and backpatched the meson check for now. I'll look at
your patch next, for HEAD.
--
Michael

In reply to: Michael Paquier (#19)
Re: [Patch] Windows relation extension failure at 2GB and 4GB

Michael Paquier <michael@paquier.xyz> writes:

On Tue, Nov 11, 2025 at 01:22:49PM -0600, Bryan Green wrote:

I have attached v4 of the patch after correcting some whitespace errors
and a struct that didn't need the pgoff_t modification as requested by
Thomas. I also have attached Michael's patch where both patches can be
found by cfbot for the commitfest.

Thanks. As the stamps have been pushed for the next minor release, I
have applied and backpatched the meson check for now. I'll look at
your patch next, for HEAD.

I just noticed that the check is for 2GB, but the error message says
1GB.

- ilmari

#21Thomas Munro
thomas.munro@gmail.com
In reply to: Dagfinn Ilmari Mannsåker (#20)
Re: [Patch] Windows relation extension failure at 2GB and 4GB

On Wed, Nov 12, 2025 at 11:55 PM Dagfinn Ilmari Mannsåker
<ilmari@ilmari.org> wrote:

I just noticed that the check is for 2GB, but the error message says
1GB.

-Dsegsize only accepts integers here and it's expressed in GB, so only
1 will actually work. Presumably you could set a size up to 2GB - 1
block using -Dsegsize_blocks instead and it would work, but that's
clearly documented as developer-only. I noticed that too and
scratched my head for a moment, but I think Michael used a defensible
cut-off and a defensible error message, they just disagree :-)

#22Andres Freund
andres@anarazel.de
In reply to: Bryan Green (#18)
Re: [Patch] Windows relation extension failure at 2GB and 4GB

Hi,

On 2025-11-11 13:22:49 -0600, Bryan Green wrote:

diff --git a/src/test/modules/test_large_files/README b/src/test/modules/test_large_files/README
new file mode 100644
index 0000000000..d7caae49e6
--- /dev/null
+++ b/src/test/modules/test_large_files/README
@@ -0,0 +1,53 @@
+Test Module for Windows Large File I/O
+
+This test module provides functions to test PostgreSQL's ability to
+handle files larger than 4GB on Windows.
+
+Requirements
+
+- Windows platform
+- PostgreSQL built with segment size greater than 2GB
+- NTFS filesystem (for sparse file support)

I don't think this test should only run on windows. That a) makes it harder to
maintain for non-windows devs b) assumes that only windows could ever have
issues.

Greetings,

Andres Freund

#23Michael Paquier
michael@paquier.xyz
In reply to: Thomas Munro (#21)
Re: [Patch] Windows relation extension failure at 2GB and 4GB

On Thu, Nov 13, 2025 at 12:26:03AM +1300, Thomas Munro wrote:

On Wed, Nov 12, 2025 at 11:55 PM Dagfinn Ilmari Mannsåker
<ilmari@ilmari.org> wrote:

I just noticed that the check is for 2GB, but the error message says
1GB.

-Dsegsize only accepts integers here and it's expressed in GB, so only
1 will actually work. Presumably you could set a size up to 2GB - 1
block using -Dsegsize_blocks instead and it would work, but that's
clearly documented as developer-only. I noticed that too and
scratched my head for a moment, but I think Michael used a defensible
cut-off and a defensible error message, they just disagree :-)

Yep, I was also scratching my head a bit on this one for the meson
bit, dived into its history before sticking to this message last
weekend for consistency.

In summary, the current formula is the same as d3b111e3205b, with a
wording much older than that: 3c6248a828af. The original option in
this commit was only settable with GB in mind as units for the segment
size, hence a 4-byte off_t could work only with 1GB back originally,
hence the error message worded this way. I doubt that it's worth
caring much for fancier segment sizes, but if we do we'd better change
that for meson and configure.
--
Michael

#24Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#19)
1 attachment(s)
Re: [Patch] Windows relation extension failure at 2GB and 4GB

On Wed, Nov 12, 2025 at 04:58:43PM +0900, Michael Paquier wrote:

Thanks. As the stamps have been pushed for the next minor release, I
have applied and backpatched the meson check for now. I'll look at
your patch next, for HEAD.

Moving on to the I/O routine changes. There was a little bit of
noise in the diffs, like one more comment removed that should still be
around. Indentation has needed some adjustment as well, there was one
funny diff with a cast to pgoff_t. And done this part as a first
step, because that's already a nice cut.

Then, about the test module.

src/test/modules/Makefile was missing, and once updated I have noticed
the extra REGRESS in the module's Makefile that made the tests fail
because we just have a TAP test. This also meant that TAP_TESTS = 1
was also missing from the Makefile. I've fixed these myself as per
the attached.

Anyway, I agree with the point about the restriction with WIN32: there
can be advantages in being able to run that in other places. I think
that we should add a new value for PG_TEST_EXTRA and execute the test
based on that, or on small machines with little disk space (think
small SD cards), this is going to blow up.

Also, is there a point in making that a TAP test? A SQL test should
work OK based on the set of SQL functions introduced for the file
creation step and the validation steps. We could also use alternate
outputs if required.
--
Michael

Attachments:

v5-0001-Add-test-module-test_large_files.patchtext/x-diff; charset=us-asciiDownload
From c46c56e7b86e26a63f9b0b638d44558f2af93b8d Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Thu, 13 Nov 2025 12:59:56 +0900
Subject: [PATCH v5] Add test module test_large_files

---
 src/test/modules/Makefile                     |   1 +
 src/test/modules/meson.build                  |   1 +
 src/test/modules/test_large_files/Makefile    |  19 ++
 src/test/modules/test_large_files/README      |  53 ++++
 src/test/modules/test_large_files/meson.build |  29 ++
 .../t/001_windows_large_files.pl              |  65 +++++
 .../test_large_files--1.0.sql                 |  36 +++
 .../test_large_files/test_large_files.c       | 270 ++++++++++++++++++
 .../test_large_files/test_large_files.control |   5 +
 .../log/regress_log_001_windows_large_files   |   1 +
 10 files changed, 480 insertions(+)
 create mode 100644 src/test/modules/test_large_files/Makefile
 create mode 100644 src/test/modules/test_large_files/README
 create mode 100644 src/test/modules/test_large_files/meson.build
 create mode 100644 src/test/modules/test_large_files/t/001_windows_large_files.pl
 create mode 100644 src/test/modules/test_large_files/test_large_files--1.0.sql
 create mode 100644 src/test/modules/test_large_files/test_large_files.c
 create mode 100644 src/test/modules/test_large_files/test_large_files.control
 create mode 100644 src/test/modules/test_large_files/tmp_check/log/regress_log_001_windows_large_files

diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 902a79541010..442713428791 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -29,6 +29,7 @@ SUBDIRS = \
 		  test_int128 \
 		  test_integerset \
 		  test_json_parser \
+		  test_large_files \
 		  test_lfind \
 		  test_lwlock_tranches \
 		  test_misc \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 14fc761c4cfa..95af220a4d97 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -28,6 +28,7 @@ subdir('test_ginpostinglist')
 subdir('test_int128')
 subdir('test_integerset')
 subdir('test_json_parser')
+subdir('test_large_files')
 subdir('test_lfind')
 subdir('test_lwlock_tranches')
 subdir('test_misc')
diff --git a/src/test/modules/test_large_files/Makefile b/src/test/modules/test_large_files/Makefile
new file mode 100644
index 000000000000..f9fa977797d0
--- /dev/null
+++ b/src/test/modules/test_large_files/Makefile
@@ -0,0 +1,19 @@
+# src/test/modules/test_large_files/Makefile
+
+MODULES = test_large_files
+
+EXTENSION = test_large_files
+DATA = test_large_files--1.0.sql
+
+TAP_TESTS = 1
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_large_files
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_large_files/README b/src/test/modules/test_large_files/README
new file mode 100644
index 000000000000..d7caae49e6a3
--- /dev/null
+++ b/src/test/modules/test_large_files/README
@@ -0,0 +1,53 @@
+Test Module for Windows Large File I/O
+
+This test module provides functions to test PostgreSQL's ability to
+handle files larger than 4GB on Windows.
+
+Requirements
+
+- Windows platform
+- PostgreSQL built with segment size greater than 2GB
+- NTFS filesystem (for sparse file support)
+
+Functions
+
+test_create_sparse_file(filename text, size_gb int) RETURNS boolean
+
+Creates a sparse file of the specified size in gigabytes. This allows
+testing large offsets without actually writing gigabytes of data to
+disk.
+
+test_sparse_write_read(filename text, offset_gb float8, test_data text)
+RETURNS boolean
+
+Writes test data at the specified offset (in GB) using PostgreSQL's VFD
+layer (FileWrite), then reads it back using FileRead to verify basic I/O
+functionality.
+
+test_verify_offset_native(filename text, offset_gb float8, expected_data
+text) RETURNS boolean
+
+Critical for validation: Uses native Windows APIs (ReadFile with proper
+OVERLAPPED structure) to verify that data written by PostgreSQL is
+actually at the correct offset. This catches bugs where both write and
+read might use the same incorrect offset calculation (making a broken
+test appear to pass).
+
+Without this verification, a test could pass even with broken offset
+handling if both FileWrite and FileRead make the same mistake.
+
+What the Test Verifies
+
+1. Sparse file creation works on Windows
+2. PostgreSQL's FileWrite can write at offsets > 4GB
+3. PostgreSQL's FileRead can read from offsets > 4GB
+4. Data is actually at the correct offset (verified with native Windows
+   APIs)
+
+The native verification step is critical because without it, a test
+could pass even with broken offset handling. For example, if both
+FileWrite and FileRead truncate offsets to 32 bits, writing at 4.5GB
+would actually write at ~512MB, and reading at 4.5GB would read from
+~512MB - the test would find matching data but at the wrong location.
+The native verification catches this by independently checking the
+actual file offset.
diff --git a/src/test/modules/test_large_files/meson.build b/src/test/modules/test_large_files/meson.build
new file mode 100644
index 000000000000..c755e2cf16d0
--- /dev/null
+++ b/src/test/modules/test_large_files/meson.build
@@ -0,0 +1,29 @@
+# src/test/modules/test_large_files/meson.build
+
+test_large_files_sources = files(
+  'test_large_files.c',
+)
+
+if host_system == 'windows'
+  test_large_files = shared_module('test_large_files',
+    test_large_files_sources,
+    kwargs: pg_test_mod_args,
+  )
+  test_install_libs += test_large_files
+
+  test_install_data += files(
+    'test_large_files.control',
+    'test_large_files--1.0.sql',
+  )
+
+  tests += {
+    'name': 'test_large_files',
+    'sd': meson.current_source_dir(),
+    'bd': meson.current_build_dir(),
+    'tap': {
+      'tests': [
+        't/001_windows_large_files.pl',
+      ],
+    },
+  }
+endif
diff --git a/src/test/modules/test_large_files/t/001_windows_large_files.pl b/src/test/modules/test_large_files/t/001_windows_large_files.pl
new file mode 100644
index 000000000000..2fb0ef5e36bf
--- /dev/null
+++ b/src/test/modules/test_large_files/t/001_windows_large_files.pl
@@ -0,0 +1,65 @@
+#!/usr/bin/perl
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+=pod
+
+=head1 NAME
+
+001_windows_large_files.pl - Test Windows support for files >4GB
+
+=head1 SYNOPSIS
+
+  prove src/test/modules/test_large_files/t/001_windows_large_files.pl
+
+=head1 DESCRIPTION
+
+This test verifies that PostgreSQL on Windows can correctly handle file
+operations at offsets beyond 4GB. This requires PostgreSQL to be
+built with a segment size greater than 2GB.
+
+The test uses sparse files to avoid actually writing gigabytes of data.
+
+=cut
+
+use strict;
+use warnings;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+use File::Spec;
+use File::Temp;
+
+if ($^O ne 'MSWin32')
+{
+	plan skip_all => 'test is Windows-specific';
+}
+
+plan tests => 4;
+
+my $node = PostgreSQL::Test::Cluster->new('main');
+$node->init;
+$node->start;
+
+$node->safe_psql('postgres', 'CREATE EXTENSION test_large_files;');
+pass("test_large_files extension loaded");
+
+my $tempdir = File::Temp->newdir();
+my $testfile = File::Spec->catfile($tempdir, 'large_file_test.dat');
+
+note "Test file: $testfile";
+
+my $create_result = $node->safe_psql('postgres',
+	"SELECT test_create_sparse_file('$testfile', 5);");
+is($create_result, 't', "Created 5GB sparse file");
+
+my $test_4_5gb = $node->safe_psql('postgres',
+	"SELECT test_sparse_write_read('$testfile', 4.5, 'TEST_DATA_AT_4.5GB');");
+is($test_4_5gb, 't', "Write/read successful at 4.5GB offset");
+
+my $verify_4_5gb = $node->safe_psql('postgres',
+	"SELECT test_verify_offset_native('$testfile', 4.5, 'TEST_DATA_AT_4.5GB');");
+is($verify_4_5gb, 't', "Native verification confirms data at correct 4.5GB offset");
+
+$node->stop;
+
+done_testing();
diff --git a/src/test/modules/test_large_files/test_large_files--1.0.sql b/src/test/modules/test_large_files/test_large_files--1.0.sql
new file mode 100644
index 000000000000..c4db84106c8d
--- /dev/null
+++ b/src/test/modules/test_large_files/test_large_files--1.0.sql
@@ -0,0 +1,36 @@
+-- src/test/modules/test_large_files/test_large_files--1.0.sql
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_large_files" to load this file. \quit
+
+--
+-- test_create_sparse_file(filename text, size_gb int) returns boolean
+--
+-- Creates a sparse file for testing. Windows only.
+--
+CREATE FUNCTION test_create_sparse_file(filename text, size_gb int)
+RETURNS boolean
+AS 'MODULE_PATHNAME', 'test_create_sparse_file'
+LANGUAGE C STRICT;
+
+--
+-- test_sparse_write_read(filename text, offset_gb numeric, test_data text) returns boolean
+--
+-- Writes data at a large offset and reads it back to verify correctness.
+-- Tests pg_pwrite/pg_pread with offsets beyond 2GB and 4GB. Windows only.
+--
+CREATE FUNCTION test_sparse_write_read(filename text, offset_gb float8, test_data text)
+RETURNS boolean
+AS 'MODULE_PATHNAME', 'test_sparse_write_read'
+LANGUAGE C STRICT;
+
+--
+-- test_verify_offset_native(filename text, offset_gb numeric, expected_data text) returns boolean
+--
+-- Uses native Windows APIs to verify data is at the correct offset.
+-- This ensures PostgreSQL's I/O didn't write to a wrapped/incorrect offset.
+--
+CREATE FUNCTION test_verify_offset_native(filename text, offset_gb float8, expected_data text)
+RETURNS boolean
+AS 'MODULE_PATHNAME', 'test_verify_offset_native'
+LANGUAGE C STRICT;
diff --git a/src/test/modules/test_large_files/test_large_files.c b/src/test/modules/test_large_files/test_large_files.c
new file mode 100644
index 000000000000..623d2d214cde
--- /dev/null
+++ b/src/test/modules/test_large_files/test_large_files.c
@@ -0,0 +1,270 @@
+/* src/test/modules/test_large_files/test_large_files.c */
+
+#include "postgres.h"
+
+#include "fmgr.h"
+#include "storage/fd.h"
+#include "utils/builtins.h"
+
+#ifdef WIN32
+#include <windows.h>
+#include <winioctl.h>
+#endif
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(test_sparse_write_read);
+PG_FUNCTION_INFO_V1(test_create_sparse_file);
+PG_FUNCTION_INFO_V1(test_verify_offset_native);
+
+/*
+ * test_verify_offset_native(filename text, offset_gb numeric, expected_data text) returns boolean
+ *
+ * Uses native Windows APIs to read data at the specified offset and verify it matches.
+ * This ensures PostgreSQL's I/O functions wrote to the CORRECT offset, not a wrapped one.
+ * Windows only.
+ */
+Datum
+test_verify_offset_native(PG_FUNCTION_ARGS)
+{
+#ifdef WIN32
+	text	   *filename_text = PG_GETARG_TEXT_PP(0);
+	float8		offset_gb = PG_GETARG_FLOAT8(1);
+	text	   *expected_text = PG_GETARG_TEXT_PP(2);
+	char	   *filename;
+	char	   *expected_data;
+	char	   *read_buffer;
+	int			expected_len;
+	int64		offset;
+	HANDLE		hFile;
+	OVERLAPPED	overlapped = {0};
+	DWORD		bytesRead;
+	bool		success = false;
+
+	filename = text_to_cstring(filename_text);
+	expected_data = text_to_cstring(expected_text);
+	expected_len = strlen(expected_data) + 1;
+
+	/* Calculate offset in bytes */
+	offset = (int64) (offset_gb * 1024.0 * 1024.0 * 1024.0);
+
+	/* Open file with native Windows API */
+	hFile = CreateFile(filename,
+					   GENERIC_READ,
+					   FILE_SHARE_READ | FILE_SHARE_WRITE,
+					   NULL,
+					   OPEN_EXISTING,
+					   FILE_ATTRIBUTE_NORMAL,
+					   NULL);
+
+	if (hFile == INVALID_HANDLE_VALUE)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\" for verification: %lu",
+						filename, GetLastError())));
+
+	/* Set up OVERLAPPED structure with proper 64-bit offset */
+	overlapped.Offset = (DWORD)(offset & 0xFFFFFFFF);
+	overlapped.OffsetHigh = (DWORD)(offset >> 32);
+
+	/* Allocate read buffer */
+	read_buffer = palloc(expected_len);
+
+	/* Read using native Windows API */
+	if (!ReadFile(hFile, read_buffer, expected_len, &bytesRead, &overlapped))
+	{
+		DWORD error = GetLastError();
+		CloseHandle(hFile);
+		pfree(read_buffer);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("native ReadFile failed at offset %lld: %lu",
+						offset, error)));
+	}
+
+	if (bytesRead != expected_len)
+	{
+		CloseHandle(hFile);
+		pfree(read_buffer);
+		ereport(ERROR,
+				(errmsg("native ReadFile read %lu bytes, expected %d",
+						bytesRead, expected_len)));
+	}
+
+	/* Verify data matches */
+	success = (memcmp(expected_data, read_buffer, expected_len) == 0);
+
+	pfree(read_buffer);
+	CloseHandle(hFile);
+
+	if (!success)
+		ereport(ERROR,
+				(errmsg("data mismatch at offset %lld: PostgreSQL wrote to wrong location",
+						offset)));
+
+	PG_RETURN_BOOL(success);
+#else
+	ereport(ERROR,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("this test is only supported on Windows")));
+	PG_RETURN_BOOL(false);
+#endif
+}
+
+/*
+ * test_create_sparse_file(filename text, size_gb int) returns boolean
+ *
+ * Creates a sparse file of the specified size in gigabytes.
+ * Windows only.
+ */
+Datum
+test_create_sparse_file(PG_FUNCTION_ARGS)
+{
+#ifdef WIN32
+	text	   *filename_text = PG_GETARG_TEXT_PP(0);
+	int32		size_gb = PG_GETARG_INT32(1);
+	char	   *filename;
+	HANDLE		hFile;
+	DWORD		bytesReturned;
+	LARGE_INTEGER fileSize;
+	bool		success = false;
+
+	filename = text_to_cstring(filename_text);
+
+	/* Open/create the file */
+	hFile = CreateFile(filename,
+					   GENERIC_WRITE,
+					   0,
+					   NULL,
+					   CREATE_ALWAYS,
+					   FILE_ATTRIBUTE_NORMAL,
+					   NULL);
+
+	if (hFile == INVALID_HANDLE_VALUE)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create file \"%s\": %lu",
+						filename, GetLastError())));
+
+	/* Mark as sparse */
+	if (!DeviceIoControl(hFile, FSCTL_SET_SPARSE, NULL, 0, NULL, 0,
+						 &bytesReturned, NULL))
+	{
+		CloseHandle(hFile);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not set file sparse: %lu", GetLastError())));
+	}
+
+	/* Set file size */
+	fileSize.QuadPart = (int64) size_gb * 1024 * 1024 * 1024;
+	if (!SetFilePointerEx(hFile, fileSize, NULL, FILE_BEGIN))
+	{
+		CloseHandle(hFile);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not set file pointer: %lu", GetLastError())));
+	}
+
+	if (!SetEndOfFile(hFile))
+	{
+		CloseHandle(hFile);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not set end of file: %lu", GetLastError())));
+	}
+
+	success = true;
+	CloseHandle(hFile);
+
+	PG_RETURN_BOOL(success);
+#else
+	ereport(ERROR,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("sparse file test only supported on Windows")));
+	PG_RETURN_BOOL(false);
+#endif
+}
+
+/*
+ * test_sparse_write_read(filename text, offset_gb numeric, test_data text) returns boolean
+ *
+ * Writes test data at the specified offset (in GB) and reads it back to verify.
+ * Tests that pg_pwrite and pg_pread work correctly with large offsets.
+ * Windows only.
+ */
+Datum
+test_sparse_write_read(PG_FUNCTION_ARGS)
+{
+#ifdef WIN32
+	text	   *filename_text = PG_GETARG_TEXT_PP(0);
+	float8		offset_gb = PG_GETARG_FLOAT8(1);
+	text	   *test_data_text = PG_GETARG_TEXT_PP(2);
+	char	   *filename;
+	char	   *test_data;
+	char	   *read_buffer;
+	int			test_data_len;
+	pgoff_t		offset;
+	int			fd;
+	ssize_t		written;
+	ssize_t		nread;
+	bool		success = false;
+
+	filename = text_to_cstring(filename_text);
+	test_data = text_to_cstring(test_data_text);
+	test_data_len = strlen(test_data) + 1;	/* include null terminator */
+
+	/* Calculate offset in bytes */
+	offset = (pgoff_t) (offset_gb * 1024.0 * 1024.0 * 1024.0);
+
+	/* Open the file using PostgreSQL's VFD layer */
+	fd = BasicOpenFile(filename, O_RDWR | PG_BINARY);
+	if (fd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m", filename)));
+
+	/* Write test data at the specified offset using pg_pwrite */
+	written = pg_pwrite(fd, test_data, test_data_len, offset);
+	if (written != test_data_len)
+	{
+		close(fd);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file at offset %lld: wrote %zd of %d bytes",
+						(long long) offset, written, test_data_len)));
+	}
+
+	/* Allocate buffer for reading */
+	read_buffer = palloc(test_data_len);
+
+	/* Read back the data using pg_pread */
+	nread = pg_pread(fd, read_buffer, test_data_len, offset);
+	if (nread != test_data_len)
+	{
+		close(fd);
+		pfree(read_buffer);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read from file at offset %lld: read %zd of %d bytes",
+						(long long) offset, nread, test_data_len)));
+	}
+
+	/* Verify data matches */
+	success = (memcmp(test_data, read_buffer, test_data_len) == 0);
+
+	pfree(read_buffer);
+	close(fd);
+
+	if (!success)
+		ereport(ERROR,
+				(errmsg("data mismatch: read data does not match written data")));
+
+	PG_RETURN_BOOL(success);
+#else
+	ereport(ERROR,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("this test is only supported on Windows")));
+	PG_RETURN_BOOL(false);
+#endif
+}
diff --git a/src/test/modules/test_large_files/test_large_files.control b/src/test/modules/test_large_files/test_large_files.control
new file mode 100644
index 000000000000..9b0a30974b95
--- /dev/null
+++ b/src/test/modules/test_large_files/test_large_files.control
@@ -0,0 +1,5 @@
+# test_large_files extension
+comment = 'Test module for large file I/O on Windows'
+default_version = '1.0'
+module_pathname = '$libdir/test_large_files'
+relocatable = true
diff --git a/src/test/modules/test_large_files/tmp_check/log/regress_log_001_windows_large_files b/src/test/modules/test_large_files/tmp_check/log/regress_log_001_windows_large_files
new file mode 100644
index 000000000000..6d1526ee93d9
--- /dev/null
+++ b/src/test/modules/test_large_files/tmp_check/log/regress_log_001_windows_large_files
@@ -0,0 +1 @@
+[12:57:48.543](0.006s) 1..0 # SKIP test is Windows-specific
-- 
2.51.0

#25Bryan Green
dbryan.green@gmail.com
In reply to: Michael Paquier (#24)
Re: [Patch] Windows relation extension failure at 2GB and 4GB

On 11/12/2025 10:05 PM, Michael Paquier wrote:

On Wed, Nov 12, 2025 at 04:58:43PM +0900, Michael Paquier wrote:

Thanks. As the stamps have been pushed for the next minor release, I
have applied and backpatched the meson check for now. I'll look at
your patch next, for HEAD.
Moving on to the I/O routine changes. There was a little bit of

noise in the diffs, like one more comment removed that should still be
around. Indentation has needed some adjustment as well, there was one
funny diff with a cast to pgoff_t. And done this part as a first
step, because that's already a nice cut.

Apologies for the state of this and your loss of time. I was testing a
new workflow and attached the wrong revision. No excuse, just what
happened. I will be more careful and do a closer review of diffs going
forward.

Then, about the test module.

src/test/modules/Makefile was missing, and once updated I have noticed
the extra REGRESS in the module's Makefile that made the tests fail
because we just have a TAP test. This also meant that TAP_TESTS = 1
was also missing from the Makefile. I've fixed these myself as per
the attached.

I had started down the path of using sql and doing regression testing
and decide instead that a tap test would be better for my specific case
of testing on Windows.

Anyway, I agree with the point about the restriction with WIN32: there
can be advantages in being able to run that in other places. I think
that we should add a new value for PG_TEST_EXTRA and execute the test
based on that, or on small machines with little disk space (think
small SD cards), this is going to blow up.

I was focused on testing the overlapped i/o portion of this for windows
and that is why I went with a tap test.

Also, is there a point in making that a TAP test? A SQL test should
work OK based on the set of SQL functions introduced for the file
creation step and the validation steps. We could also use alternate
outputs if required.
--
Michael

Thanks for all the work Michael. I owe you for the cleanup. I assume
you are suggesting that we shift from test for windows-specific bugs to
testing for whether any platform that supports N-bit file offsets,
whether PG's I/O layer can actually use them? Basically we could check
the size of off_t or pgoff_t and the test at those offsets specifically.
I think we would still want to use sparse files though.

The argument for a TAP test in this case would be File::Temp handles
cleanup automatically for us (even on test failure). Also, no need for
alternate output files.

I agree we should go to a cross-platform test. I'm 51/49 in favor of
using TAP tests still, but if you, or others, feel strongly otherwise, I
can restructure it to work that way.

--
Bryan Green
EDB: https://www.enterprisedb.com

#26Michael Paquier
michael@paquier.xyz
In reply to: Bryan Green (#25)
Re: [Patch] Windows relation extension failure at 2GB and 4GB

On Thu, Nov 13, 2025 at 10:58:54AM -0600, Bryan Green wrote:

On 11/12/2025 10:05 PM, Michael Paquier wrote:

Moving on to the I/O routine changes. There was a little bit of

noise in the diffs, like one more comment removed that should still be
around. Indentation has needed some adjustment as well, there was one
funny diff with a cast to pgoff_t. And done this part as a first
step, because that's already a nice cut.

Apologies for the state of this and your loss of time. I was testing a
new workflow and attached the wrong revision. No excuse, just what
happened. I will be more careful and do a closer review of diffs going
forward.

No worries. Thanks for all your efforts here.

I had started down the path of using sql and doing regression testing
and decide instead that a tap test would be better for my specific case
of testing on Windows.

How much do we really care about the case of FSCTL_SET_SPARSE? We
don't use it in the tree, and I doubt we will, but perhaps you have
some plans to use it for something I am unaware of, that would justify
its existence?

The argument for a TAP test in this case would be File::Temp handles
cleanup automatically for us (even on test failure). Also, no need for
alternate output files.

I agree we should go to a cross-platform test. I'm 51/49 in favor of
using TAP tests still, but if you, or others, feel strongly otherwise, I
can restructure it to work that way.

There are a couple of options here:
- Use NO_INSTALLCHECK so as the test would never be run on an existing
deployment, only check. We could use that on top of a PG_TEST_EXTRA
to check with a large offset if the writes cannot be cheap..
- For alternate output, the module could have a SQL function that
returns the size of off_t or equivalent, mixed with an \if to avoid
the test for a sizeof 4 bytes.

If others argue in favor of a TAP test as well, that's OK by me.
However, there is nothing in the current patch that justifies that:
the proposal only does direct SQL function calls and does not need a
specific configuration or any cluster manipulations (aka restarts,
etc).
--
Michael

#27Bryan Green
dbryan.green@gmail.com
In reply to: Michael Paquier (#26)
1 attachment(s)
Re: [Patch] Windows relation extension failure at 2GB and 4GB

On 11/14/2025 12:44 AM, Michael Paquier wrote:

On Thu, Nov 13, 2025 at 10:58:54AM -0600, Bryan Green wrote:

On 11/12/2025 10:05 PM, Michael Paquier wrote:

Moving on to the I/O routine changes. There was a little bit of

noise in the diffs, like one more comment removed that should still be
around. Indentation has needed some adjustment as well, there was one
funny diff with a cast to pgoff_t. And done this part as a first
step, because that's already a nice cut.

Apologies for the state of this and your loss of time. I was testing a
new workflow and attached the wrong revision. No excuse, just what
happened. I will be more careful and do a closer review of diffs going
forward.

No worries. Thanks for all your efforts here.

I had started down the path of using sql and doing regression testing
and decide instead that a tap test would be better for my specific case
of testing on Windows.

How much do we really care about the case of FSCTL_SET_SPARSE? We
don't use it in the tree, and I doubt we will, but perhaps you have
some plans to use it for something I am unaware of, that would justify
its existence?

No plans for it. Dropped.

The argument for a TAP test in this case would be File::Temp handles
cleanup automatically for us (even on test failure). Also, no need for
alternate output files.

I agree we should go to a cross-platform test. I'm 51/49 in favor of
using TAP tests still, but if you, or others, feel strongly otherwise, I
can restructure it to work that way.

There are a couple of options here:
- Use NO_INSTALLCHECK so as the test would never be run on an existing
deployment, only check. We could use that on top of a PG_TEST_EXTRA
to check with a large offset if the writes cannot be cheap..
- For alternate output, the module could have a SQL function that
returns the size of off_t or equivalent, mixed with an \if to avoid
the test for a sizeof 4 bytes.

If others argue in favor of a TAP test as well, that's OK by me.
However, there is nothing in the current patch that justifies that:

Agreed. I've reworked this as a SQL regression test per your suggestions.

The test now uses OpenTemporaryFile() via the VFD layer, which handles
cleanup automatically, so there's no need for TAP's File::Temp. A
test_large_files_offset_size() function returns sizeof(pgoff_t), and
the SQL uses \if to skip on platforms where that's less than 8 bytes.
NO_INSTALLCHECK is set.
One issue came up during testing: at 2GB+1, the OVERLAPPED.OffsetHigh
field is naturally zero, so commenting out the OffsetHigh fix didn't
cause the test to fail. I've changed the test offset to 4GB+1 where
OffsetHigh must be non-zero. The test now catches both bugs. FileSize()
provides independent verification that writes actually reached the
correct offset.

I have changed the name of the patch to reflect that it is not just
adding tests, but includes the change for the problem.

Updated patch attached.

--
Michael

--
Bryan Green
EDB: https://www.enterprisedb.com

Attachments:

v6-0001-Fix-Windows-file-I-O-for-offsets-beyond-2GB.patchtext/plain; charset=UTF-8; name=v6-0001-Fix-Windows-file-I-O-for-offsets-beyond-2GB.patchDownload
From 4513158e1ecd912873628b88191b15d24846cca2 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Thu, 13 Nov 2025 12:59:56 +0900
Subject: [PATCH v6] Fix Windows file I/O for offsets beyond 2GB

Two bugs prevented files from exceeding 2GB on Windows when built with
segment sizes larger than the default 1GB.

First, off_t is only 32 bits on Windows with MSVC, causing signed
overflow at 2GB. Change the file I/O layer to use pgoff_t consistently:
fd.c, md.c, pg_iovec.h, file_utils.c, and their associated headers.
This is safe on Unix where pgoff_t equals off_t.

Second, the Windows pg_pwrite() and pg_pread() implementations only set
the low 32 bits of the OVERLAPPED structure, leaving OffsetHigh at zero.
This happens to work below 4GB but wraps around above that. Set both
Offset and OffsetHigh properly.

Add a regression test that validates I/O at 4GB+1. Testing beyond 4GB is
necessary because OffsetHigh is naturally zero at smaller offsets and
the bug would pass unnoticed. The test uses FileSize() to independently
verify that writes reach the correct location.
---
 meson.build                                   |   8 --
 src/test/modules/Makefile                     |   1 +
 src/test/modules/meson.build                  |   1 +
 src/test/modules/test_large_files/Makefile    |  22 ++++
 src/test/modules/test_large_files/README      |  74 +++++++++++
 .../expected/test_large_files.out             |  19 +++
 src/test/modules/test_large_files/meson.build |  36 ++++++
 .../test_large_files/sql/test_large_files.sql |  14 +++
 .../test_large_files--1.0.sql                 |   9 ++
 .../test_large_files/test_large_files.c       | 117 ++++++++++++++++++
 .../test_large_files/test_large_files.control |   5 +
 11 files changed, 298 insertions(+), 8 deletions(-)
 create mode 100644 src/test/modules/test_large_files/Makefile
 create mode 100644 src/test/modules/test_large_files/README
 create mode 100644 src/test/modules/test_large_files/expected/test_large_files.out
 create mode 100644 src/test/modules/test_large_files/meson.build
 create mode 100644 src/test/modules/test_large_files/sql/test_large_files.sql
 create mode 100644 src/test/modules/test_large_files/test_large_files--1.0.sql
 create mode 100644 src/test/modules/test_large_files/test_large_files.c
 create mode 100644 src/test/modules/test_large_files/test_large_files.control

diff --git a/meson.build b/meson.build
index 6e7ddd7468..b8f64c176d 100644
--- a/meson.build
+++ b/meson.build
@@ -452,14 +452,6 @@ else
   segsize = (get_option('segsize') * 1024 * 1024 * 1024) / blocksize
 endif
 
-# If we don't have largefile support, can't handle segment size >= 2GB.
-if cc.sizeof('off_t', args: test_c_args) < 8
-  segsize_bytes = segsize * blocksize
-  if segsize_bytes >= (2 * 1024 * 1024 * 1024)
-    error('Large file support is not enabled. Segment size cannot be larger than 1GB.')
-  endif
-endif
-
 cdata.set('BLCKSZ', blocksize, description:
 '''Size of a disk block --- this also limits the size of a tuple. You can set
    it bigger if you need bigger tuples (although TOAST should reduce the need
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index d079b91b1a..a045065ad9 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -30,6 +30,7 @@ SUBDIRS = \
 		  test_int128 \
 		  test_integerset \
 		  test_json_parser \
+		  test_large_files \
 		  test_lfind \
 		  test_lwlock_tranches \
 		  test_misc \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index f5114469b9..9888009720 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -29,6 +29,7 @@ subdir('test_ginpostinglist')
 subdir('test_int128')
 subdir('test_integerset')
 subdir('test_json_parser')
+subdir('test_large_files')
 subdir('test_lfind')
 subdir('test_lwlock_tranches')
 subdir('test_misc')
diff --git a/src/test/modules/test_large_files/Makefile b/src/test/modules/test_large_files/Makefile
new file mode 100644
index 0000000000..1960e31e52
--- /dev/null
+++ b/src/test/modules/test_large_files/Makefile
@@ -0,0 +1,22 @@
+MODULE_big = test_large_files
+OBJS = \
+	$(WIN32RES) \
+	test_large_files.o
+
+EXTENSION = test_large_files
+DATA = test_large_files--1.0.sql
+
+REGRESS = test_large_files
+
+NO_INSTALLCHECK = 1
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_large_files
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_large_files/README b/src/test/modules/test_large_files/README
new file mode 100644
index 0000000000..467ec629fd
--- /dev/null
+++ b/src/test/modules/test_large_files/README
@@ -0,0 +1,74 @@
+test_large_files
+================
+
+This test module validates PostgreSQL's handling of files larger than
+4GB, specifically testing that pgoff_t (64-bit file offset type) is
+used correctly throughout the file I/O layer.
+
+Background
+----------
+
+On Windows with MSVC, off_t is only 32 bits, causing signed integer
+overflow at 2GB (2^31 bytes). Additionally, Windows' OVERLAPPED structure
+requires both low and high 32-bit offset fields to be set for offsets
+beyond 4GB. PostgreSQL defines pgoff_t as a 64-bit type (__int64 on
+Windows, off_t on Unix where it's already 64-bit) to handle large files
+correctly.
+
+Two bugs were fixed:
+
+1. Pervasive use of off_t where pgoff_t should be used in fd.c, md.c,
+   and related file I/O functions. This caused failures at exactly 2GB.
+
+2. Windows-specific bug in pg_pwrite()/pg_pread() where the OVERLAPPED
+   structure only set the low 32 bits of the file offset (Offset field),
+   leaving OffsetHigh at zero, causing wrap-around at 4GB.
+
+Test Design
+-----------
+
+The test validates file I/O at 4GB + 1 byte:
+
+1. Writes "OFFSET_0" at byte 0
+2. Writes "TESTDATA" at byte 4GB+1
+3. Checks FileSize() reports ~4GB (not ~16 bytes from wrap-around)
+4. Reads offset 0 to verify it wasn't corrupted by wrap-around
+5. Reads offset 4GB+1 to verify data integrity
+
+This approach catches both bugs:
+- The off_t truncation bug (would fail at 2GB writes)
+- The OVERLAPPED OffsetHigh bug (only manifests at 4GB+ where high bits != 0)
+
+Testing at 4GB+1 is critical because at 2GB+1, OffsetHigh would naturally
+be zero, so bugs in setting OffsetHigh wouldn't be detected. At 4GB+1,
+OffsetHigh must be 1, so the test verifies it's set correctly.
+
+The test catches the bug even if both read and write have the same
+truncation issue, because FileSize() provides independent verification.
+
+The test only runs on platforms with 64-bit pgoff_t (checked via
+sizeof(pgoff_t) >= 8).
+
+Platform Support
+----------------
+
+- Linux/Unix: Automatically creates sparse files (fast, no disk space used)
+- Windows NTFS: Creates sparse file efficiently
+- 32-bit offset platforms: Test is skipped automatically
+
+Running the Test
+----------------
+
+The test only runs during 'make check' or 'meson test', not on
+'make installcheck'. This is intentional, as the test creates temporary
+files and is designed for development/CI testing rather than production
+validation.
+
+  make check
+
+or with meson:
+
+  meson test test_large_files
+
+The test completes in seconds on most platforms. On Windows, the test
+may take longer as the OS allocates the sparse file structure.
diff --git a/src/test/modules/test_large_files/expected/test_large_files.out b/src/test/modules/test_large_files/expected/test_large_files.out
new file mode 100644
index 0000000000..a2128cdd8d
--- /dev/null
+++ b/src/test/modules/test_large_files/expected/test_large_files.out
@@ -0,0 +1,19 @@
+CREATE EXTENSION test_large_files;
+SELECT test_large_files_offset_size() >= 8 AS has_large_file_support \gset
+-- Only run test on platforms with 64-bit offsets
+\if :has_large_file_support
+    -- Test file I/O at 4GB + 1 byte boundary
+    -- This validates that pgoff_t is used correctly throughout
+    -- the file I/O layer and catches both:
+    -- 1. off_t truncation bugs (affects all operations at 2GB+)
+    -- 2. Windows OVERLAPPED structure bugs (OffsetHigh must be set
+    --    correctly at 4GB+ where high 32 bits are non-zero)
+    SELECT test_large_files_test_4gb_boundary();
+ test_large_files_test_4gb_boundary 
+------------------------------------
+ 4GB boundary test passed
+(1 row)
+
+\else
+    SELECT 'Skipped - 32-bit offsets not supported'::text AS test_large_files_test_4gb_boundary;
+\endif
diff --git a/src/test/modules/test_large_files/meson.build b/src/test/modules/test_large_files/meson.build
new file mode 100644
index 0000000000..2110bcf23f
--- /dev/null
+++ b/src/test/modules/test_large_files/meson.build
@@ -0,0 +1,36 @@
+# src/test/modules/test_large_files/meson.build
+
+test_large_files_sources = files(
+  'test_large_files.c',
+)
+
+if host_system == 'windows'
+  test_large_files_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'test_large_files',
+    '--FILEDESC', 'test_large_files - test module for large file I/O',])
+endif
+
+test_large_files = shared_module('test_large_files',
+  test_large_files_sources,
+  kwargs: pg_test_mod_args,
+)
+test_install_libs += test_large_files
+
+test_install_data += files(
+  'test_large_files.control',
+  'test_large_files--1.0.sql',
+)
+
+tests += {
+  'name': 'test_large_files',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'regress': {
+    'sql': [
+      'test_large_files',
+    ],
+    # Don't run on installcheck - only during regular check
+    'regress_args': ['--no-locale'],
+    'runningcheck': false,
+  },
+}
diff --git a/src/test/modules/test_large_files/sql/test_large_files.sql b/src/test/modules/test_large_files/sql/test_large_files.sql
new file mode 100644
index 0000000000..4543b357f6
--- /dev/null
+++ b/src/test/modules/test_large_files/sql/test_large_files.sql
@@ -0,0 +1,14 @@
+CREATE EXTENSION test_large_files;
+SELECT test_large_files_offset_size() >= 8 AS has_large_file_support \gset
+-- Only run test on platforms with 64-bit offsets
+\if :has_large_file_support
+    -- Test file I/O at 4GB + 1 byte boundary
+    -- This validates that pgoff_t is used correctly throughout
+    -- the file I/O layer and catches both:
+    -- 1. off_t truncation bugs (affects all operations at 2GB+)
+    -- 2. Windows OVERLAPPED structure bugs (OffsetHigh must be set
+    --    correctly at 4GB+ where high 32 bits are non-zero)
+    SELECT test_large_files_test_4gb_boundary();
+\else
+    SELECT 'Skipped - 32-bit offsets not supported'::text AS test_large_files_test_4gb_boundary;
+\endif
diff --git a/src/test/modules/test_large_files/test_large_files--1.0.sql b/src/test/modules/test_large_files/test_large_files--1.0.sql
new file mode 100644
index 0000000000..9b13c398a1
--- /dev/null
+++ b/src/test/modules/test_large_files/test_large_files--1.0.sql
@@ -0,0 +1,9 @@
+CREATE FUNCTION test_large_files_offset_size()
+RETURNS integer
+AS 'MODULE_PATHNAME'
+LANGUAGE C STRICT;
+
+CREATE FUNCTION test_large_files_test_4gb_boundary()
+RETURNS text
+AS 'MODULE_PATHNAME'
+LANGUAGE C STRICT;
diff --git a/src/test/modules/test_large_files/test_large_files.c b/src/test/modules/test_large_files/test_large_files.c
new file mode 100644
index 0000000000..70e675d473
--- /dev/null
+++ b/src/test/modules/test_large_files/test_large_files.c
@@ -0,0 +1,117 @@
+/*-------------------------------------------------------------------------
+ *
+ * test_large_files.c
+ *		Test module for large file I/O operations
+ *
+ * This module tests PostgreSQL's ability to handle file offsets larger
+ * than 2GB (2^31 bytes), validating that pgoff_t is correctly used
+ * throughout the file I/O layer.
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "fmgr.h"
+#include "miscadmin.h"
+#include "storage/fd.h"
+#include "utils/builtins.h"
+#include "utils/wait_event.h"
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(test_large_files_offset_size);
+Datum
+test_large_files_offset_size(PG_FUNCTION_ARGS)
+{
+	PG_RETURN_INT32(sizeof(pgoff_t));
+}
+
+PG_FUNCTION_INFO_V1(test_large_files_test_4gb_boundary);
+Datum
+test_large_files_test_4gb_boundary(PG_FUNCTION_ARGS)
+{
+	File		file;
+	pgoff_t		large_offset = ((pgoff_t) 4294967296LL) + 1;
+	pgoff_t		expected_size = large_offset + 8;
+	pgoff_t		actual_size;
+	char		write_buf_0[8] = "OFFSET_0";
+	char		write_buf_large[8] = "TESTDATA";
+	char		read_buf[8];
+	int			nbytes;
+
+	file = OpenTemporaryFile(false);
+	if (file < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create temporary file")));
+
+	nbytes = FileWrite(file, write_buf_0, 8, 0, WAIT_EVENT_DATA_FILE_WRITE);
+	if (nbytes != 8)
+	{
+		FileClose(file);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write at offset 0")));
+	}
+
+	nbytes = FileWrite(file, write_buf_large, 8, large_offset,
+					   WAIT_EVENT_DATA_FILE_WRITE);
+	if (nbytes != 8)
+	{
+		FileClose(file);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write at large offset")));
+	}
+
+	actual_size = FileSize(file);
+	if (actual_size < expected_size)
+	{
+		FileClose(file);
+		ereport(ERROR,
+				(errmsg("file size is %lld bytes, expected at least %lld bytes - offset truncated",
+						(long long) actual_size,
+						(long long) expected_size)));
+	}
+
+	memset(read_buf, 0, 8);
+	nbytes = FileRead(file, read_buf, 8, 0, WAIT_EVENT_DATA_FILE_READ);
+	if (nbytes != 8)
+	{
+		FileClose(file);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read from offset 0")));
+	}
+
+	if (memcmp(read_buf, write_buf_0, 8) != 0)
+	{
+		FileClose(file);
+		ereport(ERROR,
+				(errmsg("data at offset 0 was corrupted - write wrapped around")));
+	}
+
+	memset(read_buf, 0, 8);
+	nbytes = FileRead(file, read_buf, 8, large_offset,
+					  WAIT_EVENT_DATA_FILE_READ);
+	if (nbytes != 8)
+	{
+		FileClose(file);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read at large offset")));
+	}
+
+	if (memcmp(write_buf_large, read_buf, 8) != 0)
+	{
+		FileClose(file);
+		ereport(ERROR,
+				(errmsg("data mismatch at large offset")));
+	}
+
+	FileClose(file);
+
+	PG_RETURN_TEXT_P(cstring_to_text("4GB boundary test passed"));
+}
diff --git a/src/test/modules/test_large_files/test_large_files.control b/src/test/modules/test_large_files/test_large_files.control
new file mode 100644
index 0000000000..b0bff5bd86
--- /dev/null
+++ b/src/test/modules/test_large_files/test_large_files.control
@@ -0,0 +1,5 @@
+# test_large_files extension
+comment = 'Test module for large file I/O operations'
+default_version = '1.0'
+module_pathname = '$libdir/test_large_files'
+relocatable = true
-- 
2.46.0.windows.1

#28Michael Paquier
michael@paquier.xyz
In reply to: Bryan Green (#27)
Re: [Patch] Windows relation extension failure at 2GB and 4GB

On Thu, Nov 27, 2025 at 01:52:33AM -0600, Bryan Green wrote:

The test now uses OpenTemporaryFile() via the VFD layer, which handles
cleanup automatically, so there's no need for TAP's File::Temp. A
test_large_files_offset_size() function returns sizeof(pgoff_t), and
the SQL uses \if to skip on platforms where that's less than 8 bytes.
NO_INSTALLCHECK is set.
One issue came up during testing: at 2GB+1, the OVERLAPPED.OffsetHigh
field is naturally zero, so commenting out the OffsetHigh fix didn't
cause the test to fail. I've changed the test offset to 4GB+1 where
OffsetHigh must be non-zero. The test now catches both bugs. FileSize()
provides independent verification that writes actually reached the
correct offset.

I have changed the name of the patch to reflect that it is not just
adding tests, but includes the change for the problem.

Updated patch attached.

Pretty cool result, and the test fails in the Windows CI with
84fb27511dbe reverted:
- test_large_files_test_4gb_boundary
-------------------------------------
- 4GB boundary test passed
-(1 row)
-
+ERROR:  file size is 9 bytes, expected at least 4294967305 bytes -
offset truncated

-# If we don't have largefile support, can't handle segment size >= 2GB.
-if cc.sizeof('off_t', args: test_c_args) < 8
- segsize_bytes = segsize * blocksize
- if segsize_bytes >= (2 * 1024 * 1024 * 1024)
- error('Large file support is not enabled. Segment size cannot be larger than 1GB.')
- endif
-endif

I doubt that we can drop this check yet. There are still a lot of
places in the tree that need to be switched from off_t to pgoff_t,
like the buffer APIs, etc.

+SELECT test_large_files_offset_size() >= 8 AS has_large_file_support \gset
+-- Only run test on platforms with 64-bit offsets
+\if :has_large_file_support

That would be sufficient to make the test conditional. However we
already know that pgoff_t is forced at 8 bytes all the time in the
tree, so why do we need that. If that was based on off_t, with tests
around it, that would be adapted, of course.

+PG_FUNCTION_INFO_V1(test_large_files_test_4gb_boundary);
+Datum
+test_large_files_test_4gb_boundary(PG_FUNCTION_ARGS)

As of this patch, this includes one function that opens a temporary
file, writes some data into it twice, checks its size, reads a few
bytes, closes the file. When designing such test modules, it is
important to make them modular, IMO, so as they can be extended at
will for more cases in the future. How about introducing a set of
thin SQL wrappers around all these File*() functions, taking in input
what they need. For the data written, this would be some bytea in
input combined with an offset. For the data read, a chunk of data to
return, with an offset where the data was read from. Then the
comparisons could be done on the two byteas, for example. FileClose()
is triggered at transaction commit through CleanupTempFiles(), so we
could return the File as an int4 when passed around to the SQL
functions (cannot rely on pg_read_file() as the path is not fixed),
passing it in input of the read, write and close function (close is
optional as we could rely on an explicit commit).

The new module is missing a .gitignore, leading to files present in
the tree when using configure/make after a test run. You could just
copy one from one of the other test modules.
--
Michael

#29Andres Freund
andres@anarazel.de
In reply to: Michael Paquier (#28)
Re: [Patch] Windows relation extension failure at 2GB and 4GB

Hi,

On 2025-12-02 11:46:56 +0900, Michael Paquier wrote:

I doubt that we can drop this check yet. There are still a lot of
places in the tree that need to be switched from off_t to pgoff_t,
like the buffer APIs, etc.

Hm? What are you thinking about re buffer APIs?

Greetings,

Andres Freund

#30Michael Paquier
michael@paquier.xyz
In reply to: Andres Freund (#29)
Re: [Patch] Windows relation extension failure at 2GB and 4GB

On Mon, Dec 01, 2025 at 09:59:52PM -0500, Andres Freund wrote:

On 2025-12-02 11:46:56 +0900, Michael Paquier wrote:

I doubt that we can drop this check yet. There are still a lot of
places in the tree that need to be switched from off_t to pgoff_t,
like the buffer APIs, etc.

Hm? What are you thinking about re buffer APIs?

buffile.h and buffile.c still have traces of off_t.
--
Michael

#31Andres Freund
andres@anarazel.de
In reply to: Michael Paquier (#30)
Re: [Patch] Windows relation extension failure at 2GB and 4GB

On 2025-12-02 12:02:39 +0900, Michael Paquier wrote:

On Mon, Dec 01, 2025 at 09:59:52PM -0500, Andres Freund wrote:

On 2025-12-02 11:46:56 +0900, Michael Paquier wrote:

I doubt that we can drop this check yet. There are still a lot of
places in the tree that need to be switched from off_t to pgoff_t,
like the buffer APIs, etc.

Hm? What are you thinking about re buffer APIs?

buffile.h and buffile.c still have traces of off_t.

Oh, I was interpreting buffer as bufmgr.c...

#32Michael Paquier
michael@paquier.xyz
In reply to: Andres Freund (#31)
Re: [Patch] Windows relation extension failure at 2GB and 4GB

On Mon, Dec 01, 2025 at 10:05:52PM -0500, Andres Freund wrote:

On 2025-12-02 12:02:39 +0900, Michael Paquier wrote:

buffile.h and buffile.c still have traces of off_t.

Oh, I was interpreting buffer as bufmgr.c...

Sorry for being rather unclear here ;D
--
Michael