[Patch] Windows relation extension failure at 2GB and 4GB
Hello,
I found two related bugs in PostgreSQL's Windows port that prevent files
from exceeding 2GB. While unlikely to affect most installations (default
1GB segments), the code is objectively wrong and worth fixing.
The first bug is a pervasive use of off_t where pgoff_t should be used.
On Windows, off_t is only 32-bit, causing signed integer overflow at
exactly 2GB (2^31 bytes). PostgreSQL already defined pgoff_t as __int64
for this purpose and some function declarations in headers already use
it, but the implementations weren't updated to match.
The problem shows up in multiple layers:
In fd.c and fd.h, the VfdCache structure's fileSize field uses off_t, as
do FileSize(), FileTruncate(), and all the File* functions. These are
the core file I/O abstraction layer that everything else builds on.
In md.c, _mdnblocks() uses off_t for its calculations. The actual
arithmetic for computing file offsets is fine - the casts to pgoff_t
work correctly - but passing these values through functions with off_t
parameters truncates them.
In pg_iovec.h, pg_preadv() and pg_pwritev() take off_t offset
parameters, truncating any 64-bit offsets passed from above.
In file_utils.c, pg_pwrite_zeros() takes an off_t offset parameter. This
function is called by FileZero() to extend files with zeros, so it's hit
during relation extension. This was the actual culprit in my testing -
mdzeroextend() would compute a correct 64-bit offset, but it got
truncated to 32-bit (and negative) when passed to pg_pwrite_zeros().
After fixing all those off_t issues, there's a second bug at 4GB in the
Windows implementations of pg_pwrite()/pg_pread() in win32pwrite.c and
win32pread.c. The current implementation uses an OVERLAPPED structure
for positioned I/O, but only sets the Offset field (low 32 bits),
leaving OffsetHigh at zero. This works up to 4GB by accident, but beyond
that, offsets wrap around.
I can reproduce both bugs reliably with --with-segsize=8. The file grows
to exactly 2GB and fails with "could not extend file: Invalid argument"
despite having 300GB free. After fixing the off_t issues, it grows to
exactly 4GB and hits the OVERLAPPED bug. Both are independently verifiable.
The fix touches nine files:
src/include/storage/fd.h - File* function declarations
src/backend/storage/file/fd.c - File* implementations and VfdCache
src/backend/storage/smgr/md.c - _mdnblocks and other functions
src/include/port/pg_iovec.h - pg_preadv/pg_pwritev signatures
src/include/common/file_utils.h - pg_pwrite_zeros declaration
src/common/file_utils.c - pg_pwrite_zeros implementation
src/include/port/win32_port.h - pg_pread/pg_pwrite declarations
src/port/win32pwrite.c - Windows pwrite implementation
src/port/win32pread.c - Windows pread implementation
It's safe for all platforms since pgoff_t equals off_t on Unix where
off_t is already 64-bit. Only Windows behavior changes.
That said, I'm finding off_t used in many other places throughout the
codebase - buffile.c, various other file utilities such as backup and
archive, probably more. This is likely causing latent bugs elsewhere on
Windows, though most are masked by the 1GB default segment size. I'm
investigating the full scope, but I think this needs to be broken up
into multiple patches. The core file I/O layer (fd.c, md.c,
pg_pwrite/pg_pread) should probably go first since that's what's
actively breaking file extension.
Not urgent since few people hit this in practice, but it's clearly wrong
code.
Someone building with larger segments would see failures at 2GB and
potential corruption at 4GB. Windows supports files up to 16 exabytes -
no good reason to limit PostgreSQL to 2GB.
I have attached the patch to fix the relation extension problems for
Windows to this email.
Can provide the other patches that changes off_t for pgoff_t in the rest
of the code if there's interest in fixing this.
To reproduce the bugs on Windows:
1) Build with large segment size: meson setup build
--prefix=C:\pgsql-test -Dsegsize=8.
2) Create a large table and insert data that will make it bigger than 2GB.
CREATE TABLE large_test (
id bigserial PRIMARY KEY,
data1 text,
data2 text,
data3 text
);
INSERT INTO large_test (data1, data2, data3)
SELECT
repeat('A', 300),
repeat('B', 300),
repeat('C', 300)
FROM generate_series(1, 5000000);
SELECT pg_size_pretty(pg_relation_size('large_test'));
You will notice at this point that the first bug surfaces.
3) If you want to reproduce the 2nd bug then you should apply the patch
and then comment out 'overlapped.OffsetHigh = (DWORD) (offset >> 32);'
is win32pwrite.c.
4) Assuming you did 3, do the test in 2 again. If you are watching the
data/base/N/xxxxx file growing you will notice that it gets past 2GB but
now fails at 4GB.
BG
Attachments:
0001-Fix-Windows-file-IO.patchtext/plain; charset=UTF-8; name=0001-Fix-Windows-file-IO.patchDownload+75-77
On Tue, Oct 28, 2025 at 09:42:11AM -0500, Bryan Green wrote:
I found two related bugs in PostgreSQL's Windows port that prevent files
from exceeding 2GB. While unlikely to affect most installations (default 1GB
segments), the code is objectively wrong and worth fixing.The first bug is a pervasive use of off_t where pgoff_t should be used. On
Windows, off_t is only 32-bit, causing signed integer overflow at exactly
2GB (2^31 bytes). PostgreSQL already defined pgoff_t as __int64 for this
purpose and some function declarations in headers already use it, but the
implementations weren't updated to match.
Ugh. That's the same problem as "long", which is 8 bytes wide
everywhere except WIN32. Removing traces of "long" from the code has
been a continuous effort over the years because of these silent
overflow issues.
After fixing all those off_t issues, there's a second bug at 4GB in the
Windows implementations of pg_pwrite()/pg_pread() in win32pwrite.c and
win32pread.c. The current implementation uses an OVERLAPPED structure for
positioned I/O, but only sets the Offset field (low 32 bits), leaving
OffsetHigh at zero. This works up to 4GB by accident, but beyond that,
offsets wrap around.I can reproduce both bugs reliably with --with-segsize=8. The file grows to
exactly 2GB and fails with "could not extend file: Invalid argument" despite
having 300GB free. After fixing the off_t issues, it grows to exactly 4GB
and hits the OVERLAPPED bug. Both are independently verifiable.
The most popular option in terms of installation on Windows is the EDB
installer, where I bet that a file segment size of 1GB is what's
embedded in the code compiled. This argument would not hold with WAL
segment sizes if we begin to support even higher sizes than 1GB at
some point, and we use pwrite() in the WAL insert code. That should
not be a problem even in the near future.
It's safe for all platforms since pgoff_t equals off_t on Unix where off_t
is already 64-bit. Only Windows behavior changes.
win32_port.h and port.h say so, yeah.
Not urgent since few people hit this in practice, but it's clearly wrong
code.
Someone building with larger segments would see failures at 2GB and
potential corruption at 4GB. Windows supports files up to 16 exabytes - no
good reason to limit PostgreSQL to 2GB.
The same kind of limitations with 4GB files existed with stat() and
fstat(), but it was much more complicated than what you are doing
here, where COPY was not able to work with files larger than 4GB on
WIN32. See the saga from bed90759fcbc.
I have attached the patch to fix the relation extension problems for Windows
to this email.Can provide the other patches that changes off_t for pgoff_t in the rest of
the code if there's interest in fixing this.
Yeah, I think that we should rip out these issues, and move to the
more portable pgoff_t across the board. I doubt that any of this
could be backpatched due to the potential ABI breakages these
signatures changes would cause. Implementing things in incremental
steps is more sensible when it comes to such changes, as a revert
blast can be reduced if a portion is incorrectly handled.
I'm seeing as well the things you are pointing in buffile.c. These
should be fixed as well. The WAL code is less annoying due to the 1GB
WAL segment size limit, still consistency across the board makes the
code easier to reason about, at least.
Among the files you have mentioned, there is also copydir.c.
pg_rewind seems also broken with files larger than 4GB, from what I
can see in libpq_source.c and filemap.c.. Worse. Oops.
- /* Note that this changes the file position, despite not using it. */
- overlapped.Offset = offset;
+ overlapped.Offset = (DWORD) offset;
+ overlapped.OffsetHigh = (DWORD) (offset >> 32);
Based on the docs at [1]https://learn.microsoft.com/en-us/windows/win32/api/minwinbase/ns-minwinbase-overlapped -- Michael, yes, this change makes sense.
It seems to me that a couple of extra code paths should be handled in
the first patch, and I have spotted three of them. None of them are
critical as they are related to WAL segments, just become masked and
inconsistent:
- xlogrecovery.c, pg_pread() called with a cast to off_t. WAL
segments have a max size of 1GB, meaning that we're OK.
- xlogreader.c, pg_pread() with a cast to off_t.
- walreceiver.c, pg_pwrite().
Except for these three spots, the first patch looks like a cut good
enough on its own.
Glad to see someone who takes time to spend time on this kind of
stuff with portability in mind, by the way.
[1]: https://learn.microsoft.com/en-us/windows/win32/api/minwinbase/ns-minwinbase-overlapped -- Michael
--
Michael
On 11/5/2025 11:05 PM, Michael Paquier wrote:
On Tue, Oct 28, 2025 at 09:42:11AM -0500, Bryan Green wrote:
I found two related bugs in PostgreSQL's Windows port that prevent files
from exceeding 2GB. While unlikely to affect most installations (default 1GB
segments), the code is objectively wrong and worth fixing.The first bug is a pervasive use of off_t where pgoff_t should be used. On
Windows, off_t is only 32-bit, causing signed integer overflow at exactly
2GB (2^31 bytes). PostgreSQL already defined pgoff_t as __int64 for this
purpose and some function declarations in headers already use it, but the
implementations weren't updated to match.Ugh. That's the same problem as "long", which is 8 bytes wide
everywhere except WIN32. Removing traces of "long" from the code has
been a continuous effort over the years because of these silent
overflow issues.
Exactly - these silent overflows are particularly nasty since they only
manifest under specific conditions (large files on Windows) and can
cause data corruption rather than immediate crashes.
After fixing all those off_t issues, there's a second bug at 4GB in the
Windows implementations of pg_pwrite()/pg_pread() in win32pwrite.c and
win32pread.c. The current implementation uses an OVERLAPPED structure for
positioned I/O, but only sets the Offset field (low 32 bits), leaving
OffsetHigh at zero. This works up to 4GB by accident, but beyond that,
offsets wrap around.I can reproduce both bugs reliably with --with-segsize=8. The file grows to
exactly 2GB and fails with "could not extend file: Invalid argument" despite
having 300GB free. After fixing the off_t issues, it grows to exactly 4GB
and hits the OVERLAPPED bug. Both are independently verifiable.The most popular option in terms of installation on Windows is the EDB
installer, where I bet that a file segment size of 1GB is what's
embedded in the code compiled. This argument would not hold with WAL
Right, which is why this has gone unnoticed. The 1GB default masks both
bugs completely. It's only when someone uses --with-segsize > 2 that the
issues appear.
segment sizes if we begin to support even higher sizes than 1GB at
some point, and we use pwrite() in the WAL insert code. That should
not be a problem even in the near future.It's safe for all platforms since pgoff_t equals off_t on Unix where off_t
is already 64-bit. Only Windows behavior changes.win32_port.h and port.h say so, yeah.
Not urgent since few people hit this in practice, but it's clearly wrong
code.
Someone building with larger segments would see failures at 2GB and
potential corruption at 4GB. Windows supports files up to 16 exabytes - no
good reason to limit PostgreSQL to 2GB.The same kind of limitations with 4GB files existed with stat() and
fstat(), but it was much more complicated than what you are doing
here, where COPY was not able to work with files larger than 4GB on
WIN32. See the saga from bed90759fcbc.I have attached the patch to fix the relation extension problems for Windows
to this email.Can provide the other patches that changes off_t for pgoff_t in the rest of
the code if there's interest in fixing this.Yeah, I think that we should rip out these issues, and move to the
more portable pgoff_t across the board. I doubt that any of this
could be backpatched due to the potential ABI breakages these
signatures changes would cause. Implementing things in incremental
steps is more sensible when it comes to such changes, as a revert
blast can be reduced if a portion is incorrectly handled.I'm seeing as well the things you are pointing in buffile.c. These
should be fixed as well. The WAL code is less annoying due to the 1GB
WAL segment size limit, still consistency across the board makes the
code easier to reason about, at least.Among the files you have mentioned, there is also copydir.c.
pg_rewind seems also broken with files larger than 4GB, from what I
can see in libpq_source.c and filemap.c.. Worse. Oops.- /* Note that this changes the file position, despite not using it. */ - overlapped.Offset = offset; + overlapped.Offset = (DWORD) offset; + overlapped.OffsetHigh = (DWORD) (offset >> 32);Based on the docs at [1], yes, this change makes sense.
It seems to me that a couple of extra code paths should be handled in
the first patch, and I have spotted three of them. None of them are
critical as they are related to WAL segments, just become masked and
inconsistent:
- xlogrecovery.c, pg_pread() called with a cast to off_t. WAL
segments have a max size of 1GB, meaning that we're OK.
- xlogreader.c, pg_pread() with a cast to off_t.
- walreceiver.c, pg_pwrite().
I'll include these in the first patch for consistency even though
they're not currently problematic. Better to fix all the function call
sites together rather than leaving known inconsistencies.
Except for these three spots, the first patch looks like a cut good
enough on its own.Glad to see someone who takes time to spend time on this kind of
stuff with portability in mind, by the way.
Windows portability issues tend to hide in corners like this. I'll
prepare the updated patch series addressing your feedback and post v2
shortly.
[1]: https://learn.microsoft.com/en-us/windows/win32/api/minwinbase/ns-minwinbase-overlapped
--
Michael
Thanks for taking the time to look over this patch.
--
Bryan Green
EDB: https://www.enterprisedb.com
On Wed, Oct 29, 2025 at 3:42 AM Bryan Green <dbryan.green@gmail.com> wrote:
That said, I'm finding off_t used in many other places throughout the
codebase - buffile.c, various other file utilities such as backup and
archive, probably more. This is likely causing latent bugs elsewhere on
Windows, though most are masked by the 1GB default segment size. I'm
investigating the full scope, but I think this needs to be broken up
into multiple patches. The core file I/O layer (fd.c, md.c,
pg_pwrite/pg_pread) should probably go first since that's what's
actively breaking file extension.
The way I understand this situation, there are two kinds of file I/O,
with respect to large files:
1. Some places *have* to deal with large files (eg navigating in a
potentially large tar file), and there we should already be using
pgoff_t and the relevant system call wrappers should be using the
int64_t stuff Windows provides. These are primarily frontend code.
2. Some places use segmentation *specifically because* there are
systems with 32 bit off_t. These are mostly backend code dealing with
relation data files. The only system left with narrow off_t is
Windows.
In reality the stuff in category 1 has been developed through a
process of bug reports and patches (970b97e and 970b97e^ springs to
mind as the most recent case I had something to with, but see also
stat()-related stuff, and see aa5518304 where we addressed the one
spot in buffile.c that had to consider multiple segments). But the
fact that Windows can't use segments > 2GB because the fd.c and
smgr.c/md.c layers work with off_t is certainly a well known
limitation, ie specifically that relation and temporary/buf files are
special in this way. I'm mostly baffled by the fact that --relsegsize
actually *lets* you set it higher than 2 on that platform. Perhaps we
should at least backpatch a configure check or static assertion to
block that? It's not good if it compiles but doesn't actually work.
For master I think it makes sense to clean this up, as you say,
because the fuzzy boundary between the two categories of file I/O is
bound to cause more problems, it's just unfinished business that has
been tackled piecemeal as required by bug reports... In fact, on a
thread[1]/messages/by-id/CA+hUKG+BGXwMbrvzXAjL8VMGf25y_ga_XnO741g10y0=m6dDiA@mail.gmail.com where I explored making the segment size a runtime option
specified at initdb time, I even posted patches much like yours in the
first version, spreading pgoff_t into more places, and then in a later
version it was suggested that it might be better to just block
settings that are too big for your off_t, so I did that. I probably
thought that we already did that somewhere for the current
compile-time constant...
Not urgent since few people hit this in practice, but it's clearly wrong
code.
Yeah. In my experience dealing with bug reports, the Windows users
community skews very heavily towards just consuming EDB's read-built
installer. We rarely hear about configuration-level problems, so I
suppose it's not surprising that no one has ever complained that it
lets you configure it in a way that we hackers all know is certainly
going to break.
[1]: /messages/by-id/CA+hUKG+BGXwMbrvzXAjL8VMGf25y_ga_XnO741g10y0=m6dDiA@mail.gmail.com
On Thu, Nov 6, 2025 at 10:20 PM Thomas Munro <thomas.munro@gmail.com> wrote:
The only system left with narrow off_t is Windows.
By the way, that's not always true: Meson + MinGW have a 64-bit off_t
on Windows, because meson decides to set -D_FIILE_OFFSET_BITS=64 for
all C compilers except msvc[1]https://github.com/mesonbuild/meson/blob/97a1c567c9813176e4bec40f6055f228b2121609/mesonbuild/compilers/compilers.py#L1144 (only other exclusion is macOS, but
that is 64-bit only these days; there are other systems like FreeBSD
where sizeof(off_t) is always 8 but it doesn't seem to know about that
or bother to check), and MinGW's headers react to that. I suspect
autoconf's AC_SYS_LARGEFILE would do that too with MinGW, IDK, but
configure.ac doesn't call it for win32 by special condition. That
creates a strange difference between meson and autoconf builds IMHO,
but if we resolve that in the only direction possible AFAICS we'd
still have a strange difference between MSVC and MinGW.
Observing that mess, I kinda wonder what would happen if we just used
a big hammer to redefine off_t to be __int64 ourselves. On the face
of it, it sounds like an inherently bad idea that could bite you when
interacting with libraries whose headers use off_t. On the other
hand, the world of open source libraries we care about might already
be resistant to that chaos, if libraries are being built with and
without -D_FIILE_OFFSET_BITS=64 willy-nilly, or they actually can't
deal with large files at all in which case that's something we'd have
to deal with whatever we do. I don't know, it's just a thought that
occurred to me while contemplating how unpleasant it is to splatter
pgoff_t all over our tree, and yet *still* have to tread very
carefully with the boundaries of external libraries that might be
using off_t, researching each case...
On 11/6/2025 3:20 AM, Thomas Munro wrote:
On Wed, Oct 29, 2025 at 3:42 AM Bryan Green <dbryan.green@gmail.com> wrote:
That said, I'm finding off_t used in many other places throughout the
codebase - buffile.c, various other file utilities such as backup and
archive, probably more. This is likely causing latent bugs elsewhere on
Windows, though most are masked by the 1GB default segment size. I'm
investigating the full scope, but I think this needs to be broken up
into multiple patches. The core file I/O layer (fd.c, md.c,
pg_pwrite/pg_pread) should probably go first since that's what's
actively breaking file extension.The way I understand this situation, there are two kinds of file I/O,
with respect to large files:1. Some places *have* to deal with large files (eg navigating in a
potentially large tar file), and there we should already be using
pgoff_t and the relevant system call wrappers should be using the
int64_t stuff Windows provides. These are primarily frontend code.
2. Some places use segmentation *specifically because* there are
systems with 32 bit off_t. These are mostly backend code dealing with
relation data files. The only system left with narrow off_t is
Windows.In reality the stuff in category 1 has been developed through a
process of bug reports and patches (970b97e and 970b97e^ springs to
mind as the most recent case I had something to with, but see also
stat()-related stuff, and see aa5518304 where we addressed the one
spot in buffile.c that had to consider multiple segments). But the
fact that Windows can't use segments > 2GB because the fd.c and
smgr.c/md.c layers work with off_t is certainly a well known
limitation, ie specifically that relation and temporary/buf files are
special in this way. I'm mostly baffled by the fact that --relsegsize
actually *lets* you set it higher than 2 on that platform. Perhaps we
should at least backpatch a configure check or static assertion to
block that? It's not good if it compiles but doesn't actually work.
I agree that the backpatch should just block setting -relsegsize > 2GB
on Windows.
For master I think it makes sense to clean this up, as you say,
because the fuzzy boundary between the two categories of file I/O is
bound to cause more problems, it's just unfinished business that has
been tackled piecemeal as required by bug reports... In fact, on a
thread[1] where I explored making the segment size a runtime option
specified at initdb time, I even posted patches much like yours in the
first version, spreading pgoff_t into more places, and then in a later
version it was suggested that it might be better to just block
settings that are too big for your off_t, so I did that. I probably
thought that we already did that somewhere for the current
compile-time constant...
For master, I'd like to proceed with the cleanup approach - spreading
pgoff_t into the core I/O layer (fd.c, md.c, pg_pread/pg_pwrite
wrappers, etc). That would let us eliminate the artificial 2GB ceiling
on Windows and clean up the file I/O category boundary.
Not urgent since few people hit this in practice, but it's clearly wrong
code.Yeah. In my experience dealing with bug reports, the Windows users
community skews very heavily towards just consuming EDB's read-built
installer. We rarely hear about configuration-level problems, so I
suppose it's not surprising that no one has ever complained that it
lets you configure it in a way that we hackers all know is certainly
going to break.[1] /messages/by-id/CA+hUKG+BGXwMbrvzXAjL8VMGf25y_ga_XnO741g10y0=m6dDiA@mail.gmail.com
Thanks for the feedback.
--
Bryan Green
EDB: https://www.enterprisedb.com
On 11/5/2025 11:05 PM, Michael Paquier wrote:
On Tue, Oct 28, 2025 at 09:42:11AM -0500, Bryan Green wrote:
I found two related bugs in PostgreSQL's Windows port that prevent files
from exceeding 2GB. While unlikely to affect most installations (default 1GB
segments), the code is objectively wrong and worth fixing.The first bug is a pervasive use of off_t where pgoff_t should be used. On
Windows, off_t is only 32-bit, causing signed integer overflow at exactly
2GB (2^31 bytes). PostgreSQL already defined pgoff_t as __int64 for this
purpose and some function declarations in headers already use it, but the
implementations weren't updated to match.Ugh. That's the same problem as "long", which is 8 bytes wide
everywhere except WIN32. Removing traces of "long" from the code has
been a continuous effort over the years because of these silent
overflow issues.After fixing all those off_t issues, there's a second bug at 4GB in the
Windows implementations of pg_pwrite()/pg_pread() in win32pwrite.c and
win32pread.c. The current implementation uses an OVERLAPPED structure for
positioned I/O, but only sets the Offset field (low 32 bits), leaving
OffsetHigh at zero. This works up to 4GB by accident, but beyond that,
offsets wrap around.I can reproduce both bugs reliably with --with-segsize=8. The file grows to
exactly 2GB and fails with "could not extend file: Invalid argument" despite
having 300GB free. After fixing the off_t issues, it grows to exactly 4GB
and hits the OVERLAPPED bug. Both are independently verifiable.The most popular option in terms of installation on Windows is the EDB
installer, where I bet that a file segment size of 1GB is what's
embedded in the code compiled. This argument would not hold with WAL
segment sizes if we begin to support even higher sizes than 1GB at
some point, and we use pwrite() in the WAL insert code. That should
not be a problem even in the near future.It's safe for all platforms since pgoff_t equals off_t on Unix where off_t
is already 64-bit. Only Windows behavior changes.win32_port.h and port.h say so, yeah.
Not urgent since few people hit this in practice, but it's clearly wrong
code.
Someone building with larger segments would see failures at 2GB and
potential corruption at 4GB. Windows supports files up to 16 exabytes - no
good reason to limit PostgreSQL to 2GB.The same kind of limitations with 4GB files existed with stat() and
fstat(), but it was much more complicated than what you are doing
here, where COPY was not able to work with files larger than 4GB on
WIN32. See the saga from bed90759fcbc.I have attached the patch to fix the relation extension problems for Windows
to this email.Can provide the other patches that changes off_t for pgoff_t in the rest of
the code if there's interest in fixing this.Yeah, I think that we should rip out these issues, and move to the
more portable pgoff_t across the board. I doubt that any of this
could be backpatched due to the potential ABI breakages these
signatures changes would cause. Implementing things in incremental
steps is more sensible when it comes to such changes, as a revert
blast can be reduced if a portion is incorrectly handled.I'm seeing as well the things you are pointing in buffile.c. These
should be fixed as well. The WAL code is less annoying due to the 1GB
WAL segment size limit, still consistency across the board makes the
code easier to reason about, at least.Among the files you have mentioned, there is also copydir.c.
pg_rewind seems also broken with files larger than 4GB, from what I
can see in libpq_source.c and filemap.c.. Worse. Oops.- /* Note that this changes the file position, despite not using it. */ - overlapped.Offset = offset; + overlapped.Offset = (DWORD) offset; + overlapped.OffsetHigh = (DWORD) (offset >> 32);Based on the docs at [1], yes, this change makes sense.
It seems to me that a couple of extra code paths should be handled in
the first patch, and I have spotted three of them. None of them are
critical as they are related to WAL segments, just become masked and
inconsistent:
- xlogrecovery.c, pg_pread() called with a cast to off_t. WAL
segments have a max size of 1GB, meaning that we're OK.
- xlogreader.c, pg_pread() with a cast to off_t.
- walreceiver.c, pg_pwrite().Except for these three spots, the first patch looks like a cut good
enough on its own.
Latest patch attached that includes these code paths.
Glad to see someone who takes time to spend time on this kind of
stuff with portability in mind, by the way.[1]: https://learn.microsoft.com/en-us/windows/win32/api/minwinbase/ns-minwinbase-overlapped
--
Michael
Thanks for the quick reviewing.
--
Bryan Green
EDB: https://www.enterprisedb.com
Attachments:
v2-0001-Fix-Windows-file-IO.patchtext/plain; charset=UTF-8; name=v2-0001-Fix-Windows-file-IO.patchDownload+78-83
Hi,
On 2025-11-06 11:17:52 -0600, Bryan Green wrote:
From d3f7543a35b3b72a7069188302cbfc7e4de9120b Mon Sep 17 00:00:00 2001
From: Bryan Green <dbryan.green@gmail.com>
Date: Thu, 6 Nov 2025 10:56:02 -0600
Subject: [PATCH] Fix Windows file I/O to support files larger than 2GB
Could we add a testcase that actually exercises at least some of the
codepaths? We presumably wouldn't want to actually write that much data, but
it shouldn't be hard to write portable code to create a file with holes...
Greetings,
Andres Freund
On Thu, Nov 06, 2025 at 12:45:30PM -0500, Andres Freund wrote:
Could we add a testcase that actually exercises at least some of the
codepaths? We presumably wouldn't want to actually write that much data, but
it shouldn't be hard to write portable code to create a file with holes...
With something that relies on a pg_pwrite() and pg_pread(), that does
not sound like an issue to me.
FWIW, I have wanted a test module that does FS-level operations for
some time. Here, we could just have thin wrappers of the write and
read calls and a give way for the tests to pass directly arguments to
them via a SQL function call. That would be easier to extend
depending on what comes next. Not sure that this is absolutely
mandatory for the sake of the proposal, though, but long-term that's
something we should do more to stress the portability of the code.
--
Michael
On Thu, Nov 06, 2025 at 11:10:00PM +1300, Thomas Munro wrote:
Observing that mess, I kinda wonder what would happen if we just used
a big hammer to redefine off_t to be __int64 ourselves. On the face
of it, it sounds like an inherently bad idea that could bite you when
interacting with libraries whose headers use off_t. On the other
hand, the world of open source libraries we care about might already
be resistant to that chaos, if libraries are being built with and
without -D_FIILE_OFFSET_BITS=64 willy-nilly, or they actually can't
deal with large files at all in which case that's something we'd have
to deal with whatever we do. I don't know, it's just a thought that
occurred to me while contemplating how unpleasant it is to splatter
pgoff_t all over our tree, and yet *still* have to tread very
carefully with the boundaries of external libraries that might be
using off_t, researching each case...
Not sure about that. It's always been something we have tackled with
the various pg_ and PG_ structures and definitions. Not sure that
this has to apply here since we already have one pgoff_t for the
purpose of portability for quite a few years already (d00a3472cfc4).
--
Michael
On Thu, Nov 06, 2025 at 11:17:52AM -0600, Bryan Green wrote:
On 11/5/2025 11:05 PM, Michael Paquier wrote:
It seems to me that a couple of extra code paths should be handled in
the first patch, and I have spotted three of them. None of them are
critical as they are related to WAL segments, just become masked and
inconsistent:
- xlogrecovery.c, pg_pread() called with a cast to off_t. WAL
segments have a max size of 1GB, meaning that we're OK.
- xlogreader.c, pg_pread() with a cast to off_t.
- walreceiver.c, pg_pwrite().Except for these three spots, the first patch looks like a cut good
enough on its own.Latest patch attached that includes these code paths.
That feels OK for me. Thomas, do you have a different view on the
matter for HEAD? Like long, I would just switch to something that we
have in the tree that's fixed.
And +1 for the idea to restrict the segment size to never be more than
2GB based on a ./configure and meson check on the back branches. In
PG15 and older branches, we already enforced a check by the way. See
src/tools/msvc/Solution.pm which was the only way to compile the code
with visual studio so one would have never seen the limitations except
if they had the idea to edit the perl scripts (FWIW, I've done exactly
that in the past for a past project at $company, never touched the
segsize):
# only allow segsize 1 for now, as we can't do large files yet in windows
die "Bad segsize $options->{segsize}"
unless $options->{segsize} == 1;
So this is a meson issue that goes down to v16, when using a VS
compiler. Was there a different compiler where off_t is also 4 bytes?
MinGW is mentioned as clear by Thomas.
--
Michael
On 11/6/25 11:45, Andres Freund wrote:
Hi,
On 2025-11-06 11:17:52 -0600, Bryan Green wrote:
From d3f7543a35b3b72a7069188302cbfc7e4de9120b Mon Sep 17 00:00:00 2001
From: Bryan Green <dbryan.green@gmail.com>
Date: Thu, 6 Nov 2025 10:56:02 -0600
Subject: [PATCH] Fix Windows file I/O to support files larger than 2GBCould we add a testcase that actually exercises at least some of the
codepaths? We presumably wouldn't want to actually write that much data, but
it shouldn't be hard to write portable code to create a file with holes...Greetings,
Andres Freund
Agreed. Added a test module called test_large_files. There is a README.
--
Bryan Green
EDB: https://www.enterprisedb.com
Attachments:
v3-0001-Fix-Windows-file-IO.patchtext/plain; charset=UTF-8; name=v3-0001-Fix-Windows-file-IO.patchDownload+557-83
On Fri, Nov 7, 2025 at 11:29 AM Michael Paquier <michael@paquier.xyz> wrote:
On Thu, Nov 06, 2025 at 11:17:52AM -0600, Bryan Green wrote:
Latest patch attached that includes these code paths.
That feels OK for me. Thomas, do you have a different view on the
matter for HEAD? Like long, I would just switch to something that we
have in the tree that's fixed.
WFM.
- /* Note that this changes the file position, despite not using it. */
Why drop these comments? They still apply.
-
Accidental whitespace change?
struct radvisory
{
- off_t ra_offset; /* offset into the file */
+ pgoff_t ra_offset; /* offset into the file */
IIRC this is a struct definition from an Apple man page, maybe leave unchanged?
Looking at the v3 that arrived while I was typing this:
+ errmsg("sparse files are only supported on Windows")));
Nit: maybe sparse file test only supported on Windows? Also, nice test!
And +1 for the idea to restrict the segment size to never be more than
2GB based on a ./configure and meson check on the back branches. In
PG15 and older branches, we already enforced a check by the way. See
src/tools/msvc/Solution.pm which was the only way to compile the code
with visual studio so one would have never seen the limitations except
if they had the idea to edit the perl scripts (FWIW, I've done exactly
that in the past for a past project at $company, never touched the
segsize):
# only allow segsize 1 for now, as we can't do large files yet in windows
die "Bad segsize $options->{segsize}"
unless $options->{segsize} == 1;So this is a meson issue that goes down to v16, when using a VS
compiler. Was there a different compiler where off_t is also 4 bytes?
Ohh, this all makes more sense now. I wasn't wrong to think there was
already a check, it just didn't get ported to meson.
I wouldn't personally pitch the commit message as "Fix ...", which
sounds like a bug fix. There *is* a bug, but it's in the meson work.
Something more like "Allow large relation files on Windows" seems more
appropriate for this one, but YMMV.
MinGW is mentioned as clear by Thomas.
Only MinGW + meson. MinGW + configure has 32-bit off_t as far as I
can tell because we do:
if test "$PORTNAME" != "win32"; then
AC_SYS_LARGEFILE
...
I don't personally know of any current Unix without LFS, they just
vary on whether it's always on or you have to ask for it, as autoconf
and meson know. But I suppose the check for oversized segments should
use sizeof(off_t), not the OS's identity.
There are of course a few filesystems even today that don't let you
make a file as big as our maximum size, but that's another topic.
On Fri, Nov 07, 2025 at 02:45:32PM +1300, Thomas Munro wrote:
Only MinGW + meson. MinGW + configure has 32-bit off_t as far as I
can tell because we do:if test "$PORTNAME" != "win32"; then
AC_SYS_LARGEFILE
...I don't personally know of any current Unix without LFS, they just
vary on whether it's always on or you have to ask for it, as autoconf
and meson know. But I suppose the check for oversized segments should
use sizeof(off_t), not the OS's identity.
Yes, I was first wondering about the addition of a WIN32 check for
meson, but this is a much better idea for both ./configure and meson.
There is a cc.sizeof(), which I guess should be enough to report the
size of off_t, and fail if we try a size larger than 4GB for the
segment file when a 4-byte off_t is detected. It's something that I'd
rather backpatch first down to v16, before moving on with more pgoff_t
integration in the tree, mostly for history clarity. That's clearly
an oversight.
--
Michael
On Fri, Nov 7, 2025 at 3:13 PM Michael Paquier <michael@paquier.xyz> wrote:
There is a cc.sizeof(), which I guess should be enough to report the
size of off_t, and fail if we try a size larger than 4GB for the
segment file when a 4-byte off_t is detected.
(It's signed per POSIX and on Windows so I assume you meant to write 2GB here.)
On Fri, Nov 07, 2025 at 03:56:07PM +1300, Thomas Munro wrote:
On Fri, Nov 7, 2025 at 3:13 PM Michael Paquier <michael@paquier.xyz> wrote:
There is a cc.sizeof(), which I guess should be enough to report the
size of off_t, and fail if we try a size larger than 4GB for the
segment file when a 4-byte off_t is detected.(It's signed per POSIX and on Windows so I assume you meant to write 2GB here.)
Yes, right..
--
Michael
On Fri, Nov 07, 2025 at 12:32:21PM +0900, Michael Paquier wrote:
On Fri, Nov 07, 2025 at 03:56:07PM +1300, Thomas Munro wrote:
(It's signed per POSIX and on Windows so I assume you meant to write 2GB here.)
Yes, right..
So, please find attached a patch that adds a check for large files in
meson as we do now in ./configure, for a backpatch down to v16 (once
the branch freeze is lifted next week of course).
Thoughts?
--
Michael
Attachments:
0001-Add-check-for-large-files-in-meson.build.patchtext/x-diff; charset=us-asciiDownload+8-1
On 11/8/2025 3:40 AM, Michael Paquier wrote:
On Fri, Nov 07, 2025 at 12:32:21PM +0900, Michael Paquier wrote:
On Fri, Nov 07, 2025 at 03:56:07PM +1300, Thomas Munro wrote:
(It's signed per POSIX and on Windows so I assume you meant to write 2GB here.)
Yes, right..
So, please find attached a patch that adds a check for large files in
meson as we do now in ./configure, for a backpatch down to v16 (once
the branch freeze is lifted next week of course).Thoughts?
--
Michael
I have attached v4 of the patch after correcting some whitespace errors
and a struct that didn't need the pgoff_t modification as requested by
Thomas. I also have attached Michael's patch where both patches can be
found by cfbot for the commitfest.
--
Bryan Green
EDB: https://www.enterprisedb.com
Attachments:
v4-0001-Add-Windows-support-for-large-files.patchtext/plain; charset=UTF-8; name=v4-0001-Add-Windows-support-for-large-files.patchDownload+556-82
0001-Add-check-for-large-files-in-meson.build.patchtext/plain; charset=UTF-8; name=0001-Add-check-for-large-files-in-meson.build.patchDownload+8-1
On Tue, Nov 11, 2025 at 01:22:49PM -0600, Bryan Green wrote:
I have attached v4 of the patch after correcting some whitespace errors
and a struct that didn't need the pgoff_t modification as requested by
Thomas. I also have attached Michael's patch where both patches can be
found by cfbot for the commitfest.
Thanks. As the stamps have been pushed for the next minor release, I
have applied and backpatched the meson check for now. I'll look at
your patch next, for HEAD.
--
Michael
Michael Paquier <michael@paquier.xyz> writes:
On Tue, Nov 11, 2025 at 01:22:49PM -0600, Bryan Green wrote:
I have attached v4 of the patch after correcting some whitespace errors
and a struct that didn't need the pgoff_t modification as requested by
Thomas. I also have attached Michael's patch where both patches can be
found by cfbot for the commitfest.Thanks. As the stamps have been pushed for the next minor release, I
have applied and backpatched the meson check for now. I'll look at
your patch next, for HEAD.
I just noticed that the check is for 2GB, but the error message says
1GB.
- ilmari