[PING] fallocate() causes btrfs to never compress postgresql files

Started by Dimitrios Apostolou8 months ago27 messages

Hello, sorry for mass sending this, but I didn't get any response to my
first email [1]/messages/by-id/d0f4fc11-969d-7b3a-aacf-00f86450e738@gmx.net so I'm now CC'ing the commit's 4d330a6 [2]https://github.com/postgres/postgres/commit/4d330a61bb1969df31f2cebfe1ba9d1d004346d8 author and the
reviewers. I think it's an important issue, because I need to
custom-compile postgresql to have what I had before: a transparently
compressed database.

[1]: /messages/by-id/d0f4fc11-969d-7b3a-aacf-00f86450e738@gmx.net
[2]: https://github.com/postgres/postgres/commit/4d330a61bb1969df31f2cebfe1ba9d1d004346d8

My previous message follows:

Hi,

this is just a heads-up about files being generated by PostgreSQL 17 not
being compressed by Btrfs, even when mounted with the force-compress mount
option. I have this occuring aggressively when restoring a database via
pg_restore. I think this is caused mdzeroextend() calling FileFallocate(),
which in turn invokes posix_fallocate().

I also verified that turning off the use of fallocate causes the database
to write compressed files again, like it did in older versions.
Unfortunately the only way I found was to configure with a "hack" so that
autoconf thinks the feature is not available:

./configure ac_cv_func_posix_fallocate=no

There have been discussions on the btrfs mailing list about why it does
that, the summary is that it is very difficult to guarantee that
compressed writes will not fail with ENOSPACE on a CoW filesystem, thus
files with fallocate()d ranges are treated as being marked NOCOW,
effectively disabling compression.

Should PostgreSQL provide a setting to avoid the use of fallocate()? Or is
it the filesystem at fault for not returning EOPNOTSUPP, in which case
postgres would use its fallback code?

BTW even in the last case, PostgreSQL would not notice the lack of
fallocate() support as glibc implements a userspace fallback in
posix_fallocate(). That fallback has its own issues that hopefully will
not affect postgres (see CAVEATS in man 3 posix_fallocate).

Regards,
Dimitris

#2Tomas Vondra
tomas@vondra.me
In reply to: Dimitrios Apostolou (#1)
Re: [PING] fallocate() causes btrfs to never compress postgresql files

On 5/28/25 16:22, Dimitrios Apostolou wrote:

Hello, sorry for mass sending this, but I didn't get any response to my
first email [1] so I'm now CC'ing the commit's 4d330a6 [2] author and
the reviewers. I think it's an important issue, because I need to
custom-compile postgresql to have what I had before: a transparently
compressed database.

That message arrived a couple days before the feature freeze, so
everyone was busy with getting PG18 patches over the line. I assume
that's why no one responded to a message about an issue that already
affects PG17. We're in the quieter part of the dev cycle, people are
recovering etc. Hence the delay.

[1] /messages/by-id/d0f4fc11-969d-7b3a-
aacf-00f86450e738@gmx.net
[2] https://github.com/postgres/postgres/
commit/4d330a61bb1969df31f2cebfe1ba9d1d004346d8

My previous message follows:

Hi,

this is just a heads-up about files being generated by PostgreSQL 17 not
being compressed by Btrfs, even when mounted with the force-compress mount
option. I have this occuring aggressively when restoring a database via
pg_restore. I think this is caused mdzeroextend() calling FileFallocate(),
which in turn invokes posix_fallocate().

Right, I don't think we're really using posix_fallocate() in other
places, or at least not in places that would matter. And this code comes
from commit 4d330a61bb in PG17:

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=4d330a61bb1969df31f2cebfe1ba9d1d004346d8

The commit message explains why we do that - it has advantages when
allocating large number of blocks. FWIW it's a general code, when we
need to add space to a relation, not just for pg_restore.

I also verified that turning off the use of fallocate causes the database
to write compressed files again, like it did in older versions.
Unfortunately the only way I found was to configure with a "hack" so that
autoconf thinks the feature is not available:

   ./configure ac_cv_func_posix_fallocate=no

Unfortunately, that seems pretty heavy handed, because it will affect
the whole build, no matter which filesystem it gets used with. And I
guess we don't want to disable posix_fallocate() just because one
filesystem does something ... strange.

There have been discussions on the btrfs mailing list about why it does
that, the summary is that it is very difficult to guarantee that
compressed writes will not fail with ENOSPACE on a CoW filesystem, thus
files with fallocate()d ranges are treated as being marked NOCOW,
effectively disabling compression.

Isn't guaranteeing success of a write a general issue with compressed
filesystem? Why is posix_fallocate() any special in this regard?
Shouldn't the filesystem be defensive and assume the data is not
compressible? Or maybe just return EOPNOTSUPP when in doubt.

Should PostgreSQL provide a setting to avoid the use of fallocate()? Or is
it the filesystem at fault for not returning EOPNOTSUPP, in which case
postgres would use its fallback code?

I don't have a clear opinion on whether it's a filesystem issue. Maybe
we should be handling this differently, not sure.

BTW even in the last case, PostgreSQL would not notice the lack of
fallocate() support as glibc implements a userspace fallback in
posix_fallocate(). That fallback has its own issues that hopefully will
not affect postgres (see CAVEATS in man 3 posix_fallocate).

Well, if btrfs starts returning EOPNOTSUPP, and glibc switches to the
userspace fallback, we wouldn't notice. But that's up to the btrfs to
decide if they want to support fallocate. We still need our fallback
anyway, because of other OSes.

regards

--
Tomas Vondra

In reply to: Tomas Vondra (#2)
Re: [PING] fallocate() causes btrfs to never compress postgresql files

On Wed, 28 May 2025, Tomas Vondra wrote:

Isn't guaranteeing success of a write a general issue with compressed
filesystem? Why is posix_fallocate() any special in this regard?
Shouldn't the filesystem be defensive and assume the data is not
compressible? Or maybe just return EOPNOTSUPP when in doubt.

It's not simple for CoW filesystems, including Btrfs and ZFS. What I know
is that the current design is a compromise, it's not that the developers
are happy with it. I can point you to some discussion, with pointers to
further discussions if you are interested:

https://marc.info/?l=linux-btrfs&m=174310663519516&w=2

BTW even in the last case, PostgreSQL would not notice the lack of
fallocate() support as glibc implements a userspace fallback in
posix_fallocate(). That fallback has its own issues that hopefully will
not affect postgres (see CAVEATS in man 3 posix_fallocate).

Well, if btrfs starts returning EOPNOTSUPP, and glibc switches to the
userspace fallback, we wouldn't notice. But that's up to the btrfs to
decide if they want to support fallocate. We still need our fallback
anyway, because of other OSes.

Btrfs has decided a few years back: they will "support" fallocate, but
because real support is very difficult, they disable compression (among
others) for files with fallocate'd ranges. They can't change that and
return EOPNOTSUPP out of the blue now, but they are open to adding a mount
option to optionally do that:

https://marc.info/?l=linux-btrfs&m=174310663519516&w=2

Should PostgreSQL provide a setting to avoid the use of fallocate()? Or is
it the filesystem at fault for not returning EOPNOTSUPP, in which case
postgres would use its fallback code?

I don't have a clear opinion on whether it's a filesystem issue. Maybe
we should be handling this differently, not sure.

All I'm saying is that this is a regression for PostgreSQL users that keep
tablespaces on compressed Btrfs. What could be done from postgres, is to
provide a runtime setting for avoiding fallocate(), going instead through
the old code path. Idelly this would be an option per tablespace, but even
a global one is better than nothing.

Thanks,
Dimitris

#4Thomas Munro
thomas.munro@gmail.com
In reply to: Dimitrios Apostolou (#3)
2 attachment(s)
Re: [PING] fallocate() causes btrfs to never compress postgresql files

On Fri, May 30, 2025 at 3:58 AM Dimitrios Apostolou <jimis@gmx.net> wrote:

All I'm saying is that this is a regression for PostgreSQL users that keep
tablespaces on compressed Btrfs. What could be done from postgres, is to
provide a runtime setting for avoiding fallocate(), going instead through
the old code path. Idelly this would be an option per tablespace, but even
a global one is better than nothing.

Here's an initial sketch of such a setting. Better name, design,
words welcome. Would need a bit more work to cover temp tables too.
It's slightly tricky to get smgr to behave differently because of the
contents of a system catalogue! I couldn't think of a better way than
exposing it as a flag that the buffer manager layer has to know about
and compute earlier, but that also seems a bit strange, as fallocate
is a highly md.c specific concern. Hmm.

I suppose something like the 0001 part could be back-patched if this
is considered a serious enough problem without other workarounds, so I
did this in two steps. I wonder if there are good reasons to want to
change the number on other file systems. I suppose it at least allows
experimentation.

Attachments:

0001-Add-io_min_fallocate-setting.patchtext/x-patch; charset=US-ASCII; name=0001-Add-io_min_fallocate-setting.patchDownload
From 8607189eb19302c509eed78a7a2db55b9a2d70b3 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sat, 31 May 2025 22:50:22 +1200
Subject: [PATCH 1/2] Add io_min_fallocate setting.

BTRFS's compression is reported to be disabled by posix_fallocate(), so
offer a way to turn it off.  The previous coding had a threshold of 8
blocks before using that instead of writing zeroes, so make that
configurable.  0 means never, and other numbers specify a threshold in
blocks, defaulting to 8 as before.

Reported-by: Dimitrios Apostolou <jimis@gmx.net>
Discussion: https://postgr.es/m/b1843124-fd22-e279-a31f-252dffb6fbf2%40gmx.net
---
 doc/src/sgml/config.sgml                      | 17 +++++++++++++++++
 src/backend/storage/file/fd.c                 |  3 +++
 src/backend/storage/smgr/md.c                 |  6 ++----
 src/backend/utils/misc/guc_tables.c           | 12 ++++++++++++
 src/backend/utils/misc/postgresql.conf.sample |  1 +
 src/include/storage/fd.h                      |  1 +
 6 files changed, 36 insertions(+), 4 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index f4a0191c55b..7d476665f42 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2684,6 +2684,23 @@ include_dir 'conf.d'
        </listitem>
       </varlistentry>
 
+      <varlistentry id="guc-io-min-fallocate" xreflabel="io_min_fallocate">
+       <term><varname>io_min_fallocate</varname> (<type>integer</type>)
+       <indexterm>
+        <primary><varname>io_min_fallocate</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Threshold at which <function>posix_fallocate()</function> is used to
+         extend data files instead of writing zeroes.  <literal>0</literal>
+         means never (always write
+         zeroes), and other values indicate a number of blocks.
+         The default is <literal>8</literal>.
+        </para>
+       </listitem>
+      </varlistentry>
+
       <varlistentry id="guc-io-max-combine-limit" xreflabel="io_max_combine_limit">
        <term><varname>io_max_combine_limit</varname> (<type>integer</type>)
        <indexterm>
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 0e8299dd556..ff16b5cc6bd 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -164,6 +164,9 @@ bool		data_sync_retry = false;
 /* How SyncDataDirectory() should do its job. */
 int			recovery_init_sync_method = DATA_DIR_SYNC_METHOD_FSYNC;
 
+/* At what size FileFallocate() should be preferred over FileZero(). */
+int			io_min_fallocate = 8;
+
 /* Which kinds of files should be opened with PG_O_DIRECT. */
 int			io_direct_flags;
 
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 2ccb0faceb5..6d1b9cb65b2 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -588,11 +588,9 @@ mdzeroextend(SMgrRelation reln, ForkNumber forknum,
 		 * to allocate page cache space for the extended pages.
 		 *
 		 * However, we don't use FileFallocate() for small extensions, as it
-		 * defeats delayed allocation on some filesystems. Not clear where
-		 * that decision should be made though? For now just use a cutoff of
-		 * 8, anything between 4 and 8 worked OK in some local testing.
+		 * defeats delayed allocation on some filesystems.
 		 */
-		if (numblocks > 8)
+		if (io_min_fallocate > 0 && numblocks >= io_min_fallocate)
 		{
 			int			ret;
 
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 2f8cbd86759..a75ff8623d9 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -3265,6 +3265,18 @@ struct config_int ConfigureNamesInt[] =
 		NULL
 	},
 
+	{
+		{"io_min_fallocate",
+			PGC_USERSET,
+			RESOURCES_IO,
+			gettext_noop("Threshold for preferring posix_fallocate() when extending data files."),
+			NULL,
+			GUC_UNIT_BLOCKS
+		},
+		&io_min_fallocate,
+		8, 0, INT_MAX
+	},
+
 	{
 		{"io_max_combine_limit",
 			PGC_POSTMASTER,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 87ce76b18f4..8b712ef244f 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -200,6 +200,7 @@
 #backend_flush_after = 0		# measured in pages, 0 disables
 #effective_io_concurrency = 16		# 1-1000; 0 disables issuing multiple simultaneous IO requests
 #maintenance_io_concurrency = 16	# 1-1000; same as effective_io_concurrency
+#io_min_fallocate = 8			# min size at which to prefer posix_fallocate, 0 = never
 #io_max_combine_limit = 128kB		# usually 1-128 blocks (depends on OS)
 					# (change requires restart)
 #io_combine_limit = 128kB		# usually 1-128 blocks (depends on OS)
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index b77d8e5e30e..b8714e5ceb8 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -60,6 +60,7 @@ typedef int File;
 extern PGDLLIMPORT int max_files_per_process;
 extern PGDLLIMPORT bool data_sync_retry;
 extern PGDLLIMPORT int recovery_init_sync_method;
+extern PGDLLIMPORT int io_min_fallocate;
 extern PGDLLIMPORT int io_direct_flags;
 
 /*
-- 
2.39.5

0002-Add-io_min_fallocate-tablespace-option.patchtext/x-patch; charset=US-ASCII; name=0002-Add-io_min_fallocate-tablespace-option.patchDownload
From 33bb0c22769206a3a752e99a8987be44dd9bdd31 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sat, 31 May 2025 23:26:28 +1200
Subject: [PATCH 2/2] Add io_min_fallocate tablespace option.

Allow io_min_fallocate to be set on a per-tablespace basis.  The
decision must be made by bufmgr.c because md.c can't access catalogs
once the extension lock is held.  The smgrzeroextend() function gains a
flags parameter, and smgrextend() too for consistency.

TODO: temp tables too

Reported-by: Dimitrios Apostolou <jimis@gmx.net>
Discussion: https://postgr.es/m/b1843124-fd22-e279-a31f-252dffb6fbf2%40gmx.net
---
 doc/src/sgml/config.sgml               |  7 +++++--
 doc/src/sgml/ref/alter_tablespace.sgml |  1 +
 src/backend/access/common/reloptions.c | 12 +++++++++++-
 src/backend/access/hash/hashpage.c     |  3 +--
 src/backend/storage/buffer/bufmgr.c    | 22 +++++++++++++++++++++-
 src/backend/storage/smgr/bulk_write.c  |  5 +++--
 src/backend/storage/smgr/md.c          | 12 ++++++++----
 src/backend/storage/smgr/smgr.c        | 12 ++++++------
 src/backend/utils/cache/spccache.c     | 13 +++++++++++++
 src/bin/psql/tab-complete.in.c         |  1 +
 src/include/commands/tablespace.h      |  1 +
 src/include/storage/md.h               |  4 ++--
 src/include/storage/smgr.h             |  7 +++++--
 src/include/utils/rel.h                |  1 +
 src/include/utils/spccache.h           |  1 +
 15 files changed, 80 insertions(+), 22 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 7d476665f42..7f956666c99 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2696,8 +2696,11 @@ include_dir 'conf.d'
          extend data files instead of writing zeroes.  <literal>0</literal>
          means never (always write
          zeroes), and other values indicate a number of blocks.
-         The default is <literal>8</literal>.
-        </para>
+         The default is <literal>8</literal>.  This value can be overridden
+         for tables in a particular tablespace by setting the tablespace
+         parameter of the same name (see <xref
+         linkend="sql-altertablespace"/>).
+         </para>
        </listitem>
       </varlistentry>
 
diff --git a/doc/src/sgml/ref/alter_tablespace.sgml b/doc/src/sgml/ref/alter_tablespace.sgml
index d0e08089ddb..351ae62b24a 100644
--- a/doc/src/sgml/ref/alter_tablespace.sgml
+++ b/doc/src/sgml/ref/alter_tablespace.sgml
@@ -92,6 +92,7 @@ ALTER TABLESPACE <replaceable>name</replaceable> RESET ( <replaceable class="par
       by the configuration parameters of the
       same name (see <xref linkend="guc-seq-page-cost"/>,
       <xref linkend="guc-random-page-cost"/>,
+      <xref linkend="guc-io-min-fallocate"/>,
       <xref linkend="guc-effective-io-concurrency"/>,
       <xref linkend="guc-maintenance-io-concurrency"/>).  This may be useful if
       one tablespace is located on a disk which is faster or slower than the
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 46c1dce222d..703b395c302 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -372,6 +372,15 @@ static relopt_int intRelOpts[] =
 		},
 		-1, 0, MAX_IO_CONCURRENCY
 	},
+	{
+		{
+			"io_min_fallocate",
+			"Threshold for preferring posix_fallocate() when extending data files.",
+			RELOPT_KIND_TABLESPACE,
+			ShareUpdateExclusiveLock
+		},
+		-1, 0, INT_MAX
+	},
 	{
 		{
 			"parallel_workers",
@@ -2116,7 +2125,8 @@ tablespace_reloptions(Datum reloptions, bool validate)
 		{"random_page_cost", RELOPT_TYPE_REAL, offsetof(TableSpaceOpts, random_page_cost)},
 		{"seq_page_cost", RELOPT_TYPE_REAL, offsetof(TableSpaceOpts, seq_page_cost)},
 		{"effective_io_concurrency", RELOPT_TYPE_INT, offsetof(TableSpaceOpts, effective_io_concurrency)},
-		{"maintenance_io_concurrency", RELOPT_TYPE_INT, offsetof(TableSpaceOpts, maintenance_io_concurrency)}
+		{"maintenance_io_concurrency", RELOPT_TYPE_INT, offsetof(TableSpaceOpts, maintenance_io_concurrency)},
+		{"io_min_fallocate", RELOPT_TYPE_INT, offsetof(TableSpaceOpts, io_min_fallocate)}
 	};
 
 	return (bytea *) build_reloptions(reloptions, validate,
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index b8e5bd005e5..d8b5f06f435 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -1030,8 +1030,7 @@ _hash_alloc_buckets(Relation rel, BlockNumber firstblock, uint32 nblocks)
 					true);
 
 	PageSetChecksumInplace(page, lastblock);
-	smgrextend(RelationGetSmgr(rel), MAIN_FORKNUM, lastblock, zerobuf.data,
-			   false);
+	smgrextend(RelationGetSmgr(rel), MAIN_FORKNUM, lastblock, zerobuf.data, 0);
 
 	return true;
 }
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f93131a645e..69dc70d7ad1 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -65,6 +65,7 @@
 #include "utils/ps_status.h"
 #include "utils/rel.h"
 #include "utils/resowner.h"
+#include "utils/spccache.h"
 #include "utils/timestamp.h"
 
 
@@ -2616,6 +2617,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 						Buffer *buffers,
 						uint32 *extended_by)
 {
+	int			smgr_flags = 0;
 	BlockNumber first_block;
 	IOContext	io_context = IOContextForStrategy(strategy);
 	instr_time	io_start;
@@ -2643,6 +2645,24 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 		MemSet(buf_block, 0, BLCKSZ);
 	}
 
+	/*
+	 * For multi-block extension, decide if smgr should use fallocate for an
+	 * extension of this size.  It can't decide for itself, because it can't
+	 * access the catalog with the extension lock held.  Likewise, initdb and
+	 * recovery can't access catalogs either.
+	 */
+	if (extend_by > 1 && IsUnderPostmaster)
+	{
+		int			threshold;
+
+		if (InRecovery)
+			threshold = io_min_fallocate;
+		else
+			threshold = get_tablespace_io_min_fallocate(bmr.smgr->smgr_rlocator.locator.spcOid);
+		if (threshold != 0 && extend_by >= threshold)
+			smgr_flags |= SMGR_FLAG_FALLOCATE;
+	}
+
 	/*
 	 * Lock relation against concurrent extensions, unless requested not to.
 	 *
@@ -2836,7 +2856,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 	 *
 	 * We don't need to set checksum for all-zero pages.
 	 */
-	smgrzeroextend(bmr.smgr, fork, first_block, extend_by, false);
+	smgrzeroextend(bmr.smgr, fork, first_block, extend_by, smgr_flags);
 
 	/*
 	 * Release the file-extension lock; it's now OK for someone else to extend
diff --git a/src/backend/storage/smgr/bulk_write.c b/src/backend/storage/smgr/bulk_write.c
index b958be15716..ef821300f38 100644
--- a/src/backend/storage/smgr/bulk_write.c
+++ b/src/backend/storage/smgr/bulk_write.c
@@ -296,11 +296,12 @@ smgr_bulk_flush(BulkWriteState *bulkstate)
 				smgrextend(bulkstate->smgr, bulkstate->forknum,
 						   bulkstate->relsize,
 						   &zero_buffer,
-						   true);
+						   SMGR_FLAG_SKIP_FSYNC);
 				bulkstate->relsize++;
 			}
 
-			smgrextend(bulkstate->smgr, bulkstate->forknum, blkno, page, true);
+			smgrextend(bulkstate->smgr, bulkstate->forknum, blkno, page,
+					   SMGR_FLAG_SKIP_FSYNC);
 			bulkstate->relsize++;
 		}
 		else
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 6d1b9cb65b2..4f9909755c8 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -475,11 +475,12 @@ mdunlinkfork(RelFileLocatorBackend rlocator, ForkNumber forknum, bool isRedo)
  */
 void
 mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
-		 const void *buffer, bool skipFsync)
+		 const void *buffer, int flags)
 {
 	off_t		seekpos;
 	int			nbytes;
 	MdfdVec    *v;
+	bool		skipFsync = flags & SMGR_FLAG_SKIP_FSYNC;
 
 	/* If this build supports direct I/O, the buffer must be I/O aligned. */
 	if (PG_O_DIRECT != 0 && PG_IO_ALIGN_SIZE <= BLCKSZ)
@@ -540,11 +541,12 @@ mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
  */
 void
 mdzeroextend(SMgrRelation reln, ForkNumber forknum,
-			 BlockNumber blocknum, int nblocks, bool skipFsync)
+			 BlockNumber blocknum, int nblocks, int flags)
 {
 	MdfdVec    *v;
 	BlockNumber curblocknum = blocknum;
 	int			remblocks = nblocks;
+	bool		skipFsync = flags & SMGR_FLAG_SKIP_FSYNC;
 
 	Assert(nblocks > 0);
 
@@ -588,9 +590,11 @@ mdzeroextend(SMgrRelation reln, ForkNumber forknum,
 		 * to allocate page cache space for the extended pages.
 		 *
 		 * However, we don't use FileFallocate() for small extensions, as it
-		 * defeats delayed allocation on some filesystems.
+		 * defeats delayed allocation on some filesystems.  bufmgr.c must make
+		 * that decision, because we can't access the tablespace catalog while
+		 * the extension lock is held.
 		 */
-		if (io_min_fallocate > 0 && numblocks >= io_min_fallocate)
+		if (flags & SMGR_FLAG_FALLOCATE)
 		{
 			int			ret;
 
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index bce37a36d51..dfe556ca5e2 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -97,9 +97,9 @@ typedef struct f_smgr
 	void		(*smgr_unlink) (RelFileLocatorBackend rlocator, ForkNumber forknum,
 								bool isRedo);
 	void		(*smgr_extend) (SMgrRelation reln, ForkNumber forknum,
-								BlockNumber blocknum, const void *buffer, bool skipFsync);
+								BlockNumber blocknum, const void *buffer, int flags);
 	void		(*smgr_zeroextend) (SMgrRelation reln, ForkNumber forknum,
-									BlockNumber blocknum, int nblocks, bool skipFsync);
+									BlockNumber blocknum, int nblocks, int flags);
 	bool		(*smgr_prefetch) (SMgrRelation reln, ForkNumber forknum,
 								  BlockNumber blocknum, int nblocks);
 	uint32		(*smgr_maxcombine) (SMgrRelation reln, ForkNumber forknum,
@@ -618,12 +618,12 @@ smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo)
  */
 void
 smgrextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
-		   const void *buffer, bool skipFsync)
+		   const void *buffer, int flags)
 {
 	HOLD_INTERRUPTS();
 
 	smgrsw[reln->smgr_which].smgr_extend(reln, forknum, blocknum,
-										 buffer, skipFsync);
+										 buffer, flags);
 
 	/*
 	 * Normally we expect this to increase nblocks by one, but if the cached
@@ -647,12 +647,12 @@ smgrextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
  */
 void
 smgrzeroextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
-			   int nblocks, bool skipFsync)
+			   int nblocks, int flags)
 {
 	HOLD_INTERRUPTS();
 
 	smgrsw[reln->smgr_which].smgr_zeroextend(reln, forknum, blocknum,
-											 nblocks, skipFsync);
+											 nblocks, flags);
 
 	/*
 	 * Normally we expect this to increase the fork size by nblocks, but if
diff --git a/src/backend/utils/cache/spccache.c b/src/backend/utils/cache/spccache.c
index 23458599298..b2eb51d1bab 100644
--- a/src/backend/utils/cache/spccache.c
+++ b/src/backend/utils/cache/spccache.c
@@ -24,6 +24,7 @@
 #include "miscadmin.h"
 #include "optimizer/optimizer.h"
 #include "storage/bufmgr.h"
+#include "storage/fd.h"
 #include "utils/catcache.h"
 #include "utils/hsearch.h"
 #include "utils/inval.h"
@@ -235,3 +236,15 @@ get_tablespace_maintenance_io_concurrency(Oid spcid)
 	else
 		return spc->opts->maintenance_io_concurrency;
 }
+
+int
+get_tablespace_io_min_fallocate(Oid spcid)
+{
+	TableSpaceCacheEntry *spc;
+
+	spc = get_tablespace(spcid);
+	if (!spc->opts || spc->opts->io_min_fallocate < 0)
+		return io_min_fallocate;
+	else
+		return spc->opts->io_min_fallocate;
+}
diff --git a/src/bin/psql/tab-complete.in.c b/src/bin/psql/tab-complete.in.c
index ec65ab79fec..489b391b437 100644
--- a/src/bin/psql/tab-complete.in.c
+++ b/src/bin/psql/tab-complete.in.c
@@ -3000,6 +3000,7 @@ match_previous_words(int pattern_id,
 	/* ALTER TABLESPACE <foo> SET|RESET ( */
 	else if (Matches("ALTER", "TABLESPACE", MatchAny, "SET|RESET", "("))
 		COMPLETE_WITH("seq_page_cost", "random_page_cost",
+					  "io_min_fallocate",
 					  "effective_io_concurrency", "maintenance_io_concurrency");
 
 	/* ALTER TEXT SEARCH */
diff --git a/src/include/commands/tablespace.h b/src/include/commands/tablespace.h
index 4e8bf4dc0de..986f9d623bf 100644
--- a/src/include/commands/tablespace.h
+++ b/src/include/commands/tablespace.h
@@ -45,6 +45,7 @@ typedef struct TableSpaceOpts
 	float8		seq_page_cost;
 	int			effective_io_concurrency;
 	int			maintenance_io_concurrency;
+	int			io_min_fallocate;
 } TableSpaceOpts;
 
 extern Oid	CreateTableSpace(CreateTableSpaceStmt *stmt);
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index b563c27abf0..608267599b1 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -30,9 +30,9 @@ extern void mdcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
 extern bool mdexists(SMgrRelation reln, ForkNumber forknum);
 extern void mdunlink(RelFileLocatorBackend rlocator, ForkNumber forknum, bool isRedo);
 extern void mdextend(SMgrRelation reln, ForkNumber forknum,
-					 BlockNumber blocknum, const void *buffer, bool skipFsync);
+					 BlockNumber blocknum, const void *buffer, int flags);
 extern void mdzeroextend(SMgrRelation reln, ForkNumber forknum,
-						 BlockNumber blocknum, int nblocks, bool skipFsync);
+						 BlockNumber blocknum, int nblocks, int flags);
 extern bool mdprefetch(SMgrRelation reln, ForkNumber forknum,
 					   BlockNumber blocknum, int nblocks);
 extern uint32 mdmaxcombine(SMgrRelation reln, ForkNumber forknum,
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 3964d9334b3..b8b4da0e90c 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -19,6 +19,9 @@
 #include "storage/block.h"
 #include "storage/relfilelocator.h"
 
+#define SMGR_FLAG_SKIP_FSYNC	(1 << 0)
+#define SMGR_FLAG_FALLOCATE 	(1 << 1)
+
 /*
  * smgr.c maintains a table of SMgrRelation objects, which are essentially
  * cached file handles.  An SMgrRelation is created (if not already present)
@@ -90,9 +93,9 @@ extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
 extern void smgrdosyncall(SMgrRelation *rels, int nrels);
 extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo);
 extern void smgrextend(SMgrRelation reln, ForkNumber forknum,
-					   BlockNumber blocknum, const void *buffer, bool skipFsync);
+					   BlockNumber blocknum, const void *buffer, int flags);
 extern void smgrzeroextend(SMgrRelation reln, ForkNumber forknum,
-						   BlockNumber blocknum, int nblocks, bool skipFsync);
+						   BlockNumber blocknum, int nblocks, int flags);
 extern bool smgrprefetch(SMgrRelation reln, ForkNumber forknum,
 						 BlockNumber blocknum, int nblocks);
 extern uint32 smgrmaxcombine(SMgrRelation reln, ForkNumber forknum,
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index b552359915f..a4592f2f184 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -342,6 +342,7 @@ typedef struct StdRdOptions
 	int32		vl_len_;		/* varlena header (do not touch directly!) */
 	int			fillfactor;		/* page fill factor in percent (0..100) */
 	int			toast_tuple_target; /* target for tuple toasting */
+	int			io_min_fallocate;
 	AutoVacOpts autovacuum;		/* autovacuum-related options */
 	bool		user_catalog_table; /* use as an additional catalog relation */
 	int			parallel_workers;	/* max number of parallel workers */
diff --git a/src/include/utils/spccache.h b/src/include/utils/spccache.h
index d7edd79b18d..497f3707e2a 100644
--- a/src/include/utils/spccache.h
+++ b/src/include/utils/spccache.h
@@ -17,5 +17,6 @@ extern void get_tablespace_page_costs(Oid spcid, float8 *spc_random_page_cost,
 									  float8 *spc_seq_page_cost);
 extern int	get_tablespace_io_concurrency(Oid spcid);
 extern int	get_tablespace_maintenance_io_concurrency(Oid spcid);
+extern int	get_tablespace_io_min_fallocate(Oid spcid);
 
 #endif							/* SPCCACHE_H */
-- 
2.39.5

#5Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#4)
Re: [PING] fallocate() causes btrfs to never compress postgresql files

Or for a completely different approach: I wonder if ftruncate() would
be more efficient on COW systems anyway. The minimum thing we need is
for the file system to remember the new size, 'cause, erm, we don't.
All the rest is probably a waste of cycles, since they reserve real
space (or fail to) later in the checkpointer or whatever process
eventually writes the data out.

#6Tomas Vondra
tomas@vondra.me
In reply to: Thomas Munro (#4)
Re: [PING] fallocate() causes btrfs to never compress postgresql files

On 5/31/25 16:00, Thomas Munro wrote:

On Fri, May 30, 2025 at 3:58 AM Dimitrios Apostolou <jimis@gmx.net> wrote:

All I'm saying is that this is a regression for PostgreSQL users that keep
tablespaces on compressed Btrfs. What could be done from postgres, is to
provide a runtime setting for avoiding fallocate(), going instead through
the old code path. Idelly this would be an option per tablespace, but even
a global one is better than nothing.

Here's an initial sketch of such a setting. Better name, design,
words welcome. Would need a bit more work to cover temp tables too.
It's slightly tricky to get smgr to behave differently because of the
contents of a system catalogue! I couldn't think of a better way than
exposing it as a flag that the buffer manager layer has to know about
and compute earlier, but that also seems a bit strange, as fallocate
is a highly md.c specific concern. Hmm.

I find the definition of io_min_fallocate confusing, or rather that 0
means "never" instead of "always". It's described as a "threshold at
which to start using fallocate", so I'd expect 0 to mean "always"
because (len >= 0).

I suggest to use "-1" to mean never and "0" always, as for other similar
settings (e.g. log_min_duration_statement or log_lock_waits).

I suppose something like the 0001 part could be back-patched if this
is considered a serious enough problem without other workarounds, so I
did this in two steps. I wonder if there are good reasons to want to
change the number on other file systems. I suppose it at least allows
experimentation.

Maybe. It'd need to get some of the 0002 bits too, ofc.

I'm not sure we really want all these special GUC tailored for different
filesystems. We already have a few such GUCs, it's getting tricky to
know which ones to set / not set, and it also changes with the
filesystem version ... I personally don't know which ones to set, a lot
of the knowledge is somewhat outdated I think.

Wouldn't it be better for btrfs to just start returning EOPNOTSUPP
(maybe with a mount option), in which case we already do the right thing
automatically already? Sure, it means the admin needs to be aware of
this in both cases.

regards

--
Tomas Vondra

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Thomas Munro (#4)
Re: [PING] fallocate() causes btrfs to never compress postgresql files

Thomas Munro <thomas.munro@gmail.com> writes:

It's slightly tricky to get smgr to behave differently because of the
contents of a system catalogue!

The mere thought makes me blanch. I'm okay with the GUC part,
but I do not think we should put in 0002 --- the odds of
causing serious problems greatly outweigh the value, IMO.
Fundamental layering violations tend to bite you on tender
parts of your anatomy.

regards, tom lane

#8Jakub Wartak
jakub.wartak@enterprisedb.com
In reply to: Tomas Vondra (#6)
Re: [PING] fallocate() causes btrfs to never compress postgresql files

On Sat, May 31, 2025 at 4:33 PM Tomas Vondra <tomas@vondra.me> wrote:

On 5/31/25 16:00, Thomas Munro wrote:

On Fri, May 30, 2025 at 3:58 AM Dimitrios Apostolou <jimis@gmx.net> wrote:

All I'm saying is that this is a regression for PostgreSQL users that keep
tablespaces on compressed Btrfs. What could be done from postgres, is to
provide a runtime setting for avoiding fallocate(), going instead through
the old code path. Idelly this would be an option per tablespace, but even
a global one is better than nothing.

Here's an initial sketch of such a setting. Better name, design,
words welcome. Would need a bit more work to cover temp tables too.
It's slightly tricky to get smgr to behave differently because of the
contents of a system catalogue! I couldn't think of a better way than
exposing it as a flag that the buffer manager layer has to know about
and compute earlier, but that also seems a bit strange, as fallocate
is a highly md.c specific concern. Hmm.

I find the definition of io_min_fallocate confusing, [..]

Thanks to Thomas for providing the patch, but - same here - but my
take is that making it a GUC that takes a number for this instead of
simply making it on/off switches makes it less more understandable. I
think io_fallocate=on/off would be easier for the users.

I suppose something like the 0001 part could be back-patched if this
is considered a serious enough problem without other workarounds, so I
did this in two steps. I wonder if there are good reasons to want to
change the number on other file systems. I suppose it at least allows
experimentation.

Maybe. It'd need to get some of the 0002 bits too, ofc.

I'm not sure we really want all these special GUC tailored for different
filesystems. We already have a few such GUCs, it's getting tricky to
know which ones to set / not set, and it also changes with the
filesystem version ... I personally don't know which ones to set, a lot
of the knowledge is somewhat outdated I think.

Well, XFS also got quite several reports of regressions due to
fallocate() being used [1]/messages/by-id/CADofcAV8xu3hCNHq7-7x56KrP9rD6=A04=qjTr3nETh-gptF8w@mail.gmail.com, but there you could at least try to
mitigate it. I don't think we'll be able to get away without it and
the ginnie is already out of the bottle as the kernels are already
widely used (well, in theory we could add capability that would help
set some of those internal switches based on statfs(/path).fs_type,
but realistically we would still need to have the ability to override
anyway).

-J.

[1]: /messages/by-id/CADofcAV8xu3hCNHq7-7x56KrP9rD6=A04=qjTr3nETh-gptF8w@mail.gmail.com

In reply to: Thomas Munro (#5)
Re: [PING] fallocate() causes btrfs to never compress postgresql files

On Sun, 1 Jun 2025, Thomas Munro wrote:

Or for a completely different approach: I wonder if ftruncate() would
be more efficient on COW systems anyway. The minimum thing we need is
for the file system to remember the new size, 'cause, erm, we don't.
All the rest is probably a waste of cycles, since they reserve real
space (or fail to) later in the checkpointer or whatever process
eventually writes the data out.

FWIW I asked the btrfs devs. From
https://github.com/kdave/btrfs-progs/pull/976
I quote Qu Wenruo:

Only for falloc(), not ftruncate().

The PREALLOC inode flag is added for any preallocated file extent,
meanwhile truncate only creates holes.

truncate is fast but it's really different from fallocate by there is
nothing really allocated.

This means the later writes will need to allocate their own data
extents. This is fine and even preferred for btrfs, but may lead to
performance drop for more traditional fses.

We're in an era that fs features are not longer that generic, fallocate
is just one example, in fact fallocate will cause more problems more
than no compression.

It's really a deep rabbit hole, and is not something simple true or
false questions.

In other words, btrfs will not try to allocate anything with ftruncate(),
it will just mark the new space as a "hole". As such, the file is not
marked as "PREALLOC" which is what disables compression. Of course there
is no guarantee that further writes will succeed, and as quoted above,
other (non-COW) filesystems might be slower writing the
ftruncate()-allocated space.

Regards,
Dimitris

#10Thomas Munro
thomas.munro@gmail.com
In reply to: Dimitrios Apostolou (#9)
Re: [PING] fallocate() causes btrfs to never compress postgresql files

On Mon, Jun 2, 2025 at 10:14 PM Dimitrios Apostolou <jimis@gmx.net> wrote:

On Sun, 1 Jun 2025, Thomas Munro wrote:

Or for a completely different approach: I wonder if ftruncate() would
be more efficient on COW systems anyway. The minimum thing we need is
for the file system to remember the new size, 'cause, erm, we don't.
All the rest is probably a waste of cycles, since they reserve real
space (or fail to) later in the checkpointer or whatever process
eventually writes the data out.

FWIW I asked the btrfs devs. From
https://github.com/kdave/btrfs-progs/pull/976
I quote Qu Wenruo:

Only for falloc(), not ftruncate().

The PREALLOC inode flag is added for any preallocated file extent,
meanwhile truncate only creates holes.

truncate is fast but it's really different from fallocate by there is
nothing really allocated.

This means the later writes will need to allocate their own data
extents. This is fine and even preferred for btrfs, but may lead to
performance drop for more traditional fses.

We're in an era that fs features are not longer that generic, fallocate
is just one example, in fact fallocate will cause more problems more
than no compression.

It's really a deep rabbit hole, and is not something simple true or
false questions.

In other words, btrfs will not try to allocate anything with ftruncate(),
it will just mark the new space as a "hole". As such, the file is not
marked as "PREALLOC" which is what disables compression. Of course there
is no guarantee that further writes will succeed, and as quoted above,
other (non-COW) filesystems might be slower writing the
ftruncate()-allocated space.

Yeah, right, I know. But PostgreSQL has at least two different goals
when extending a relation:

1. Remember the new size of the relation somewhere*.
2. Reserve space now, so that we can report ENOSPC and roll back the
transaction that wants to extend the relation when the disk is full,
instead of causing a checkpoint or buffer eviction to fail later (see
https://wiki.postgresql.org/wiki/ENOSPC for longer version).

But the second thing just can't work on a COW system by definition, so
the whole notion is bogus, which is why I wondered if fruncate() is
actually a reasonable option to have, even though it just creates
holes (on Unixen). I also know of another completely different reason
to want to use ftruncate(): NTFS, which *doesn't* create holes (NTFS
supports holes via other syscalls, but ftruncate() or rather
_chsize_s() as they spell it doesn't make them), making it more like
posix_fallocate() in this usage. So I was beginning to wonder if we
might want to experiment with a patch that adds
file_extend_method=fallocate,ftruncate,write. Perhaps accompanied by
a threshold setting below which it always writes. Then we could
experiment with various COW file systems (zfs, btrfs, apfs, refs, ???)
and NTFS to see how that speculation works out in reality.

Wild speculation: To actually achieve the second thing on a COW file
system, you'd probably need some totally new kind of interface,
because that POSIX interface has the wrong shape. I have wondered
about a new fcntl() or whatever that would let you reserve the right
to write N blocks (ie just once!) without ENOSPC on a given
descriptor, that a database could conceptually acquire when dirtying
buffers, since that's the point at which we know that a write must
eventually happen (then probably amortise that accounting a lot),
including but not limited to this relation-extension case, and that
way you could achieve goal #2, ie transferring ENOSPC errors to
transaction time. But that's just a daydream about vapourware. One
problem is that PostgreSQL has many processes with separate file
descriptors, so that'd make the bookkeeping trickier but not
impossible.

(*That has a few known issues...)

#11Bruce Momjian
bruce@momjian.us
In reply to: Thomas Munro (#4)
Re: [PING] fallocate() causes btrfs to never compress postgresql files

On Sun, Jun 1, 2025 at 02:00:17AM +1200, Thomas Munro wrote:

I suppose something like the 0001 part could be back-patched if this
is considered a serious enough problem without other workarounds, so I
did this in two steps. I wonder if there are good reasons to want to
change the number on other file systems. I suppose it at least allows
experimentation.

Consider that postgresql.conf is installed by initdb, so backpatching
this is not going to add the setting to postgresql.conf unless we do
some magic. That will be confusing to users.

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

Do not let urgent matters crowd out time for investment in the future.

In reply to: Thomas Munro (#10)
Re: [PING] fallocate() causes btrfs to never compress postgresql files

On Tue, 3 Jun 2025, Thomas Munro wrote:

On Mon, Jun 2, 2025 at 10:14 PM Dimitrios Apostolou <jimis@gmx.net> wrote:

On Sun, 1 Jun 2025, Thomas Munro wrote:

Or for a completely different approach: I wonder if ftruncate() would
be more efficient on COW systems anyway. The minimum thing we need is
for the file system to remember the new size, 'cause, erm, we don't.
All the rest is probably a waste of cycles, since they reserve real
space (or fail to) later in the checkpointer or whatever process
eventually writes the data out.

FWIW I asked the btrfs devs. From
https://github.com/kdave/btrfs-progs/pull/976
I quote Qu Wenruo:

Only for falloc(), not ftruncate().

The PREALLOC inode flag is added for any preallocated file extent,
meanwhile truncate only creates holes.

truncate is fast but it's really different from fallocate by there is
nothing really allocated.

This means the later writes will need to allocate their own data
extents. This is fine and even preferred for btrfs, but may lead to
performance drop for more traditional fses.

We're in an era that fs features are not longer that generic, fallocate
is just one example, in fact fallocate will cause more problems more
than no compression.

It's really a deep rabbit hole, and is not something simple true or
false questions.

In other words, btrfs will not try to allocate anything with ftruncate(),
it will just mark the new space as a "hole". As such, the file is not
marked as "PREALLOC" which is what disables compression. Of course there
is no guarantee that further writes will succeed, and as quoted above,
other (non-COW) filesystems might be slower writing the
ftruncate()-allocated space.

Yeah, right, I know. But PostgreSQL has at least two different goals
when extending a relation:

1. Remember the new size of the relation somewhere*.
2. Reserve space now, so that we can report ENOSPC and roll back the
transaction that wants to extend the relation when the disk is full,
instead of causing a checkpoint or buffer eviction to fail later (see
https://wiki.postgresql.org/wiki/ENOSPC for longer version).

But the second thing just can't work on a COW system by definition, so
the whole notion is bogus, which is why I wondered if fruncate() is
actually a reasonable option to have, even though it just creates
holes (on Unixen). I also know of another completely different reason
to want to use ftruncate(): NTFS, which *doesn't* create holes (NTFS
supports holes via other syscalls, but ftruncate() or rather
_chsize_s() as they spell it doesn't make them), making it more like
posix_fallocate() in this usage. So I was beginning to wonder if we
might want to experiment with a patch that adds
file_extend_method=fallocate,ftruncate,write. Perhaps accompanied by
a threshold setting below which it always writes.

This sounds like the best solution IMO. People can then experiment with
different settings and filesystems, and that way we also learn in the
process. Thank you for the effort and patches so far.

Dimitris

#13Thomas Munro
thomas.munro@gmail.com
In reply to: Dimitrios Apostolou (#12)
1 attachment(s)
Re: [PING] fallocate() causes btrfs to never compress postgresql files

On Tue, Jun 3, 2025 at 1:58 AM Dimitrios Apostolou <jimis@gmx.net> wrote:

This sounds like the best solution IMO. People can then experiment with
different settings and filesystems, and that way we also learn in the
process. Thank you for the effort and patches so far.

OK, here's a basic patch to experiment with. You can set:

file_extend_method = fallocate,ftruncate,write
file_extend_method_threshold = 8 # (below 8 always write, 0 means never write)

To really make COPY fly we also need to get write combining and AIO
going (we've had this working with various prototypes, but it all
missed the boat for v18 which can only do that stuff for reads). Then
you'll have concurrent 128kB or up to 1MB writes trundling along in
the background which I guess should work pretty nicely for stuff like
BTRFS/ZFS and compression and all that jazz.

Attachments:

0001-Add-file_extend_method-setting.patchtext/x-patch; charset=US-ASCII; name=0001-Add-file_extend_method-setting.patchDownload
From 8513b2ec3d31cb5afed9ffc1952326905fd90732 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sat, 31 May 2025 22:50:22 +1200
Subject: [PATCH] Add file_extend_method setting.

BTRFS's compression is reported to be disabled by posix_fallocate(), so
offer a way to turn it off by setting it to either write or ftruncate
instead.  May also be useful for Windows, which lacks fallocate but is
known to allocate space on ftruncate.

The previous coding had a threshold of 8 blocks before using a
bulk-extension system call instead of writing zeroes, so also make that
configurable, as file_extend_method_threshold.  0 means never, and other
numbers specify a threshold in blocks, defaulting to 8 as before.

XXX WIP

Reported-by: Dimitrios Apostolou <jimis@gmx.net>
Discussion: https://postgr.es/m/b1843124-fd22-e279-a31f-252dffb6fbf2%40gmx.net
---
 src/backend/storage/file/fd.c                 |  6 ++++
 src/backend/storage/smgr/md.c                 | 27 ++++++++++------
 src/backend/utils/misc/guc_tables.c           | 31 +++++++++++++++++++
 src/backend/utils/misc/postgresql.conf.sample |  2 ++
 src/include/storage/fd.h                      | 13 ++++++++
 5 files changed, 70 insertions(+), 9 deletions(-)

diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 0e8299dd556..046e285e84f 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -164,6 +164,12 @@ bool		data_sync_retry = false;
 /* How SyncDataDirectory() should do its job. */
 int			recovery_init_sync_method = DATA_DIR_SYNC_METHOD_FSYNC;
 
+/* How data files should be bulk-extended with zeroes. */
+int			file_extend_method = DEFAULT_FILE_EXTEND_METHOD;
+
+/* At what size file_extend_method is used instead of plain write. */
+int			file_extend_method_threshold = 8;
+
 /* Which kinds of files should be opened with PG_O_DIRECT. */
 int			io_direct_flags;
 
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 2ccb0faceb5..e1de6a26a67 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -588,23 +588,32 @@ mdzeroextend(SMgrRelation reln, ForkNumber forknum,
 		 * to allocate page cache space for the extended pages.
 		 *
 		 * However, we don't use FileFallocate() for small extensions, as it
-		 * defeats delayed allocation on some filesystems. Not clear where
-		 * that decision should be made though? For now just use a cutoff of
-		 * 8, anything between 4 and 8 worked OK in some local testing.
+		 * defeats delayed allocation on some filesystems.
 		 */
-		if (numblocks > 8)
+		if (file_extend_method_threshold > 0 &&
+			numblocks >= file_extend_method_threshold &&
+			file_extend_method != FILE_EXTEND_METHOD_WRITE)
 		{
 			int			ret;
 
-			ret = FileFallocate(v->mdfd_vfd,
-								seekpos, (off_t) BLCKSZ * numblocks,
-								WAIT_EVENT_DATA_FILE_EXTEND);
+			if (file_extend_method == FILE_EXTEND_METHOD_FTRUNCATE)
+				ret = FileTruncate(v->mdfd_vfd,
+								   seekpos + (off_t) BLCKSZ * numblocks,
+								   WAIT_EVENT_DATA_FILE_EXTEND);
+#ifdef FILE_EXTEND_METHOD_FALLOCATE
+			else
+				ret = FileFallocate(v->mdfd_vfd,
+									seekpos, (off_t) BLCKSZ * numblocks,
+									WAIT_EVENT_DATA_FILE_EXTEND);
+#endif
 			if (ret != 0)
 			{
 				ereport(ERROR,
 						errcode_for_file_access(),
-						errmsg("could not extend file \"%s\" with FileFallocate(): %m",
-							   FilePathName(v->mdfd_vfd)),
+						errmsg("could not extend file \"%s\" with %s(): %m",
+							   FilePathName(v->mdfd_vfd),
+							   file_extend_method == FILE_EXTEND_METHOD_FTRUNCATE ?
+							   "FileTruncate" : "FileFallocate"),
 						errhint("Check free disk space."));
 			}
 		}
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index f04bfedb2fd..3d779d3f4dc 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -491,6 +491,15 @@ static const struct config_enum_entry file_copy_method_options[] = {
 	{NULL, 0, false}
 };
 
+static const struct config_enum_entry file_extend_method_options[] = {
+	{"write", FILE_EXTEND_METHOD_WRITE, false},
+	{"ftruncate", FILE_EXTEND_METHOD_FTRUNCATE, false},
+#ifdef FILE_EXTEND_METHOD_FALLOCATE
+	{"fallocate", FILE_EXTEND_METHOD_FALLOCATE, false},
+#endif
+	{NULL, 0, false}
+};
+
 /*
  * Options for enum values stored in other modules
  */
@@ -3265,6 +3274,18 @@ struct config_int ConfigureNamesInt[] =
 		NULL
 	},
 
+	{
+		{"file_extend_method_threshold",
+			PGC_USERSET,
+			RESOURCES_DISK,
+			gettext_noop("Threshold for using methods other than write when extending data files."),
+			NULL,
+			GUC_UNIT_BLOCKS
+		},
+		&file_extend_method_threshold,
+		8, 0, INT_MAX
+	},
+
 	{
 		{"io_max_combine_limit",
 			PGC_POSTMASTER,
@@ -5264,6 +5285,16 @@ struct config_enum ConfigureNamesEnum[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"file_extend_method", PGC_USERSET, RESOURCES_DISK,
+			gettext_noop("Selects the method used for extending data files."),
+			NULL
+		},
+		&file_extend_method,
+		DEFAULT_FILE_EXTEND_METHOD, file_extend_method_options,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"wal_sync_method", PGC_SIGHUP, WAL_SETTINGS,
 			gettext_noop("Selects the method used for forcing WAL updates to disk."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 341f88adc87..4dbad4400c8 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -179,6 +179,8 @@
 					# in kilobytes, or -1 for no limit
 
 #file_copy_method = copy		# copy, clone (if supported by OS)
+#file_extend_method = fallocate		# fallocate, ftruncate, write
+#file_extend_method_threshold = 8	# min to prefer selected method, 0 = never
 
 #max_notify_queue_pages = 1048576	# limits the number of SLRU pages allocated
 					# for NOTIFY / LISTEN queue
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index b77d8e5e30e..25a39e6d539 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -55,11 +55,24 @@ typedef int File;
 #define IO_DIRECT_WAL			0x02
 #define IO_DIRECT_WAL_INIT		0x04
 
+#define FILE_EXTEND_METHOD_WRITE 1
+#define FILE_EXTEND_METHOD_FTRUNCATE 2
+#ifdef HAVE_POSIX_FALLOCATE
+#define	FILE_EXTEND_METHOD_FALLOCATE 3
+#endif
+
+#ifdef FILE_EXTEND_METHOD_FALLOCATE
+#define DEFAULT_FILE_EXTEND_METHOD FILE_EXTEND_METHOD_FALLOCATE
+#else
+#define DEFAULT_FILE_EXTEND_METHOD FILE_EXTEND_METHOD_WRITE
+#endif
 
 /* GUC parameter */
 extern PGDLLIMPORT int max_files_per_process;
 extern PGDLLIMPORT bool data_sync_retry;
 extern PGDLLIMPORT int recovery_init_sync_method;
+extern PGDLLIMPORT int file_extend_method_threshold;
+extern PGDLLIMPORT int file_extend_method;
 extern PGDLLIMPORT int io_direct_flags;
 
 /*
-- 
2.47.2

In reply to: Thomas Munro (#13)
Re: [PING] fallocate() causes btrfs to never compress postgresql files

On Mon, 9 Jun 2025, Thomas Munro wrote:

On Tue, Jun 3, 2025 at 1:58 AM Dimitrios Apostolou <jimis@gmx.net> wrote:

This sounds like the best solution IMO. People can then experiment with
different settings and filesystems, and that way we also learn in the
process. Thank you for the effort and patches so far.

OK, here's a basic patch to experiment with. You can set:

file_extend_method = fallocate,ftruncate,write
file_extend_method_threshold = 8 # (below 8 always write, 0 means never write)

I applied the patch on PostgreSQL v17 and am testing it now. I chose
ftruncate method and I see ftruncate in action using strace while doing
pg_restore of a big database. Nothing unexpected has happened so far. I
also verified that files are being compressed, obeying Btrfs's mount
option compress=zstd.

Thanks for the patch! What are the odds of commiting it to v17?

Dimitris

In reply to: Dimitrios Apostolou (#14)
Re: [PING] fallocate() causes btrfs to never compress postgresql files

On Thu, 12 Jun 2025, Dimitrios Apostolou wrote:

On Mon, 9 Jun 2025, Thomas Munro wrote:

On Tue, Jun 3, 2025 at 1:58 AM Dimitrios Apostolou <jimis@gmx.net> wrote:

This sounds like the best solution IMO. People can then experiment with
different settings and filesystems, and that way we also learn in the
process. Thank you for the effort and patches so far.

OK, here's a basic patch to experiment with. You can set:

file_extend_method = fallocate,ftruncate,write
file_extend_method_threshold = 8 # (below 8 always write, 0 means never
write)

I applied the patch on PostgreSQL v17 and am testing it now. I chose
ftruncate method and I see ftruncate in action using strace while doing
pg_restore of a big database. Nothing unexpected has happened so far. I also
verified that files are being compressed, obeying Btrfs's mount option
compress=zstd.

Thanks for the patch! What are the odds of commiting it to v17?

Ping. :-)
Patch behaves good for me. Any chance of applying it and backporting it?

Show quoted text

Dimitris

#16Thomas Munro
thomas.munro@gmail.com
In reply to: Dimitrios Apostolou (#15)
Re: [PING] fallocate() causes btrfs to never compress postgresql files

On Fri, Jul 11, 2025 at 5:39 AM Dimitrios Apostolou <jimis@gmx.net> wrote:

I applied the patch on PostgreSQL v17 and am testing it now. I chose
ftruncate method and I see ftruncate in action using strace while doing
pg_restore of a big database. Nothing unexpected has happened so far. I also
verified that files are being compressed, obeying Btrfs's mount option
compress=zstd.

Thanks for the patch! What are the odds of commiting it to v17?

Ping. :-)
Patch behaves good for me. Any chance of applying it and backporting it?

Yeah, this seems to make sense, as it is a pretty bad regression for
people who are counting on BTRFS compression for their large database.
Not so sure about the threshold bit -- I'd probably leave that out of
the backport in the interest of stable branch-minimalism. Anyone have
any better ideas, better naming, or objections?

In reply to: Thomas Munro (#16)
Re: [PING] fallocate() causes btrfs to never compress postgresql files

On Friday 2025-07-11 00:45, Thomas Munro wrote:

On Fri, Jul 11, 2025 at 5:39 AM Dimitrios Apostolou <jimis@gmx.net> wrote:

I applied the patch on PostgreSQL v17 and am testing it now. I chose
ftruncate method and I see ftruncate in action using strace while doing
pg_restore of a big database. Nothing unexpected has happened so far. I also
verified that files are being compressed, obeying Btrfs's mount option
compress=zstd.

Thanks for the patch! What are the odds of commiting it to v17?

Ping. :-)
Patch behaves good for me. Any chance of applying it and backporting it?

Yeah, this seems to make sense, as it is a pretty bad regression for
people who are counting on BTRFS compression for their large database.
Not so sure about the threshold bit -- I'd probably leave that out of
the backport in the interest of stable branch-minimalism. Anyone have
any better ideas, better naming, or objections?

What is the right process to not lose track of this? Should I create a
commitfest entry? Should I keep pinging every couple of weeks? Or is the
patch queued somewhere and I have to wait patiently? If July commitfest
passes, could it miss the next release?

Please forgive my ignorance, but I'm lost with respect to the postgresql
development process. I also have some patches or suggestions of my own
that struggle to get feedback, so I'd appreciate any tips regarding the
development process.

Thank you,
Dimitris

#18Magnus Hagander
magnus@hagander.net
In reply to: Thomas Munro (#16)
Re: [PING] fallocate() causes btrfs to never compress postgresql files

On Fri, Jul 11, 2025 at 12:45 AM Thomas Munro <thomas.munro@gmail.com>
wrote:

On Fri, Jul 11, 2025 at 5:39 AM Dimitrios Apostolou <jimis@gmx.net> wrote:

I applied the patch on PostgreSQL v17 and am testing it now. I chose
ftruncate method and I see ftruncate in action using strace while doing
pg_restore of a big database. Nothing unexpected has happened so far.

I also

verified that files are being compressed, obeying Btrfs's mount option
compress=zstd.

Thanks for the patch! What are the odds of commiting it to v17?

Ping. :-)
Patch behaves good for me. Any chance of applying it and backporting it?

Yeah, this seems to make sense, as it is a pretty bad regression for
people who are counting on BTRFS compression for their large database.
Not so sure about the threshold bit -- I'd probably leave that out of
the backport in the interest of stable branch-minimalism. Anyone have
any better ideas, better naming, or objections?

Not just to throw a wrench in there, but... Should this perhaps be a
tablespace option? ISTM having different filesystems for them is a good
reason to use tablespaces in the first place, and then being able to pick
different options...

--
Magnus Hagander
Me: https://www.hagander.net/ <http://www.hagander.net/&gt;
Work: https://www.redpill-linpro.com/ <http://www.redpill-linpro.com/&gt;

#19Thomas Munro
thomas.munro@gmail.com
In reply to: Magnus Hagander (#18)
Re: [PING] fallocate() causes btrfs to never compress postgresql files

On Tue, Jul 29, 2025 at 6:52 PM Magnus Hagander <magnus@hagander.net> wrote:

Not just to throw a wrench in there, but... Should this perhaps be a tablespace option? ISTM having different filesystems for them is a good reason to use tablespaces in the first place, and then being able to pick different options...

We discussed that a bit earlier in the thread. Some problems about
layering violations and general weirdness, I recall trying it even.
On the flip side, is it right to declare very local
filesystem-specific choices in a system catalogue that is replicated
and affects replicas?
What about a fancier GUC that can reference tablespaces?

#20Magnus Hagander
magnus@hagander.net
In reply to: Thomas Munro (#19)
Re: [PING] fallocate() causes btrfs to never compress postgresql files

On Tue, Aug 5, 2025 at 3:08 PM Thomas Munro <thomas.munro@gmail.com> wrote:

On Tue, Jul 29, 2025 at 6:52 PM Magnus Hagander <magnus@hagander.net>
wrote:

Not just to throw a wrench in there, but... Should this perhaps be a

tablespace option? ISTM having different filesystems for them is a good
reason to use tablespaces in the first place, and then being able to pick
different options...

We discussed that a bit earlier in the thread. Some problems about
layering violations and general weirdness, I recall trying it even.
On the flip side, is it right to declare very local
filesystem-specific choices in a system catalogue that is replicated
and affects replicas?
What about a fancier GUC that can reference tablespaces?

Wouldn't that be something that applies to *all* the tablespace configs
then, taht is a proper movement of the goalposts? :) Such as being able to
set random_page_cost per tablespace to different values on different
machines. I agree that it would be useful though. But it seems like a
different patch, if useful, and one that should be generic?

--
Magnus Hagander
Me: https://www.hagander.net/ <http://www.hagander.net/&gt;
Work: https://www.redpill-linpro.com/ <http://www.redpill-linpro.com/&gt;

#21Thomas Munro
thomas.munro@gmail.com
In reply to: Magnus Hagander (#20)
Re: [PING] fallocate() causes btrfs to never compress postgresql files

On Fri, Aug 8, 2025 at 1:38 AM Magnus Hagander <magnus@hagander.net> wrote:

On Tue, Aug 5, 2025 at 3:08 PM Thomas Munro <thomas.munro@gmail.com> wrote:

We discussed that a bit earlier in the thread. Some problems about
layering violations and general weirdness, I recall trying it even.
On the flip side, is it right to declare very local
filesystem-specific choices in a system catalogue that is replicated
and affects replicas?
What about a fancier GUC that can reference tablespaces?

Wouldn't that be something that applies to *all* the tablespace configs then, taht is a proper movement of the goalposts? :) Such as being able to set random_page_cost per tablespace to different values on different machines. I agree that it would be useful though. But it seems like a different patch, if useful, and one that should be generic?

Yeah. And while we're talking pie-in-the-sky future features,
full_page_writes is also describing a property of a particular
server's file system and/or hardware for a given tablespace. Can't do
much about that today, as it can only be decided by the primary node
that must log full pages or not, but its potential replacement
"atomic_double_write" (as I call it) *can* be chosen on a per-server
basis in a replication chain. We could probably have done that
independently, but it gets easier with new infrastructure for
streaming large asynchronous combined writes...

To solve Dimitrios's real production issue, I am planning to proceed
with the simple whole-system GUC(s) already posted, after I've done
some light testing on ZFS (which has similar design constraints though
makes different choices) and thought a bit harder about the
Windows/NTFS situation. I'll post a new version before pushing
anything. My plan is to have this in the next minor release, unless
the upcoming 18 release forces me to delay it until the one after.

Another thing I noticed is that macOS has its own funky way[1]https://github.com/libgit2/libgit2/commit/bd132046b04875f928e52d16363fb73f8e85dded of
preallocating disk space that looks plausibly relevant. Not
investigated and not planning to work on that myself necessarily but
it might be worth thinking for a moment about the GUC future-proofing
implications.

[1]: https://github.com/libgit2/libgit2/commit/bd132046b04875f928e52d16363fb73f8e85dded

In reply to: Thomas Munro (#21)
Re: [PING] fallocate() causes btrfs to never compress postgresql files

Sorry to ping again, but was there a conclusion reached regarding adding the new file_extend_method setting?

#23Thomas Munro
thomas.munro@gmail.com
In reply to: Dimitrios Apostolou (#22)
Re: [PING] fallocate() causes btrfs to never compress postgresql files

On Wed, Oct 29, 2025 at 4:31 AM Dimitrios Apostolou <jimis@gmx.net> wrote:

Sorry to ping again, but was there a conclusion reached regarding adding the new file_extend_method setting?

No objections appeared, so the conclusion I am drawing is that we
should do this, and back-patch it into 17 for the upcoming release.
It is working as expected on my ZFS system in light testing. Rebasing
and figuring out where to add the missing documentation for last
chance review...

#24Jakub Wartak
jakub.wartak@enterprisedb.com
In reply to: Thomas Munro (#23)
Re: [PING] fallocate() causes btrfs to never compress postgresql files

On Thu, Oct 30, 2025 at 4:56 AM Thomas Munro <thomas.munro@gmail.com> wrote:

On Wed, Oct 29, 2025 at 4:31 AM Dimitrios Apostolou <jimis@gmx.net> wrote:

Sorry to ping again, but was there a conclusion reached regarding adding the new file_extend_method setting?

No objections appeared, so the conclusion I am drawing is that we
should do this,

Hi Thomas,

+1 to this GUCs as this would also help the nearby thread with XFS
mysteries which are not fully solved [1]/messages/by-id/CADofcAV8xu3hCNHq7-7x56KrP9rD6=A04=qjTr3nETh-gptF8w@mail.gmail.com. Since the latest message in
that discussion, I'm aware of at least one additional report of XFS
failing at fallocate() with free space too, but without any details
from the OS support vendor why that happened, so this $patch could be
also used to workaround that problem too.

Just nitpicking:

and back-patch it into 17 for the upcoming release.
It is working as expected on my ZFS system in light testing. Rebasing
and figuring out where to add the missing documentation for last
chance review...

Why just 17? (wasn't fallocate() introduced in 16? 4d330a61bb19 and
31966b151e6ab are from Apr 2023, while 16 was released on Sep 2023)

From other things, I was wondering about this:

PGC_USERSET

QQ: Do we really want to have those two GUCs to be alterable like that
by anyone? The alternative would be like let's say PGC_SIGHUP? (on one
end it's flexible, but are there any downsides to this as it stands
out in 0001?). I've checked others and io_workers is PGC_SIGHUP
(understandable), but we also have io_combine_limit &&
effective_io_concurrency with PGC_USERSET. I'm just wondering if it
would be sane to have one backend doing I/O with fallocate() and other
just writing using pwrite(). One could argue you could be writing to
two different filesystems with two different users...

-J.

[1]: /messages/by-id/CADofcAV8xu3hCNHq7-7x56KrP9rD6=A04=qjTr3nETh-gptF8w@mail.gmail.com

#25Bruce Momjian
bruce@momjian.us
In reply to: Jakub Wartak (#24)
Re: [PING] fallocate() causes btrfs to never compress postgresql files

On Thu, Oct 30, 2025 at 11:14:07AM +0100, Jakub Wartak wrote:

On Thu, Oct 30, 2025 at 4:56 AM Thomas Munro <thomas.munro@gmail.com> wrote:

On Wed, Oct 29, 2025 at 4:31 AM Dimitrios Apostolou <jimis@gmx.net> wrote:

Sorry to ping again, but was there a conclusion reached regarding adding the new file_extend_method setting?

No objections appeared, so the conclusion I am drawing is that we
should do this,

Hi Thomas,

+1 to this GUCs as this would also help the nearby thread with XFS
mysteries which are not fully solved [1]. Since the latest message in
that discussion, I'm aware of at least one additional report of XFS
failing at fallocate() with free space too, but without any details
from the OS support vendor why that happened, so this $patch could be
also used to workaround that problem too.

Uh, the problem with backpatching new GUCs is that the GUC variable will
_not_ appear in any postgresql.conf file until a new initdb is run.
This can be quite confusing for people. The minor release notes have to
explain this.

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

Do not let urgent matters crowd out time for investment in the future.

#26Thomas Munro
thomas.munro@gmail.com
In reply to: Bruce Momjian (#25)
3 attachment(s)
Re: [PING] fallocate() causes btrfs to never compress postgresql files

Here's a new version with some cleanup and documentation. I tried to
pare it down to the minimum change for the back-branches, keeping
unnecessary changes for master. In the process, I also thought a bit
about how to de-confused matters on Windows, where the function we
call as ftruncate() behaves differently in a crucial respect. See
attached.

I'm proposing to back-patch 0001. 0002 and 0003 are proposals for master only.

See below for replies to separate messages from Jakub and Bruce.

On Thu, Oct 30, 2025 at 11:14 PM Jakub Wartak
<jakub.wartak@enterprisedb.com> wrote:

+1 to this GUCs as this would also help the nearby thread with XFS
mysteries which are not fully solved [1]. Since the latest message in
that discussion, I'm aware of at least one additional report of XFS
failing at fallocate() with free space too, but without any details
from the OS support vendor why that happened, so this $patch could be
also used to workaround that problem too.

Yeah, that seems quite important, and the new report in psql-bugs
#19348 sounds like another case.

Just nitpicking:

and back-patch it into 17 for the upcoming release.
It is working as expected on my ZFS system in light testing. Rebasing
and figuring out where to add the missing documentation for last
chance review...

Why just 17? (wasn't fallocate() introduced in 16? 4d330a61bb19 and
31966b151e6ab are from Apr 2023, while 16 was released on Sep 2023)

Right, fixed.

From other things, I was wondering about this:

PGC_USERSET

QQ: Do we really want to have those two GUCs to be alterable like that
by anyone? The alternative would be like let's say PGC_SIGHUP? (on one
end it's flexible, but are there any downsides to this as it stands
out in 0001?). I've checked others and io_workers is PGC_SIGHUP
(understandable), but we also have io_combine_limit &&
effective_io_concurrency with PGC_USERSET. I'm just wondering if it
would be sane to have one backend doing I/O with fallocate() and other
just writing using pwrite(). One could argue you could be writing to
two different filesystems with two different users...

Yeah. Let's go with PGC_SIGHUP. Let's worry about multiple
filesystems when we've figured out how to do per-tablespace settings.

This is vapourware for later, but I've been wondering if we could
invent a sysctl-style hierarchy as a scoping mechanism, something
like:

tablespace.foo.random_page_cost=1
tablespace.foo.file_extend_method=ftruncate
tablespace.foo.io_combine_limit=1MB

Obviously there are some name resolution problems with that. I also
thought about allowing a new kind of configuration file inside
tablespace directories, but that doesn't work for PGC_USERSET stuff
like random_page_cost. If the hierarchy idea goes somewhere, it might
also allow a reorganisation like [tablespace.foo.]io.combine_limit,
with legacy long names like io_combine_limit still supported, but
that's getting quite far off topic...

On Fri, Oct 31, 2025 at 5:59 AM Bruce Momjian <bruce@momjian.us> wrote:

Uh, the problem with backpatching new GUCs is that the GUC variable will
_not_ appear in any postgresql.conf file until a new initdb is run.
This can be quite confusing for people. The minor release notes have to
explain this.

Yeah. Fortunately the vast majority of users won't ever need to know
about this. Those who run into a problem should hopefully find their
way to the docs, release notes, settings view, these threads, or write
to us? Any other way of controlling this that we invent to avoid
back-patching a GUC would surely only be harder to find than a new
GUC, I think? And I don't think we're anywhere near the level of
needing to revert the posix_fallocate() feature: both reported
problems are rare. (Though there is a lesson here in terms of
off-switch planning.)

Here's my attempt at a release note:

"The new setting file_extend_method can be set to write_zeros to
disable the use of the posix_fallocate() system call when extending
relation files. This is a workaround for users of BTRFS compression,
reported to be disabled by posix_fallocate(), and some versions of
XFS, reported to fail with spurious ENOSPC errors under some
workloads."

Attachments:

v2-0001-Add-file_extend_method-posix_fallocate-write_zero.patchtext/x-patch; charset=US-ASCII; name=v2-0001-Add-file_extend_method-posix_fallocate-write_zero.patchDownload
From 58ec33550147e324e5a6a8793c8e502b9e7065f2 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sat, 31 May 2025 22:50:22 +1200
Subject: [PATCH v2 1/3] Add file_extend_method=posix_fallocate,write_zeros.

Provide a way to disable the use of posix_fallocate() for relation
files.  It was introduced by commit 4d330a61bb1.  The new setting
file_extend_method=write_zeros can be used as a workaround for problems
reported from the field:

 * BTRFS compression is disabled by the use of posix_fallocate()
 * XFS users have reported a few cases of spurious ENOSPC that haven't
   been explained yet

The default is file_extend_method=posix_fallocate as before.  The new
mode is simlar to PostgreSQL < 16, except that bulk extension writes
zeros for multiple blocks at a time.

Backpatch-through: 16
Reviewed-by: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reported-by: Dimitrios Apostolou <jimis@gmx.net>
Discussion: https://postgr.es/m/b1843124-fd22-e279-a31f-252dffb6fbf2%40gmx.net
---
 doc/src/sgml/config.sgml                      | 37 +++++++++++++++++++
 src/backend/storage/file/fd.c                 |  3 ++
 src/backend/storage/smgr/md.c                 | 21 ++++++++---
 src/backend/utils/misc/guc_parameters.dat     |  7 ++++
 src/backend/utils/misc/guc_tables.c           |  9 +++++
 src/backend/utils/misc/postgresql.conf.sample |  4 ++
 src/include/storage/fd.h                      | 11 ++++++
 7 files changed, 87 insertions(+), 5 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 405c9689bd0..0b4922b35c4 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2410,6 +2410,43 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-file-extend-method" xreflabel="file_extend_method">
+      <term><varname>file_extend_method</varname> (<type>enum</type>)
+      <indexterm>
+       <primary><varname>file_extend_method</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the method used to extend data files during bulk operations
+        such as <command>COPY</command>.  The first available option is used as
+        the default, depending on the operating system:
+        <itemizedlist>
+         <listitem>
+          <para>
+           <literal>posix_fallocate</literal> (Unix) uses the standard POSIX
+            interface for allocating disk space, but is missing on some systems.
+            If it is present but the underlying file system doesn't support it,
+            this option silently falls back to <literal>write_zeros</literal>.
+            Current versions of BTRFS are known to disable compression when
+            this option is used.
+            This is the default on systems that have the function.
+           </para>
+         </listitem>
+         <listitem>
+          <para>
+           <literal>write_zeros</literal> extends files by writing out blocks
+            of zero bytes.  This is the default on systems that don't have the
+            function <function>posix_fallocate</function>.
+          </para>
+         </listitem>
+        </itemizedlist>
+        The <literal>write_zeros</literal> method is always used when data
+        files are extended by 8 blocks or fewer.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-max-notify-queue-pages" xreflabel="max_notify_queue_pages">
       <term><varname>max_notify_queue_pages</varname> (<type>integer</type>)
       <indexterm>
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 9670e809b72..a2fd55cc408 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -164,6 +164,9 @@ bool		data_sync_retry = false;
 /* How SyncDataDirectory() should do its job. */
 int			recovery_init_sync_method = DATA_DIR_SYNC_METHOD_FSYNC;
 
+/* How data files should be bulk-extended with zeros. */
+int			file_extend_method = DEFAULT_FILE_EXTEND_METHOD;
+
 /* Which kinds of files should be opened with PG_O_DIRECT. */
 int			io_direct_flags;
 
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 71bcdeb6601..df0aa20708d 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -602,13 +602,24 @@ mdzeroextend(SMgrRelation reln, ForkNumber forknum,
 		 * that decision should be made though? For now just use a cutoff of
 		 * 8, anything between 4 and 8 worked OK in some local testing.
 		 */
-		if (numblocks > 8)
+		if (numblocks > 8 &&
+			file_extend_method != FILE_EXTEND_METHOD_WRITE_ZEROS)
 		{
-			int			ret;
+			int			ret = 0;
 
-			ret = FileFallocate(v->mdfd_vfd,
-								seekpos, (pgoff_t) BLCKSZ * numblocks,
-								WAIT_EVENT_DATA_FILE_EXTEND);
+#ifdef HAVE_POSIX_FALLOCATE
+			if (file_extend_method == FILE_EXTEND_METHOD_POSIX_FALLOCATE)
+			{
+				ret = FileFallocate(v->mdfd_vfd,
+									seekpos, (pgoff_t) BLCKSZ * numblocks,
+									WAIT_EVENT_DATA_FILE_EXTEND);
+			}
+			else
+#endif
+			{
+				elog(ERROR, "unsupported file_extend_method: %d",
+					 file_extend_method);
+			}
 			if (ret != 0)
 			{
 				ereport(ERROR,
diff --git a/src/backend/utils/misc/guc_parameters.dat b/src/backend/utils/misc/guc_parameters.dat
index 3b9d8349078..220a092ef52 100644
--- a/src/backend/utils/misc/guc_parameters.dat
+++ b/src/backend/utils/misc/guc_parameters.dat
@@ -1039,6 +1039,13 @@
   options => 'file_copy_method_options',
 },
 
+{ name => 'file_extend_method', type => 'enum', context => 'PGC_SIGHUP', group => 'RESOURCES_DISK',
+  short_desc => 'Selects the method used for extending data files.',
+  variable => 'file_extend_method',
+  boot_val => 'DEFAULT_FILE_EXTEND_METHOD',
+  options => 'file_extend_method_options',
+},
+
 { name => 'from_collapse_limit', type => 'int', context => 'PGC_USERSET', group => 'QUERY_TUNING_OTHER',
   short_desc => 'Sets the FROM-list size beyond which subqueries are not collapsed.',
   long_desc => 'The planner will merge subqueries into upper queries if the resulting FROM list would have no more than this many items.',
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index f87b558c2c6..6c65a47a88d 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -80,6 +80,7 @@
 #include "storage/bufmgr.h"
 #include "storage/bufpage.h"
 #include "storage/copydir.h"
+#include "storage/fd.h"
 #include "storage/io_worker.h"
 #include "storage/large_object.h"
 #include "storage/pg_shmem.h"
@@ -491,6 +492,14 @@ static const struct config_enum_entry file_copy_method_options[] = {
 	{NULL, 0, false}
 };
 
+static const struct config_enum_entry file_extend_method_options[] = {
+#ifdef HAVE_POSIX_FALLOCATE
+	{"posix_fallocate", FILE_EXTEND_METHOD_POSIX_FALLOCATE, false},
+#endif
+	{"write_zeros", FILE_EXTEND_METHOD_WRITE_ZEROS, false},
+	{NULL, 0, false}
+};
+
 /*
  * Options for enum values stored in other modules
  */
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index dc9e2255f8a..753a42e8ca5 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -179,6 +179,10 @@
                                         # in kilobytes, or -1 for no limit
 
 #file_copy_method = copy                # copy, clone (if supported by OS)
+#file_extend_method = posix_fallocate   # the default is the first option supported
+                                        # by the operating system:
+                                        #   posix_fallocate (most Unix-like systems)
+                                        #   write_zeros
 
 #max_notify_queue_pages = 1048576       # limits the number of SLRU pages allocated
                                         # for NOTIFY / LISTEN queue
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index a8b0c9b3997..f21ac4545a8 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -55,12 +55,23 @@ typedef int File;
 #define IO_DIRECT_WAL			0x02
 #define IO_DIRECT_WAL_INIT		0x04
 
+enum FileExtendMethod
+{
+#ifdef HAVE_POSIX_FALLOCATE
+	FILE_EXTEND_METHOD_POSIX_FALLOCATE,
+#endif
+	FILE_EXTEND_METHOD_WRITE_ZEROS,
+};
+
+/* Default to the first available file_extend_method. */
+#define DEFAULT_FILE_EXTEND_METHOD 0
 
 /* GUC parameter */
 extern PGDLLIMPORT int max_files_per_process;
 extern PGDLLIMPORT bool data_sync_retry;
 extern PGDLLIMPORT int recovery_init_sync_method;
 extern PGDLLIMPORT int io_direct_flags;
+extern PGDLLIMPORT int file_extend_method;
 
 /*
  * This is private to fd.c, but exported for save/restore_backend_variables()
-- 
2.51.2

v2-0002-Add-file_extend_method_threshold-setting.patchtext/x-patch; charset=US-ASCII; name=v2-0002-Add-file_extend_method_threshold-setting.patchDownload
From c5b1fd2fdcf41de11d2701602d3e243df9bbb049 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Mon, 15 Dec 2025 16:16:23 +1300
Subject: [PATCH v2 2/3] Add file_extend_method_threshold setting.

Previously, write_zeros behavior was used at or below a hard-coded
extension size of 8, based on tests with common Linux file systems.
Make it user-adjustable, to allow testing on other systems.

Discussion: https://postgr.es/m/b1843124-fd22-e279-a31f-252dffb6fbf2%40gmx.net
---
 doc/src/sgml/config.sgml                      | 21 ++++++++++++++++++-
 src/backend/storage/file/fd.c                 |  3 +++
 src/backend/storage/smgr/md.c                 |  6 ++----
 src/backend/utils/misc/guc_parameters.dat     |  8 +++++++
 src/backend/utils/misc/postgresql.conf.sample |  1 +
 src/include/storage/fd.h                      |  8 +++++++
 6 files changed, 42 insertions(+), 5 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 0b4922b35c4..5a298646100 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2442,7 +2442,26 @@ include_dir 'conf.d'
          </listitem>
         </itemizedlist>
         The <literal>write_zeros</literal> method is always used when data
-        files are extended by 8 blocks or fewer.
+        files are extended by <literal>file_extend_method_threshold</literal>
+        or fewer blocks.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry id="guc-file-extend-method-threshold" xreflabel="file_extend_method_threshold">
+      <term><varname>file_extend_method_threshold</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>file_extend_method_threshold</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        <literal>posix_fallocate</literal> is known to interfere with
+        delayed allocation heuristics on some file systems, when the extension
+        size is small.  This setting specifies the size up to which
+        <literal>write_zeros</literal> is used, overriding the
+        <literal>file_extend_method</literal> setting.  The default is 8
+        blocks.
        </para>
       </listitem>
      </varlistentry>
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index a2fd55cc408..7eb537ab15e 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -167,6 +167,9 @@ int			recovery_init_sync_method = DATA_DIR_SYNC_METHOD_FSYNC;
 /* How data files should be bulk-extended with zeros. */
 int			file_extend_method = DEFAULT_FILE_EXTEND_METHOD;
 
+/* At what size file_extend_method is used instead of write_zeros. */
+int			file_extend_method_threshold = DEFAULT_FILE_EXTEND_METHOD_THRESHOLD;
+
 /* Which kinds of files should be opened with PG_O_DIRECT. */
 int			io_direct_flags;
 
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index df0aa20708d..f893687814b 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -598,11 +598,9 @@ mdzeroextend(SMgrRelation reln, ForkNumber forknum,
 		 * to allocate page cache space for the extended pages.
 		 *
 		 * However, we don't use FileFallocate() for small extensions, as it
-		 * defeats delayed allocation on some filesystems. Not clear where
-		 * that decision should be made though? For now just use a cutoff of
-		 * 8, anything between 4 and 8 worked OK in some local testing.
+		 * defeats delayed allocation on some filesystems.
 		 */
-		if (numblocks > 8 &&
+		if (numblocks > file_extend_method_threshold &&
 			file_extend_method != FILE_EXTEND_METHOD_WRITE_ZEROS)
 		{
 			int			ret = 0;
diff --git a/src/backend/utils/misc/guc_parameters.dat b/src/backend/utils/misc/guc_parameters.dat
index 220a092ef52..964e107c7a5 100644
--- a/src/backend/utils/misc/guc_parameters.dat
+++ b/src/backend/utils/misc/guc_parameters.dat
@@ -1046,6 +1046,14 @@
   options => 'file_extend_method_options',
 },
 
+{ name => 'file_extend_method_threshold', type => 'int', context => 'PGC_SIGHUP', group => 'RESOURCES_DISK',
+  short_desc => 'Specifies the extension size above which file_extend_method is used.',
+  variable => 'file_extend_method_threshold',
+  boot_val => 'DEFAULT_FILE_EXTEND_METHOD_THRESHOLD',
+  min => '1',
+  max => 'INT_MAX',
+},
+
 { name => 'from_collapse_limit', type => 'int', context => 'PGC_USERSET', group => 'QUERY_TUNING_OTHER',
   short_desc => 'Sets the FROM-list size beyond which subqueries are not collapsed.',
   long_desc => 'The planner will merge subqueries into upper queries if the resulting FROM list would have no more than this many items.',
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 753a42e8ca5..b745e31a38d 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -179,6 +179,7 @@
                                         # in kilobytes, or -1 for no limit
 
 #file_copy_method = copy                # copy, clone (if supported by OS)
+#file_extend_method_threshold = 8       # size up to which write_zeros is used
 #file_extend_method = posix_fallocate   # the default is the first option supported
                                         # by the operating system:
                                         #   posix_fallocate (most Unix-like systems)
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index f21ac4545a8..7074c3f118b 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -66,12 +66,20 @@ enum FileExtendMethod
 /* Default to the first available file_extend_method. */
 #define DEFAULT_FILE_EXTEND_METHOD 0
 
+/*
+ * Values 4-8 were experimentally determined to avoid interference between
+ * posix_fallocate() and delayed allocation on common Linux file systems, but
+ * other systems might vary.
+ */
+#define DEFAULT_FILE_EXTEND_METHOD_THRESHOLD 8
+
 /* GUC parameter */
 extern PGDLLIMPORT int max_files_per_process;
 extern PGDLLIMPORT bool data_sync_retry;
 extern PGDLLIMPORT int recovery_init_sync_method;
 extern PGDLLIMPORT int io_direct_flags;
 extern PGDLLIMPORT int file_extend_method;
+extern PGDLLIMPORT int file_extend_method_threshold;
 
 /*
  * This is private to fd.c, but exported for save/restore_backend_variables()
-- 
2.51.2

v2-0003-Add-file_extend_method-ftruncate-chsize-options.patchtext/x-patch; charset=US-ASCII; name=v2-0003-Add-file_extend_method-ftruncate-chsize-options.patchDownload
From cfeffb032d96e96016e5b840b07bcf8c04262860 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Mon, 15 Dec 2025 16:39:56 +1300
Subject: [PATCH v2 3/3] Add file_extend_method=ftruncate,chsize options.

Since COW file systems can't reserve space for future writes by any
means, provide an alternative that should at least be more efficient.
At least it delays kernel buffer allocation and skips copying zeros
around, like posix_fallocate.

"ftruncate" isn't a concept on Windows, so provide a different
surface-level option "chsize".  It actually differs in a crucially
relevant way on the most common file system NTFS: it reserves disk
blocks immediately rather than creating a sparse file.  On the other
hand, it surely can't do that on ReFS, so it seems inappropriate to
pretend that Windows has "posix_fallocate".  Exposing the true
operation's name makes it the user's problem to figure out what the
filesystem does when we call it.

Tested-by: Dimitrios Apostolou <jimis@gmx.net>
Discussion: https://postgr.es/m/b1843124-fd22-e279-a31f-252dffb6fbf2%40gmx.net
---
 doc/src/sgml/config.sgml                      | 20 ++++++++++++++
 src/backend/storage/smgr/md.c                 | 26 ++++++++++++-------
 src/backend/utils/misc/guc_tables.c           |  1 +
 src/backend/utils/misc/postgresql.conf.sample |  2 ++
 src/include/storage/fd.h                      | 25 ++++++++++++++++++
 5 files changed, 65 insertions(+), 9 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 5a298646100..ff8b66f52cf 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2440,6 +2440,26 @@ include_dir 'conf.d'
             function <function>posix_fallocate</function>.
           </para>
          </listitem>
+         <listitem>
+          <para>
+           <literal>ftruncate</literal> (Unix) extends files without
+           allocating space.  Out-of-space errors are deferred until PostgreSQL
+           writes data out later, potentially preventing checkpoints from
+           completing, so it is not recommended for tradition "overwrite"
+           file systems.  It is provided as an option for copy-on-write file
+           systems where <literal>posix_fallocate</literal> and
+           <literal>write_zeros</literal> can't reserve space eagerly, and
+           <literal>ftruncate</literal> might be more efficient.
+          </para>
+         </listitem>
+         <listitem>
+          <para>
+           <literal>chsize</literal> (Windows) allocates space and reports
+           out-of-space errors immediately on NTFS (like
+           <literal>posix_fallocate</literal>), but defers allocation on
+           ReFS (like <literal>fallocate_ftruncate</literal>).
+          </para>
+         </listitem>
         </itemizedlist>
         The <literal>write_zeros</literal> method is always used when data
         files are extended by <literal>file_extend_method_threshold</literal>
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index f893687814b..b65cd308fd3 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -595,7 +595,12 @@ mdzeroextend(SMgrRelation reln, ForkNumber forknum,
 		 * If available and useful, use posix_fallocate() (via
 		 * FileFallocate()) to extend the relation. That's often more
 		 * efficient than using write(), as it commonly won't cause the kernel
-		 * to allocate page cache space for the extended pages.
+		 * to allocate page cache space for the extended pages. COW
+		 * filesystems can't really reserve disk space for future writeback
+		 * (possibly moving the ENOSPC error into the checkpointer), but
+		 * ftruncate() can still still be used to defer the kernel cache
+		 * overheads until then.  Note that on Windows, ftruncate() is really
+		 * _chsize_s(), which *does* allocate blocks, at least on NTFS.
 		 *
 		 * However, we don't use FileFallocate() for small extensions, as it
 		 * defeats delayed allocation on some filesystems.
@@ -605,25 +610,28 @@ mdzeroextend(SMgrRelation reln, ForkNumber forknum,
 		{
 			int			ret = 0;
 
+			if (file_extend_method == FILE_EXTEND_METHOD_FTRUNCATE)
+				ret = FileTruncate(v->mdfd_vfd,
+								   seekpos + (pgoff_t) BLCKSZ * numblocks,
+								   WAIT_EVENT_DATA_FILE_EXTEND);
 #ifdef HAVE_POSIX_FALLOCATE
-			if (file_extend_method == FILE_EXTEND_METHOD_POSIX_FALLOCATE)
-			{
+			else if (file_extend_method == FILE_EXTEND_METHOD_POSIX_FALLOCATE)
 				ret = FileFallocate(v->mdfd_vfd,
 									seekpos, (pgoff_t) BLCKSZ * numblocks,
 									WAIT_EVENT_DATA_FILE_EXTEND);
-			}
-			else
 #endif
-			{
+			else
 				elog(ERROR, "unsupported file_extend_method: %d",
 					 file_extend_method);
-			}
+
 			if (ret != 0)
 			{
 				ereport(ERROR,
 						errcode_for_file_access(),
-						errmsg("could not extend file \"%s\" with FileFallocate(): %m",
-							   FilePathName(v->mdfd_vfd)),
+						errmsg("could not extend file \"%s\" with %s(): %m",
+							   FilePathName(v->mdfd_vfd),
+							   file_extend_method == FILE_EXTEND_METHOD_FTRUNCATE ?
+							   "FileTruncate" : "FileFallocate"),
 						errhint("Check free disk space."));
 			}
 		}
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 6c65a47a88d..63712c9e465 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -497,6 +497,7 @@ static const struct config_enum_entry file_extend_method_options[] = {
 	{"posix_fallocate", FILE_EXTEND_METHOD_POSIX_FALLOCATE, false},
 #endif
 	{"write_zeros", FILE_EXTEND_METHOD_WRITE_ZEROS, false},
+	{FILE_EXTEND_METHOD_FTRUNCATE_NAME, FILE_EXTEND_METHOD_FTRUNCATE, false},
 	{NULL, 0, false}
 };
 
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index b745e31a38d..18ed8a6a549 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -184,6 +184,8 @@
                                         # by the operating system:
                                         #   posix_fallocate (most Unix-like systems)
                                         #   write_zeros
+                                        #   ftruncate (Unix)
+                                        #   chsize (Windows)
 
 #max_notify_queue_pages = 1048576       # limits the number of SLRU pages allocated
                                         # for NOTIFY / LISTEN queue
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 7074c3f118b..bb1729a41d1 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -61,11 +61,36 @@ enum FileExtendMethod
 	FILE_EXTEND_METHOD_POSIX_FALLOCATE,
 #endif
 	FILE_EXTEND_METHOD_WRITE_ZEROS,
+	FILE_EXTEND_METHOD_FTRUNCATE,
 };
 
 /* Default to the first available file_extend_method. */
 #define DEFAULT_FILE_EXTEND_METHOD 0
 
+#ifdef WIN32
+
+ /*
+  * Even though file_extend_method=chsize uses the same code path as
+  * file_extend_method=ftruncate, our ftruncate() macro for Windows expands to
+  * _chsize_s(), whose filesystem-dependent behavior might not match
+  * ftruncate() in a relevant way:
+  *
+  * 1.  NTFS allocates physical blocks so that overwriting them later can't
+  * fail with ENOSPC.  It would be confusing and misleading to label it
+  * "ftruncate", as it sounds like a recipe for sparse files.
+  *
+  * 2.  ReFS doesn't, being a COW system, and nor is allocation in the
+  * function's contract, so it would also be also be misleading to label it
+  * "posix_fallocate".
+  *
+  * We don't know what the file system does, and Unix terminology would only
+  * obfuscate matters, so we expose the name of the real OS function.
+  */
+#define FILE_EXTEND_METHOD_FTRUNCATE_NAME "chsize"
+#else
+#define FILE_EXTEND_METHOD_FTRUNCATE_NAME "ftruncate"
+#endif
+
 /*
  * Values 4-8 were experimentally determined to avoid interference between
  * posix_fallocate() and delayed allocation on common Linux file systems, but
-- 
2.51.2

#27Jakub Wartak
jakub.wartak@enterprisedb.com
In reply to: Thomas Munro (#26)
Re: [PING] fallocate() causes btrfs to never compress postgresql files

On Mon, Dec 15, 2025 at 7:00 AM Thomas Munro <thomas.munro@gmail.com> wrote:

Here's a new version with some cleanup and documentation. I tried to
pare it down to the minimum change for the back-branches, keeping
unnecessary changes for master. In the process, I also thought a bit
about how to de-confused matters on Windows, where the function we
call as ftruncate() behaves differently in a crucial respect. See
attached.

I'm proposing to back-patch 0001. 0002 and 0003 are proposals for master only.

Hi Thomas,

Thanks for working on this. I have reviewed and played a little with
them and they are in very good shape, so +1 from my side. Just couple
of minor things:

1. 0001 I would just add another Discussion there too in commit
message (/messages/by-id/CADofcAV8xu3hCNHq7-7x56KrP9rD6=A04=qjTr3nETh-gptF8w@mail.gmail.com
- XFS thread)
2. I've tested those lightly and they pass my local/built/test. Just a
non-actionable observation from my side: I'm just not sure how useful
the v2-0002 (the new file_extend_method_threshold) is going to be in
real life, for me it sounds like it could be debug_file_extend*...
however that would break convention of using just file_extend
3. I haven't tested 0003 as it is for Windows, probably we could add
it to cfbot, so that it would tell us something more there.

See below for replies to separate messages from Jakub and Bruce.

On Thu, Oct 30, 2025 at 11:14 PM Jakub Wartak
<jakub.wartak@enterprisedb.com> wrote:

+1 to this GUCs as this would also help the nearby thread with XFS
mysteries which are not fully solved [1]. Since the latest message in
that discussion, I'm aware of at least one additional report of XFS
failing at fallocate() with free space too, but without any details
from the OS support vendor why that happened, so this $patch could be
also used to workaround that problem too.

Yeah, that seems quite important, and the new report in psql-bugs
#19348 sounds like another case.

Right, I think we've got another report internally too since last time
we've talked, but contact went silent after being redirected to the OS
vendor (after some recommended workaround did not work for them , but
those worked for others).

Why just 17? (wasn't fallocate() introduced in 16? 4d330a61bb19 and
31966b151e6ab are from Apr 2023, while 16 was released on Sep 2023)

Right, fixed.

Cool, thanks.

Yeah. Let's go with PGC_SIGHUP. Let's worry about multiple
filesystems when we've figured out how to do per-tablespace settings.

Cool, thanks.

This is vapourware for later, but I've been wondering if we could
invent a sysctl-style hierarchy as a scoping mechanism, something
like:

tablespace.foo.random_page_cost=1
tablespace.foo.file_extend_method=ftruncate
tablespace.foo.io_combine_limit=1MB

This looks more like sysfs than sysctl (as foo is tablespace name?)
:^). Anyway I think that 0001 should go in and then new thread could
be started for this if You want (as this would be a little conflicting
to stuff we already have: e.g. alter tablespace pg_default set
(maintenance_io_concurrency=XXX), but it is highly unlikely anybody
uses '\db+' in psql see those options there).

-J.