pgsql: Add smgrzeroextend(), FileZero(), FileFallocate()

Started by Andres Freundalmost 3 years ago20 messages
#1Andres Freund
andres@anarazel.de

Add smgrzeroextend(), FileZero(), FileFallocate()

smgrzeroextend() uses FileFallocate() to efficiently extend files by multiple
blocks. When extending by a small number of blocks, use FileZero() instead, as
using posix_fallocate() for small numbers of blocks is inefficient for some
file systems / operating systems. FileZero() is also used as the fallback for
FileFallocate() on platforms / filesystems that don't support fallocate.

A big advantage of using posix_fallocate() is that it typically won't cause
dirty buffers in the kernel pagecache. So far the most common pattern in our
code is that we smgrextend() a page full of zeroes and put the corresponding
page into shared buffers, from where we later write out the actual contents of
the page. If the kernel, e.g. due to memory pressure or elapsed time, already
wrote back the all-zeroes page, this can lead to doubling the amount of writes
reaching storage.

There are no users of smgrzeroextend() as of this commit. That will follow in
future commits.

Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: John Naylor <john.naylor@enterprisedb.com>
Discussion: /messages/by-id/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de

Branch
------
master

Details
-------
https://git.postgresql.org/pg/commitdiff/4d330a61bb1969df31f2cebfe1ba9d1d004346d8

Modified Files
--------------
src/backend/storage/file/fd.c | 88 ++++++++++++++++++++++++++++++++
src/backend/storage/smgr/md.c | 108 ++++++++++++++++++++++++++++++++++++++++
src/backend/storage/smgr/smgr.c | 28 +++++++++++
src/include/storage/fd.h | 3 ++
src/include/storage/md.h | 2 +
src/include/storage/smgr.h | 2 +
6 files changed, 231 insertions(+)

#2Christoph Berg
myon@debian.org
In reply to: Andres Freund (#1)
could not extend file "base/5/3501" with FileFallocate(): Interrupted system call

Re: Andres Freund

Add smgrzeroextend(), FileZero(), FileFallocate()

Hi,

I'm often seeing PG16 builds erroring out in the pgbench tests:

00:33:12 make[2]: Entering directory '/<<PKGBUILDDIR>>/build/src/bin/pgbench'
00:33:12 echo "# +++ tap check in src/bin/pgbench +++" && rm -rf '/<<PKGBUILDDIR>>/build/src/bin/pgbench'/tmp_check && /bin/mkdir -p '/<<PKGBUILDDIR>>/build/src/bin/pgbench'/tmp_check && cd /<<PKGBUILDDIR>>/build/../src/bin/pgbench && TESTLOGDIR='/<<PKGBUILDDIR>>/build/src/bin/pgbench/tmp_check/log' TESTDATADIR='/<<PKGBUILDDIR>>/build/src/bin/pgbench/tmp_check' PATH="/<<PKGBUILDDIR>>/build/tmp_install/usr/lib/postgresql/16/bin:/<<PKGBUILDDIR>>/build/src/bin/pgbench:$PATH" LD_LIBRARY_PATH="/<<PKGBUILDDIR>>/build/tmp_install/usr/lib/aarch64-linux-gnu" PGPORT='65432' top_builddir='/<<PKGBUILDDIR>>/build/src/bin/pgbench/../../..' PG_REGRESS='/<<PKGBUILDDIR>>/build/src/bin/pgbench/../../../src/test/regress/pg_regress' /usr/bin/prove -I /<<PKGBUILDDIR>>/build/../src/test/perl/ -I /<<PKGBUILDDIR>>/build/../src/bin/pgbench --verbose t/*.pl
00:33:12 # +++ tap check in src/bin/pgbench +++
00:33:14 # Failed test 'concurrent OID generation status (got 2 vs expected 0)'
00:33:14 # at t/001_pgbench_with_server.pl line 31.
00:33:14 # Failed test 'concurrent OID generation stdout /(?^:processed: 125/125)/'
00:33:14 # at t/001_pgbench_with_server.pl line 31.
00:33:14 # 'pgbench (16devel (Debian 16~~devel-1.pgdg100+~20230423.1656.g8bbd0cc))
00:33:14 # transaction type: /<<PKGBUILDDIR>>/build/src/bin/pgbench/tmp_check/t_001_pgbench_with_server_main_data/001_pgbench_concurrent_insert
00:33:14 # scaling factor: 1
00:33:14 # query mode: prepared
00:33:14 # number of clients: 5
00:33:14 # number of threads: 1
00:33:14 # maximum number of tries: 1
00:33:14 # number of transactions per client: 25
00:33:14 # number of transactions actually processed: 118/125
00:33:14 # number of failed transactions: 0 (0.000%)
00:33:14 # latency average = 26.470 ms
00:33:14 # initial connection time = 66.583 ms
00:33:14 # tps = 188.889760 (without initial connection time)
00:33:14 # '
00:33:14 # doesn't match '(?^:processed: 125/125)'
00:33:14 # Failed test 'concurrent OID generation stderr /(?^:^$)/'
00:33:14 # at t/001_pgbench_with_server.pl line 31.
00:33:14 # 'pgbench: error: client 2 script 0 aborted in command 0 query 0: ERROR: could not extend file "base/5/3501" with FileFallocate(): Interrupted system call
00:33:14 # HINT: Check free disk space.
00:33:14 # pgbench: error: Run was aborted; the above results are incomplete.
00:33:14 # '
00:33:14 # doesn't match '(?^:^$)'
00:33:26 # Looks like you failed 3 tests of 428.
00:33:26 t/001_pgbench_with_server.pl ..
00:33:26 not ok 1 - concurrent OID generation status (got 2 vs expected 0)

I don't think the disk is full since it's always hitting that same
spot, on some of the builds:

https://pgdgbuild.dus.dg-i.net/job/postgresql-16-binaries-snapshot/833/

This is overlayfs with tmpfs (upper)/ext4 (lower). Manually running
that test works though, and the FS seems to support posix_fallocate:

#include <fcntl.h>
#include <stdio.h>

int main ()
{
int f;
int err;

if (!(f = open("moo", O_CREAT | O_RDWR, 0666)))
perror("open");

err = posix_fallocate(f, 0, 10);
perror("posix_fallocate");

return 0;
}

$ ./a.out
posix_fallocate: Success

The problem has been there for some weeks - I didn't report it earlier
as I was on vacation, in the free time trying to bootstrap s390x
support for apt.pg.o, and there was this other direct IO problem
making all the builds fail for some time.

Christoph

#3Melanie Plageman
melanieplageman@gmail.com
In reply to: Christoph Berg (#2)
Re: could not extend file "base/5/3501" with FileFallocate(): Interrupted system call

On Mon, Apr 24, 2023 at 10:53:35AM +0200, Christoph Berg wrote:

Re: Andres Freund

Add smgrzeroextend(), FileZero(), FileFallocate()

Hi,

I'm often seeing PG16 builds erroring out in the pgbench tests:

00:33:12 make[2]: Entering directory '/<<PKGBUILDDIR>>/build/src/bin/pgbench'
00:33:12 echo "# +++ tap check in src/bin/pgbench +++" && rm -rf '/<<PKGBUILDDIR>>/build/src/bin/pgbench'/tmp_check && /bin/mkdir -p '/<<PKGBUILDDIR>>/build/src/bin/pgbench'/tmp_check && cd /<<PKGBUILDDIR>>/build/../src/bin/pgbench && TESTLOGDIR='/<<PKGBUILDDIR>>/build/src/bin/pgbench/tmp_check/log' TESTDATADIR='/<<PKGBUILDDIR>>/build/src/bin/pgbench/tmp_check' PATH="/<<PKGBUILDDIR>>/build/tmp_install/usr/lib/postgresql/16/bin:/<<PKGBUILDDIR>>/build/src/bin/pgbench:$PATH" LD_LIBRARY_PATH="/<<PKGBUILDDIR>>/build/tmp_install/usr/lib/aarch64-linux-gnu" PGPORT='65432' top_builddir='/<<PKGBUILDDIR>>/build/src/bin/pgbench/../../..' PG_REGRESS='/<<PKGBUILDDIR>>/build/src/bin/pgbench/../../../src/test/regress/pg_regress' /usr/bin/prove -I /<<PKGBUILDDIR>>/build/../src/test/perl/ -I /<<PKGBUILDDIR>>/build/../src/bin/pgbench --verbose t/*.pl
00:33:12 # +++ tap check in src/bin/pgbench +++
00:33:14 # Failed test 'concurrent OID generation status (got 2 vs expected 0)'
00:33:14 # at t/001_pgbench_with_server.pl line 31.
00:33:14 # Failed test 'concurrent OID generation stdout /(?^:processed: 125/125)/'
00:33:14 # at t/001_pgbench_with_server.pl line 31.
00:33:14 # 'pgbench (16devel (Debian 16~~devel-1.pgdg100+~20230423.1656.g8bbd0cc))
00:33:14 # transaction type: /<<PKGBUILDDIR>>/build/src/bin/pgbench/tmp_check/t_001_pgbench_with_server_main_data/001_pgbench_concurrent_insert
00:33:14 # scaling factor: 1
00:33:14 # query mode: prepared
00:33:14 # number of clients: 5
00:33:14 # number of threads: 1
00:33:14 # maximum number of tries: 1
00:33:14 # number of transactions per client: 25
00:33:14 # number of transactions actually processed: 118/125
00:33:14 # number of failed transactions: 0 (0.000%)
00:33:14 # latency average = 26.470 ms
00:33:14 # initial connection time = 66.583 ms
00:33:14 # tps = 188.889760 (without initial connection time)
00:33:14 # '
00:33:14 # doesn't match '(?^:processed: 125/125)'
00:33:14 # Failed test 'concurrent OID generation stderr /(?^:^$)/'
00:33:14 # at t/001_pgbench_with_server.pl line 31.
00:33:14 # 'pgbench: error: client 2 script 0 aborted in command 0 query 0: ERROR: could not extend file "base/5/3501" with FileFallocate(): Interrupted system call
00:33:14 # HINT: Check free disk space.
00:33:14 # pgbench: error: Run was aborted; the above results are incomplete.
00:33:14 # '
00:33:14 # doesn't match '(?^:^$)'
00:33:26 # Looks like you failed 3 tests of 428.
00:33:26 t/001_pgbench_with_server.pl ..
00:33:26 not ok 1 - concurrent OID generation status (got 2 vs expected 0)

I don't think the disk is full since it's always hitting that same
spot, on some of the builds:

https://pgdgbuild.dus.dg-i.net/job/postgresql-16-binaries-snapshot/833/

This is overlayfs with tmpfs (upper)/ext4 (lower). Manually running
that test works though, and the FS seems to support posix_fallocate:

#include <fcntl.h>
#include <stdio.h>

int main ()
{
int f;
int err;

if (!(f = open("moo", O_CREAT | O_RDWR, 0666)))
perror("open");

err = posix_fallocate(f, 0, 10);
perror("posix_fallocate");

return 0;
}

$ ./a.out
posix_fallocate: Success

The problem has been there for some weeks - I didn't report it earlier
as I was on vacation, in the free time trying to bootstrap s390x
support for apt.pg.o, and there was this other direct IO problem
making all the builds fail for some time.

I noticed that dsm_impl_posix_resize() does a do while rc==EINTR and
FileFallocate() doesn't. From what the comment says in
dsm_impl_posix_resize() and some cursory googling, posix_fallocate()
doesn't restart automatically on most systems, so a do while() rc==EINTR
is often used. Is there a reason it isn't used in FileFallocate() I
wonder?

- Melanie

#4Andres Freund
andres@anarazel.de
In reply to: Christoph Berg (#2)
Re: could not extend file "base/5/3501" with FileFallocate(): Interrupted system call

Hi,

On 2023-04-24 10:53:35 +0200, Christoph Berg wrote:

I'm often seeing PG16 builds erroring out in the pgbench tests:

Interesting!

I don't think the disk is full since it's always hitting that same
spot, on some of the builds:

Yea, the EINTR pretty clearly indicates that it's not really out-of-space.

https://pgdgbuild.dus.dg-i.net/job/postgresql-16-binaries-snapshot/833/

This is overlayfs with tmpfs (upper)/ext4 (lower). Manually running
that test works though, and the FS seems to support posix_fallocate:

I guess it requires a bunch of memory (?) pressure for this to happen
(triggering blocking during fallocate, opening the window for a signal to
arrive), which likely only happens when running things concurrently.

We obviously can add a retry loop to FileFallocate(), similar to what's
already present e.g. in FileRead(). But I wonder if we shouldn't go a bit
further, and do it for all the fd.c routines where it's remotely plausible
EINTR could be returned? It's a bit silly to add EINTR retries one-by-one to
the functions.

The following are documented to potentially return EINTR, without fd.c having
code to retry:

- FileWriteback() / pg_flush_data()
- FileSync() / pg_fsync()
- FileFallocate()
- FileTruncate()

With the first two there's the added complication that it's not entirely
obvious whether it'd be better to handle this in File* or pg_*. I'd argue the
latter is a bit more sensible?

Greetings,

Andres Freund

#5Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#4)
1 attachment(s)
Re: could not extend file "base/5/3501" with FileFallocate(): Interrupted system call

Hi,

On 2023-04-24 15:32:25 -0700, Andres Freund wrote:

On 2023-04-24 10:53:35 +0200, Christoph Berg wrote:

I'm often seeing PG16 builds erroring out in the pgbench tests:
I don't think the disk is full since it's always hitting that same
spot, on some of the builds:

Yea, the EINTR pretty clearly indicates that it's not really out-of-space.

FWIW, I tried to reproduce this, without success - not too surprising, I
assume it's rather timing dependent.

We obviously can add a retry loop to FileFallocate(), similar to what's
already present e.g. in FileRead(). But I wonder if we shouldn't go a bit
further, and do it for all the fd.c routines where it's remotely plausible
EINTR could be returned? It's a bit silly to add EINTR retries one-by-one to
the functions.

The following are documented to potentially return EINTR, without fd.c having
code to retry:

- FileWriteback() / pg_flush_data()
- FileSync() / pg_fsync()
- FileFallocate()
- FileTruncate()

With the first two there's the added complication that it's not entirely
obvious whether it'd be better to handle this in File* or pg_*. I'd argue the
latter is a bit more sensible?

A prototype of that approach is attached. I pushed the retry handling into the
pg_* routines where applicable. I guess we could add pg_* routines for
FileFallocate(), FilePrewarm() etc as well, but I didn't do that here.

Christoph, could you verify this fixes your issue?

Greetings,

Andres Freund

Attachments:

v1-0001-fd.c-Retry-after-EINTR-in-more-places.patchtext/x-diff; charset=us-asciiDownload
From d8f5b0d4765044a0f35d01054c9d2720c6045b4c Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 24 Apr 2023 16:53:52 -0700
Subject: [PATCH v1] fd.c: Retry after EINTR in more places

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/storage/file/fd.c | 71 +++++++++++++++++++++++++++++------
 1 file changed, 59 insertions(+), 12 deletions(-)

diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 277a28fc137..2b232a80489 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -415,10 +415,18 @@ pg_fsync(int fd)
 int
 pg_fsync_no_writethrough(int fd)
 {
-	if (enableFsync)
-		return fsync(fd);
-	else
+	int		rc;
+
+	if (!enableFsync)
 		return 0;
+
+retry:
+	rc = fsync(fd);
+
+	if (rc == -1 && errno == EINTR)
+		goto retry;
+
+	return rc;
 }
 
 /*
@@ -448,10 +456,18 @@ pg_fsync_writethrough(int fd)
 int
 pg_fdatasync(int fd)
 {
-	if (enableFsync)
-		return fdatasync(fd);
-	else
+	int		rc;
+
+	if (!enableFsync)
 		return 0;
+
+retry:
+	rc = fdatasync(fd);
+
+	if (rc == -1 && errno == EINTR)
+		goto retry;
+
+	return rc;
 }
 
 /*
@@ -483,6 +499,7 @@ pg_flush_data(int fd, off_t offset, off_t nbytes)
 		if (not_implemented_by_kernel)
 			return;
 
+retry:
 		/*
 		 * sync_file_range(SYNC_FILE_RANGE_WRITE), currently linux specific,
 		 * tells the OS that writeback for the specified blocks should be
@@ -498,6 +515,9 @@ pg_flush_data(int fd, off_t offset, off_t nbytes)
 		{
 			int			elevel;
 
+			if (rc == EINTR)
+				goto retry;
+
 			/*
 			 * For systems that don't have an implementation of
 			 * sync_file_range() such as Windows WSL, generate only one
@@ -629,32 +649,51 @@ pg_flush_data(int fd, off_t offset, off_t nbytes)
 #endif
 }
 
+static int
+pg_ftruncate(int fd, off_t length)
+{
+	int			ret;
+
+retry:
+	ret = ftruncate(fd, length);
+
+	if (ret == -1 && errno == EINTR)
+		goto retry;
+
+	return ret;
+}
+
 /*
  * Truncate a file to a given length by name.
  */
 int
 pg_truncate(const char *path, off_t length)
 {
+	int			ret;
 #ifdef WIN32
 	int			save_errno;
-	int			ret;
 	int			fd;
 
 	fd = OpenTransientFile(path, O_RDWR | PG_BINARY);
 	if (fd >= 0)
 	{
-		ret = ftruncate(fd, length);
+		ret = pg_ftruncate(fd, length);
 		save_errno = errno;
 		CloseTransientFile(fd);
 		errno = save_errno;
 	}
 	else
 		ret = -1;
+#else
+
+retry:
+	ret = truncate(path, length);
+
+	if (ret == -1 && errno == EINTR)
+		goto retry;
+#endif
 
 	return ret;
-#else
-	return truncate(path, length);
-#endif
 }
 
 /*
@@ -2001,11 +2040,15 @@ FilePrefetch(File file, off_t offset, off_t amount, uint32 wait_event_info)
 	if (returnCode < 0)
 		return returnCode;
 
+retry:
 	pgstat_report_wait_start(wait_event_info);
 	returnCode = posix_fadvise(VfdCache[file].fd, offset, amount,
 							   POSIX_FADV_WILLNEED);
 	pgstat_report_wait_end();
 
+	if (returnCode == EINTR)
+		goto retry;
+
 	return returnCode;
 #else
 	Assert(FileIsValid(file));
@@ -2281,12 +2324,16 @@ FileFallocate(File file, off_t offset, off_t amount, uint32 wait_event_info)
 	if (returnCode < 0)
 		return -1;
 
+retry:
+
 	pgstat_report_wait_start(wait_event_info);
 	returnCode = posix_fallocate(VfdCache[file].fd, offset, amount);
 	pgstat_report_wait_end();
 
 	if (returnCode == 0)
 		return 0;
+	else if (returnCode == EINTR)
+		goto retry;
 
 	/* for compatibility with %m printing etc */
 	errno = returnCode;
@@ -2334,7 +2381,7 @@ FileTruncate(File file, off_t offset, uint32 wait_event_info)
 		return returnCode;
 
 	pgstat_report_wait_start(wait_event_info);
-	returnCode = ftruncate(VfdCache[file].fd, offset);
+	returnCode = pg_ftruncate(VfdCache[file].fd, offset);
 	pgstat_report_wait_end();
 
 	if (returnCode == 0 && VfdCache[file].fileSize > offset)
-- 
2.38.0

#6Christoph Berg
myon@debian.org
In reply to: Andres Freund (#5)
Re: could not extend file "base/5/3501" with FileFallocate(): Interrupted system call

Re: Andres Freund

A prototype of that approach is attached. I pushed the retry handling into the
pg_* routines where applicable. I guess we could add pg_* routines for
FileFallocate(), FilePrewarm() etc as well, but I didn't do that here.

Christoph, could you verify this fixes your issue?

Everything green with the patch applied. Thanks!

https://pgdgbuild.dus.dg-i.net/job/postgresql-16-binaries-snapshot/839/

Christoph

#7Thomas Munro
thomas.munro@gmail.com
In reply to: Andres Freund (#5)
Re: could not extend file "base/5/3501" with FileFallocate(): Interrupted system call

On Tue, Apr 25, 2023 at 12:16 PM Andres Freund <andres@anarazel.de> wrote:

On 2023-04-24 15:32:25 -0700, Andres Freund wrote:

We obviously can add a retry loop to FileFallocate(), similar to what's
already present e.g. in FileRead(). But I wonder if we shouldn't go a bit
further, and do it for all the fd.c routines where it's remotely plausible
EINTR could be returned? It's a bit silly to add EINTR retries one-by-one to
the functions.

The following are documented to potentially return EINTR, without fd.c having
code to retry:

- FileWriteback() / pg_flush_data()
- FileSync() / pg_fsync()
- FileFallocate()
- FileTruncate()

With the first two there's the added complication that it's not entirely
obvious whether it'd be better to handle this in File* or pg_*. I'd argue the
latter is a bit more sensible?

A prototype of that approach is attached. I pushed the retry handling into the
pg_* routines where applicable. I guess we could add pg_* routines for
FileFallocate(), FilePrewarm() etc as well, but I didn't do that here.

One problem we ran into with the the shm_open() case (which is nearly
identical under the covers, since shm_open() just opens a file in a
tmpfs on Linux) was that a simple retry loop like this could never
terminate if the process was receiving a lot of signals from the
recovery process, which is why we went with the idea of masking
signals instead. Eventually we should probably grow the file in
smaller chunks with a CFI in between so that we both guarantee that we
make progress (by masking for smaller size increases) and service
interrupts in a timely fashion (by unmasking between loops). I don't
think that applies here because we're not trying to fallocate
humongous size increases in one go, but I just want to note that we're
making a different choice. I think this looks reasonable for the use
cases we actually have here.

#8Christoph Berg
myon@debian.org
In reply to: Andres Freund (#5)
Re: could not extend file "base/5/3501" with FileFallocate(): Interrupted system call

Re: Andres Freund

A prototype of that approach is attached. I pushed the retry handling into the
pg_* routines where applicable. I guess we could add pg_* routines for
FileFallocate(), FilePrewarm() etc as well, but I didn't do that here.

Christoph, could you verify this fixes your issue?

Hi,

I believe this issue is still open for PG16.

Christoph

#9Michael Paquier
michael@paquier.xyz
In reply to: Christoph Berg (#8)
Re: could not extend file "base/5/3501" with FileFallocate(): Interrupted system call

On Tue, May 23, 2023 at 04:25:59PM +0200, Christoph Berg wrote:

I believe this issue is still open for PG16.

Right. I've added an item to the list, to not forget.
--
Michael

#10Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Thomas Munro (#7)
Re: could not extend file "base/5/3501" with FileFallocate(): Interrupted system call

At Wed, 26 Apr 2023 11:37:55 +1200, Thomas Munro <thomas.munro@gmail.com> wrote in

On Tue, Apr 25, 2023 at 12:16 PM Andres Freund <andres@anarazel.de> wrote:

On 2023-04-24 15:32:25 -0700, Andres Freund wrote:

We obviously can add a retry loop to FileFallocate(), similar to what's
already present e.g. in FileRead(). But I wonder if we shouldn't go a bit
further, and do it for all the fd.c routines where it's remotely plausible
EINTR could be returned? It's a bit silly to add EINTR retries one-by-one to
the functions.

The following are documented to potentially return EINTR, without fd.c having
code to retry:

- FileWriteback() / pg_flush_data()
- FileSync() / pg_fsync()
- FileFallocate()
- FileTruncate()

With the first two there's the added complication that it's not entirely
obvious whether it'd be better to handle this in File* or pg_*. I'd argue the
latter is a bit more sensible?

A prototype of that approach is attached. I pushed the retry handling into the
pg_* routines where applicable. I guess we could add pg_* routines for
FileFallocate(), FilePrewarm() etc as well, but I didn't do that here.

One problem we ran into with the the shm_open() case (which is nearly
identical under the covers, since shm_open() just opens a file in a
tmpfs on Linux) was that a simple retry loop like this could never
terminate if the process was receiving a lot of signals from the
recovery process, which is why we went with the idea of masking
signals instead. Eventually we should probably grow the file in
smaller chunks with a CFI in between so that we both guarantee that we
make progress (by masking for smaller size increases) and service
interrupts in a timely fashion (by unmasking between loops). I don't
think that applies here because we're not trying to fallocate
humongous size increases in one go, but I just want to note that we're
making a different choice. I think this looks reasonable for the use
cases we actually have here.

FWIW I share the same feeling about looping by EINTR without signals
being blocked. If we just retry the same operation without processing
signals after getting EINTR, I think blocking signals is better. We
could block signals more gracefully, but I'm not sure it's worth the
complexity.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#11Andres Freund
andres@anarazel.de
In reply to: Kyotaro Horiguchi (#10)
Re: could not extend file "base/5/3501" with FileFallocate(): Interrupted system call

Hi,

On 2023-05-24 10:56:28 +0900, Kyotaro Horiguchi wrote:

At Wed, 26 Apr 2023 11:37:55 +1200, Thomas Munro <thomas.munro@gmail.com> wrote in

On Tue, Apr 25, 2023 at 12:16 PM Andres Freund <andres@anarazel.de> wrote:

On 2023-04-24 15:32:25 -0700, Andres Freund wrote:

We obviously can add a retry loop to FileFallocate(), similar to what's
already present e.g. in FileRead(). But I wonder if we shouldn't go a bit
further, and do it for all the fd.c routines where it's remotely plausible
EINTR could be returned? It's a bit silly to add EINTR retries one-by-one to
the functions.

The following are documented to potentially return EINTR, without fd.c having
code to retry:

- FileWriteback() / pg_flush_data()
- FileSync() / pg_fsync()
- FileFallocate()
- FileTruncate()

With the first two there's the added complication that it's not entirely
obvious whether it'd be better to handle this in File* or pg_*. I'd argue the
latter is a bit more sensible?

A prototype of that approach is attached. I pushed the retry handling into the
pg_* routines where applicable. I guess we could add pg_* routines for
FileFallocate(), FilePrewarm() etc as well, but I didn't do that here.

One problem we ran into with the the shm_open() case (which is nearly
identical under the covers, since shm_open() just opens a file in a
tmpfs on Linux) was that a simple retry loop like this could never
terminate if the process was receiving a lot of signals from the
recovery process, which is why we went with the idea of masking
signals instead. Eventually we should probably grow the file in
smaller chunks with a CFI in between so that we both guarantee that we
make progress (by masking for smaller size increases) and service
interrupts in a timely fashion (by unmasking between loops). I don't
think that applies here because we're not trying to fallocate
humongous size increases in one go, but I just want to note that we're
making a different choice. I think this looks reasonable for the use
cases we actually have here.

FWIW I share the same feeling about looping by EINTR without signals
being blocked. If we just retry the same operation without processing
signals after getting EINTR, I think blocking signals is better. We
could block signals more gracefully, but I'm not sure it's worth the
complexity.

I seriously doubt it's a good path to go down in this case. As Thomas
mentioned, this case isn't really comparable to the shm_open() one, due to the
bounded vs unbounded amount of memory we're dealing with.

What would be the benefit?

Greetings,

Andres Freund

#12Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Andres Freund (#11)
Re: could not extend file "base/5/3501" with FileFallocate(): Interrupted system call

At Tue, 23 May 2023 19:28:45 -0700, Andres Freund <andres@anarazel.de> wrote in

Hi,

On 2023-05-24 10:56:28 +0900, Kyotaro Horiguchi wrote:

At Wed, 26 Apr 2023 11:37:55 +1200, Thomas Munro <thomas.munro@gmail.com> wrote in

On Tue, Apr 25, 2023 at 12:16 PM Andres Freund <andres@anarazel.de> wrote:

On 2023-04-24 15:32:25 -0700, Andres Freund wrote:

We obviously can add a retry loop to FileFallocate(), similar to what's
already present e.g. in FileRead(). But I wonder if we shouldn't go a bit
further, and do it for all the fd.c routines where it's remotely plausible
EINTR could be returned? It's a bit silly to add EINTR retries one-by-one to
the functions.

The following are documented to potentially return EINTR, without fd.c having
code to retry:

- FileWriteback() / pg_flush_data()
- FileSync() / pg_fsync()
- FileFallocate()
- FileTruncate()

With the first two there's the added complication that it's not entirely
obvious whether it'd be better to handle this in File* or pg_*. I'd argue the
latter is a bit more sensible?

A prototype of that approach is attached. I pushed the retry handling into the
pg_* routines where applicable. I guess we could add pg_* routines for
FileFallocate(), FilePrewarm() etc as well, but I didn't do that here.

One problem we ran into with the the shm_open() case (which is nearly
identical under the covers, since shm_open() just opens a file in a
tmpfs on Linux) was that a simple retry loop like this could never
terminate if the process was receiving a lot of signals from the
recovery process, which is why we went with the idea of masking
signals instead. Eventually we should probably grow the file in
smaller chunks with a CFI in between so that we both guarantee that we
make progress (by masking for smaller size increases) and service
interrupts in a timely fashion (by unmasking between loops). I don't
think that applies here because we're not trying to fallocate
humongous size increases in one go, but I just want to note that we're
making a different choice. I think this looks reasonable for the use
cases we actually have here.

FWIW I share the same feeling about looping by EINTR without signals
being blocked. If we just retry the same operation without processing
signals after getting EINTR, I think blocking signals is better. We
could block signals more gracefully, but I'm not sure it's worth the
complexity.

I seriously doubt it's a good path to go down in this case. As Thomas
mentioned, this case isn't really comparable to the shm_open() one, due to the
bounded vs unbounded amount of memory we're dealing with.

What would be the benefit?

I'm not certain what you mean by "it" here. Regarding signal blocking,
the benefit would be a lower chance of getting constantly interrupted
by a string of frequent interrupts, which can't be prevented just by
looping over. From what I gathered, Thomas meant that we don't need to
use chunking to prevent long periods of ignoring interrupts because
we're extending a file by a few blocks. However, I might have
misunderstood.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#13Alvaro Herrera
alvherre@alvh.no-ip.org
In reply to: Andres Freund (#5)
Re: could not extend file "base/5/3501" with FileFallocate(): Interrupted system call

On 2023-Apr-24, Andres Freund wrote:

A prototype of that approach is attached. I pushed the retry handling into the
pg_* routines where applicable. I guess we could add pg_* routines for
FileFallocate(), FilePrewarm() etc as well, but I didn't do that here.

Christoph, could you verify this fixes your issue?

So, is anyone making progress on this? I don't see anything in the
thread.

On adding the missing pg_* wrappers: I think if we don't (and we leave
the retry loops at the File* layer), then the risk is that some external
code would add calls to the underlying File* routines trusting them to
do the retrying, which would then become broken when we move the retry
loops to the pg_* wrappers when we add them. That doesn't seem terribly
serious to me.

--
Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
"El Maquinismo fue proscrito so pena de cosquilleo hasta la muerte"
(Ijon Tichy en Viajes, Stanislaw Lem)

#14Christoph Berg
myon@debian.org
In reply to: Alvaro Herrera (#13)
Re: could not extend file "base/5/3501" with FileFallocate(): Interrupted system call

Re: Alvaro Herrera

Christoph, could you verify this fixes your issue?

So, is anyone making progress on this? I don't see anything in the
thread.

Well, I had reported that I haven't been seeing any problems since I
applied the patch to the postgresql-16.deb package. So for me, the
patch looks like it solves the problem.

Christoph

#15Andres Freund
andres@anarazel.de
In reply to: Alvaro Herrera (#13)
Re: could not extend file "base/5/3501" with FileFallocate(): Interrupted system call

Hi,

On 2023-06-06 21:53:00 +0200, Alvaro Herrera wrote:

On 2023-Apr-24, Andres Freund wrote:

A prototype of that approach is attached. I pushed the retry handling into the
pg_* routines where applicable. I guess we could add pg_* routines for
FileFallocate(), FilePrewarm() etc as well, but I didn't do that here.

Christoph, could you verify this fixes your issue?

So, is anyone making progress on this? I don't see anything in the
thread.

Thanks for bringing it up again, I had lost track. I now added an open items
entry.

My gut feeling is that we should go with something quite minimal at this
stage.

On adding the missing pg_* wrappers: I think if we don't (and we leave
the retry loops at the File* layer), then the risk is that some external
code would add calls to the underlying File* routines trusting them to
do the retrying, which would then become broken when we move the retry
loops to the pg_* wrappers when we add them. That doesn't seem terribly
serious to me.

I'm not too worried about that either.

Unless somebody strongly advocates a different path, I plan to push something
along the lines of the prototype I had posted. After reading over it a bunch
more times, some of the code is a bit finnicky.

I wish we had some hack that made syscalls EINTR at a random intervals, just
to make it realistic to test these paths...

Greetings,

Andres Freund

#16Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#15)
1 attachment(s)
Re: could not extend file "base/5/3501" with FileFallocate(): Interrupted system call

Hi Tom,

Unfortunately, due to some personal life business, it took until for me to
feel comfortable pushing the fix for
/messages/by-id/ZEZDj1H61ryrmY9o@msg.df7cb.de
(FileFallocate() erroring out with EINTR due to running on tmpfs).

Do you want me to hold off before beta2 is wrapped? I did a bunch of CI runs
with the patch and patch + test infra, and they all passed.

Greetings,

Andres Freund

Attachments:

v2-0001-fd.c-Retry-after-EINTR-in-more-places.patchtext/x-diff; charset=us-asciiDownload
From 6e67692231de22b6e347d330aacad72a6fdbd89a Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 19 Jun 2023 10:26:30 -0700
Subject: [PATCH v2] fd.c: Retry after EINTR in more places

Starting with 4d330a61bb1 we can use posix_fallocate() to extend
files. Unfortunately in some situation, e.g. on tmpfs filesystems, EINTR may
be returned. See also 4518c798b2b.

To fix, add a retry path to FileFallocate(). In contrast to 4518c798b2b the
amount we extend by is limited and the extending may happen at a high
frequency, so disabling signals does not appear to be the correct path here.

Also add retry paths to other file operations currently lacking them (around
fdatasync(), fsync(), ftruncate(), posix_fadvise(), sync_file_range(),
truncate()) - they are all documented or have been observed to return EINTR.

Even though most of these functions used in the back branches, it does not
seem worth the risk to backpatch - outside of the new-to-16 case of
posix_fallocate() I am not aware of problem reports due to the lack of
retries.

Reported-by: Christoph Berg <myon@debian.org>
Discussion: https://postgr.es/m/ZEZDj1H61ryrmY9o@msg.df7cb.de
Backpatch: -
---
 src/backend/storage/file/fd.c | 73 +++++++++++++++++++++++++++++------
 1 file changed, 61 insertions(+), 12 deletions(-)

diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 173476789c7..db39186f058 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -415,10 +415,18 @@ pg_fsync(int fd)
 int
 pg_fsync_no_writethrough(int fd)
 {
-	if (enableFsync)
-		return fsync(fd);
-	else
+	int		rc;
+
+	if (!enableFsync)
 		return 0;
+
+retry:
+	rc = fsync(fd);
+
+	if (rc == -1 && errno == EINTR)
+		goto retry;
+
+	return rc;
 }
 
 /*
@@ -448,10 +456,18 @@ pg_fsync_writethrough(int fd)
 int
 pg_fdatasync(int fd)
 {
-	if (enableFsync)
-		return fdatasync(fd);
-	else
+	int		rc;
+
+	if (!enableFsync)
 		return 0;
+
+retry:
+	rc = fdatasync(fd);
+
+	if (rc == -1 && errno == EINTR)
+		goto retry;
+
+	return rc;
 }
 
 /*
@@ -483,6 +499,7 @@ pg_flush_data(int fd, off_t offset, off_t nbytes)
 		if (not_implemented_by_kernel)
 			return;
 
+retry:
 		/*
 		 * sync_file_range(SYNC_FILE_RANGE_WRITE), currently linux specific,
 		 * tells the OS that writeback for the specified blocks should be
@@ -498,6 +515,9 @@ pg_flush_data(int fd, off_t offset, off_t nbytes)
 		{
 			int			elevel;
 
+			if (rc == EINTR)
+				goto retry;
+
 			/*
 			 * For systems that don't have an implementation of
 			 * sync_file_range() such as Windows WSL, generate only one
@@ -629,32 +649,54 @@ pg_flush_data(int fd, off_t offset, off_t nbytes)
 #endif
 }
 
+/*
+ * Truncate an open file to a given length.
+ */
+static int
+pg_ftruncate(int fd, off_t length)
+{
+	int			ret;
+
+retry:
+	ret = ftruncate(fd, length);
+
+	if (ret == -1 && errno == EINTR)
+		goto retry;
+
+	return ret;
+}
+
 /*
  * Truncate a file to a given length by name.
  */
 int
 pg_truncate(const char *path, off_t length)
 {
+	int			ret;
 #ifdef WIN32
 	int			save_errno;
-	int			ret;
 	int			fd;
 
 	fd = OpenTransientFile(path, O_RDWR | PG_BINARY);
 	if (fd >= 0)
 	{
-		ret = ftruncate(fd, length);
+		ret = pg_ftruncate(fd, length);
 		save_errno = errno;
 		CloseTransientFile(fd);
 		errno = save_errno;
 	}
 	else
 		ret = -1;
+#else
+
+retry:
+	ret = truncate(path, length);
+
+	if (ret == -1 && errno == EINTR)
+		goto retry;
+#endif
 
 	return ret;
-#else
-	return truncate(path, length);
-#endif
 }
 
 /*
@@ -2001,11 +2043,15 @@ FilePrefetch(File file, off_t offset, off_t amount, uint32 wait_event_info)
 	if (returnCode < 0)
 		return returnCode;
 
+retry:
 	pgstat_report_wait_start(wait_event_info);
 	returnCode = posix_fadvise(VfdCache[file].fd, offset, amount,
 							   POSIX_FADV_WILLNEED);
 	pgstat_report_wait_end();
 
+	if (returnCode == EINTR)
+		goto retry;
+
 	return returnCode;
 #else
 	Assert(FileIsValid(file));
@@ -2281,12 +2327,15 @@ FileFallocate(File file, off_t offset, off_t amount, uint32 wait_event_info)
 	if (returnCode < 0)
 		return -1;
 
+retry:
 	pgstat_report_wait_start(wait_event_info);
 	returnCode = posix_fallocate(VfdCache[file].fd, offset, amount);
 	pgstat_report_wait_end();
 
 	if (returnCode == 0)
 		return 0;
+	else if (returnCode == EINTR)
+		goto retry;
 
 	/* for compatibility with %m printing etc */
 	errno = returnCode;
@@ -2334,7 +2383,7 @@ FileTruncate(File file, off_t offset, uint32 wait_event_info)
 		return returnCode;
 
 	pgstat_report_wait_start(wait_event_info);
-	returnCode = ftruncate(VfdCache[file].fd, offset);
+	returnCode = pg_ftruncate(VfdCache[file].fd, offset);
 	pgstat_report_wait_end();
 
 	if (returnCode == 0 && VfdCache[file].fileSize > offset)
-- 
2.38.0

#17Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#16)
Re: could not extend file "base/5/3501" with FileFallocate(): Interrupted system call

Andres Freund <andres@anarazel.de> writes:

Unfortunately, due to some personal life business, it took until for me to
feel comfortable pushing the fix for
/messages/by-id/ZEZDj1H61ryrmY9o@msg.df7cb.de
(FileFallocate() erroring out with EINTR due to running on tmpfs).
Do you want me to hold off before beta2 is wrapped? I did a bunch of CI runs
with the patch and patch + test infra, and they all passed.

We still have a week till beta2 wrap, so I'd say push. If the buildfarm
gets unhappy you can revert.

regards, tom lane

#18Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#17)
Re: could not extend file "base/5/3501" with FileFallocate(): Interrupted system call

Hi,

On June 19, 2023 10:37:45 AM PDT, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Andres Freund <andres@anarazel.de> writes:

Unfortunately, due to some personal life business, it took until for me to
feel comfortable pushing the fix for
/messages/by-id/ZEZDj1H61ryrmY9o@msg.df7cb.de
(FileFallocate() erroring out with EINTR due to running on tmpfs).
Do you want me to hold off before beta2 is wrapped? I did a bunch of CI runs
with the patch and patch + test infra, and they all passed.

We still have a week till beta2 wrap, so I'd say push. If the buildfarm
gets unhappy you can revert.

Hah. Somehow I confused myself into thinking you're wrapping later today. Calendar math vs Andres: 6753:3

Andres

--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

#19Amit Kapila
amit.kapila16@gmail.com
In reply to: Andres Freund (#18)
Re: could not extend file "base/5/3501" with FileFallocate(): Interrupted system call

On Mon, Jun 19, 2023 at 11:47 PM Andres Freund <andres@anarazel.de> wrote:

On June 19, 2023 10:37:45 AM PDT, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Andres Freund <andres@anarazel.de> writes:

Unfortunately, due to some personal life business, it took until for me to
feel comfortable pushing the fix for
/messages/by-id/ZEZDj1H61ryrmY9o@msg.df7cb.de
(FileFallocate() erroring out with EINTR due to running on tmpfs).
Do you want me to hold off before beta2 is wrapped? I did a bunch of CI runs
with the patch and patch + test infra, and they all passed.

We still have a week till beta2 wrap, so I'd say push. If the buildfarm
gets unhappy you can revert.

Hah. Somehow I confused myself into thinking you're wrapping later today. Calendar math vs Andres: 6753:3

Can we close the open item corresponding to this after your commit 0d369ac650?

--
With Regards,
Amit Kapila.

#20Andres Freund
andres@anarazel.de
In reply to: Amit Kapila (#19)
Re: could not extend file "base/5/3501" with FileFallocate(): Interrupted system call

Hi,

On 2023-06-21 08:54:48 +0530, Amit Kapila wrote:

On Mon, Jun 19, 2023 at 11:47 PM Andres Freund <andres@anarazel.de> wrote:

On June 19, 2023 10:37:45 AM PDT, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Andres Freund <andres@anarazel.de> writes:

Unfortunately, due to some personal life business, it took until for me to
feel comfortable pushing the fix for
/messages/by-id/ZEZDj1H61ryrmY9o@msg.df7cb.de
(FileFallocate() erroring out with EINTR due to running on tmpfs).
Do you want me to hold off before beta2 is wrapped? I did a bunch of CI runs
with the patch and patch + test infra, and they all passed.

We still have a week till beta2 wrap, so I'd say push. If the buildfarm
gets unhappy you can revert.

Hah. Somehow I confused myself into thinking you're wrapping later today. Calendar math vs Andres: 6753:3

Can we close the open item corresponding to this after your commit 0d369ac650?

Yes, sorry for forgetting that. Done now.

Greetings,

Andres Freund