Bump soft open file limit (RLIMIT_NOFILE) to hard limit on startup

Started by Jelte Fennema-Nio11 months ago29 messages
#1Jelte Fennema-Nio
postgres@jeltef.nl
1 attachment(s)

The default open file limit of 1024 is extremely low, given modern
resources and kernel architectures. The reason that this hasn't changed
change is because doing so would break legacy programs that use the
select(2) system call in hard to debug ways. So instead programs that
want to opt-in to a higher open file limit are expected to bump their
soft limit to their hard limit on startup. Details on this are very well
explained in a blogpost by the systemd author[1]https://0pointer.net/blog/file-descriptor-limits.html. There's also a similar
change done by the Go language[2]https://github.com/golang/go/issues/46279.

So this starts bumping postmaster and pgbench its soft open file limit
to the hard open file limit. Doing so is especially useful for the AIO
work that Andres is doing, because io_uring consumes a lot of file
descriptors. But even without looking at AIO there is a large number of
reports from people that require changing their soft file limit before
starting postgres, sometimes falling back to lowering
max_files_per_process when they fail to do so[3-8]. It's also not all
that strange to fail at setting the soft open file limit because there
are multiple places where one can configure such limits and usually only
one of them is effective (which one depends on how Postgres is started).

[1]: https://0pointer.net/blog/file-descriptor-limits.html
[2]: https://github.com/golang/go/issues/46279
[3]: https://serverfault.com/questions/785330/getting-too-many-open-files-error-for-postgres
[4]: https://serverfault.com/questions/716982/how-to-raise-max-no-of-file-descriptors-for-daemons-running-on-debian-jessie
[5]: /messages/by-id/CAKtc8vXh7NvP_qWj8EqqorPY97bvxSaX3h5u7a9PptRFHW5x7g@mail.gmail.com
[6]: /messages/by-id/113ce31b0908120955w77029099i7ececc053084095a@mail.gmail.com
[7]: https://github.com/abiosoft/colima/discussions/836
[8]: /messages/by-id/29663.1007738957@sss.pgh.pa.us

Attachments:

v1-0001-Bump-soft-open-file-limit-RLIMIT_NOFILE-to-hard-l.patchtext/x-patch; charset=US-ASCII; name=v1-0001-Bump-soft-open-file-limit-RLIMIT_NOFILE-to-hard-l.patchDownload
From ecf7fc5f34b4e025eaa4dda809874f18dfdb24b4 Mon Sep 17 00:00:00 2001
From: Jelte Fennema-Nio <github-tech@jeltef.nl>
Date: Tue, 11 Feb 2025 19:15:36 +0100
Subject: [PATCH v1] Bump soft open file limit (RLIMIT_NOFILE) to hard limit on
 startup

The default open file limit of 1024 is extremely low, given modern
resources and kernel architectures. The reason that this hasn't changed
change is because doing so would break legacy programs that use the
select(2) system call in hard to debug ways. So instead programs that
want to opt-in to a higher open file limit are expected to bump their
soft limit to their hard limit on startup. Details on this are very well
explained in a blogpost by the systemd author[1]. There's also a similar
change done by the Go language[2].

So this starts bumping postmaster and pgbench its soft open file limit
to the hard open file limit. Doing so is especially useful for the AIO
work that Andres is doing, because io_uring consumes a lot of file
descriptors. But even without looking at AIO there is a large number of
reports from people that require changing their soft file limit before
starting postgres, sometimes falling back to lowering
max_files_per_process when they fail to do so[3-8]. It's also not all
that strange to fail at setting the soft open file limit because there
are multiple places where one can configure such limits and usually only
one of them is effective (which one depends on how Postgres is started).

[1]: https://0pointer.net/blog/file-descriptor-limits.html
[2]: https://github.com/golang/go/issues/46279
[3]: https://serverfault.com/questions/785330/getting-too-many-open-files-error-for-postgres
[4]: https://serverfault.com/questions/716982/how-to-raise-max-no-of-file-descriptors-for-daemons-running-on-debian-jessie
[5]: https://www.postgresql.org/message-id/flat/CAKtc8vXh7NvP_qWj8EqqorPY97bvxSaX3h5u7a9PptRFHW5x7g%40mail.gmail.com
[6]: https://www.postgresql.org/message-id/flat/113ce31b0908120955w77029099i7ececc053084095a%40mail.gmail.com
[7]: https://github.com/abiosoft/colima/discussions/836
[8]: https://www.postgresql.org/message-id/flat/29663.1007738957%40sss.pgh.pa.us#2079ec9e2d8b251593812a3711bfe9e9
---
 src/backend/postmaster/postmaster.c | 22 ++++++++++++++++++++++
 src/bin/pgbench/pgbench.c           |  5 +++++
 2 files changed, 27 insertions(+)

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index bb22b13adef..3c2fbebfa62 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -72,6 +72,7 @@
 #include <ctype.h>
 #include <sys/stat.h>
 #include <sys/socket.h>
+#include <sys/resource.h>
 #include <fcntl.h>
 #include <sys/param.h>
 #include <netdb.h>
@@ -406,6 +407,8 @@ static DNSServiceRef bonjour_sdref = NULL;
 /*
  * postmaster.c - function prototypes
  */
+
+static void ConfigureFileLimit(void);
 static void CloseServerPorts(int status, Datum arg);
 static void unlink_external_pid_file(int status, Datum arg);
 static void getInstallationPaths(const char *argv0);
@@ -573,6 +576,8 @@ PostmasterMain(int argc, char *argv[])
 	/* Begin accepting signals. */
 	sigprocmask(SIG_SETMASK, &UnBlockSig, NULL);
 
+	ConfigureFileLimit();
+
 	/*
 	 * Options setup
 	 */
@@ -1393,6 +1398,23 @@ PostmasterMain(int argc, char *argv[])
 	abort();					/* not reached */
 }
 
+static void
+ConfigureFileLimit(void)
+{
+#ifdef HAVE_GETRLIMIT
+	struct rlimit rlim;
+
+	if (getrlimit(RLIMIT_NOFILE, &rlim) != 0)
+	{
+		elog(WARNING, "could not get open file limit: %m");
+		return;
+	}
+
+	rlim.rlim_cur = rlim.rlim_max;
+	if (setrlimit(RLIMIT_NOFILE, &rlim) != 0)
+		elog(WARNING, "could not set open file limit: %m");
+#endif
+}
 
 /*
  * on_proc_exit callback to close server's listen sockets
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index f303bdeec8d..161b231d591 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -6794,6 +6794,11 @@ main(int argc, char **argv)
 #ifdef HAVE_GETRLIMIT
 				if (getrlimit(RLIMIT_NOFILE, &rlim) == -1)
 					pg_fatal("getrlimit failed: %m");
+
+				rlim.rlim_cur = rlim.rlim_max;
+				if (setrlimit(RLIMIT_NOFILE, &rlim) == -1)
+					pg_fatal("getrlimit failed: %m");
+
 				if (rlim.rlim_cur < nclients + 3)
 				{
 					pg_log_error("need at least %d open files, but system limit is %ld",

base-commit: c366d2bdba7c3b9b2cca1429d4535866e231ca55
-- 
2.43.0

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jelte Fennema-Nio (#1)
Re: Bump soft open file limit (RLIMIT_NOFILE) to hard limit on startup

Jelte Fennema-Nio <postgres@jeltef.nl> writes:

The default open file limit of 1024 is extremely low, given modern
resources and kernel architectures. The reason that this hasn't changed
change is because doing so would break legacy programs that use the
select(2) system call in hard to debug ways. So instead programs that
want to opt-in to a higher open file limit are expected to bump their
soft limit to their hard limit on startup. Details on this are very well
explained in a blogpost by the systemd author[1].

On a handy Linux machine (running RHEL9):

$ ulimit -n
1024
$ ulimit -n -H
524288

I'm okay with believing that 1024 is unreasonably small, but that
doesn't mean I think half a million is a safe value. (Remember that
that's *per backend*.) Postgres has run OSes out of FDs in the past
and I don't believe we couldn't do it again.

Also, the argument you cite is completely recent-Linux-centric
and does not consider the likely effects on other platforms.
To take one example, on current macOS:

$ ulimit -n
4864
$ ulimit -n -H
unlimited

(Hm, so Apple wasn't impressed by the "let's not break select(2)"
argument. But I digress.)

I'm afraid this patch will replace "you need to tune ulimit -n
to get best performance" with "you need to tune ulimit -n to
avoid crashing your system". Does not sound like an improvement.

Maybe a sanity limit on how high we'll try to raise the ulimit
would help.

regards, tom lane

#3Andres Freund
andres@anarazel.de
In reply to: Jelte Fennema-Nio (#1)
Re: Bump soft open file limit (RLIMIT_NOFILE) to hard limit on startup

Hi,

On 2025-02-11 19:52:34 +0100, Jelte Fennema-Nio wrote:

So this starts bumping postmaster and pgbench its soft open file limit
to the hard open file limit.

Not sure that's quite the right thing to do for postmaster. What I'd start
with is to increase the soft limit to
"already used files" + max_files_per_process.

That way we still limit "resource" usage, but react better to already used
FDs. If io_uring, listen_addresses, whatnot use FDs
max_files_per_process would be added ontop.

Then we can separately discuss increasing max_files_per_process more
aggressively.

I don't see a downside to just increasing the soft limit for pgbench. It
avoids the stupid cycle of getting "need at least %d open files, but system
limit is %ld", increase ulimit, retry, without any non-theoretical downsides.

Doing so is especially useful for the AIO work that Andres is doing, because
io_uring consumes a lot of file descriptors.

Yep.

One more reason this is a good idea is that we'll also need this for
threading, since there all client connections obviously will eat into the
"normal file descriptor" budget.

Greetings,

Andres Freund

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#2)
Re: Bump soft open file limit (RLIMIT_NOFILE) to hard limit on startup

I wrote:

Maybe a sanity limit on how high we'll try to raise the ulimit
would help.

Oh, I'd forgotten that we already have one: max_files_per_process.
Since that's only 1000 by default, this patch doesn't actually have
any effect (on Linux anyway) unless the DBA raises
max_files_per_process. That alleviates my concern quite a bit.

... but not completely. You didn't read all of Pid Eins' advice:

If said program you hack on forks off foreign programs, make sure
to reset the RLIMIT_NOFILE soft limit back to 1024 for them. Just
because your program might be fine with fds >= 1024 it doesn't
mean that those foreign programs might. And unfortunately
RLIMIT_NOFILE is inherited down the process tree unless explicitly
set.

I think we'd need to pay some attention to that in e.g. COPY FROM
PROGRAM. I also wonder whether plperl, plpython, etc can be
guaranteed not to run any code that depends on select(2).

regards, tom lane

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#3)
Re: Bump soft open file limit (RLIMIT_NOFILE) to hard limit on startup

Andres Freund <andres@anarazel.de> writes:

I don't see a downside to just increasing the soft limit for
pgbench.

Agreed, that end of the patch seems relatively harmless.

regards, tom lane

#6Tomas Vondra
tomas@vondra.me
In reply to: Tom Lane (#2)
Re: Bump soft open file limit (RLIMIT_NOFILE) to hard limit on startup

On 2/11/25 20:20, Tom Lane wrote:

Jelte Fennema-Nio <postgres@jeltef.nl> writes:

The default open file limit of 1024 is extremely low, given modern
resources and kernel architectures. The reason that this hasn't changed
change is because doing so would break legacy programs that use the
select(2) system call in hard to debug ways. So instead programs that
want to opt-in to a higher open file limit are expected to bump their
soft limit to their hard limit on startup. Details on this are very well
explained in a blogpost by the systemd author[1].

On a handy Linux machine (running RHEL9):

$ ulimit -n
1024
$ ulimit -n -H
524288

I'm okay with believing that 1024 is unreasonably small, but that
doesn't mean I think half a million is a safe value. (Remember that
that's *per backend*.) Postgres has run OSes out of FDs in the past
and I don't believe we couldn't do it again.

Also, the argument you cite is completely recent-Linux-centric
and does not consider the likely effects on other platforms.
To take one example, on current macOS:

$ ulimit -n
4864
$ ulimit -n -H
unlimited

(Hm, so Apple wasn't impressed by the "let's not break select(2)"
argument. But I digress.)

I'm afraid this patch will replace "you need to tune ulimit -n
to get best performance" with "you need to tune ulimit -n to
avoid crashing your system". Does not sound like an improvement.

Maybe a sanity limit on how high we'll try to raise the ulimit
would help.

I agree the defaults may be pretty low for current systems, but do we
want to get into the business of picking a value and overriding whatever
value is set by the sysadmin? I don't think a high hard limit should be
seen as an implicit permission to just set is as the soft limit.

Imagine you're a sysadmin / DBA who picks a low soft limit (for whatever
reason - there may be other stuff running on the system, ...). And then
postgres starts and just feels like bumping the soft limit. Sure, the
sysadmin can lower the hard limit and then we'll respect that, but I
don't recall any other tool requiring this approach, and it would
definitely be quite surprising to me.

I did run into bottlenecks due to "too few file descriptors" during a
recent experiments with partitioning, which made it pretty trivial to
get into a situation when we start trashing the VfdCache. I have a
half-written draft of a blog post about that somewhere.

But my conclusion was that it's damn difficult to even realize that's
happening, especially if you don't have access to the OS / perf, etc. So
my takeaway was we should improve that first, so that people have a
chance to realize they have this issue, and can do the tuning. The
improvements I thought about were:

- track hits/misses for the VfdCache (and add a system view for that)

- maybe have wait event for opening/closing file descriptors

- show max_safe_fds value somewhere, not just max_files_per_process
(which we may silently override and use a lower value)

Even if we decide to increase the soft limit somehow, I think it'd still
be useful to make this info available (or at least some of it).

regards

--
Tomas Vondra

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tomas Vondra (#6)
Re: Bump soft open file limit (RLIMIT_NOFILE) to hard limit on startup

Tomas Vondra <tomas@vondra.me> writes:

I did run into bottlenecks due to "too few file descriptors" during a
recent experiments with partitioning, which made it pretty trivial to
get into a situation when we start trashing the VfdCache. I have a
half-written draft of a blog post about that somewhere.

But my conclusion was that it's damn difficult to even realize that's
happening, especially if you don't have access to the OS / perf, etc.

Yeah. fd.c does its level best to keep going even with only a few FDs
available, and it's hard to tell that you have a performance problem
arising from that. (Although I recall old war stories about Postgres
continuing to chug along just fine after it'd run the kernel out of
FDs, although every other service on the system was crashing left and
right, making it difficult e.g. even to log in. That scenario is why
I'm resistant to pushing our allowed number of FDs to the moon...)

So
my takeaway was we should improve that first, so that people have a
chance to realize they have this issue, and can do the tuning. The
improvements I thought about were:

- track hits/misses for the VfdCache (and add a system view for that)

I think what we actually would like to know is how often we have to
close an open FD in order to make room to open a different file.
Maybe that's the same thing you mean by "cache miss", but it doesn't
seem like quite the right terminology. Anyway, +1 for adding some way
to discover how often that's happening.

- maybe have wait event for opening/closing file descriptors

Not clear that that helps, at least for this specific issue.

- show max_safe_fds value somewhere, not just max_files_per_process
(which we may silently override and use a lower value)

Maybe we should just assign max_safe_fds back to max_files_per_process
after running set_max_safe_fds? The existence of two variables is a
bit confusing anyhow. I vaguely recall that we had a reason for
keeping them separate, but I can't think of the reasoning now.

regards, tom lane

#8Andres Freund
andres@anarazel.de
In reply to: Tomas Vondra (#6)
Re: Bump soft open file limit (RLIMIT_NOFILE) to hard limit on startup

Hi,

On 2025-02-11 21:04:25 +0100, Tomas Vondra wrote:

I agree the defaults may be pretty low for current systems, but do we
want to get into the business of picking a value and overriding whatever
value is set by the sysadmin? I don't think a high hard limit should be
seen as an implicit permission to just set is as the soft limit.

As documented in the links sent by Jelte, that's *explicitly* the reasoning
for the difference between default soft and hard limits. For safety programs
should opt in into using higher FD limits, rather than being opted into it.

Imagine you're a sysadmin / DBA who picks a low soft limit (for whatever
reason - there may be other stuff running on the system, ...). And then
postgres starts and just feels like bumping the soft limit. Sure, the
sysadmin can lower the hard limit and then we'll respect that, but I don't
recall any other tool requiring this approach, and it would definitely be
quite surprising to me.

https://codesearch.debian.net/search?q=setrlimit&amp;literal=1

In a quick skim I found that at least gimp, openjdk, libreoffice, gcc, samba,
dbus increase the soft limit to something closer to the hard limit. And that
was at page 62 out of 1269.

I did run into bottlenecks due to "too few file descriptors" during a
recent experiments with partitioning, which made it pretty trivial to
get into a situation when we start trashing the VfdCache. I have a
half-written draft of a blog post about that somewhere.

But my conclusion was that it's damn difficult to even realize that's
happening, especially if you don't have access to the OS / perf, etc. So my
takeaway was we should improve that first, so that people have a chance to
realize they have this issue, and can do the tuning. The improvements I
thought about were:

Hm, that seems something orthogonal to me. I'm on board with that suggestion,
but I don't see why that should stop us from having code to adjust the rlimit.

My suggestion would be to redefine max_files_per_process as the number of
files we try to be able to open in backends. I.e. set_max_safe_fds() would
first count the number of already open fds (since those will largely be
inherited by child processes) and then check if we can open up to
max_files_per_process files in addition. Adjusting the RLIMIT_NOFILE if
necessary.

That way we don't increase the number of FDs we use in the default
configuration drastically, but increasing the number only requires a config
change, it doesn't require also figuring out how to increase the settings in
whatever starts postgres. Which, e.g. in cloud environments, typically won't
be possible.

And when using something like io_uring for AIO, it'd allow to
max_files_per_process in addition to the files requires for the io_uring
instances.

Greetings,

Andres Freund

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#8)
Re: Bump soft open file limit (RLIMIT_NOFILE) to hard limit on startup

Andres Freund <andres@anarazel.de> writes:

My suggestion would be to redefine max_files_per_process as the number of
files we try to be able to open in backends. I.e. set_max_safe_fds() would
first count the number of already open fds (since those will largely be
inherited by child processes) and then check if we can open up to
max_files_per_process files in addition. Adjusting the RLIMIT_NOFILE if
necessary.

Seems plausible. IIRC we also want 10 or so FDs available as "slop"
for code that doesn't go through fd.c.

And when using something like io_uring for AIO, it'd allow to
max_files_per_process in addition to the files requires for the io_uring
instances.

Not following? Surely we'd not be configuring that so early in
postmaster start?

regards, tom lane

#10Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#9)
Re: Bump soft open file limit (RLIMIT_NOFILE) to hard limit on startup

Hi,

On 2025-02-11 16:18:37 -0500, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

And when using something like io_uring for AIO, it'd allow to
max_files_per_process in addition to the files requires for the io_uring
instances.

Not following? Surely we'd not be configuring that so early in
postmaster start?

The issue is that, with io_uring, we need to create one FD for each possible
child process, so that one backend can wait for completions for IO issued by
another backend [1]Initially I tried to avoid that, by sharing a smaller number of io_uring instances across backends. Making that work was a fair bit of code *and* was considerably slower, due to now needing a lock around submission of IOs. Moving to one io_uring instance per backend fairly dramatically simplified the code while also speeding it up.. Those io_uring instances need to be created in
postmaster, so they're visible to each backend. Obviously that helps to much
more quickly run into an unadjusted soft RLIMIT_NOFILE, particularly if
max_connections is set to a higher value.

In the current version of the AIO patchset, the creation of those io_uring
instances does happen as part of an shmem init callback, as the io uring
creation also sets up queues visible in shmem. And shmem init callbacks are
currently happening *before* postmaster's set_max_safe_fds() call:

/*
* Set up shared memory and semaphores.
*
* Note: if using SysV shmem and/or semas, each postmaster startup will
* normally choose the same IPC keys. This helps ensure that we will
* clean up dead IPC objects if the postmaster crashes and is restarted.
*/
CreateSharedMemoryAndSemaphores();

/*
* Estimate number of openable files. This must happen after setting up
* semaphores, because on some platforms semaphores count as open files.
*/
set_max_safe_fds();

So the issue would actually be that we're currently doing set_max_safe_fds()
too late, not too early :/

Greetings,

Andres Freund

[1]: Initially I tried to avoid that, by sharing a smaller number of io_uring instances across backends. Making that work was a fair bit of code *and* was considerably slower, due to now needing a lock around submission of IOs. Moving to one io_uring instance per backend fairly dramatically simplified the code while also speeding it up.
instances across backends. Making that work was a fair bit of code *and*
was considerably slower, due to now needing a lock around submission of
IOs. Moving to one io_uring instance per backend fairly dramatically
simplified the code while also speeding it up.

#11Tomas Vondra
tomas@vondra.me
In reply to: Tom Lane (#7)
Re: Bump soft open file limit (RLIMIT_NOFILE) to hard limit on startup

On 2/11/25 21:18, Tom Lane wrote:

Tomas Vondra <tomas@vondra.me> writes:

I did run into bottlenecks due to "too few file descriptors" during a
recent experiments with partitioning, which made it pretty trivial to
get into a situation when we start trashing the VfdCache. I have a
half-written draft of a blog post about that somewhere.

But my conclusion was that it's damn difficult to even realize that's
happening, especially if you don't have access to the OS / perf, etc.

Yeah. fd.c does its level best to keep going even with only a few FDs
available, and it's hard to tell that you have a performance problem
arising from that. (Although I recall old war stories about Postgres
continuing to chug along just fine after it'd run the kernel out of
FDs, although every other service on the system was crashing left and
right, making it difficult e.g. even to log in. That scenario is why
I'm resistant to pushing our allowed number of FDs to the moon...)

So
my takeaway was we should improve that first, so that people have a
chance to realize they have this issue, and can do the tuning. The
improvements I thought about were:

- track hits/misses for the VfdCache (and add a system view for that)

I think what we actually would like to know is how often we have to
close an open FD in order to make room to open a different file.
Maybe that's the same thing you mean by "cache miss", but it doesn't
seem like quite the right terminology. Anyway, +1 for adding some way
to discover how often that's happening.

We can count the evictions (i.e. closing a file so that we can open a
new one) too, but AFAICS that's about the same as counting "misses"
(opening a file after not finding it in the cache). After the cache
warms up, those counts should be about the same, I think.

Or am I missing something?

- maybe have wait event for opening/closing file descriptors

Not clear that that helps, at least for this specific issue.

I don't think Jelte described any specific issue, but the symptoms I've
observed were that a query was accessing a table with ~1000 relations
(partitions + indexes), trashing the vfd cache, getting ~0% cache hits.
And the open/close calls were taking a lot of time (~25% CPU time).
That'd be very visible as a wait event, I believe.

- show max_safe_fds value somewhere, not just max_files_per_process
(which we may silently override and use a lower value)

Maybe we should just assign max_safe_fds back to max_files_per_process
after running set_max_safe_fds? The existence of two variables is a
bit confusing anyhow. I vaguely recall that we had a reason for
keeping them separate, but I can't think of the reasoning now.

That might work. I don't know what were the reasons for not doing that,
I suppose there were reasons not to do that.

regards

--
Tomas Vondra

#12Tomas Vondra
tomas@vondra.me
In reply to: Andres Freund (#8)
Re: Bump soft open file limit (RLIMIT_NOFILE) to hard limit on startup

On 2/11/25 22:14, Andres Freund wrote:

Hi,

On 2025-02-11 21:04:25 +0100, Tomas Vondra wrote:

I agree the defaults may be pretty low for current systems, but do we
want to get into the business of picking a value and overriding whatever
value is set by the sysadmin? I don't think a high hard limit should be
seen as an implicit permission to just set is as the soft limit.

As documented in the links sent by Jelte, that's *explicitly* the reasoning
for the difference between default soft and hard limits. For safety programs
should opt in into using higher FD limits, rather than being opted into it.

OK, I guess I was mistaken in how I understood the hard/soft limits.

Imagine you're a sysadmin / DBA who picks a low soft limit (for whatever
reason - there may be other stuff running on the system, ...). And then
postgres starts and just feels like bumping the soft limit. Sure, the
sysadmin can lower the hard limit and then we'll respect that, but I don't
recall any other tool requiring this approach, and it would definitely be
quite surprising to me.

https://codesearch.debian.net/search?q=setrlimit&amp;literal=1

In a quick skim I found that at least gimp, openjdk, libreoffice, gcc, samba,
dbus increase the soft limit to something closer to the hard limit. And that
was at page 62 out of 1269.

Ack

I did run into bottlenecks due to "too few file descriptors" during a
recent experiments with partitioning, which made it pretty trivial to
get into a situation when we start trashing the VfdCache. I have a
half-written draft of a blog post about that somewhere.

But my conclusion was that it's damn difficult to even realize that's
happening, especially if you don't have access to the OS / perf, etc. So my
takeaway was we should improve that first, so that people have a chance to
realize they have this issue, and can do the tuning. The improvements I
thought about were:

Hm, that seems something orthogonal to me. I'm on board with that suggestion,
but I don't see why that should stop us from having code to adjust the rlimit.

Right, it is somewhat orthogonal.

I was mentioning that mostly in the context of tuning the parameters we
already have (because how would you know you you have this problem /
what would be a good value to set). I'm not demanding that we do nothing
until the monitoring bits get implemented, but being able to monitor
this seems pretty useful even if we adjust the soft limit based on some
heuristic.

My suggestion would be to redefine max_files_per_process as the number of
files we try to be able to open in backends. I.e. set_max_safe_fds() would
first count the number of already open fds (since those will largely be
inherited by child processes) and then check if we can open up to
max_files_per_process files in addition. Adjusting the RLIMIT_NOFILE if
necessary.

That way we don't increase the number of FDs we use in the default
configuration drastically, but increasing the number only requires a config
change, it doesn't require also figuring out how to increase the settings in
whatever starts postgres. Which, e.g. in cloud environments, typically won't
be possible.

Seems reasonable, +1 to this. But that just makes the monitoring bits
more important, because how would you know you need to increase the GUC?

regards

--
Tomas Vondra

#13Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#10)
Re: Bump soft open file limit (RLIMIT_NOFILE) to hard limit on startup

Andres Freund <andres@anarazel.de> writes:

In the current version of the AIO patchset, the creation of those io_uring
instances does happen as part of an shmem init callback, as the io uring
creation also sets up queues visible in shmem.

Hmm.

So the issue would actually be that we're currently doing set_max_safe_fds()
too late, not too early :/

Well, we'd rather set_max_safe_fds happen after semaphore creation,
so that it doesn't have to be explicitly aware of whether semaphores
consume FDs. Could we have it be aware of how many FDs *will be*
needed for io_uring, but postpone creation of those until after we
jack up RLIMIT_NOFILE?

I guess the other way would be to have two rounds of RLIMIT_NOFILE
adjustment, before and after shmem creation. That seems ugly but
shouldn't be very time-consuming.

regards, tom lane

#14Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tomas Vondra (#11)
Re: Bump soft open file limit (RLIMIT_NOFILE) to hard limit on startup

Tomas Vondra <tomas@vondra.me> writes:

On 2/11/25 21:18, Tom Lane wrote:

I think what we actually would like to know is how often we have to
close an open FD in order to make room to open a different file.
Maybe that's the same thing you mean by "cache miss", but it doesn't
seem like quite the right terminology. Anyway, +1 for adding some way
to discover how often that's happening.

We can count the evictions (i.e. closing a file so that we can open a
new one) too, but AFAICS that's about the same as counting "misses"
(opening a file after not finding it in the cache). After the cache
warms up, those counts should be about the same, I think.

Umm ... only if the set of files you want access to is quite static,
which doesn't seem like a great bet in the presence of temporary
tables and such. I think if we don't explicitly count evictions
then we'll be presenting misleading results.

regards, tom lane

#15Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#13)
Re: Bump soft open file limit (RLIMIT_NOFILE) to hard limit on startup

Hi,

On 2025-02-11 17:55:39 -0500, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

So the issue would actually be that we're currently doing set_max_safe_fds()
too late, not too early :/

Well, we'd rather set_max_safe_fds happen after semaphore creation,
so that it doesn't have to be explicitly aware of whether semaphores
consume FDs. Could we have it be aware of how many FDs *will be*
needed for io_uring, but postpone creation of those until after we
jack up RLIMIT_NOFILE?

Yes, that should be fairly trivial to do. It'd require a call from postmaster
into the aio subsystem, after set_max_safe_fds() is done, but I think that'd
be ok. I'd be inclined to not make that a general mechanism for now.

I guess the other way would be to have two rounds of RLIMIT_NOFILE
adjustment, before and after shmem creation. That seems ugly but
shouldn't be very time-consuming.

Yea, something like that could also work.

At some point I was playing with calling AcquireExternalFD() the appropriate
number of times during shmem reservation. That turned out to work badly right
now, because it relies on max_safe_fds() being set *and* insists on
numExternalFDs < max_safe_fds / 3.

One way to deal with this would be for AcquireExternalFD() to first increase
the rlimit, if necessary and possible, then start to close LRU files and only
then fail.

Arguably that'd have the advantage of e.g. postgres_fdw not stomping quite so
hard on VFDs, because we'd first increase the limit.

Of course it'd also mean we'd increase the limit more often. But I don't think
it's particularly expensive.

Greetings,

Andres Freund

#16Jelte Fennema-Nio
postgres@jeltef.nl
In reply to: Tom Lane (#9)
3 attachment(s)
Re: Bump soft open file limit (RLIMIT_NOFILE) to hard limit on startup

On Tue, 11 Feb 2025 at 22:18, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Andres Freund <andres@anarazel.de> writes:

My suggestion would be to redefine max_files_per_process as the number of
files we try to be able to open in backends. I.e. set_max_safe_fds() would
first count the number of already open fds (since those will largely be
inherited by child processes) and then check if we can open up to
max_files_per_process files in addition. Adjusting the RLIMIT_NOFILE if
necessary.

Seems plausible. IIRC we also want 10 or so FDs available as "slop"
for code that doesn't go through fd.c.

Attached is a patchset that does this. I split off the pgbench change,
which is in the first patch. The change to postmaster is in the
second. And the 3rd patch is a small follow up to make it easier to
notice that max_safe_fds is lower than intended.

Attachments:

v2-0001-Bump-pgbench-soft-open-file-limit-RLIMIT_NOFILE-t.patchtext/x-patch; charset=US-ASCII; name=v2-0001-Bump-pgbench-soft-open-file-limit-RLIMIT_NOFILE-t.patchDownload
From 8e964db585989734a5f6c1449ffb4c62e1190a6a Mon Sep 17 00:00:00 2001
From: Jelte Fennema-Nio <github-tech@jeltef.nl>
Date: Tue, 11 Feb 2025 22:04:44 +0100
Subject: [PATCH v2 1/3] Bump pgbench soft open file limit (RLIMIT_NOFILE) to
 hard limit

The default open file limit of 1024 is extremely low, given modern
resources and kernel architectures. The reason that this hasn't changed
is because doing so would break legacy programs that use the select(2)
system call in hard to debug ways. So instead programs that want to
opt-in to a higher open file limit are expected to bump their soft limit
to their hard limit on startup. Details on this are very well explained
in a blogpost by the systemd author[1].

This starts bumping pgbench its soft open file limit to the hard open
file limit. This makes sure users are not told to change their ulimit,
and then retry. Instead we now do that for them automatically.

[1]: https://0pointer.net/blog/file-descriptor-limits.html
---
 src/bin/pgbench/pgbench.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 5e1fcf59c61..ea7bf980bea 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -6815,6 +6815,15 @@ main(int argc, char **argv)
 #ifdef HAVE_GETRLIMIT
 				if (getrlimit(RLIMIT_NOFILE, &rlim) == -1)
 					pg_fatal("getrlimit failed: %m");
+
+				/*
+				 * Bump the soft limit to the hard limit to not run into low
+				 * file limits.
+				 */
+				rlim.rlim_cur = rlim.rlim_max;
+				if (setrlimit(RLIMIT_NOFILE, &rlim) == -1)
+					pg_fatal("setrlimit failed: %m");
+
 				if (rlim.rlim_cur < nclients + 3)
 				{
 					pg_log_error("need at least %d open files, but system limit is %ld",

base-commit: fd602f29c19d4f483f54d93abe240c12219d9f51
-- 
2.43.0

v2-0002-Bump-postmaster-soft-open-file-limit-RLIMIT_NOFIL.patchtext/x-patch; charset=US-ASCII; name=v2-0002-Bump-postmaster-soft-open-file-limit-RLIMIT_NOFIL.patchDownload
From e84092e3db7712db08dfc5f8824f692a53a1aa9e Mon Sep 17 00:00:00 2001
From: Jelte Fennema-Nio <github-tech@jeltef.nl>
Date: Tue, 11 Feb 2025 19:15:36 +0100
Subject: [PATCH v2 2/3] Bump postmaster soft open file limit (RLIMIT_NOFILE)
 when necessary

The default open file limit of 1024 on Linux is extremely low. The
reason that this hasn't changed change is because doing so would break
legacy programs that use the select(2) system call in hard to debug
ways. So instead programs that want to opt-in to a higher open file
limit are expected to bump their soft limit to their hard limit on
startup. Details on this are very well explained in a blogpost by the
systemd author[1]. There's also a similar change done by the Go
language[2].

This starts bumping postmaster its soft open file limit when we realize
that we'll run into the soft limit with the requested
max_files_per_process GUC. We do so by slightly changing the meaning of
the max_files_per_process GUC. The actual (not publicly exposed) limit
is max_safe_fds, previously this would be set to:
max_files_per_process - already_open_files - NUM_RESERVED_FDS
After this change we now try to set max_safe_fds to
max_files_per_process if the system allows that. This is deemed more
natural to understand for users, because now the limit of files that
they can open is actually what they configured in max_files_per_process.

Adding this infrastructure to change RLIMIT_NOFILE when needed is
especially useful for the AIO work that Andres is doing, because
io_uring consumes a lot of file descriptors. Even without looking at AIO
there is a large number of reports from people that require changing
their soft file limit before starting Postgres, sometimes falling back
to lowering max_files_per_process when they fail to do so[3-8]. It's
also not all that strange to fail at setting the soft open file limit
because there are multiple places where one can configure such limits
and usually only one of them is effective (which one depends on how
Postgres is started). In cloud environments its also often not possible
for user to change the soft limit, because they don't control the way
that Postgres is started.

One thing to note is that we temporarily restore the original soft
limit when shell-ing out to other executables. This is done as a
precaution in case those executables are using select(2).

[1]: https://0pointer.net/blog/file-descriptor-limits.html
[2]: https://github.com/golang/go/issues/46279
[3]: https://serverfault.com/questions/785330/getting-too-many-open-files-error-for-postgres
[4]: https://serverfault.com/questions/716982/how-to-raise-max-no-of-file-descriptors-for-daemons-running-on-debian-jessie
[5]: https://www.postgresql.org/message-id/flat/CAKtc8vXh7NvP_qWj8EqqorPY97bvxSaX3h5u7a9PptRFHW5x7g%40mail.gmail.com
[6]: https://www.postgresql.org/message-id/flat/113ce31b0908120955w77029099i7ececc053084095a%40mail.gmail.com
[7]: https://github.com/abiosoft/colima/discussions/836
[8]: https://www.postgresql.org/message-id/flat/29663.1007738957%40sss.pgh.pa.us#2079ec9e2d8b251593812a3711bfe9e9
---
 doc/src/sgml/config.sgml                 |  17 ++--
 src/backend/access/transam/xlogarchive.c |   4 +
 src/backend/archive/shell_archive.c      |   2 +
 src/backend/storage/file/fd.c            | 111 ++++++++++++++++++++---
 src/include/storage/fd.h                 |   2 +
 5 files changed, 116 insertions(+), 20 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 5e4f201e099..4886a02d47e 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2352,15 +2352,14 @@ include_dir 'conf.d'
       </term>
       <listitem>
        <para>
-        Sets the maximum number of simultaneously open files allowed to each
-        server subprocess. The default is one thousand files. If the kernel is enforcing
-        a safe per-process limit, you don't need to worry about this setting.
-        But on some platforms (notably, most BSD systems), the kernel will
-        allow individual processes to open many more files than the system
-        can actually support if many processes all try to open
-        that many files. If you find yourself seeing <quote>Too many open
-        files</quote> failures, try reducing this setting.
-        This parameter can only be set at server start.
+        Sets the maximum number of files to each server subprocess is allowed
+        to open. The default is one thousand files. If the kernel is enforcing
+        a lower soft limit Postgres will automatically increase it if the hard
+        limit allows that. Setting this value too high can cause resource
+        exhaustion problems: On some platforms (notably, most BSD systems), the
+        kernel will allow individual processes to open many more files than the
+        system can actually support if many processes all try to open that many
+        files. This parameter can only be set at server start.
        </para>
       </listitem>
      </varlistentry>
diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index 1ef1713c91a..bf17426039e 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -158,6 +158,7 @@ RestoreArchivedFile(char *path, const char *xlogfname,
 			(errmsg_internal("executing restore command \"%s\"",
 							 xlogRestoreCmd)));
 
+	RestoreOriginalOpenFileLimit();
 	fflush(NULL);
 	pgstat_report_wait_start(WAIT_EVENT_RESTORE_COMMAND);
 
@@ -180,6 +181,7 @@ RestoreArchivedFile(char *path, const char *xlogfname,
 
 	pgstat_report_wait_end();
 	pfree(xlogRestoreCmd);
+	RestoreCustomOpenFileLimit();
 
 	if (rc == 0)
 	{
@@ -325,10 +327,12 @@ ExecuteRecoveryCommand(const char *command, const char *commandName,
 	/*
 	 * execute the constructed command
 	 */
+	RestoreOriginalOpenFileLimit();
 	fflush(NULL);
 	pgstat_report_wait_start(wait_event_info);
 	rc = system(xlogRecoveryCmd);
 	pgstat_report_wait_end();
+	RestoreCustomOpenFileLimit();
 
 	pfree(xlogRecoveryCmd);
 
diff --git a/src/backend/archive/shell_archive.c b/src/backend/archive/shell_archive.c
index 828723afe47..c16f354a118 100644
--- a/src/backend/archive/shell_archive.c
+++ b/src/backend/archive/shell_archive.c
@@ -75,10 +75,12 @@ shell_archive_file(ArchiveModuleState *state, const char *file,
 			(errmsg_internal("executing archive command \"%s\"",
 							 xlogarchcmd)));
 
+	RestoreOriginalOpenFileLimit();
 	fflush(NULL);
 	pgstat_report_wait_start(WAIT_EVENT_ARCHIVE_COMMAND);
 	rc = system(xlogarchcmd);
 	pgstat_report_wait_end();
+	RestoreCustomOpenFileLimit();
 
 	if (rc != 0)
 	{
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index e454db4c020..04fb93be56d 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -157,6 +157,12 @@ int			max_files_per_process = 1000;
  */
 int			max_safe_fds = FD_MINFREE;	/* default if not changed */
 
+#ifdef HAVE_GETRLIMIT
+static struct rlimit original_max_open_files;
+static struct rlimit custom_max_open_files;
+#endif
+
+
 /* Whether it is safe to continue running after fsync() fails. */
 bool		data_sync_retry = false;
 
@@ -945,6 +951,82 @@ InitTemporaryFileAccess(void)
 #endif
 }
 
+#ifdef HAVE_GETRLIMIT
+/*
+ * Increases the open file limit (RLIMIT_NOFILE) by the requested amount.
+ * Returns true if successful, false otherwise.
+ */
+static bool
+IncreaseOpenFileLimit(int extra_files)
+{
+	struct rlimit rlim = custom_max_open_files;
+
+	/* If we're already at the max we reached our limit */
+	if (rlim.rlim_cur == original_max_open_files.rlim_max)
+		return false;
+
+	/* Otherwise try to increase the soft limit to what we need */
+	rlim.rlim_cur = Min(rlim.rlim_cur + extra_files, rlim.rlim_max);
+
+	if (setrlimit(RLIMIT_NOFILE, &rlim) != 0)
+	{
+		ereport(WARNING, (errmsg("setrlimit failed: %m")));
+		return false;
+	}
+
+	custom_max_open_files = rlim;
+
+	elog(LOG, "increased open file limit to %ld", (long) rlim.rlim_cur);
+
+	return true;
+}
+#endif
+
+/*
+ * RestoreOriginalOpenFileLimit --- Restore the original open file limit that
+ * 		was present at postmaster start.
+ *
+ * This should be called before spawning subprocesses that might use select(2)
+ * which can only handle file descriptors up to 1024.
+ */
+void
+RestoreOriginalOpenFileLimit(void)
+{
+#ifdef HAVE_GETRLIMIT
+	if (custom_max_open_files.rlim_cur == original_max_open_files.rlim_cur)
+	{
+		/* Not changed, so no need to call setrlimit at all */
+		return;
+	}
+
+	if (setrlimit(RLIMIT_NOFILE, &original_max_open_files) != 0)
+	{
+		ereport(WARNING, (errmsg("setrlimit failed: %m")));
+	}
+#endif
+}
+
+/*
+ * RestoreCustomOpenFileLimit --- Restores our custom open file limit after a
+ * 		previous call to RestoreOriginalOpenFileLimit.
+ */
+void
+RestoreCustomOpenFileLimit(void)
+{
+#ifdef HAVE_GETRLIMIT
+	if (custom_max_open_files.rlim_cur == original_max_open_files.rlim_cur)
+	{
+		/* Not changed, so no need to call setrlimit at all */
+		return;
+	}
+
+	if (setrlimit(RLIMIT_NOFILE, &custom_max_open_files) != 0)
+	{
+		ereport(WARNING, (errmsg("setrlimit failed: %m")));
+	}
+#endif
+}
+
 /*
  * count_usable_fds --- count how many FDs the system will let us open,
  *		and estimate how many are already open.
@@ -969,7 +1051,6 @@ count_usable_fds(int max_to_probe, int *usable_fds, int *already_open)
 	int			j;
 
 #ifdef HAVE_GETRLIMIT
-	struct rlimit rlim;
 	int			getrlimit_status;
 #endif
 
@@ -977,9 +1058,11 @@ count_usable_fds(int max_to_probe, int *usable_fds, int *already_open)
 	fd = (int *) palloc(size * sizeof(int));
 
 #ifdef HAVE_GETRLIMIT
-	getrlimit_status = getrlimit(RLIMIT_NOFILE, &rlim);
+	getrlimit_status = getrlimit(RLIMIT_NOFILE, &original_max_open_files);
 	if (getrlimit_status != 0)
 		ereport(WARNING, (errmsg("getrlimit failed: %m")));
+	else
+		custom_max_open_files = original_max_open_files;
 #endif							/* HAVE_GETRLIMIT */
 
 	/* dup until failure or probe limit reached */
@@ -993,13 +1076,21 @@ count_usable_fds(int max_to_probe, int *usable_fds, int *already_open)
 		 * don't go beyond RLIMIT_NOFILE; causes irritating kernel logs on
 		 * some platforms
 		 */
-		if (getrlimit_status == 0 && highestfd >= rlim.rlim_cur - 1)
-			break;
+		if (getrlimit_status == 0 && highestfd >= custom_max_open_files.rlim_cur - 1)
+		{
+			if (!IncreaseOpenFileLimit(max_to_probe - used))
+				break;
+		}
 #endif
 
 		thisfd = dup(2);
 		if (thisfd < 0)
 		{
+#ifdef HAVE_GETRLIMIT
+			if (errno == ENFILE && IncreaseOpenFileLimit(max_to_probe - used))
+				continue;
+#endif
+
 			/* Expect EMFILE or ENFILE, else it's fishy */
 			if (errno != EMFILE && errno != ENFILE)
 				elog(WARNING, "duplicating stderr file descriptor failed after %d successes: %m", used);
@@ -1053,15 +1144,10 @@ set_max_safe_fds(void)
 	 * or the experimentally-determined EMFILE limit.
 	 *----------
 	 */
-	count_usable_fds(max_files_per_process,
+	count_usable_fds(max_files_per_process + NUM_RESERVED_FDS,
 					 &usable_fds, &already_open);
 
-	max_safe_fds = Min(usable_fds, max_files_per_process - already_open);
-
-	/*
-	 * Take off the FDs reserved for system() etc.
-	 */
-	max_safe_fds -= NUM_RESERVED_FDS;
+	max_safe_fds = Min(usable_fds - NUM_RESERVED_FDS, max_files_per_process);
 
 	/*
 	 * Make sure we still have enough to get by.
@@ -1078,6 +1164,7 @@ set_max_safe_fds(void)
 		 max_safe_fds, usable_fds, already_open);
 }
 
+
 /*
  * Open a file with BasicOpenFilePerm() and pass default file mode for the
  * fileMode parameter.
@@ -2724,6 +2811,7 @@ OpenPipeStream(const char *command, const char *mode)
 	ReleaseLruFiles();
 
 TryAgain:
+	RestoreOriginalOpenFileLimit();
 	fflush(NULL);
 	pqsignal(SIGPIPE, SIG_DFL);
 	errno = 0;
@@ -2731,6 +2819,7 @@ TryAgain:
 	save_errno = errno;
 	pqsignal(SIGPIPE, SIG_IGN);
 	errno = save_errno;
+	RestoreCustomOpenFileLimit();
 	if (file != NULL)
 	{
 		AllocateDesc *desc = &allocatedDescs[numAllocatedDescs];
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index e3067ab6597..d24e7f1c8df 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -177,6 +177,8 @@ extern void RemovePgTempFiles(void);
 extern void RemovePgTempFilesInDir(const char *tmpdirname, bool missing_ok,
 								   bool unlink_all);
 extern bool looks_like_temp_rel_name(const char *name);
+extern void RestoreOriginalOpenFileLimit(void);
+extern void RestoreCustomOpenFileLimit(void);
 
 extern int	pg_fsync(int fd);
 extern int	pg_fsync_no_writethrough(int fd);
-- 
2.43.0

v2-0003-Reflect-the-value-of-max_safe_fds-in-max_files_pe.patchtext/x-patch; charset=US-ASCII; name=v2-0003-Reflect-the-value-of-max_safe_fds-in-max_files_pe.patchDownload
From 44c196025cbf57a09f6aa1bc70646ed2537d9654 Mon Sep 17 00:00:00 2001
From: Jelte Fennema-Nio <github-tech@jeltef.nl>
Date: Wed, 12 Feb 2025 01:08:07 +0100
Subject: [PATCH v2 3/3] Reflect the value of max_safe_fds in
 max_files_per_process

It is currently hard to figure out if max_safe_fds is significantly
lower than max_files_per_process. This starts reflecting the value of
max_safe_fds in max_files_per_process after our limit detection. We
still want to have two separate variables because for the bootstrap or
standalone-backend cases their values differ on purpose.
---
 src/backend/storage/file/fd.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 04fb93be56d..05c0792e4e1 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -1149,6 +1149,9 @@ set_max_safe_fds(void)
 
 	max_safe_fds = Min(usable_fds - NUM_RESERVED_FDS, max_files_per_process);
 
+	/* Update GUC variable to allow users to see the result */
+	max_files_per_process = max_safe_fds;
+
 	/*
 	 * Make sure we still have enough to get by.
 	 */
-- 
2.43.0

#17Andres Freund
andres@anarazel.de
In reply to: Jelte Fennema-Nio (#16)
Re: Bump soft open file limit (RLIMIT_NOFILE) to hard limit on startup

Hi,

Thanks for these patches!

On 2025-02-12 22:52:52 +0100, Jelte Fennema-Nio wrote:

From 8e964db585989734a5f6c1449ffb4c62e1190a6a Mon Sep 17 00:00:00 2001
From: Jelte Fennema-Nio <github-tech@jeltef.nl>
Date: Tue, 11 Feb 2025 22:04:44 +0100
Subject: [PATCH v2 1/3] Bump pgbench soft open file limit (RLIMIT_NOFILE) to
hard limit

The default open file limit of 1024 is extremely low, given modern
resources and kernel architectures. The reason that this hasn't changed
is because doing so would break legacy programs that use the select(2)
system call in hard to debug ways. So instead programs that want to
opt-in to a higher open file limit are expected to bump their soft limit
to their hard limit on startup. Details on this are very well explained
in a blogpost by the systemd author[1].

This starts bumping pgbench its soft open file limit to the hard open
file limit. This makes sure users are not told to change their ulimit,
and then retry. Instead we now do that for them automatically.

[1]: https://0pointer.net/blog/file-descriptor-limits.html
---
src/bin/pgbench/pgbench.c | 9 +++++++++
1 file changed, 9 insertions(+)

diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 5e1fcf59c61..ea7bf980bea 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -6815,6 +6815,15 @@ main(int argc, char **argv)
#ifdef HAVE_GETRLIMIT
if (getrlimit(RLIMIT_NOFILE, &rlim) == -1)
pg_fatal("getrlimit failed: %m");
+
+				/*
+				 * Bump the soft limit to the hard limit to not run into low
+				 * file limits.
+				 */
+				rlim.rlim_cur = rlim.rlim_max;
+				if (setrlimit(RLIMIT_NOFILE, &rlim) == -1)
+					pg_fatal("setrlimit failed: %m");
+
if (rlim.rlim_cur < nclients + 3)
{
pg_log_error("need at least %d open files, but system limit is %ld",

Why not do this only in the if (rlim.rlim_cur < nclients + 3) case?

Other than this I think we should just apply this.

From e84092e3db7712db08dfc5f8824f692a53a1aa9e Mon Sep 17 00:00:00 2001
From: Jelte Fennema-Nio <github-tech@jeltef.nl>
Date: Tue, 11 Feb 2025 19:15:36 +0100
Subject: [PATCH v2 2/3] Bump postmaster soft open file limit (RLIMIT_NOFILE)
when necessary

The default open file limit of 1024 on Linux is extremely low. The
reason that this hasn't changed change is because doing so would break
legacy programs that use the select(2) system call in hard to debug
ways. So instead programs that want to opt-in to a higher open file
limit are expected to bump their soft limit to their hard limit on
startup. Details on this are very well explained in a blogpost by the
systemd author[1]. There's also a similar change done by the Go
language[2].

This starts bumping postmaster its soft open file limit when we realize
that we'll run into the soft limit with the requested
max_files_per_process GUC. We do so by slightly changing the meaning of
the max_files_per_process GUC. The actual (not publicly exposed) limit
is max_safe_fds, previously this would be set to:
max_files_per_process - already_open_files - NUM_RESERVED_FDS
After this change we now try to set max_safe_fds to
max_files_per_process if the system allows that. This is deemed more
natural to understand for users, because now the limit of files that
they can open is actually what they configured in max_files_per_process.

Adding this infrastructure to change RLIMIT_NOFILE when needed is
especially useful for the AIO work that Andres is doing, because
io_uring consumes a lot of file descriptors. Even without looking at AIO
there is a large number of reports from people that require changing
their soft file limit before starting Postgres, sometimes falling back
to lowering max_files_per_process when they fail to do so[3-8]. It's
also not all that strange to fail at setting the soft open file limit
because there are multiple places where one can configure such limits
and usually only one of them is effective (which one depends on how
Postgres is started). In cloud environments its also often not possible
for user to change the soft limit, because they don't control the way
that Postgres is started.

One thing to note is that we temporarily restore the original soft
limit when shell-ing out to other executables. This is done as a
precaution in case those executables are using select(2).

diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index 1ef1713c91a..bf17426039e 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -158,6 +158,7 @@ RestoreArchivedFile(char *path, const char *xlogfname,
(errmsg_internal("executing restore command \"%s\"",
xlogRestoreCmd)));

+ RestoreOriginalOpenFileLimit();
fflush(NULL);
pgstat_report_wait_start(WAIT_EVENT_RESTORE_COMMAND);

I think it might be better to have a precursor commit that wraps system(),
including the fflush(), pgstat_report_wait_start(), pgstat_report_wait_end().

It seems a bit silly to have a dance of 5 function calls that are repeated in
three places.

+#ifdef HAVE_GETRLIMIT
+/*
+ * Increases the open file limit (RLIMIT_NOFILE) by the requested amount.
+ * Returns true if successful, false otherwise.
+ */
+static bool
+IncreaseOpenFileLimit(int extra_files)
+{

Why is this a relative instead of an absolute amount?

+	struct rlimit rlim = custom_max_open_files;
+
+	/* If we're already at the max we reached our limit */
+	if (rlim.rlim_cur == original_max_open_files.rlim_max)
+		return false;
+
+	/* Otherwise try to increase the soft limit to what we need */
+	rlim.rlim_cur = Min(rlim.rlim_cur + extra_files, rlim.rlim_max);
+
+	if (setrlimit(RLIMIT_NOFILE, &rlim) != 0)
+	{
+		ereport(WARNING, (errmsg("setrlimit failed: %m")));
+		return false;
+	}

Is a WARNING really appropriate here? I guess it's just copied from elsewhere
in the file, but still...

+/*
+ * RestoreOriginalOpenFileLimit --- Restore the original open file limit that
+ * 		was present at postmaster start.
+ *
+ * This should be called before spawning subprocesses that might use select(2)
+ * which can only handle file descriptors up to 1024.
+ */
+void
+RestoreOriginalOpenFileLimit(void)
+{
+#ifdef HAVE_GETRLIMIT
+	if (custom_max_open_files.rlim_cur == original_max_open_files.rlim_cur)
+	{
+		/* Not changed, so no need to call setrlimit at all */
+		return;
+	}
+
+	if (setrlimit(RLIMIT_NOFILE, &original_max_open_files) != 0)
+	{
+		ereport(WARNING, (errmsg("setrlimit failed: %m")));
+	}
+#endif
+}

Hm. Does this actually work if we currently have more than
original_max_open_files.rlim_cur files open?

It'd be fine to do the system() with more files open, because it'll internally
do exec(), but of course setrlimit() doesn't know that ...

/*
* count_usable_fds --- count how many FDs the system will let us open,
* and estimate how many are already open.
@@ -969,7 +1051,6 @@ count_usable_fds(int max_to_probe, int *usable_fds, int *already_open)
int j;

#ifdef HAVE_GETRLIMIT
- struct rlimit rlim;
int getrlimit_status;
#endif

All these ifdefs make this function rather hard to read. It was kinda bad
before, but it does get even worse with this patch. Not really sure what to
do about that though.

@@ -977,9 +1058,11 @@ count_usable_fds(int max_to_probe, int *usable_fds, int *already_open)
fd = (int *) palloc(size * sizeof(int));

#ifdef HAVE_GETRLIMIT
-	getrlimit_status = getrlimit(RLIMIT_NOFILE, &rlim);
+	getrlimit_status = getrlimit(RLIMIT_NOFILE, &original_max_open_files);
if (getrlimit_status != 0)
ereport(WARNING, (errmsg("getrlimit failed: %m")));
+	else
+		custom_max_open_files = original_max_open_files;
#endif							/* HAVE_GETRLIMIT */

We were discussing calling set_max_fds() twice to deal with io_uring FDs
requiring a higher FD limit than before. With the patch as-is, we'd overwrite
our original original_max_open_files in that case...

/* dup until failure or probe limit reached */
@@ -993,13 +1076,21 @@ count_usable_fds(int max_to_probe, int *usable_fds, int *already_open)
* don't go beyond RLIMIT_NOFILE; causes irritating kernel logs on
* some platforms
*/
-		if (getrlimit_status == 0 && highestfd >= rlim.rlim_cur - 1)
-			break;
+		if (getrlimit_status == 0 && highestfd >= custom_max_open_files.rlim_cur - 1)
+		{
+			if (!IncreaseOpenFileLimit(max_to_probe - used))
+				break;
+		}
#endif
thisfd = dup(2);
if (thisfd < 0)
{
+#ifdef HAVE_GETRLIMIT
+			if (errno == ENFILE && IncreaseOpenFileLimit(max_to_probe - used))
+				continue;
+#endif
+

Hm. When is this second case here reachable and useful? Shouldn't we already
have increased the limit before getting here? If so IncreaseOpenFileLimit()
won't help, no?

@@ -1078,6 +1164,7 @@ set_max_safe_fds(void)
max_safe_fds, usable_fds, already_open);
}

+
/*
* Open a file with BasicOpenFilePerm() and pass default file mode for the
* fileMode parameter.

Unrelated change.

From 44c196025cbf57a09f6aa1bc70646ed2537d9654 Mon Sep 17 00:00:00 2001
From: Jelte Fennema-Nio <github-tech@jeltef.nl>
Date: Wed, 12 Feb 2025 01:08:07 +0100
Subject: [PATCH v2 3/3] Reflect the value of max_safe_fds in
max_files_per_process

It is currently hard to figure out if max_safe_fds is significantly
lower than max_files_per_process. This starts reflecting the value of
max_safe_fds in max_files_per_process after our limit detection. We
still want to have two separate variables because for the bootstrap or
standalone-backend cases their values differ on purpose.
---
src/backend/storage/file/fd.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 04fb93be56d..05c0792e4e1 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -1149,6 +1149,9 @@ set_max_safe_fds(void)

max_safe_fds = Min(usable_fds - NUM_RESERVED_FDS, max_files_per_process);

+	/* Update GUC variable to allow users to see the result */
+	max_files_per_process = max_safe_fds;

If we want to reflect the value, shouldn't it show up as PGC_S_OVERRIDE or
such?

Greetings,

Andres Freund

#18Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#17)
Re: Bump soft open file limit (RLIMIT_NOFILE) to hard limit on startup

Andres Freund <andres@anarazel.de> writes:

On 2025-02-12 22:52:52 +0100, Jelte Fennema-Nio wrote:

+				/*
+				 * Bump the soft limit to the hard limit to not run into low
+				 * file limits.
+				 */
+				rlim.rlim_cur = rlim.rlim_max;
+				if (setrlimit(RLIMIT_NOFILE, &rlim) == -1)
+					pg_fatal("setrlimit failed: %m");
+
if (rlim.rlim_cur < nclients + 3)

Why not do this only in the if (rlim.rlim_cur < nclients + 3) case?

+1, otherwise you're introducing a potential failure mode for nothing.
It'd take a couple extra lines to deal with the scenario where
rlim_max is still too small, but that seems fine.

Other than this I think we should just apply this.

Agreed on the pgbench change. I don't have an opinion yet on the
server changes.

regards, tom lane

#19Jelte Fennema-Nio
postgres@jeltef.nl
In reply to: Andres Freund (#17)
3 attachment(s)
Re: Bump soft open file limit (RLIMIT_NOFILE) to hard limit on startup

On Mon, 17 Feb 2025 at 18:24, Andres Freund <andres@anarazel.de> wrote:

Why not do this only in the if (rlim.rlim_cur < nclients + 3) case?

Done, I also changed it to not bump to rlim_max, but only to nclients
+ 3. The rest of the patches I'll update later. But response below.

I think it might be better to have a precursor commit that wraps system(),
including the fflush(), pgstat_report_wait_start(), pgstat_report_wait_end().

It seems a bit silly to have a dance of 5 function calls that are repeated in
three places.

Fair enough, I'll update that in the next version.

+#ifdef HAVE_GETRLIMIT
+/*
+ * Increases the open file limit (RLIMIT_NOFILE) by the requested amount.
+ * Returns true if successful, false otherwise.
+ */
+static bool
+IncreaseOpenFileLimit(int extra_files)
+{

Why is this a relative instead of an absolute amount?

I don't feel strongly about this, but it seemed nicer to encapsulate
that. My callers (and I expect the io_uring one too) all had the
relative amount. Seemed a bit silly to always require the caller to
add custom_max_open_files.rlim_cur to that, especially since you might
want to call IncreaseOpenFileLimit from another file for your io_uring
stuff, which would mean also making custom_max_open_files.rlim_cur
non-static. Or of course you'd have to query getrlimit again.

+     if (setrlimit(RLIMIT_NOFILE, &rlim) != 0)
+     {
+             ereport(WARNING, (errmsg("setrlimit failed: %m")));
+             return false;
+     }

Is a WARNING really appropriate here? I guess it's just copied from elsewhere
in the file, but still...

I mean this shouldn't ever fail. We checked beforehand if we are
allowed to increase it this far. So if setrlimit fails then there's a
bug somewhere. That made me think it was sensible to report a warning,
so at least people can see something strange is going on.

+     if (setrlimit(RLIMIT_NOFILE, &original_max_open_files) != 0)
+     {
+             ereport(WARNING, (errmsg("setrlimit failed: %m")));
+     }
+#endif
+}

Hm. Does this actually work if we currently have more than
original_max_open_files.rlim_cur files open?

So according to at least the Linux and FreeBSD manpages, lowering
below the currently open files is not a failure mode.
- https://linux.die.net/man/2/setrlimit
- https://man.freebsd.org/cgi/man.cgi?query=setrlimit

I also tried this out manually on my Linux machine, by starting
postgres with only a soft limit of 15 (lower than that and it wouldn't
start). I was then able to change the limits without issue when
calling setrlimit or system.

... Sadly that turned out not to be true when calling popen in
OpenPipeStream (ofcourse). Because that will actually open a file in
the postgres process too, and at that point you're already over your
file limit. I can see two ways around that:
1. Writing a custom version of popen, where we lower the limit just
before we call exec in the forked process.
2. Don't care about restoring the original limit for OpenPipeStream

My suggestion would be to go for 2: It doesn't seem worth the
complexity of having a custom popen implementation, just to be able to
handle buggy programs (i.e. ones that cannot handle more than 1024
files, while still trying to open more than that amount of files)

It'd be fine to do the system() with more files open, because it'll internally
do exec(), but of course setrlimit() doesn't know that ...

I'm not sure I understand what you mean here.

All these ifdefs make this function rather hard to read. It was kinda bad
before, but it does get even worse with this patch. Not really sure what to
do about that though.

Agreed, I'll see if I can figure out a way to restructure it a bit.

We were discussing calling set_max_fds() twice to deal with io_uring FDs
requiring a higher FD limit than before. With the patch as-is, we'd overwrite
our original original_max_open_files in that case...

That should be very easy to change if that's indeed how we want to
bump the rlimit for the io_uring file descriptors. But it didn't seem
like you were sure which of the two options you liked best.

Hm. When is this second case here reachable and useful? Shouldn't we already
have increased the limit before getting here? If so IncreaseOpenFileLimit()
won't help, no?

While it's a very unlikely edge case, it's theoretically possible
(afaict) that the maximum number of open files is already open. So
highestfd will then be 0 (because it's the first loop), and we'll only
find out that we're at the limit due to getting ENFILE.

I'll add a code comment about this, because it's indeed not obvious.

@@ -1078,6 +1164,7 @@ set_max_safe_fds(void)
max_safe_fds, usable_fds, already_open);
}

+
/*
* Open a file with BasicOpenFilePerm() and pass default file mode for the
* fileMode parameter.

Unrelated change.

Ugh, yeah I guess I added the new functions there originally, but
moved them around and didn't clean up the added whitespace.

If we want to reflect the value, shouldn't it show up as PGC_S_OVERRIDE or
such?

Makes sense

Attachments:

v3-0003-Reflect-the-value-of-max_safe_fds-in-max_files_pe.patchtext/x-patch; charset=US-ASCII; name=v3-0003-Reflect-the-value-of-max_safe_fds-in-max_files_pe.patchDownload
From 761c3b0a161eb9d680354c9c20a872091ce224be Mon Sep 17 00:00:00 2001
From: Jelte Fennema-Nio <github-tech@jeltef.nl>
Date: Wed, 12 Feb 2025 01:08:07 +0100
Subject: [PATCH v3 3/3] Reflect the value of max_safe_fds in
 max_files_per_process

It is currently hard to figure out if max_safe_fds is significantly
lower than max_files_per_process. This starts reflecting the value of
max_safe_fds in max_files_per_process after our limit detection. We
still want to have two separate variables because for the bootstrap or
standalone-backend cases their values differ on purpose.
---
 src/backend/storage/file/fd.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 04fb93be56d..05c0792e4e1 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -1149,6 +1149,9 @@ set_max_safe_fds(void)
 
 	max_safe_fds = Min(usable_fds - NUM_RESERVED_FDS, max_files_per_process);
 
+	/* Update GUC variable to allow users to see the result */
+	max_files_per_process = max_safe_fds;
+
 	/*
 	 * Make sure we still have enough to get by.
 	 */
-- 
2.43.0

v3-0002-Bump-postmaster-soft-open-file-limit-RLIMIT_NOFIL.patchtext/x-patch; charset=US-ASCII; name=v3-0002-Bump-postmaster-soft-open-file-limit-RLIMIT_NOFIL.patchDownload
From f3ed3a3484e1d3dcfea3d7f94cd548533a84995d Mon Sep 17 00:00:00 2001
From: Jelte Fennema-Nio <github-tech@jeltef.nl>
Date: Tue, 11 Feb 2025 19:15:36 +0100
Subject: [PATCH v3 2/3] Bump postmaster soft open file limit (RLIMIT_NOFILE)
 when necessary

The default open file limit of 1024 on Linux is extremely low. The
reason that this hasn't changed change is because doing so would break
legacy programs that use the select(2) system call in hard to debug
ways. So instead programs that want to opt-in to a higher open file
limit are expected to bump their soft limit to their hard limit on
startup. Details on this are very well explained in a blogpost by the
systemd author[1]. There's also a similar change done by the Go
language[2].

This starts bumping postmaster its soft open file limit when we realize
that we'll run into the soft limit with the requested
max_files_per_process GUC. We do so by slightly changing the meaning of
the max_files_per_process GUC. The actual (not publicly exposed) limit
is max_safe_fds, previously this would be set to:
max_files_per_process - already_open_files - NUM_RESERVED_FDS
After this change we now try to set max_safe_fds to
max_files_per_process if the system allows that. This is deemed more
natural to understand for users, because now the limit of files that
they can open is actually what they configured in max_files_per_process.

Adding this infrastructure to change RLIMIT_NOFILE when needed is
especially useful for the AIO work that Andres is doing, because
io_uring consumes a lot of file descriptors. Even without looking at AIO
there is a large number of reports from people that require changing
their soft file limit before starting Postgres, sometimes falling back
to lowering max_files_per_process when they fail to do so[3-8]. It's
also not all that strange to fail at setting the soft open file limit
because there are multiple places where one can configure such limits
and usually only one of them is effective (which one depends on how
Postgres is started). In cloud environments its also often not possible
for user to change the soft limit, because they don't control the way
that Postgres is started.

One thing to note is that we temporarily restore the original soft
limit when shell-ing out to other executables. This is done as a
precaution in case those executables are using select(2).

[1]: https://0pointer.net/blog/file-descriptor-limits.html
[2]: https://github.com/golang/go/issues/46279
[3]: https://serverfault.com/questions/785330/getting-too-many-open-files-error-for-postgres
[4]: https://serverfault.com/questions/716982/how-to-raise-max-no-of-file-descriptors-for-daemons-running-on-debian-jessie
[5]: https://www.postgresql.org/message-id/flat/CAKtc8vXh7NvP_qWj8EqqorPY97bvxSaX3h5u7a9PptRFHW5x7g%40mail.gmail.com
[6]: https://www.postgresql.org/message-id/flat/113ce31b0908120955w77029099i7ececc053084095a%40mail.gmail.com
[7]: https://github.com/abiosoft/colima/discussions/836
[8]: https://www.postgresql.org/message-id/flat/29663.1007738957%40sss.pgh.pa.us#2079ec9e2d8b251593812a3711bfe9e9
---
 doc/src/sgml/config.sgml                 |  17 ++--
 src/backend/access/transam/xlogarchive.c |   4 +
 src/backend/archive/shell_archive.c      |   2 +
 src/backend/storage/file/fd.c            | 111 ++++++++++++++++++++---
 src/include/storage/fd.h                 |   2 +
 5 files changed, 116 insertions(+), 20 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 336630ce417..6b4bf78a9fc 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2352,15 +2352,14 @@ include_dir 'conf.d'
       </term>
       <listitem>
        <para>
-        Sets the maximum number of simultaneously open files allowed to each
-        server subprocess. The default is one thousand files. If the kernel is enforcing
-        a safe per-process limit, you don't need to worry about this setting.
-        But on some platforms (notably, most BSD systems), the kernel will
-        allow individual processes to open many more files than the system
-        can actually support if many processes all try to open
-        that many files. If you find yourself seeing <quote>Too many open
-        files</quote> failures, try reducing this setting.
-        This parameter can only be set at server start.
+        Sets the maximum number of files to each server subprocess is allowed
+        to open. The default is one thousand files. If the kernel is enforcing
+        a lower soft limit Postgres will automatically increase it if the hard
+        limit allows that. Setting this value too high can cause resource
+        exhaustion problems: On some platforms (notably, most BSD systems), the
+        kernel will allow individual processes to open many more files than the
+        system can actually support if many processes all try to open that many
+        files. This parameter can only be set at server start.
        </para>
       </listitem>
      </varlistentry>
diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index 1ef1713c91a..bf17426039e 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -158,6 +158,7 @@ RestoreArchivedFile(char *path, const char *xlogfname,
 			(errmsg_internal("executing restore command \"%s\"",
 							 xlogRestoreCmd)));
 
+	RestoreOriginalOpenFileLimit();
 	fflush(NULL);
 	pgstat_report_wait_start(WAIT_EVENT_RESTORE_COMMAND);
 
@@ -180,6 +181,7 @@ RestoreArchivedFile(char *path, const char *xlogfname,
 
 	pgstat_report_wait_end();
 	pfree(xlogRestoreCmd);
+	RestoreCustomOpenFileLimit();
 
 	if (rc == 0)
 	{
@@ -325,10 +327,12 @@ ExecuteRecoveryCommand(const char *command, const char *commandName,
 	/*
 	 * execute the constructed command
 	 */
+	RestoreOriginalOpenFileLimit();
 	fflush(NULL);
 	pgstat_report_wait_start(wait_event_info);
 	rc = system(xlogRecoveryCmd);
 	pgstat_report_wait_end();
+	RestoreCustomOpenFileLimit();
 
 	pfree(xlogRecoveryCmd);
 
diff --git a/src/backend/archive/shell_archive.c b/src/backend/archive/shell_archive.c
index 828723afe47..c16f354a118 100644
--- a/src/backend/archive/shell_archive.c
+++ b/src/backend/archive/shell_archive.c
@@ -75,10 +75,12 @@ shell_archive_file(ArchiveModuleState *state, const char *file,
 			(errmsg_internal("executing archive command \"%s\"",
 							 xlogarchcmd)));
 
+	RestoreOriginalOpenFileLimit();
 	fflush(NULL);
 	pgstat_report_wait_start(WAIT_EVENT_ARCHIVE_COMMAND);
 	rc = system(xlogarchcmd);
 	pgstat_report_wait_end();
+	RestoreCustomOpenFileLimit();
 
 	if (rc != 0)
 	{
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index e454db4c020..04fb93be56d 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -157,6 +157,12 @@ int			max_files_per_process = 1000;
  */
 int			max_safe_fds = FD_MINFREE;	/* default if not changed */
 
+#ifdef HAVE_GETRLIMIT
+static struct rlimit original_max_open_files;
+static struct rlimit custom_max_open_files;
+#endif
+
+
 /* Whether it is safe to continue running after fsync() fails. */
 bool		data_sync_retry = false;
 
@@ -945,6 +951,82 @@ InitTemporaryFileAccess(void)
 #endif
 }
 
+#ifdef HAVE_GETRLIMIT
+/*
+ * Increases the open file limit (RLIMIT_NOFILE) by the requested amount.
+ * Returns true if successful, false otherwise.
+ */
+static bool
+IncreaseOpenFileLimit(int extra_files)
+{
+	struct rlimit rlim = custom_max_open_files;
+
+	/* If we're already at the max we reached our limit */
+	if (rlim.rlim_cur == original_max_open_files.rlim_max)
+		return false;
+
+	/* Otherwise try to increase the soft limit to what we need */
+	rlim.rlim_cur = Min(rlim.rlim_cur + extra_files, rlim.rlim_max);
+
+	if (setrlimit(RLIMIT_NOFILE, &rlim) != 0)
+	{
+		ereport(WARNING, (errmsg("setrlimit failed: %m")));
+		return false;
+	}
+
+	custom_max_open_files = rlim;
+
+	elog(LOG, "increased open file limit to %ld", (long) rlim.rlim_cur);
+
+	return true;
+}
+#endif
+
+/*
+ * RestoreOriginalOpenFileLimit --- Restore the original open file limit that
+ * 		was present at postmaster start.
+ *
+ * This should be called before spawning subprocesses that might use select(2)
+ * which can only handle file descriptors up to 1024.
+ */
+void
+RestoreOriginalOpenFileLimit(void)
+{
+#ifdef HAVE_GETRLIMIT
+	if (custom_max_open_files.rlim_cur == original_max_open_files.rlim_cur)
+	{
+		/* Not changed, so no need to call setrlimit at all */
+		return;
+	}
+
+	if (setrlimit(RLIMIT_NOFILE, &original_max_open_files) != 0)
+	{
+		ereport(WARNING, (errmsg("setrlimit failed: %m")));
+	}
+#endif
+}
+
+/*
+ * RestoreCustomOpenFileLimit --- Restores our custom open file limit after a
+ * 		previous call to RestoreOriginalOpenFileLimit.
+ */
+void
+RestoreCustomOpenFileLimit(void)
+{
+#ifdef HAVE_GETRLIMIT
+	if (custom_max_open_files.rlim_cur == original_max_open_files.rlim_cur)
+	{
+		/* Not changed, so no need to call setrlimit at all */
+		return;
+	}
+
+	if (setrlimit(RLIMIT_NOFILE, &custom_max_open_files) != 0)
+	{
+		ereport(WARNING, (errmsg("setrlimit failed: %m")));
+	}
+#endif
+}
+
 /*
  * count_usable_fds --- count how many FDs the system will let us open,
  *		and estimate how many are already open.
@@ -969,7 +1051,6 @@ count_usable_fds(int max_to_probe, int *usable_fds, int *already_open)
 	int			j;
 
 #ifdef HAVE_GETRLIMIT
-	struct rlimit rlim;
 	int			getrlimit_status;
 #endif
 
@@ -977,9 +1058,11 @@ count_usable_fds(int max_to_probe, int *usable_fds, int *already_open)
 	fd = (int *) palloc(size * sizeof(int));
 
 #ifdef HAVE_GETRLIMIT
-	getrlimit_status = getrlimit(RLIMIT_NOFILE, &rlim);
+	getrlimit_status = getrlimit(RLIMIT_NOFILE, &original_max_open_files);
 	if (getrlimit_status != 0)
 		ereport(WARNING, (errmsg("getrlimit failed: %m")));
+	else
+		custom_max_open_files = original_max_open_files;
 #endif							/* HAVE_GETRLIMIT */
 
 	/* dup until failure or probe limit reached */
@@ -993,13 +1076,21 @@ count_usable_fds(int max_to_probe, int *usable_fds, int *already_open)
 		 * don't go beyond RLIMIT_NOFILE; causes irritating kernel logs on
 		 * some platforms
 		 */
-		if (getrlimit_status == 0 && highestfd >= rlim.rlim_cur - 1)
-			break;
+		if (getrlimit_status == 0 && highestfd >= custom_max_open_files.rlim_cur - 1)
+		{
+			if (!IncreaseOpenFileLimit(max_to_probe - used))
+				break;
+		}
 #endif
 
 		thisfd = dup(2);
 		if (thisfd < 0)
 		{
+#ifdef HAVE_GETRLIMIT
+			if (errno == ENFILE && IncreaseOpenFileLimit(max_to_probe - used))
+				continue;
+#endif
+
 			/* Expect EMFILE or ENFILE, else it's fishy */
 			if (errno != EMFILE && errno != ENFILE)
 				elog(WARNING, "duplicating stderr file descriptor failed after %d successes: %m", used);
@@ -1053,15 +1144,10 @@ set_max_safe_fds(void)
 	 * or the experimentally-determined EMFILE limit.
 	 *----------
 	 */
-	count_usable_fds(max_files_per_process,
+	count_usable_fds(max_files_per_process + NUM_RESERVED_FDS,
 					 &usable_fds, &already_open);
 
-	max_safe_fds = Min(usable_fds, max_files_per_process - already_open);
-
-	/*
-	 * Take off the FDs reserved for system() etc.
-	 */
-	max_safe_fds -= NUM_RESERVED_FDS;
+	max_safe_fds = Min(usable_fds - NUM_RESERVED_FDS, max_files_per_process);
 
 	/*
 	 * Make sure we still have enough to get by.
@@ -1078,6 +1164,7 @@ set_max_safe_fds(void)
 		 max_safe_fds, usable_fds, already_open);
 }
 
+
 /*
  * Open a file with BasicOpenFilePerm() and pass default file mode for the
  * fileMode parameter.
@@ -2724,6 +2811,7 @@ OpenPipeStream(const char *command, const char *mode)
 	ReleaseLruFiles();
 
 TryAgain:
+	RestoreOriginalOpenFileLimit();
 	fflush(NULL);
 	pqsignal(SIGPIPE, SIG_DFL);
 	errno = 0;
@@ -2731,6 +2819,7 @@ TryAgain:
 	save_errno = errno;
 	pqsignal(SIGPIPE, SIG_IGN);
 	errno = save_errno;
+	RestoreCustomOpenFileLimit();
 	if (file != NULL)
 	{
 		AllocateDesc *desc = &allocatedDescs[numAllocatedDescs];
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index e3067ab6597..d24e7f1c8df 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -177,6 +177,8 @@ extern void RemovePgTempFiles(void);
 extern void RemovePgTempFilesInDir(const char *tmpdirname, bool missing_ok,
 								   bool unlink_all);
 extern bool looks_like_temp_rel_name(const char *name);
+extern void RestoreOriginalOpenFileLimit(void);
+extern void RestoreCustomOpenFileLimit(void);
 
 extern int	pg_fsync(int fd);
 extern int	pg_fsync_no_writethrough(int fd);
-- 
2.43.0

v3-0001-Bump-pgbench-soft-open-file-limit-RLIMIT_NOFILE-t.patchtext/x-patch; charset=US-ASCII; name=v3-0001-Bump-pgbench-soft-open-file-limit-RLIMIT_NOFILE-t.patchDownload
From fe11e75d913ac320f11ac2250c94fa58b118b2e1 Mon Sep 17 00:00:00 2001
From: Jelte Fennema-Nio <github-tech@jeltef.nl>
Date: Tue, 11 Feb 2025 22:04:44 +0100
Subject: [PATCH v3 1/3] Bump pgbench soft open file limit (RLIMIT_NOFILE) to
 hard limit

The default open file limit of 1024 is extremely low, given modern
resources and kernel architectures. The reason that this hasn't changed
is because doing so would break legacy programs that use the select(2)
system call in hard to debug ways. So instead programs that want to
opt-in to a higher open file limit are expected to bump their soft limit
to their hard limit on startup. Details on this are very well explained
in a blogpost by the systemd author[1].

This starts bumping pgbench its soft open file limit to the hard open
file limit. This makes sure users are not told to change their ulimit,
and then retry. Instead we now do that for them automatically.

[1]: https://0pointer.net/blog/file-descriptor-limits.html
---
 src/bin/pgbench/pgbench.c | 17 +++++++++++++++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 5e1fcf59c61..8f53bf50ab1 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -6815,13 +6815,26 @@ main(int argc, char **argv)
 #ifdef HAVE_GETRLIMIT
 				if (getrlimit(RLIMIT_NOFILE, &rlim) == -1)
 					pg_fatal("getrlimit failed: %m");
-				if (rlim.rlim_cur < nclients + 3)
+
+				if (rlim.rlim_max < nclients + 3)
 				{
 					pg_log_error("need at least %d open files, but system limit is %ld",
-								 nclients + 3, (long) rlim.rlim_cur);
+								 nclients + 3, (long) rlim.rlim_max);
 					pg_log_error_hint("Reduce number of clients, or use limit/ulimit to increase the system limit.");
 					exit(1);
 				}
+
+				if (rlim.rlim_cur < nclients + 3)
+				{
+					rlim.rlim_cur = nclients + 3;
+					if (setrlimit(RLIMIT_NOFILE, &rlim) == -1)
+					{
+						pg_log_error("need at least %d open files, but couldn't raise the limit",
+									 nclients + 3);
+						pg_log_error_hint("Reduce number of clients, or use limit/ulimit to increase the system limit.");
+						exit(1);
+					}
+				}
 #endif							/* HAVE_GETRLIMIT */
 				break;
 			case 'C':

base-commit: c407d5426b877b41be87f9e679e321bb2c42e47d
-- 
2.43.0

#20Andres Freund
andres@anarazel.de
In reply to: Jelte Fennema-Nio (#19)
Re: Bump soft open file limit (RLIMIT_NOFILE) to hard limit on startup

Hi,

On 2025-02-17 21:25:33 +0100, Jelte Fennema-Nio wrote:

On Mon, 17 Feb 2025 at 18:24, Andres Freund <andres@anarazel.de> wrote:

Why not do this only in the if (rlim.rlim_cur < nclients + 3) case?

Done, I also changed it to not bump to rlim_max, but only to nclients
+ 3. The rest of the patches I'll update later. But response below.

I've pushed this, with one trivial modification: I added %m to the error
message in case setrlimit() fails. That's really unlikely to ever happen, but
...

Greetings,

Andres Freund

#21Jelte Fennema-Nio
postgres@jeltef.nl
In reply to: Andres Freund (#20)
3 attachment(s)
Re: Bump soft open file limit (RLIMIT_NOFILE) to hard limit on startup

On Thu Feb 20, 2025 at 1:37 AM CET, Andres Freund wrote:

I've pushed this, with one trivial modification: I added %m to the error
message in case setrlimit() fails. That's really unlikely to ever happen, but
...

Great! Attached are the updated other patches, I think I addressed all
feedback.

Attachments:

v4-0001-Adds-a-helper-for-places-that-shell-out-to-system.patchtext/x-patch; charset=utf-8; name=v4-0001-Adds-a-helper-for-places-that-shell-out-to-system.patchDownload
From ec72e05e87c73dee3de73f9a6586e8e8db2d919e Mon Sep 17 00:00:00 2001
From: Jelte Fennema-Nio <github-tech@jeltef.nl>
Date: Sun, 23 Feb 2025 16:52:29 +0100
Subject: [PATCH v4 1/3] Adds a helper for places that shell out to system()

We need to call system() in a few places, and to do so safely we need
some pre/post-fork logic. This encapsulates that logic into a single
System helper function. The main reason to do this is that in a follow
up commit we want to add even more logic pre and post-fork.
---
 src/backend/access/transam/xlogarchive.c | 28 +-----------
 src/backend/archive/shell_archive.c      |  6 +--
 src/backend/postmaster/startup.c         | 54 +++++++++++++++++-------
 src/include/postmaster/startup.h         |  3 +-
 4 files changed, 43 insertions(+), 48 deletions(-)

diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index 1ef1713c91a..59a0f191339 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -158,27 +158,7 @@ RestoreArchivedFile(char *path, const char *xlogfname,
 			(errmsg_internal("executing restore command \"%s\"",
 							 xlogRestoreCmd)));
 
-	fflush(NULL);
-	pgstat_report_wait_start(WAIT_EVENT_RESTORE_COMMAND);
-
-	/*
-	 * PreRestoreCommand() informs the SIGTERM handler for the startup process
-	 * that it should proc_exit() right away.  This is done for the duration
-	 * of the system() call because there isn't a good way to break out while
-	 * it is executing.  Since we might call proc_exit() in a signal handler,
-	 * it is best to put any additional logic before or after the
-	 * PreRestoreCommand()/PostRestoreCommand() section.
-	 */
-	PreRestoreCommand();
-
-	/*
-	 * Copy xlog from archival storage to XLOGDIR
-	 */
-	rc = system(xlogRestoreCmd);
-
-	PostRestoreCommand();
-
-	pgstat_report_wait_end();
+	rc = System(xlogRestoreCmd, WAIT_EVENT_RESTORE_COMMAND);
 	pfree(xlogRestoreCmd);
 
 	if (rc == 0)
@@ -325,11 +305,7 @@ ExecuteRecoveryCommand(const char *command, const char *commandName,
 	/*
 	 * execute the constructed command
 	 */
-	fflush(NULL);
-	pgstat_report_wait_start(wait_event_info);
-	rc = system(xlogRecoveryCmd);
-	pgstat_report_wait_end();
-
+	rc = System(xlogRecoveryCmd, wait_event_info);
 	pfree(xlogRecoveryCmd);
 
 	if (rc != 0)
diff --git a/src/backend/archive/shell_archive.c b/src/backend/archive/shell_archive.c
index 828723afe47..7977e5a5092 100644
--- a/src/backend/archive/shell_archive.c
+++ b/src/backend/archive/shell_archive.c
@@ -22,6 +22,7 @@
 #include "archive/shell_archive.h"
 #include "common/percentrepl.h"
 #include "pgstat.h"
+#include "postmaster/startup.h"
 
 static bool shell_archive_configured(ArchiveModuleState *state);
 static bool shell_archive_file(ArchiveModuleState *state,
@@ -75,10 +76,7 @@ shell_archive_file(ArchiveModuleState *state, const char *file,
 			(errmsg_internal("executing archive command \"%s\"",
 							 xlogarchcmd)));
 
-	fflush(NULL);
-	pgstat_report_wait_start(WAIT_EVENT_ARCHIVE_COMMAND);
-	rc = system(xlogarchcmd);
-	pgstat_report_wait_end();
+	rc = System(xlogarchcmd, WAIT_EVENT_ARCHIVE_COMMAND);
 
 	if (rc != 0)
 	{
diff --git a/src/backend/postmaster/startup.c b/src/backend/postmaster/startup.c
index 88eab3d0baf..b0cd40c081e 100644
--- a/src/backend/postmaster/startup.c
+++ b/src/backend/postmaster/startup.c
@@ -33,6 +33,7 @@
 #include "utils/guc.h"
 #include "utils/memutils.h"
 #include "utils/timeout.h"
+#include "utils/wait_event_types.h"
 
 
 #ifndef USE_POSTMASTER_DEATH_SIGNAL
@@ -264,24 +265,45 @@ StartupProcessMain(const void *startup_data, size_t startup_data_len)
 	proc_exit(0);
 }
 
-void
-PreRestoreCommand(void)
+/*
+ * A custom wrapper around the system() function that calls the necessary
+ * functions pre/post-fork.
+ */
+int
+System(const char *command, uint32 wait_event_info)
 {
-	/*
-	 * Set in_restore_command to tell the signal handler that we should exit
-	 * right away on SIGTERM. We know that we're at a safe point to do that.
-	 * Check if we had already received the signal, so that we don't miss a
-	 * shutdown request received just before this.
-	 */
-	in_restore_command = true;
-	if (shutdown_requested)
-		proc_exit(1);
-}
+	int			rc;
 
-void
-PostRestoreCommand(void)
-{
-	in_restore_command = false;
+	fflush(NULL);
+	pgstat_report_wait_start(wait_event_info);
+
+	if (wait_event_info == WAIT_EVENT_RESTORE_COMMAND)
+	{
+		/*
+		 * Set in_restore_command to tell the signal handler that we should
+		 * exit right away on SIGTERM. This is done for the duration of the
+		 * system() call because there isn't a good way to break out while it
+		 * is executing.  Since we might call proc_exit() in a signal handler,
+		 * it is best to put any additional logic outside of this section
+		 * where in_restore_command is set to true.
+		 */
+		in_restore_command = true;
+
+		/*
+		 * Also check if we had already received the signal, so that we don't
+		 * miss a shutdown request received just before this.
+		 */
+		if (shutdown_requested)
+			proc_exit(1);
+	}
+
+	rc = system(command);
+
+	if (wait_event_info == WAIT_EVENT_RESTORE_COMMAND)
+		in_restore_command = false;
+
+	pgstat_report_wait_end();
+	return rc;
 }
 
 bool
diff --git a/src/include/postmaster/startup.h b/src/include/postmaster/startup.h
index ae0f6347fc0..09f5dcd3d03 100644
--- a/src/include/postmaster/startup.h
+++ b/src/include/postmaster/startup.h
@@ -27,8 +27,7 @@ extern PGDLLIMPORT int log_startup_progress_interval;
 
 extern void HandleStartupProcInterrupts(void);
 extern void StartupProcessMain(const void *startup_data, size_t startup_data_len) pg_attribute_noreturn();
-extern void PreRestoreCommand(void);
-extern void PostRestoreCommand(void);
+extern int	System(const char *command, uint32 wait_event_info);
 extern bool IsPromoteSignaled(void);
 extern void ResetPromoteSignaled(void);
 

base-commit: 454c182f8542890d0e2eac85f70d9a254a34fce3
-- 
2.43.0

v4-0002-Bump-postmaster-soft-open-file-limit-RLIMIT_NOFIL.patchtext/x-patch; charset=utf-8; name=v4-0002-Bump-postmaster-soft-open-file-limit-RLIMIT_NOFIL.patchDownload
From ab826c9c4325860b11e02043c3d77ce015bdbc19 Mon Sep 17 00:00:00 2001
From: Jelte Fennema-Nio <github-tech@jeltef.nl>
Date: Tue, 11 Feb 2025 19:15:36 +0100
Subject: [PATCH v4 2/3] Bump postmaster soft open file limit (RLIMIT_NOFILE)
 when necessary

The default open file limit of 1024 on Linux is extremely low. The
reason that this hasn't changed change is because doing so would break
legacy programs that use the select(2) system call in hard to debug
ways. So instead programs that want to opt-in to a higher open file
limit are expected to bump their soft limit to their hard limit on
startup. Details on this are very well explained in a blogpost by the
systemd author[1]. There's also a similar change done by the Go
language[2].

This starts bumping postmaster its soft open file limit when we realize
that we'll run into the soft limit with the requested
max_files_per_process GUC. We do so by slightly changing the meaning of
the max_files_per_process GUC. The actual (not publicly exposed) limit
is max_safe_fds, previously this would be set to:
max_files_per_process - already_open_files - NUM_RESERVED_FDS
After this change we now try to set max_safe_fds to
max_files_per_process if the system allows that. This is deemed more
natural to understand for users, because now the limit of files that
they can open is actually what they configured in max_files_per_process.

Adding this infrastructure to change RLIMIT_NOFILE when needed is
especially useful for the AIO work that Andres is doing, because
io_uring consumes a lot of file descriptors. Even without looking at AIO
there is a large number of reports from people that require changing
their soft file limit before starting Postgres, sometimes falling back
to lowering max_files_per_process when they fail to do so[3-8]. It's
also not all that strange to fail at setting the soft open file limit
because there are multiple places where one can configure such limits
and usually only one of them is effective (which one depends on how
Postgres is started). In cloud environments its also often not possible
for user to change the soft limit, because they don't control the way
that Postgres is started.

One thing to note is that we temporarily restore the original soft
limit when shell-ing out to other executables. This is done as a
precaution in case those executables are using select(2).

[1]: https://0pointer.net/blog/file-descriptor-limits.html
[2]: https://github.com/golang/go/issues/46279
[3]: https://serverfault.com/questions/785330/getting-too-many-open-files-error-for-postgres
[4]: https://serverfault.com/questions/716982/how-to-raise-max-no-of-file-descriptors-for-daemons-running-on-debian-jessie
[5]: https://www.postgresql.org/message-id/flat/CAKtc8vXh7NvP_qWj8EqqorPY97bvxSaX3h5u7a9PptRFHW5x7g%40mail.gmail.com
[6]: https://www.postgresql.org/message-id/flat/113ce31b0908120955w77029099i7ececc053084095a%40mail.gmail.com
[7]: https://github.com/abiosoft/colima/discussions/836
[8]: https://www.postgresql.org/message-id/flat/29663.1007738957%40sss.pgh.pa.us#2079ec9e2d8b251593812a3711bfe9e9
---
 doc/src/sgml/config.sgml         |  17 ++-
 src/backend/postmaster/startup.c |   3 +
 src/backend/storage/file/fd.c    | 208 +++++++++++++++++++++++++++----
 src/include/storage/fd.h         |   2 +
 4 files changed, 199 insertions(+), 31 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 007746a4429..c69522be67a 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2378,15 +2378,14 @@ include_dir 'conf.d'
       </term>
       <listitem>
        <para>
-        Sets the maximum number of simultaneously open files allowed to each
-        server subprocess. The default is one thousand files. If the kernel is enforcing
-        a safe per-process limit, you don't need to worry about this setting.
-        But on some platforms (notably, most BSD systems), the kernel will
-        allow individual processes to open many more files than the system
-        can actually support if many processes all try to open
-        that many files. If you find yourself seeing <quote>Too many open
-        files</quote> failures, try reducing this setting.
-        This parameter can only be set at server start.
+        Sets the maximum number of files to each server subprocess is allowed
+        to open. The default is one thousand files. If the kernel is enforcing
+        a lower soft limit Postgres will automatically increase it if the hard
+        limit allows that. Setting this value too high can cause resource
+        exhaustion problems: On some platforms (notably, most BSD systems), the
+        kernel will allow individual processes to open many more files than the
+        system can actually support if many processes all try to open that many
+        files. This parameter can only be set at server start.
        </para>
       </listitem>
      </varlistentry>
diff --git a/src/backend/postmaster/startup.c b/src/backend/postmaster/startup.c
index b0cd40c081e..d3a5e2590ab 100644
--- a/src/backend/postmaster/startup.c
+++ b/src/backend/postmaster/startup.c
@@ -30,6 +30,7 @@
 #include "storage/pmsignal.h"
 #include "storage/procsignal.h"
 #include "storage/standby.h"
+#include "storage/fd.h"
 #include "utils/guc.h"
 #include "utils/memutils.h"
 #include "utils/timeout.h"
@@ -274,6 +275,7 @@ System(const char *command, uint32 wait_event_info)
 {
 	int			rc;
 
+	RestoreOriginalOpenFileLimit();
 	fflush(NULL);
 	pgstat_report_wait_start(wait_event_info);
 
@@ -303,6 +305,7 @@ System(const char *command, uint32 wait_event_info)
 		in_restore_command = false;
 
 	pgstat_report_wait_end();
+	RestoreCustomOpenFileLimit();
 	return rc;
 }
 
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index e454db4c020..9881c18c983 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -157,6 +157,13 @@ int			max_files_per_process = 1000;
  */
 int			max_safe_fds = FD_MINFREE;	/* default if not changed */
 
+#ifdef HAVE_GETRLIMIT
+static bool saved_original_max_open_files;
+static struct rlimit original_max_open_files;
+static struct rlimit custom_max_open_files;
+#endif
+
+
 /* Whether it is safe to continue running after fsync() fails. */
 bool		data_sync_retry = false;
 
@@ -945,6 +952,154 @@ InitTemporaryFileAccess(void)
 #endif
 }
 
+/*
+ * Returns true if the passed in highestfd is the last one that we're allowed
+ * to open based on our. This should only be called if
+ */
+static bool
+IsOpenFileLimit(int highestfd)
+{
+#ifdef HAVE_GETRLIMIT
+	if (!saved_original_max_open_files)
+	{
+		return false;
+	}
+
+	return highestfd >= custom_max_open_files.rlim_cur - 1;
+#else
+	return false;
+#endif
+}
+
+/*
+ * Increases the open file limit (RLIMIT_NOFILE) by the requested amount.
+ * Returns true if successful, false otherwise.
+ */
+static bool
+IncreaseOpenFileLimit(int extra_files)
+{
+#ifdef HAVE_GETRLIMIT
+	struct rlimit rlim;
+
+	if (!saved_original_max_open_files)
+	{
+		return false;
+	}
+
+	rlim = custom_max_open_files;
+
+	/* If we're already at the max we reached our limit */
+	if (rlim.rlim_cur == original_max_open_files.rlim_max)
+		return false;
+
+	/* Otherwise try to increase the soft limit to what we need */
+	rlim.rlim_cur = Min(rlim.rlim_cur + extra_files, rlim.rlim_max);
+
+	if (setrlimit(RLIMIT_NOFILE, &rlim) != 0)
+	{
+		/* We made sure not to exceed the hard limit, so this shouldn't fail */
+		ereport(WARNING, (errmsg("setrlimit failed: %m")));
+		return false;
+	}
+
+	custom_max_open_files = rlim;
+
+	elog(LOG, "increased open file limit to %ld", (long) rlim.rlim_cur);
+
+	return true;
+#else
+	return false;
+#endif
+}
+
+/*
+ * Saves the original open file limit (RLIMIT_NOFILE) the first time when this
+ * is called. If called again it's a no-op.
+ *
+ * Returns true if successful, false otherwise.
+ */
+static void
+SaveOriginalOpenFileLimit(void)
+{
+#ifdef HAVE_GETRLIMIT
+	int			status;
+
+	if (saved_original_max_open_files)
+	{
+		/* Already saved, no need to do it again */
+		return;
+	}
+
+	status = getrlimit(RLIMIT_NOFILE, &original_max_open_files);
+	if (status != 0)
+	{
+		ereport(WARNING, (errmsg("getrlimit failed: %m")));
+		return;
+	}
+
+	custom_max_open_files = original_max_open_files;
+	saved_original_max_open_files = true;
+	return;
+#else
+	return false;
+#endif
+}
+
+/*
+ * RestoreOriginalOpenFileLimit --- Restore the original open file limit that
+ * 		was present at postmaster start.
+ *
+ * This should be called before spawning subprocesses that might use select(2)
+ * which can only handle file descriptors up to 1024.
+ */
+void
+RestoreOriginalOpenFileLimit(void)
+{
+#ifdef HAVE_GETRLIMIT
+	if (!saved_original_max_open_files)
+	{
+		return;
+	}
+
+	if (custom_max_open_files.rlim_cur == original_max_open_files.rlim_cur)
+	{
+		/* Not changed, so no need to call setrlimit at all */
+		return;
+	}
+
+	if (setrlimit(RLIMIT_NOFILE, &original_max_open_files) != 0)
+	{
+		ereport(WARNING, (errmsg("setrlimit failed: %m")));
+	}
+#endif
+}
+
+/*
+ * RestoreCustomOpenFileLimit --- Restores our custom open file limit after a
+ * 		previous call to RestoreOriginalOpenFileLimit.
+ */
+void
+RestoreCustomOpenFileLimit(void)
+{
+#ifdef HAVE_GETRLIMIT
+	if (!saved_original_max_open_files)
+	{
+		return;
+	}
+
+	if (custom_max_open_files.rlim_cur == original_max_open_files.rlim_cur)
+	{
+		/* Not changed, so no need to call setrlimit at all */
+		return;
+	}
+
+	if (setrlimit(RLIMIT_NOFILE, &custom_max_open_files) != 0)
+	{
+		ereport(WARNING, (errmsg("setrlimit failed: %m")));
+	}
+#endif
+}
+
 /*
  * count_usable_fds --- count how many FDs the system will let us open,
  *		and estimate how many are already open.
@@ -968,38 +1123,39 @@ count_usable_fds(int max_to_probe, int *usable_fds, int *already_open)
 	int			highestfd = 0;
 	int			j;
 
-#ifdef HAVE_GETRLIMIT
-	struct rlimit rlim;
-	int			getrlimit_status;
-#endif
-
 	size = 1024;
 	fd = (int *) palloc(size * sizeof(int));
 
-#ifdef HAVE_GETRLIMIT
-	getrlimit_status = getrlimit(RLIMIT_NOFILE, &rlim);
-	if (getrlimit_status != 0)
-		ereport(WARNING, (errmsg("getrlimit failed: %m")));
-#endif							/* HAVE_GETRLIMIT */
+	SaveOriginalOpenFileLimit();
 
 	/* dup until failure or probe limit reached */
 	for (;;)
 	{
 		int			thisfd;
 
-#ifdef HAVE_GETRLIMIT
-
 		/*
 		 * don't go beyond RLIMIT_NOFILE; causes irritating kernel logs on
 		 * some platforms
 		 */
-		if (getrlimit_status == 0 && highestfd >= rlim.rlim_cur - 1)
-			break;
-#endif
+		if (IsOpenFileLimit(highestfd))
+		{
+			if (!IncreaseOpenFileLimit(max_to_probe - used))
+				break;
+		}
 
 		thisfd = dup(2);
 		if (thisfd < 0)
 		{
+			/*
+			 * Eventhough we do the pre-check above, it's still possible that
+			 * the call to dup fails with EMFILE. This can happen if the last
+			 * file descriptor was already assigned to an "already open" file.
+			 * One example of this happening, is if we're already at the soft
+			 * limit when we call count_usable_fds.
+			 */
+			if (errno == EMFILE && IncreaseOpenFileLimit(max_to_probe - used))
+				continue;
+
 			/* Expect EMFILE or ENFILE, else it's fishy */
 			if (errno != EMFILE && errno != ENFILE)
 				elog(WARNING, "duplicating stderr file descriptor failed after %d successes: %m", used);
@@ -1053,15 +1209,10 @@ set_max_safe_fds(void)
 	 * or the experimentally-determined EMFILE limit.
 	 *----------
 	 */
-	count_usable_fds(max_files_per_process,
+	count_usable_fds(max_files_per_process + NUM_RESERVED_FDS,
 					 &usable_fds, &already_open);
 
-	max_safe_fds = Min(usable_fds, max_files_per_process - already_open);
-
-	/*
-	 * Take off the FDs reserved for system() etc.
-	 */
-	max_safe_fds -= NUM_RESERVED_FDS;
+	max_safe_fds = Min(usable_fds - NUM_RESERVED_FDS, max_files_per_process);
 
 	/*
 	 * Make sure we still have enough to get by.
@@ -2724,6 +2875,19 @@ OpenPipeStream(const char *command, const char *mode)
 	ReleaseLruFiles();
 
 TryAgain:
+
+	/*
+	 * It would be great if we could call RestoreOriginalOpenFileLimit here,
+	 * but since popen() also opens a file in the current process (this side
+	 * of the pipe) we cannot do so safely. Because we might already have many
+	 * more files open than the original limit.
+	 *
+	 * The only way to address this would be implementing a custom popen()
+	 * that calls RestoreOriginalOpenFileLimit only around the actual fork
+	 * call, but that seems too much effort to handle the corner case where
+	 * this external command uses both select() and tries to open more files
+	 * than select() allows for.
+	 */
 	fflush(NULL);
 	pqsignal(SIGPIPE, SIG_DFL);
 	errno = 0;
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index e3067ab6597..d24e7f1c8df 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -177,6 +177,8 @@ extern void RemovePgTempFiles(void);
 extern void RemovePgTempFilesInDir(const char *tmpdirname, bool missing_ok,
 								   bool unlink_all);
 extern bool looks_like_temp_rel_name(const char *name);
+extern void RestoreOriginalOpenFileLimit(void);
+extern void RestoreCustomOpenFileLimit(void);
 
 extern int	pg_fsync(int fd);
 extern int	pg_fsync_no_writethrough(int fd);
-- 
2.43.0

v4-0003-Reflect-the-value-of-max_safe_fds-in-max_files_pe.patchtext/x-patch; charset=utf-8; name=v4-0003-Reflect-the-value-of-max_safe_fds-in-max_files_pe.patchDownload
From f8a115dd8f32ccd33766caca0b982ea001f1eea1 Mon Sep 17 00:00:00 2001
From: Jelte Fennema-Nio <github-tech@jeltef.nl>
Date: Wed, 12 Feb 2025 01:08:07 +0100
Subject: [PATCH v4 3/3] Reflect the value of max_safe_fds in
 max_files_per_process

It is currently hard to figure out if max_safe_fds is significantly
lower than max_files_per_process. This starts reflecting the value of
max_safe_fds in max_files_per_process after our limit detection. We
still want to have two separate variables because for the bootstrap or
standalone-backend cases their values differ on purpose.
---
 src/backend/storage/file/fd.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 9881c18c983..a045dc71d2c 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -1200,6 +1200,7 @@ set_max_safe_fds(void)
 {
 	int			usable_fds;
 	int			already_open;
+	char	   *max_safe_fds_string;
 
 	/*----------
 	 * We want to set max_safe_fds to
@@ -1214,6 +1215,12 @@ set_max_safe_fds(void)
 
 	max_safe_fds = Min(usable_fds - NUM_RESERVED_FDS, max_files_per_process);
 
+	/* Update GUC variable to allow users to see the result */
+	max_safe_fds_string = psprintf("%d", max_safe_fds);
+	SetConfigOption("max_files_per_process", max_safe_fds_string,
+					PGC_POSTMASTER, PGC_S_OVERRIDE);
+	pfree(max_safe_fds_string);
+
 	/*
 	 * Make sure we still have enough to get by.
 	 */
-- 
2.43.0

#22Jelte Fennema-Nio
postgres@jeltef.nl
In reply to: Jelte Fennema-Nio (#21)
3 attachment(s)
Re: Bump soft open file limit (RLIMIT_NOFILE) to hard limit on startup

On Mon Feb 24, 2025 at 10:56 AM CET, Jelte Fennema-Nio wrote:

Great! Attached are the updated other patches, I think I addressed all
feedback.

Ughh, a compiler warning snuck on windows during some of my final
refactoring. Fixed now.

Attachments:

v5-0001-Adds-a-helper-for-places-that-shell-out-to-system.patchtext/x-patch; charset=utf-8; name=v5-0001-Adds-a-helper-for-places-that-shell-out-to-system.patchDownload
From ec72e05e87c73dee3de73f9a6586e8e8db2d919e Mon Sep 17 00:00:00 2001
From: Jelte Fennema-Nio <github-tech@jeltef.nl>
Date: Sun, 23 Feb 2025 16:52:29 +0100
Subject: [PATCH v5 1/3] Adds a helper for places that shell out to system()

We need to call system() in a few places, and to do so safely we need
some pre/post-fork logic. This encapsulates that logic into a single
System helper function. The main reason to do this is that in a follow
up commit we want to add even more logic pre and post-fork.
---
 src/backend/access/transam/xlogarchive.c | 28 +-----------
 src/backend/archive/shell_archive.c      |  6 +--
 src/backend/postmaster/startup.c         | 54 +++++++++++++++++-------
 src/include/postmaster/startup.h         |  3 +-
 4 files changed, 43 insertions(+), 48 deletions(-)

diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index 1ef1713c91a..59a0f191339 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -158,27 +158,7 @@ RestoreArchivedFile(char *path, const char *xlogfname,
 			(errmsg_internal("executing restore command \"%s\"",
 							 xlogRestoreCmd)));
 
-	fflush(NULL);
-	pgstat_report_wait_start(WAIT_EVENT_RESTORE_COMMAND);
-
-	/*
-	 * PreRestoreCommand() informs the SIGTERM handler for the startup process
-	 * that it should proc_exit() right away.  This is done for the duration
-	 * of the system() call because there isn't a good way to break out while
-	 * it is executing.  Since we might call proc_exit() in a signal handler,
-	 * it is best to put any additional logic before or after the
-	 * PreRestoreCommand()/PostRestoreCommand() section.
-	 */
-	PreRestoreCommand();
-
-	/*
-	 * Copy xlog from archival storage to XLOGDIR
-	 */
-	rc = system(xlogRestoreCmd);
-
-	PostRestoreCommand();
-
-	pgstat_report_wait_end();
+	rc = System(xlogRestoreCmd, WAIT_EVENT_RESTORE_COMMAND);
 	pfree(xlogRestoreCmd);
 
 	if (rc == 0)
@@ -325,11 +305,7 @@ ExecuteRecoveryCommand(const char *command, const char *commandName,
 	/*
 	 * execute the constructed command
 	 */
-	fflush(NULL);
-	pgstat_report_wait_start(wait_event_info);
-	rc = system(xlogRecoveryCmd);
-	pgstat_report_wait_end();
-
+	rc = System(xlogRecoveryCmd, wait_event_info);
 	pfree(xlogRecoveryCmd);
 
 	if (rc != 0)
diff --git a/src/backend/archive/shell_archive.c b/src/backend/archive/shell_archive.c
index 828723afe47..7977e5a5092 100644
--- a/src/backend/archive/shell_archive.c
+++ b/src/backend/archive/shell_archive.c
@@ -22,6 +22,7 @@
 #include "archive/shell_archive.h"
 #include "common/percentrepl.h"
 #include "pgstat.h"
+#include "postmaster/startup.h"
 
 static bool shell_archive_configured(ArchiveModuleState *state);
 static bool shell_archive_file(ArchiveModuleState *state,
@@ -75,10 +76,7 @@ shell_archive_file(ArchiveModuleState *state, const char *file,
 			(errmsg_internal("executing archive command \"%s\"",
 							 xlogarchcmd)));
 
-	fflush(NULL);
-	pgstat_report_wait_start(WAIT_EVENT_ARCHIVE_COMMAND);
-	rc = system(xlogarchcmd);
-	pgstat_report_wait_end();
+	rc = System(xlogarchcmd, WAIT_EVENT_ARCHIVE_COMMAND);
 
 	if (rc != 0)
 	{
diff --git a/src/backend/postmaster/startup.c b/src/backend/postmaster/startup.c
index 88eab3d0baf..b0cd40c081e 100644
--- a/src/backend/postmaster/startup.c
+++ b/src/backend/postmaster/startup.c
@@ -33,6 +33,7 @@
 #include "utils/guc.h"
 #include "utils/memutils.h"
 #include "utils/timeout.h"
+#include "utils/wait_event_types.h"
 
 
 #ifndef USE_POSTMASTER_DEATH_SIGNAL
@@ -264,24 +265,45 @@ StartupProcessMain(const void *startup_data, size_t startup_data_len)
 	proc_exit(0);
 }
 
-void
-PreRestoreCommand(void)
+/*
+ * A custom wrapper around the system() function that calls the necessary
+ * functions pre/post-fork.
+ */
+int
+System(const char *command, uint32 wait_event_info)
 {
-	/*
-	 * Set in_restore_command to tell the signal handler that we should exit
-	 * right away on SIGTERM. We know that we're at a safe point to do that.
-	 * Check if we had already received the signal, so that we don't miss a
-	 * shutdown request received just before this.
-	 */
-	in_restore_command = true;
-	if (shutdown_requested)
-		proc_exit(1);
-}
+	int			rc;
 
-void
-PostRestoreCommand(void)
-{
-	in_restore_command = false;
+	fflush(NULL);
+	pgstat_report_wait_start(wait_event_info);
+
+	if (wait_event_info == WAIT_EVENT_RESTORE_COMMAND)
+	{
+		/*
+		 * Set in_restore_command to tell the signal handler that we should
+		 * exit right away on SIGTERM. This is done for the duration of the
+		 * system() call because there isn't a good way to break out while it
+		 * is executing.  Since we might call proc_exit() in a signal handler,
+		 * it is best to put any additional logic outside of this section
+		 * where in_restore_command is set to true.
+		 */
+		in_restore_command = true;
+
+		/*
+		 * Also check if we had already received the signal, so that we don't
+		 * miss a shutdown request received just before this.
+		 */
+		if (shutdown_requested)
+			proc_exit(1);
+	}
+
+	rc = system(command);
+
+	if (wait_event_info == WAIT_EVENT_RESTORE_COMMAND)
+		in_restore_command = false;
+
+	pgstat_report_wait_end();
+	return rc;
 }
 
 bool
diff --git a/src/include/postmaster/startup.h b/src/include/postmaster/startup.h
index ae0f6347fc0..09f5dcd3d03 100644
--- a/src/include/postmaster/startup.h
+++ b/src/include/postmaster/startup.h
@@ -27,8 +27,7 @@ extern PGDLLIMPORT int log_startup_progress_interval;
 
 extern void HandleStartupProcInterrupts(void);
 extern void StartupProcessMain(const void *startup_data, size_t startup_data_len) pg_attribute_noreturn();
-extern void PreRestoreCommand(void);
-extern void PostRestoreCommand(void);
+extern int	System(const char *command, uint32 wait_event_info);
 extern bool IsPromoteSignaled(void);
 extern void ResetPromoteSignaled(void);
 

base-commit: 454c182f8542890d0e2eac85f70d9a254a34fce3
-- 
2.43.0

v5-0002-Bump-postmaster-soft-open-file-limit-RLIMIT_NOFIL.patchtext/x-patch; charset=utf-8; name=v5-0002-Bump-postmaster-soft-open-file-limit-RLIMIT_NOFIL.patchDownload
From d04d0350510efc57dd1cd5947ef90f139972d3fc Mon Sep 17 00:00:00 2001
From: Jelte Fennema-Nio <github-tech@jeltef.nl>
Date: Tue, 11 Feb 2025 19:15:36 +0100
Subject: [PATCH v5 2/3] Bump postmaster soft open file limit (RLIMIT_NOFILE)
 when necessary

The default open file limit of 1024 on Linux is extremely low. The
reason that this hasn't changed change is because doing so would break
legacy programs that use the select(2) system call in hard to debug
ways. So instead programs that want to opt-in to a higher open file
limit are expected to bump their soft limit to their hard limit on
startup. Details on this are very well explained in a blogpost by the
systemd author[1]. There's also a similar change done by the Go
language[2].

This starts bumping postmaster its soft open file limit when we realize
that we'll run into the soft limit with the requested
max_files_per_process GUC. We do so by slightly changing the meaning of
the max_files_per_process GUC. The actual (not publicly exposed) limit
is max_safe_fds, previously this would be set to:
max_files_per_process - already_open_files - NUM_RESERVED_FDS
After this change we now try to set max_safe_fds to
max_files_per_process if the system allows that. This is deemed more
natural to understand for users, because now the limit of files that
they can open is actually what they configured in max_files_per_process.

Adding this infrastructure to change RLIMIT_NOFILE when needed is
especially useful for the AIO work that Andres is doing, because
io_uring consumes a lot of file descriptors. Even without looking at AIO
there is a large number of reports from people that require changing
their soft file limit before starting Postgres, sometimes falling back
to lowering max_files_per_process when they fail to do so[3-8]. It's
also not all that strange to fail at setting the soft open file limit
because there are multiple places where one can configure such limits
and usually only one of them is effective (which one depends on how
Postgres is started). In cloud environments its also often not possible
for user to change the soft limit, because they don't control the way
that Postgres is started.

One thing to note is that we temporarily restore the original soft
limit when shell-ing out to other executables. This is done as a
precaution in case those executables are using select(2).

[1]: https://0pointer.net/blog/file-descriptor-limits.html
[2]: https://github.com/golang/go/issues/46279
[3]: https://serverfault.com/questions/785330/getting-too-many-open-files-error-for-postgres
[4]: https://serverfault.com/questions/716982/how-to-raise-max-no-of-file-descriptors-for-daemons-running-on-debian-jessie
[5]: https://www.postgresql.org/message-id/flat/CAKtc8vXh7NvP_qWj8EqqorPY97bvxSaX3h5u7a9PptRFHW5x7g%40mail.gmail.com
[6]: https://www.postgresql.org/message-id/flat/113ce31b0908120955w77029099i7ececc053084095a%40mail.gmail.com
[7]: https://github.com/abiosoft/colima/discussions/836
[8]: https://www.postgresql.org/message-id/flat/29663.1007738957%40sss.pgh.pa.us#2079ec9e2d8b251593812a3711bfe9e9
---
 doc/src/sgml/config.sgml         |  17 ++-
 src/backend/postmaster/startup.c |   3 +
 src/backend/storage/file/fd.c    | 208 +++++++++++++++++++++++++++----
 src/include/storage/fd.h         |   2 +
 4 files changed, 199 insertions(+), 31 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 007746a4429..c69522be67a 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2378,15 +2378,14 @@ include_dir 'conf.d'
       </term>
       <listitem>
        <para>
-        Sets the maximum number of simultaneously open files allowed to each
-        server subprocess. The default is one thousand files. If the kernel is enforcing
-        a safe per-process limit, you don't need to worry about this setting.
-        But on some platforms (notably, most BSD systems), the kernel will
-        allow individual processes to open many more files than the system
-        can actually support if many processes all try to open
-        that many files. If you find yourself seeing <quote>Too many open
-        files</quote> failures, try reducing this setting.
-        This parameter can only be set at server start.
+        Sets the maximum number of files to each server subprocess is allowed
+        to open. The default is one thousand files. If the kernel is enforcing
+        a lower soft limit Postgres will automatically increase it if the hard
+        limit allows that. Setting this value too high can cause resource
+        exhaustion problems: On some platforms (notably, most BSD systems), the
+        kernel will allow individual processes to open many more files than the
+        system can actually support if many processes all try to open that many
+        files. This parameter can only be set at server start.
        </para>
       </listitem>
      </varlistentry>
diff --git a/src/backend/postmaster/startup.c b/src/backend/postmaster/startup.c
index b0cd40c081e..d3a5e2590ab 100644
--- a/src/backend/postmaster/startup.c
+++ b/src/backend/postmaster/startup.c
@@ -30,6 +30,7 @@
 #include "storage/pmsignal.h"
 #include "storage/procsignal.h"
 #include "storage/standby.h"
+#include "storage/fd.h"
 #include "utils/guc.h"
 #include "utils/memutils.h"
 #include "utils/timeout.h"
@@ -274,6 +275,7 @@ System(const char *command, uint32 wait_event_info)
 {
 	int			rc;
 
+	RestoreOriginalOpenFileLimit();
 	fflush(NULL);
 	pgstat_report_wait_start(wait_event_info);
 
@@ -303,6 +305,7 @@ System(const char *command, uint32 wait_event_info)
 		in_restore_command = false;
 
 	pgstat_report_wait_end();
+	RestoreCustomOpenFileLimit();
 	return rc;
 }
 
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index e454db4c020..f665422692d 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -157,6 +157,13 @@ int			max_files_per_process = 1000;
  */
 int			max_safe_fds = FD_MINFREE;	/* default if not changed */
 
+#ifdef HAVE_GETRLIMIT
+static bool saved_original_max_open_files;
+static struct rlimit original_max_open_files;
+static struct rlimit custom_max_open_files;
+#endif
+
+
 /* Whether it is safe to continue running after fsync() fails. */
 bool		data_sync_retry = false;
 
@@ -945,6 +952,154 @@ InitTemporaryFileAccess(void)
 #endif
 }
 
+/*
+ * Returns true if the passed in highestfd is the last one that we're allowed
+ * to open based on our. This should only be called if
+ */
+static bool
+IsOpenFileLimit(int highestfd)
+{
+#ifdef HAVE_GETRLIMIT
+	if (!saved_original_max_open_files)
+	{
+		return false;
+	}
+
+	return highestfd >= custom_max_open_files.rlim_cur - 1;
+#else
+	return false;
+#endif
+}
+
+/*
+ * Increases the open file limit (RLIMIT_NOFILE) by the requested amount.
+ * Returns true if successful, false otherwise.
+ */
+static bool
+IncreaseOpenFileLimit(int extra_files)
+{
+#ifdef HAVE_GETRLIMIT
+	struct rlimit rlim;
+
+	if (!saved_original_max_open_files)
+	{
+		return false;
+	}
+
+	rlim = custom_max_open_files;
+
+	/* If we're already at the max we reached our limit */
+	if (rlim.rlim_cur == original_max_open_files.rlim_max)
+		return false;
+
+	/* Otherwise try to increase the soft limit to what we need */
+	rlim.rlim_cur = Min(rlim.rlim_cur + extra_files, rlim.rlim_max);
+
+	if (setrlimit(RLIMIT_NOFILE, &rlim) != 0)
+	{
+		/* We made sure not to exceed the hard limit, so this shouldn't fail */
+		ereport(WARNING, (errmsg("setrlimit failed: %m")));
+		return false;
+	}
+
+	custom_max_open_files = rlim;
+
+	elog(LOG, "increased open file limit to %ld", (long) rlim.rlim_cur);
+
+	return true;
+#else
+	return false;
+#endif
+}
+
+/*
+ * Saves the original open file limit (RLIMIT_NOFILE) the first time when this
+ * is called. If called again it's a no-op.
+ *
+ * Returns true if successful, false otherwise.
+ */
+static void
+SaveOriginalOpenFileLimit(void)
+{
+#ifdef HAVE_GETRLIMIT
+	int			status;
+
+	if (saved_original_max_open_files)
+	{
+		/* Already saved, no need to do it again */
+		return;
+	}
+
+	status = getrlimit(RLIMIT_NOFILE, &original_max_open_files);
+	if (status != 0)
+	{
+		ereport(WARNING, (errmsg("getrlimit failed: %m")));
+		return;
+	}
+
+	custom_max_open_files = original_max_open_files;
+	saved_original_max_open_files = true;
+	return;
+#else
+	return;
+#endif
+}
+
+/*
+ * RestoreOriginalOpenFileLimit --- Restore the original open file limit that
+ * 		was present at postmaster start.
+ *
+ * This should be called before spawning subprocesses that might use select(2)
+ * which can only handle file descriptors up to 1024.
+ */
+void
+RestoreOriginalOpenFileLimit(void)
+{
+#ifdef HAVE_GETRLIMIT
+	if (!saved_original_max_open_files)
+	{
+		return;
+	}
+
+	if (custom_max_open_files.rlim_cur == original_max_open_files.rlim_cur)
+	{
+		/* Not changed, so no need to call setrlimit at all */
+		return;
+	}
+
+	if (setrlimit(RLIMIT_NOFILE, &original_max_open_files) != 0)
+	{
+		ereport(WARNING, (errmsg("setrlimit failed: %m")));
+	}
+#endif
+}
+
+/*
+ * RestoreCustomOpenFileLimit --- Restores our custom open file limit after a
+ * 		previous call to RestoreOriginalOpenFileLimit.
+ */
+void
+RestoreCustomOpenFileLimit(void)
+{
+#ifdef HAVE_GETRLIMIT
+	if (!saved_original_max_open_files)
+	{
+		return;
+	}
+
+	if (custom_max_open_files.rlim_cur == original_max_open_files.rlim_cur)
+	{
+		/* Not changed, so no need to call setrlimit at all */
+		return;
+	}
+
+	if (setrlimit(RLIMIT_NOFILE, &custom_max_open_files) != 0)
+	{
+		ereport(WARNING, (errmsg("setrlimit failed: %m")));
+	}
+#endif
+}
+
 /*
  * count_usable_fds --- count how many FDs the system will let us open,
  *		and estimate how many are already open.
@@ -968,38 +1123,39 @@ count_usable_fds(int max_to_probe, int *usable_fds, int *already_open)
 	int			highestfd = 0;
 	int			j;
 
-#ifdef HAVE_GETRLIMIT
-	struct rlimit rlim;
-	int			getrlimit_status;
-#endif
-
 	size = 1024;
 	fd = (int *) palloc(size * sizeof(int));
 
-#ifdef HAVE_GETRLIMIT
-	getrlimit_status = getrlimit(RLIMIT_NOFILE, &rlim);
-	if (getrlimit_status != 0)
-		ereport(WARNING, (errmsg("getrlimit failed: %m")));
-#endif							/* HAVE_GETRLIMIT */
+	SaveOriginalOpenFileLimit();
 
 	/* dup until failure or probe limit reached */
 	for (;;)
 	{
 		int			thisfd;
 
-#ifdef HAVE_GETRLIMIT
-
 		/*
 		 * don't go beyond RLIMIT_NOFILE; causes irritating kernel logs on
 		 * some platforms
 		 */
-		if (getrlimit_status == 0 && highestfd >= rlim.rlim_cur - 1)
-			break;
-#endif
+		if (IsOpenFileLimit(highestfd))
+		{
+			if (!IncreaseOpenFileLimit(max_to_probe - used))
+				break;
+		}
 
 		thisfd = dup(2);
 		if (thisfd < 0)
 		{
+			/*
+			 * Eventhough we do the pre-check above, it's still possible that
+			 * the call to dup fails with EMFILE. This can happen if the last
+			 * file descriptor was already assigned to an "already open" file.
+			 * One example of this happening, is if we're already at the soft
+			 * limit when we call count_usable_fds.
+			 */
+			if (errno == EMFILE && IncreaseOpenFileLimit(max_to_probe - used))
+				continue;
+
 			/* Expect EMFILE or ENFILE, else it's fishy */
 			if (errno != EMFILE && errno != ENFILE)
 				elog(WARNING, "duplicating stderr file descriptor failed after %d successes: %m", used);
@@ -1053,15 +1209,10 @@ set_max_safe_fds(void)
 	 * or the experimentally-determined EMFILE limit.
 	 *----------
 	 */
-	count_usable_fds(max_files_per_process,
+	count_usable_fds(max_files_per_process + NUM_RESERVED_FDS,
 					 &usable_fds, &already_open);
 
-	max_safe_fds = Min(usable_fds, max_files_per_process - already_open);
-
-	/*
-	 * Take off the FDs reserved for system() etc.
-	 */
-	max_safe_fds -= NUM_RESERVED_FDS;
+	max_safe_fds = Min(usable_fds - NUM_RESERVED_FDS, max_files_per_process);
 
 	/*
 	 * Make sure we still have enough to get by.
@@ -2724,6 +2875,19 @@ OpenPipeStream(const char *command, const char *mode)
 	ReleaseLruFiles();
 
 TryAgain:
+
+	/*
+	 * It would be great if we could call RestoreOriginalOpenFileLimit here,
+	 * but since popen() also opens a file in the current process (this side
+	 * of the pipe) we cannot do so safely. Because we might already have many
+	 * more files open than the original limit.
+	 *
+	 * The only way to address this would be implementing a custom popen()
+	 * that calls RestoreOriginalOpenFileLimit only around the actual fork
+	 * call, but that seems too much effort to handle the corner case where
+	 * this external command uses both select() and tries to open more files
+	 * than select() allows for.
+	 */
 	fflush(NULL);
 	pqsignal(SIGPIPE, SIG_DFL);
 	errno = 0;
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index e3067ab6597..d24e7f1c8df 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -177,6 +177,8 @@ extern void RemovePgTempFiles(void);
 extern void RemovePgTempFilesInDir(const char *tmpdirname, bool missing_ok,
 								   bool unlink_all);
 extern bool looks_like_temp_rel_name(const char *name);
+extern void RestoreOriginalOpenFileLimit(void);
+extern void RestoreCustomOpenFileLimit(void);
 
 extern int	pg_fsync(int fd);
 extern int	pg_fsync_no_writethrough(int fd);
-- 
2.43.0

v5-0003-Reflect-the-value-of-max_safe_fds-in-max_files_pe.patchtext/x-patch; charset=utf-8; name=v5-0003-Reflect-the-value-of-max_safe_fds-in-max_files_pe.patchDownload
From ff352f09fa72d6e359be6bf20a215c3724be2303 Mon Sep 17 00:00:00 2001
From: Jelte Fennema-Nio <github-tech@jeltef.nl>
Date: Wed, 12 Feb 2025 01:08:07 +0100
Subject: [PATCH v5 3/3] Reflect the value of max_safe_fds in
 max_files_per_process

It is currently hard to figure out if max_safe_fds is significantly
lower than max_files_per_process. This starts reflecting the value of
max_safe_fds in max_files_per_process after our limit detection. We
still want to have two separate variables because for the bootstrap or
standalone-backend cases their values differ on purpose.
---
 src/backend/storage/file/fd.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index f665422692d..03a16e8e5b6 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -1200,6 +1200,7 @@ set_max_safe_fds(void)
 {
 	int			usable_fds;
 	int			already_open;
+	char	   *max_safe_fds_string;
 
 	/*----------
 	 * We want to set max_safe_fds to
@@ -1214,6 +1215,12 @@ set_max_safe_fds(void)
 
 	max_safe_fds = Min(usable_fds - NUM_RESERVED_FDS, max_files_per_process);
 
+	/* Update GUC variable to allow users to see the result */
+	max_safe_fds_string = psprintf("%d", max_safe_fds);
+	SetConfigOption("max_files_per_process", max_safe_fds_string,
+					PGC_POSTMASTER, PGC_S_OVERRIDE);
+	pfree(max_safe_fds_string);
+
 	/*
 	 * Make sure we still have enough to get by.
 	 */
-- 
2.43.0

#23Jelte Fennema-Nio
postgres@jeltef.nl
In reply to: Jelte Fennema-Nio (#22)
3 attachment(s)
Re: Bump soft open file limit (RLIMIT_NOFILE) to hard limit on startup

On Mon Feb 24, 2025 at 11:58 AM CET, Jelte Fennema-Nio wrote:

Ughh, a compiler warning snuck on windows during some of my final
refactoring. Fixed now.

Right after pressing send I realized I could remove two more useless
lines...

Attachments:

v6-0001-Adds-a-helper-for-places-that-shell-out-to-system.patchtext/x-patch; charset=utf-8; name=v6-0001-Adds-a-helper-for-places-that-shell-out-to-system.patchDownload
From ec72e05e87c73dee3de73f9a6586e8e8db2d919e Mon Sep 17 00:00:00 2001
From: Jelte Fennema-Nio <github-tech@jeltef.nl>
Date: Sun, 23 Feb 2025 16:52:29 +0100
Subject: [PATCH v6 1/3] Adds a helper for places that shell out to system()

We need to call system() in a few places, and to do so safely we need
some pre/post-fork logic. This encapsulates that logic into a single
System helper function. The main reason to do this is that in a follow
up commit we want to add even more logic pre and post-fork.
---
 src/backend/access/transam/xlogarchive.c | 28 +-----------
 src/backend/archive/shell_archive.c      |  6 +--
 src/backend/postmaster/startup.c         | 54 +++++++++++++++++-------
 src/include/postmaster/startup.h         |  3 +-
 4 files changed, 43 insertions(+), 48 deletions(-)

diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index 1ef1713c91a..59a0f191339 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -158,27 +158,7 @@ RestoreArchivedFile(char *path, const char *xlogfname,
 			(errmsg_internal("executing restore command \"%s\"",
 							 xlogRestoreCmd)));
 
-	fflush(NULL);
-	pgstat_report_wait_start(WAIT_EVENT_RESTORE_COMMAND);
-
-	/*
-	 * PreRestoreCommand() informs the SIGTERM handler for the startup process
-	 * that it should proc_exit() right away.  This is done for the duration
-	 * of the system() call because there isn't a good way to break out while
-	 * it is executing.  Since we might call proc_exit() in a signal handler,
-	 * it is best to put any additional logic before or after the
-	 * PreRestoreCommand()/PostRestoreCommand() section.
-	 */
-	PreRestoreCommand();
-
-	/*
-	 * Copy xlog from archival storage to XLOGDIR
-	 */
-	rc = system(xlogRestoreCmd);
-
-	PostRestoreCommand();
-
-	pgstat_report_wait_end();
+	rc = System(xlogRestoreCmd, WAIT_EVENT_RESTORE_COMMAND);
 	pfree(xlogRestoreCmd);
 
 	if (rc == 0)
@@ -325,11 +305,7 @@ ExecuteRecoveryCommand(const char *command, const char *commandName,
 	/*
 	 * execute the constructed command
 	 */
-	fflush(NULL);
-	pgstat_report_wait_start(wait_event_info);
-	rc = system(xlogRecoveryCmd);
-	pgstat_report_wait_end();
-
+	rc = System(xlogRecoveryCmd, wait_event_info);
 	pfree(xlogRecoveryCmd);
 
 	if (rc != 0)
diff --git a/src/backend/archive/shell_archive.c b/src/backend/archive/shell_archive.c
index 828723afe47..7977e5a5092 100644
--- a/src/backend/archive/shell_archive.c
+++ b/src/backend/archive/shell_archive.c
@@ -22,6 +22,7 @@
 #include "archive/shell_archive.h"
 #include "common/percentrepl.h"
 #include "pgstat.h"
+#include "postmaster/startup.h"
 
 static bool shell_archive_configured(ArchiveModuleState *state);
 static bool shell_archive_file(ArchiveModuleState *state,
@@ -75,10 +76,7 @@ shell_archive_file(ArchiveModuleState *state, const char *file,
 			(errmsg_internal("executing archive command \"%s\"",
 							 xlogarchcmd)));
 
-	fflush(NULL);
-	pgstat_report_wait_start(WAIT_EVENT_ARCHIVE_COMMAND);
-	rc = system(xlogarchcmd);
-	pgstat_report_wait_end();
+	rc = System(xlogarchcmd, WAIT_EVENT_ARCHIVE_COMMAND);
 
 	if (rc != 0)
 	{
diff --git a/src/backend/postmaster/startup.c b/src/backend/postmaster/startup.c
index 88eab3d0baf..b0cd40c081e 100644
--- a/src/backend/postmaster/startup.c
+++ b/src/backend/postmaster/startup.c
@@ -33,6 +33,7 @@
 #include "utils/guc.h"
 #include "utils/memutils.h"
 #include "utils/timeout.h"
+#include "utils/wait_event_types.h"
 
 
 #ifndef USE_POSTMASTER_DEATH_SIGNAL
@@ -264,24 +265,45 @@ StartupProcessMain(const void *startup_data, size_t startup_data_len)
 	proc_exit(0);
 }
 
-void
-PreRestoreCommand(void)
+/*
+ * A custom wrapper around the system() function that calls the necessary
+ * functions pre/post-fork.
+ */
+int
+System(const char *command, uint32 wait_event_info)
 {
-	/*
-	 * Set in_restore_command to tell the signal handler that we should exit
-	 * right away on SIGTERM. We know that we're at a safe point to do that.
-	 * Check if we had already received the signal, so that we don't miss a
-	 * shutdown request received just before this.
-	 */
-	in_restore_command = true;
-	if (shutdown_requested)
-		proc_exit(1);
-}
+	int			rc;
 
-void
-PostRestoreCommand(void)
-{
-	in_restore_command = false;
+	fflush(NULL);
+	pgstat_report_wait_start(wait_event_info);
+
+	if (wait_event_info == WAIT_EVENT_RESTORE_COMMAND)
+	{
+		/*
+		 * Set in_restore_command to tell the signal handler that we should
+		 * exit right away on SIGTERM. This is done for the duration of the
+		 * system() call because there isn't a good way to break out while it
+		 * is executing.  Since we might call proc_exit() in a signal handler,
+		 * it is best to put any additional logic outside of this section
+		 * where in_restore_command is set to true.
+		 */
+		in_restore_command = true;
+
+		/*
+		 * Also check if we had already received the signal, so that we don't
+		 * miss a shutdown request received just before this.
+		 */
+		if (shutdown_requested)
+			proc_exit(1);
+	}
+
+	rc = system(command);
+
+	if (wait_event_info == WAIT_EVENT_RESTORE_COMMAND)
+		in_restore_command = false;
+
+	pgstat_report_wait_end();
+	return rc;
 }
 
 bool
diff --git a/src/include/postmaster/startup.h b/src/include/postmaster/startup.h
index ae0f6347fc0..09f5dcd3d03 100644
--- a/src/include/postmaster/startup.h
+++ b/src/include/postmaster/startup.h
@@ -27,8 +27,7 @@ extern PGDLLIMPORT int log_startup_progress_interval;
 
 extern void HandleStartupProcInterrupts(void);
 extern void StartupProcessMain(const void *startup_data, size_t startup_data_len) pg_attribute_noreturn();
-extern void PreRestoreCommand(void);
-extern void PostRestoreCommand(void);
+extern int	System(const char *command, uint32 wait_event_info);
 extern bool IsPromoteSignaled(void);
 extern void ResetPromoteSignaled(void);
 

base-commit: 454c182f8542890d0e2eac85f70d9a254a34fce3
-- 
2.43.0

v6-0002-Bump-postmaster-soft-open-file-limit-RLIMIT_NOFIL.patchtext/x-patch; charset=utf-8; name=v6-0002-Bump-postmaster-soft-open-file-limit-RLIMIT_NOFIL.patchDownload
From d03ad921f8bca8c16107d755d6edc75d1ccba3e0 Mon Sep 17 00:00:00 2001
From: Jelte Fennema-Nio <github-tech@jeltef.nl>
Date: Tue, 11 Feb 2025 19:15:36 +0100
Subject: [PATCH v6 2/3] Bump postmaster soft open file limit (RLIMIT_NOFILE)
 when necessary

The default open file limit of 1024 on Linux is extremely low. The
reason that this hasn't changed change is because doing so would break
legacy programs that use the select(2) system call in hard to debug
ways. So instead programs that want to opt-in to a higher open file
limit are expected to bump their soft limit to their hard limit on
startup. Details on this are very well explained in a blogpost by the
systemd author[1]. There's also a similar change done by the Go
language[2].

This starts bumping postmaster its soft open file limit when we realize
that we'll run into the soft limit with the requested
max_files_per_process GUC. We do so by slightly changing the meaning of
the max_files_per_process GUC. The actual (not publicly exposed) limit
is max_safe_fds, previously this would be set to:
max_files_per_process - already_open_files - NUM_RESERVED_FDS
After this change we now try to set max_safe_fds to
max_files_per_process if the system allows that. This is deemed more
natural to understand for users, because now the limit of files that
they can open is actually what they configured in max_files_per_process.

Adding this infrastructure to change RLIMIT_NOFILE when needed is
especially useful for the AIO work that Andres is doing, because
io_uring consumes a lot of file descriptors. Even without looking at AIO
there is a large number of reports from people that require changing
their soft file limit before starting Postgres, sometimes falling back
to lowering max_files_per_process when they fail to do so[3-8]. It's
also not all that strange to fail at setting the soft open file limit
because there are multiple places where one can configure such limits
and usually only one of them is effective (which one depends on how
Postgres is started). In cloud environments its also often not possible
for user to change the soft limit, because they don't control the way
that Postgres is started.

One thing to note is that we temporarily restore the original soft
limit when shell-ing out to other executables. This is done as a
precaution in case those executables are using select(2).

[1]: https://0pointer.net/blog/file-descriptor-limits.html
[2]: https://github.com/golang/go/issues/46279
[3]: https://serverfault.com/questions/785330/getting-too-many-open-files-error-for-postgres
[4]: https://serverfault.com/questions/716982/how-to-raise-max-no-of-file-descriptors-for-daemons-running-on-debian-jessie
[5]: https://www.postgresql.org/message-id/flat/CAKtc8vXh7NvP_qWj8EqqorPY97bvxSaX3h5u7a9PptRFHW5x7g%40mail.gmail.com
[6]: https://www.postgresql.org/message-id/flat/113ce31b0908120955w77029099i7ececc053084095a%40mail.gmail.com
[7]: https://github.com/abiosoft/colima/discussions/836
[8]: https://www.postgresql.org/message-id/flat/29663.1007738957%40sss.pgh.pa.us#2079ec9e2d8b251593812a3711bfe9e9
---
 doc/src/sgml/config.sgml         |  17 ++-
 src/backend/postmaster/startup.c |   3 +
 src/backend/storage/file/fd.c    | 206 +++++++++++++++++++++++++++----
 src/include/storage/fd.h         |   2 +
 4 files changed, 197 insertions(+), 31 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 007746a4429..c69522be67a 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2378,15 +2378,14 @@ include_dir 'conf.d'
       </term>
       <listitem>
        <para>
-        Sets the maximum number of simultaneously open files allowed to each
-        server subprocess. The default is one thousand files. If the kernel is enforcing
-        a safe per-process limit, you don't need to worry about this setting.
-        But on some platforms (notably, most BSD systems), the kernel will
-        allow individual processes to open many more files than the system
-        can actually support if many processes all try to open
-        that many files. If you find yourself seeing <quote>Too many open
-        files</quote> failures, try reducing this setting.
-        This parameter can only be set at server start.
+        Sets the maximum number of files to each server subprocess is allowed
+        to open. The default is one thousand files. If the kernel is enforcing
+        a lower soft limit Postgres will automatically increase it if the hard
+        limit allows that. Setting this value too high can cause resource
+        exhaustion problems: On some platforms (notably, most BSD systems), the
+        kernel will allow individual processes to open many more files than the
+        system can actually support if many processes all try to open that many
+        files. This parameter can only be set at server start.
        </para>
       </listitem>
      </varlistentry>
diff --git a/src/backend/postmaster/startup.c b/src/backend/postmaster/startup.c
index b0cd40c081e..d3a5e2590ab 100644
--- a/src/backend/postmaster/startup.c
+++ b/src/backend/postmaster/startup.c
@@ -30,6 +30,7 @@
 #include "storage/pmsignal.h"
 #include "storage/procsignal.h"
 #include "storage/standby.h"
+#include "storage/fd.h"
 #include "utils/guc.h"
 #include "utils/memutils.h"
 #include "utils/timeout.h"
@@ -274,6 +275,7 @@ System(const char *command, uint32 wait_event_info)
 {
 	int			rc;
 
+	RestoreOriginalOpenFileLimit();
 	fflush(NULL);
 	pgstat_report_wait_start(wait_event_info);
 
@@ -303,6 +305,7 @@ System(const char *command, uint32 wait_event_info)
 		in_restore_command = false;
 
 	pgstat_report_wait_end();
+	RestoreCustomOpenFileLimit();
 	return rc;
 }
 
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index e454db4c020..fc60e7ebeb9 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -157,6 +157,13 @@ int			max_files_per_process = 1000;
  */
 int			max_safe_fds = FD_MINFREE;	/* default if not changed */
 
+#ifdef HAVE_GETRLIMIT
+static bool saved_original_max_open_files;
+static struct rlimit original_max_open_files;
+static struct rlimit custom_max_open_files;
+#endif
+
+
 /* Whether it is safe to continue running after fsync() fails. */
 bool		data_sync_retry = false;
 
@@ -945,6 +952,152 @@ InitTemporaryFileAccess(void)
 #endif
 }
 
+/*
+ * Returns true if the passed in highestfd is the last one that we're allowed
+ * to open based on our. This should only be called if
+ */
+static bool
+IsOpenFileLimit(int highestfd)
+{
+#ifdef HAVE_GETRLIMIT
+	if (!saved_original_max_open_files)
+	{
+		return false;
+	}
+
+	return highestfd >= custom_max_open_files.rlim_cur - 1;
+#else
+	return false;
+#endif
+}
+
+/*
+ * Increases the open file limit (RLIMIT_NOFILE) by the requested amount.
+ * Returns true if successful, false otherwise.
+ */
+static bool
+IncreaseOpenFileLimit(int extra_files)
+{
+#ifdef HAVE_GETRLIMIT
+	struct rlimit rlim;
+
+	if (!saved_original_max_open_files)
+	{
+		return false;
+	}
+
+	rlim = custom_max_open_files;
+
+	/* If we're already at the max we reached our limit */
+	if (rlim.rlim_cur == original_max_open_files.rlim_max)
+		return false;
+
+	/* Otherwise try to increase the soft limit to what we need */
+	rlim.rlim_cur = Min(rlim.rlim_cur + extra_files, rlim.rlim_max);
+
+	if (setrlimit(RLIMIT_NOFILE, &rlim) != 0)
+	{
+		/* We made sure not to exceed the hard limit, so this shouldn't fail */
+		ereport(WARNING, (errmsg("setrlimit failed: %m")));
+		return false;
+	}
+
+	custom_max_open_files = rlim;
+
+	elog(LOG, "increased open file limit to %ld", (long) rlim.rlim_cur);
+
+	return true;
+#else
+	return false;
+#endif
+}
+
+/*
+ * Saves the original open file limit (RLIMIT_NOFILE) the first time when this
+ * is called. If called again it's a no-op.
+ *
+ * Returns true if successful, false otherwise.
+ */
+static void
+SaveOriginalOpenFileLimit(void)
+{
+#ifdef HAVE_GETRLIMIT
+	int			status;
+
+	if (saved_original_max_open_files)
+	{
+		/* Already saved, no need to do it again */
+		return;
+	}
+
+	status = getrlimit(RLIMIT_NOFILE, &original_max_open_files);
+	if (status != 0)
+	{
+		ereport(WARNING, (errmsg("getrlimit failed: %m")));
+		return;
+	}
+
+	custom_max_open_files = original_max_open_files;
+	saved_original_max_open_files = true;
+	return;
+#endif
+}
+
+/*
+ * RestoreOriginalOpenFileLimit --- Restore the original open file limit that
+ * 		was present at postmaster start.
+ *
+ * This should be called before spawning subprocesses that might use select(2)
+ * which can only handle file descriptors up to 1024.
+ */
+void
+RestoreOriginalOpenFileLimit(void)
+{
+#ifdef HAVE_GETRLIMIT
+	if (!saved_original_max_open_files)
+	{
+		return;
+	}
+
+	if (custom_max_open_files.rlim_cur == original_max_open_files.rlim_cur)
+	{
+		/* Not changed, so no need to call setrlimit at all */
+		return;
+	}
+
+	if (setrlimit(RLIMIT_NOFILE, &original_max_open_files) != 0)
+	{
+		ereport(WARNING, (errmsg("setrlimit failed: %m")));
+	}
+#endif
+}
+
+/*
+ * RestoreCustomOpenFileLimit --- Restores our custom open file limit after a
+ * 		previous call to RestoreOriginalOpenFileLimit.
+ */
+void
+RestoreCustomOpenFileLimit(void)
+{
+#ifdef HAVE_GETRLIMIT
+	if (!saved_original_max_open_files)
+	{
+		return;
+	}
+
+	if (custom_max_open_files.rlim_cur == original_max_open_files.rlim_cur)
+	{
+		/* Not changed, so no need to call setrlimit at all */
+		return;
+	}
+
+	if (setrlimit(RLIMIT_NOFILE, &custom_max_open_files) != 0)
+	{
+		ereport(WARNING, (errmsg("setrlimit failed: %m")));
+	}
+#endif
+}
+
 /*
  * count_usable_fds --- count how many FDs the system will let us open,
  *		and estimate how many are already open.
@@ -968,38 +1121,39 @@ count_usable_fds(int max_to_probe, int *usable_fds, int *already_open)
 	int			highestfd = 0;
 	int			j;
 
-#ifdef HAVE_GETRLIMIT
-	struct rlimit rlim;
-	int			getrlimit_status;
-#endif
-
 	size = 1024;
 	fd = (int *) palloc(size * sizeof(int));
 
-#ifdef HAVE_GETRLIMIT
-	getrlimit_status = getrlimit(RLIMIT_NOFILE, &rlim);
-	if (getrlimit_status != 0)
-		ereport(WARNING, (errmsg("getrlimit failed: %m")));
-#endif							/* HAVE_GETRLIMIT */
+	SaveOriginalOpenFileLimit();
 
 	/* dup until failure or probe limit reached */
 	for (;;)
 	{
 		int			thisfd;
 
-#ifdef HAVE_GETRLIMIT
-
 		/*
 		 * don't go beyond RLIMIT_NOFILE; causes irritating kernel logs on
 		 * some platforms
 		 */
-		if (getrlimit_status == 0 && highestfd >= rlim.rlim_cur - 1)
-			break;
-#endif
+		if (IsOpenFileLimit(highestfd))
+		{
+			if (!IncreaseOpenFileLimit(max_to_probe - used))
+				break;
+		}
 
 		thisfd = dup(2);
 		if (thisfd < 0)
 		{
+			/*
+			 * Eventhough we do the pre-check above, it's still possible that
+			 * the call to dup fails with EMFILE. This can happen if the last
+			 * file descriptor was already assigned to an "already open" file.
+			 * One example of this happening, is if we're already at the soft
+			 * limit when we call count_usable_fds.
+			 */
+			if (errno == EMFILE && IncreaseOpenFileLimit(max_to_probe - used))
+				continue;
+
 			/* Expect EMFILE or ENFILE, else it's fishy */
 			if (errno != EMFILE && errno != ENFILE)
 				elog(WARNING, "duplicating stderr file descriptor failed after %d successes: %m", used);
@@ -1053,15 +1207,10 @@ set_max_safe_fds(void)
 	 * or the experimentally-determined EMFILE limit.
 	 *----------
 	 */
-	count_usable_fds(max_files_per_process,
+	count_usable_fds(max_files_per_process + NUM_RESERVED_FDS,
 					 &usable_fds, &already_open);
 
-	max_safe_fds = Min(usable_fds, max_files_per_process - already_open);
-
-	/*
-	 * Take off the FDs reserved for system() etc.
-	 */
-	max_safe_fds -= NUM_RESERVED_FDS;
+	max_safe_fds = Min(usable_fds - NUM_RESERVED_FDS, max_files_per_process);
 
 	/*
 	 * Make sure we still have enough to get by.
@@ -2724,6 +2873,19 @@ OpenPipeStream(const char *command, const char *mode)
 	ReleaseLruFiles();
 
 TryAgain:
+
+	/*
+	 * It would be great if we could call RestoreOriginalOpenFileLimit here,
+	 * but since popen() also opens a file in the current process (this side
+	 * of the pipe) we cannot do so safely. Because we might already have many
+	 * more files open than the original limit.
+	 *
+	 * The only way to address this would be implementing a custom popen()
+	 * that calls RestoreOriginalOpenFileLimit only around the actual fork
+	 * call, but that seems too much effort to handle the corner case where
+	 * this external command uses both select() and tries to open more files
+	 * than select() allows for.
+	 */
 	fflush(NULL);
 	pqsignal(SIGPIPE, SIG_DFL);
 	errno = 0;
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index e3067ab6597..d24e7f1c8df 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -177,6 +177,8 @@ extern void RemovePgTempFiles(void);
 extern void RemovePgTempFilesInDir(const char *tmpdirname, bool missing_ok,
 								   bool unlink_all);
 extern bool looks_like_temp_rel_name(const char *name);
+extern void RestoreOriginalOpenFileLimit(void);
+extern void RestoreCustomOpenFileLimit(void);
 
 extern int	pg_fsync(int fd);
 extern int	pg_fsync_no_writethrough(int fd);
-- 
2.43.0

v6-0003-Reflect-the-value-of-max_safe_fds-in-max_files_pe.patchtext/x-patch; charset=utf-8; name=v6-0003-Reflect-the-value-of-max_safe_fds-in-max_files_pe.patchDownload
From faf1c3b550215a6dc831d5fc9ee4e5c2ec5604a4 Mon Sep 17 00:00:00 2001
From: Jelte Fennema-Nio <github-tech@jeltef.nl>
Date: Wed, 12 Feb 2025 01:08:07 +0100
Subject: [PATCH v6 3/3] Reflect the value of max_safe_fds in
 max_files_per_process

It is currently hard to figure out if max_safe_fds is significantly
lower than max_files_per_process. This starts reflecting the value of
max_safe_fds in max_files_per_process after our limit detection. We
still want to have two separate variables because for the bootstrap or
standalone-backend cases their values differ on purpose.
---
 src/backend/storage/file/fd.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index fc60e7ebeb9..f0643838857 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -1198,6 +1198,7 @@ set_max_safe_fds(void)
 {
 	int			usable_fds;
 	int			already_open;
+	char	   *max_safe_fds_string;
 
 	/*----------
 	 * We want to set max_safe_fds to
@@ -1212,6 +1213,12 @@ set_max_safe_fds(void)
 
 	max_safe_fds = Min(usable_fds - NUM_RESERVED_FDS, max_files_per_process);
 
+	/* Update GUC variable to allow users to see the result */
+	max_safe_fds_string = psprintf("%d", max_safe_fds);
+	SetConfigOption("max_files_per_process", max_safe_fds_string,
+					PGC_POSTMASTER, PGC_S_OVERRIDE);
+	pfree(max_safe_fds_string);
+
 	/*
 	 * Make sure we still have enough to get by.
 	 */
-- 
2.43.0

#24Jelte Fennema-Nio
postgres@jeltef.nl
In reply to: Jelte Fennema-Nio (#23)
3 attachment(s)
Re: Bump soft open file limit (RLIMIT_NOFILE) to hard limit on startup

On Mon Feb 24, 2025 at 12:01 PM CET, Jelte Fennema-Nio wrote:

Right after pressing send I realized I could remove two more useless
lines...

Rebased patchset attached (trivial conflict against pg_noreturn
changes).

Attachments:

v7-0001-Adds-a-helper-for-places-that-shell-out-to-system.patchtext/x-patch; charset=utf-8; name=v7-0001-Adds-a-helper-for-places-that-shell-out-to-system.patchDownload
From 249ebbac1b6c01b99795cfe9b0201ab7ee6bacfa Mon Sep 17 00:00:00 2001
From: Jelte Fennema-Nio <github-tech@jeltef.nl>
Date: Sun, 23 Feb 2025 16:52:29 +0100
Subject: [PATCH v7 1/3] Adds a helper for places that shell out to system()

We need to call system() in a few places, and to do so safely we need
some pre/post-fork logic. This encapsulates that logic into a single
System helper function. The main reason to do this is that in a follow
up commit we want to add even more logic pre and post-fork.
---
 src/backend/access/transam/xlogarchive.c | 28 +-----------
 src/backend/archive/shell_archive.c      |  6 +--
 src/backend/postmaster/startup.c         | 54 +++++++++++++++++-------
 src/include/postmaster/startup.h         |  3 +-
 4 files changed, 43 insertions(+), 48 deletions(-)

diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index 1ef1713c91a..59a0f191339 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -158,27 +158,7 @@ RestoreArchivedFile(char *path, const char *xlogfname,
 			(errmsg_internal("executing restore command \"%s\"",
 							 xlogRestoreCmd)));
 
-	fflush(NULL);
-	pgstat_report_wait_start(WAIT_EVENT_RESTORE_COMMAND);
-
-	/*
-	 * PreRestoreCommand() informs the SIGTERM handler for the startup process
-	 * that it should proc_exit() right away.  This is done for the duration
-	 * of the system() call because there isn't a good way to break out while
-	 * it is executing.  Since we might call proc_exit() in a signal handler,
-	 * it is best to put any additional logic before or after the
-	 * PreRestoreCommand()/PostRestoreCommand() section.
-	 */
-	PreRestoreCommand();
-
-	/*
-	 * Copy xlog from archival storage to XLOGDIR
-	 */
-	rc = system(xlogRestoreCmd);
-
-	PostRestoreCommand();
-
-	pgstat_report_wait_end();
+	rc = System(xlogRestoreCmd, WAIT_EVENT_RESTORE_COMMAND);
 	pfree(xlogRestoreCmd);
 
 	if (rc == 0)
@@ -325,11 +305,7 @@ ExecuteRecoveryCommand(const char *command, const char *commandName,
 	/*
 	 * execute the constructed command
 	 */
-	fflush(NULL);
-	pgstat_report_wait_start(wait_event_info);
-	rc = system(xlogRecoveryCmd);
-	pgstat_report_wait_end();
-
+	rc = System(xlogRecoveryCmd, wait_event_info);
 	pfree(xlogRecoveryCmd);
 
 	if (rc != 0)
diff --git a/src/backend/archive/shell_archive.c b/src/backend/archive/shell_archive.c
index 828723afe47..7977e5a5092 100644
--- a/src/backend/archive/shell_archive.c
+++ b/src/backend/archive/shell_archive.c
@@ -22,6 +22,7 @@
 #include "archive/shell_archive.h"
 #include "common/percentrepl.h"
 #include "pgstat.h"
+#include "postmaster/startup.h"
 
 static bool shell_archive_configured(ArchiveModuleState *state);
 static bool shell_archive_file(ArchiveModuleState *state,
@@ -75,10 +76,7 @@ shell_archive_file(ArchiveModuleState *state, const char *file,
 			(errmsg_internal("executing archive command \"%s\"",
 							 xlogarchcmd)));
 
-	fflush(NULL);
-	pgstat_report_wait_start(WAIT_EVENT_ARCHIVE_COMMAND);
-	rc = system(xlogarchcmd);
-	pgstat_report_wait_end();
+	rc = System(xlogarchcmd, WAIT_EVENT_ARCHIVE_COMMAND);
 
 	if (rc != 0)
 	{
diff --git a/src/backend/postmaster/startup.c b/src/backend/postmaster/startup.c
index 27e86cf393f..2e3d0ea8cf0 100644
--- a/src/backend/postmaster/startup.c
+++ b/src/backend/postmaster/startup.c
@@ -33,6 +33,7 @@
 #include "utils/guc.h"
 #include "utils/memutils.h"
 #include "utils/timeout.h"
+#include "utils/wait_event_types.h"
 
 
 #ifndef USE_POSTMASTER_DEATH_SIGNAL
@@ -264,24 +265,45 @@ StartupProcessMain(const void *startup_data, size_t startup_data_len)
 	proc_exit(0);
 }
 
-void
-PreRestoreCommand(void)
+/*
+ * A custom wrapper around the system() function that calls the necessary
+ * functions pre/post-fork.
+ */
+int
+System(const char *command, uint32 wait_event_info)
 {
-	/*
-	 * Set in_restore_command to tell the signal handler that we should exit
-	 * right away on SIGTERM. We know that we're at a safe point to do that.
-	 * Check if we had already received the signal, so that we don't miss a
-	 * shutdown request received just before this.
-	 */
-	in_restore_command = true;
-	if (shutdown_requested)
-		proc_exit(1);
-}
+	int			rc;
 
-void
-PostRestoreCommand(void)
-{
-	in_restore_command = false;
+	fflush(NULL);
+	pgstat_report_wait_start(wait_event_info);
+
+	if (wait_event_info == WAIT_EVENT_RESTORE_COMMAND)
+	{
+		/*
+		 * Set in_restore_command to tell the signal handler that we should
+		 * exit right away on SIGTERM. This is done for the duration of the
+		 * system() call because there isn't a good way to break out while it
+		 * is executing.  Since we might call proc_exit() in a signal handler,
+		 * it is best to put any additional logic outside of this section
+		 * where in_restore_command is set to true.
+		 */
+		in_restore_command = true;
+
+		/*
+		 * Also check if we had already received the signal, so that we don't
+		 * miss a shutdown request received just before this.
+		 */
+		if (shutdown_requested)
+			proc_exit(1);
+	}
+
+	rc = system(command);
+
+	if (wait_event_info == WAIT_EVENT_RESTORE_COMMAND)
+		in_restore_command = false;
+
+	pgstat_report_wait_end();
+	return rc;
 }
 
 bool
diff --git a/src/include/postmaster/startup.h b/src/include/postmaster/startup.h
index ec2c8d3bff5..7fdf9100044 100644
--- a/src/include/postmaster/startup.h
+++ b/src/include/postmaster/startup.h
@@ -27,8 +27,7 @@ extern PGDLLIMPORT int log_startup_progress_interval;
 
 extern void ProcessStartupProcInterrupts(void);
 pg_noreturn extern void StartupProcessMain(const void *startup_data, size_t startup_data_len);
-extern void PreRestoreCommand(void);
-extern void PostRestoreCommand(void);
+extern int	System(const char *command, uint32 wait_event_info);
 extern bool IsPromoteSignaled(void);
 extern void ResetPromoteSignaled(void);
 

base-commit: 65db3963ae7154b8f01e4d73dc6b1ffd81c70e1e
-- 
2.43.0

v7-0002-Bump-postmaster-soft-open-file-limit-RLIMIT_NOFIL.patchtext/x-patch; charset=utf-8; name=v7-0002-Bump-postmaster-soft-open-file-limit-RLIMIT_NOFIL.patchDownload
From 5e861b43cc0dc903d530e9140ddf5acf5c81b270 Mon Sep 17 00:00:00 2001
From: Jelte Fennema-Nio <github-tech@jeltef.nl>
Date: Tue, 11 Feb 2025 19:15:36 +0100
Subject: [PATCH v7 2/3] Bump postmaster soft open file limit (RLIMIT_NOFILE)
 when necessary

The default open file limit of 1024 on Linux is extremely low. The
reason that this hasn't changed change is because doing so would break
legacy programs that use the select(2) system call in hard to debug
ways. So instead programs that want to opt-in to a higher open file
limit are expected to bump their soft limit to their hard limit on
startup. Details on this are very well explained in a blogpost by the
systemd author[1]. There's also a similar change done by the Go
language[2].

This starts bumping postmaster its soft open file limit when we realize
that we'll run into the soft limit with the requested
max_files_per_process GUC. We do so by slightly changing the meaning of
the max_files_per_process GUC. The actual (not publicly exposed) limit
is max_safe_fds, previously this would be set to:
max_files_per_process - already_open_files - NUM_RESERVED_FDS
After this change we now try to set max_safe_fds to
max_files_per_process if the system allows that. This is deemed more
natural to understand for users, because now the limit of files that
they can open is actually what they configured in max_files_per_process.

Adding this infrastructure to change RLIMIT_NOFILE when needed is
especially useful for the AIO work that Andres is doing, because
io_uring consumes a lot of file descriptors. Even without looking at AIO
there is a large number of reports from people that require changing
their soft file limit before starting Postgres, sometimes falling back
to lowering max_files_per_process when they fail to do so[3-8]. It's
also not all that strange to fail at setting the soft open file limit
because there are multiple places where one can configure such limits
and usually only one of them is effective (which one depends on how
Postgres is started). In cloud environments its also often not possible
for user to change the soft limit, because they don't control the way
that Postgres is started.

One thing to note is that we temporarily restore the original soft
limit when shell-ing out to other executables. This is done as a
precaution in case those executables are using select(2).

[1]: https://0pointer.net/blog/file-descriptor-limits.html
[2]: https://github.com/golang/go/issues/46279
[3]: https://serverfault.com/questions/785330/getting-too-many-open-files-error-for-postgres
[4]: https://serverfault.com/questions/716982/how-to-raise-max-no-of-file-descriptors-for-daemons-running-on-debian-jessie
[5]: https://www.postgresql.org/message-id/flat/CAKtc8vXh7NvP_qWj8EqqorPY97bvxSaX3h5u7a9PptRFHW5x7g%40mail.gmail.com
[6]: https://www.postgresql.org/message-id/flat/113ce31b0908120955w77029099i7ececc053084095a%40mail.gmail.com
[7]: https://github.com/abiosoft/colima/discussions/836
[8]: https://www.postgresql.org/message-id/flat/29663.1007738957%40sss.pgh.pa.us#2079ec9e2d8b251593812a3711bfe9e9
---
 doc/src/sgml/config.sgml         |  17 ++-
 src/backend/postmaster/startup.c |   3 +
 src/backend/storage/file/fd.c    | 206 +++++++++++++++++++++++++++----
 src/include/storage/fd.h         |   2 +
 4 files changed, 197 insertions(+), 31 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 3d62c8bd274..271ae897172 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2378,15 +2378,14 @@ include_dir 'conf.d'
       </term>
       <listitem>
        <para>
-        Sets the maximum number of simultaneously open files allowed to each
-        server subprocess. The default is one thousand files. If the kernel is enforcing
-        a safe per-process limit, you don't need to worry about this setting.
-        But on some platforms (notably, most BSD systems), the kernel will
-        allow individual processes to open many more files than the system
-        can actually support if many processes all try to open
-        that many files. If you find yourself seeing <quote>Too many open
-        files</quote> failures, try reducing this setting.
-        This parameter can only be set at server start.
+        Sets the maximum number of files to each server subprocess is allowed
+        to open. The default is one thousand files. If the kernel is enforcing
+        a lower soft limit Postgres will automatically increase it if the hard
+        limit allows that. Setting this value too high can cause resource
+        exhaustion problems: On some platforms (notably, most BSD systems), the
+        kernel will allow individual processes to open many more files than the
+        system can actually support if many processes all try to open that many
+        files. This parameter can only be set at server start.
        </para>
       </listitem>
      </varlistentry>
diff --git a/src/backend/postmaster/startup.c b/src/backend/postmaster/startup.c
index 2e3d0ea8cf0..9a5c3eb29a7 100644
--- a/src/backend/postmaster/startup.c
+++ b/src/backend/postmaster/startup.c
@@ -30,6 +30,7 @@
 #include "storage/pmsignal.h"
 #include "storage/procsignal.h"
 #include "storage/standby.h"
+#include "storage/fd.h"
 #include "utils/guc.h"
 #include "utils/memutils.h"
 #include "utils/timeout.h"
@@ -274,6 +275,7 @@ System(const char *command, uint32 wait_event_info)
 {
 	int			rc;
 
+	RestoreOriginalOpenFileLimit();
 	fflush(NULL);
 	pgstat_report_wait_start(wait_event_info);
 
@@ -303,6 +305,7 @@ System(const char *command, uint32 wait_event_info)
 		in_restore_command = false;
 
 	pgstat_report_wait_end();
+	RestoreCustomOpenFileLimit();
 	return rc;
 }
 
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 62f1185859f..de6bdf9c141 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -157,6 +157,13 @@ int			max_files_per_process = 1000;
  */
 int			max_safe_fds = FD_MINFREE;	/* default if not changed */
 
+#ifdef HAVE_GETRLIMIT
+static bool saved_original_max_open_files;
+static struct rlimit original_max_open_files;
+static struct rlimit custom_max_open_files;
+#endif
+
+
 /* Whether it is safe to continue running after fsync() fails. */
 bool		data_sync_retry = false;
 
@@ -945,6 +952,152 @@ InitTemporaryFileAccess(void)
 #endif
 }
 
+/*
+ * Returns true if the passed in highestfd is the last one that we're allowed
+ * to open based on our. This should only be called if
+ */
+static bool
+IsOpenFileLimit(int highestfd)
+{
+#ifdef HAVE_GETRLIMIT
+	if (!saved_original_max_open_files)
+	{
+		return false;
+	}
+
+	return highestfd >= custom_max_open_files.rlim_cur - 1;
+#else
+	return false;
+#endif
+}
+
+/*
+ * Increases the open file limit (RLIMIT_NOFILE) by the requested amount.
+ * Returns true if successful, false otherwise.
+ */
+static bool
+IncreaseOpenFileLimit(int extra_files)
+{
+#ifdef HAVE_GETRLIMIT
+	struct rlimit rlim;
+
+	if (!saved_original_max_open_files)
+	{
+		return false;
+	}
+
+	rlim = custom_max_open_files;
+
+	/* If we're already at the max we reached our limit */
+	if (rlim.rlim_cur == original_max_open_files.rlim_max)
+		return false;
+
+	/* Otherwise try to increase the soft limit to what we need */
+	rlim.rlim_cur = Min(rlim.rlim_cur + extra_files, rlim.rlim_max);
+
+	if (setrlimit(RLIMIT_NOFILE, &rlim) != 0)
+	{
+		/* We made sure not to exceed the hard limit, so this shouldn't fail */
+		ereport(WARNING, (errmsg("setrlimit failed: %m")));
+		return false;
+	}
+
+	custom_max_open_files = rlim;
+
+	elog(LOG, "increased open file limit to %ld", (long) rlim.rlim_cur);
+
+	return true;
+#else
+	return false;
+#endif
+}
+
+/*
+ * Saves the original open file limit (RLIMIT_NOFILE) the first time when this
+ * is called. If called again it's a no-op.
+ *
+ * Returns true if successful, false otherwise.
+ */
+static void
+SaveOriginalOpenFileLimit(void)
+{
+#ifdef HAVE_GETRLIMIT
+	int			status;
+
+	if (saved_original_max_open_files)
+	{
+		/* Already saved, no need to do it again */
+		return;
+	}
+
+	status = getrlimit(RLIMIT_NOFILE, &original_max_open_files);
+	if (status != 0)
+	{
+		ereport(WARNING, (errmsg("getrlimit failed: %m")));
+		return;
+	}
+
+	custom_max_open_files = original_max_open_files;
+	saved_original_max_open_files = true;
+	return;
+#endif
+}
+
+/*
+ * RestoreOriginalOpenFileLimit --- Restore the original open file limit that
+ * 		was present at postmaster start.
+ *
+ * This should be called before spawning subprocesses that might use select(2)
+ * which can only handle file descriptors up to 1024.
+ */
+void
+RestoreOriginalOpenFileLimit(void)
+{
+#ifdef HAVE_GETRLIMIT
+	if (!saved_original_max_open_files)
+	{
+		return;
+	}
+
+	if (custom_max_open_files.rlim_cur == original_max_open_files.rlim_cur)
+	{
+		/* Not changed, so no need to call setrlimit at all */
+		return;
+	}
+
+	if (setrlimit(RLIMIT_NOFILE, &original_max_open_files) != 0)
+	{
+		ereport(WARNING, (errmsg("setrlimit failed: %m")));
+	}
+#endif
+}
+
+/*
+ * RestoreCustomOpenFileLimit --- Restores our custom open file limit after a
+ * 		previous call to RestoreOriginalOpenFileLimit.
+ */
+void
+RestoreCustomOpenFileLimit(void)
+{
+#ifdef HAVE_GETRLIMIT
+	if (!saved_original_max_open_files)
+	{
+		return;
+	}
+
+	if (custom_max_open_files.rlim_cur == original_max_open_files.rlim_cur)
+	{
+		/* Not changed, so no need to call setrlimit at all */
+		return;
+	}
+
+	if (setrlimit(RLIMIT_NOFILE, &custom_max_open_files) != 0)
+	{
+		ereport(WARNING, (errmsg("setrlimit failed: %m")));
+	}
+#endif
+}
+
 /*
  * count_usable_fds --- count how many FDs the system will let us open,
  *		and estimate how many are already open.
@@ -968,38 +1121,39 @@ count_usable_fds(int max_to_probe, int *usable_fds, int *already_open)
 	int			highestfd = 0;
 	int			j;
 
-#ifdef HAVE_GETRLIMIT
-	struct rlimit rlim;
-	int			getrlimit_status;
-#endif
-
 	size = 1024;
 	fd = (int *) palloc(size * sizeof(int));
 
-#ifdef HAVE_GETRLIMIT
-	getrlimit_status = getrlimit(RLIMIT_NOFILE, &rlim);
-	if (getrlimit_status != 0)
-		ereport(WARNING, (errmsg("getrlimit failed: %m")));
-#endif							/* HAVE_GETRLIMIT */
+	SaveOriginalOpenFileLimit();
 
 	/* dup until failure or probe limit reached */
 	for (;;)
 	{
 		int			thisfd;
 
-#ifdef HAVE_GETRLIMIT
-
 		/*
 		 * don't go beyond RLIMIT_NOFILE; causes irritating kernel logs on
 		 * some platforms
 		 */
-		if (getrlimit_status == 0 && highestfd >= rlim.rlim_cur - 1)
-			break;
-#endif
+		if (IsOpenFileLimit(highestfd))
+		{
+			if (!IncreaseOpenFileLimit(max_to_probe - used))
+				break;
+		}
 
 		thisfd = dup(2);
 		if (thisfd < 0)
 		{
+			/*
+			 * Eventhough we do the pre-check above, it's still possible that
+			 * the call to dup fails with EMFILE. This can happen if the last
+			 * file descriptor was already assigned to an "already open" file.
+			 * One example of this happening, is if we're already at the soft
+			 * limit when we call count_usable_fds.
+			 */
+			if (errno == EMFILE && IncreaseOpenFileLimit(max_to_probe - used))
+				continue;
+
 			/* Expect EMFILE or ENFILE, else it's fishy */
 			if (errno != EMFILE && errno != ENFILE)
 				elog(WARNING, "duplicating stderr file descriptor failed after %d successes: %m", used);
@@ -1053,15 +1207,10 @@ set_max_safe_fds(void)
 	 * or the experimentally-determined EMFILE limit.
 	 *----------
 	 */
-	count_usable_fds(max_files_per_process,
+	count_usable_fds(max_files_per_process + NUM_RESERVED_FDS,
 					 &usable_fds, &already_open);
 
-	max_safe_fds = Min(usable_fds, max_files_per_process - already_open);
-
-	/*
-	 * Take off the FDs reserved for system() etc.
-	 */
-	max_safe_fds -= NUM_RESERVED_FDS;
+	max_safe_fds = Min(usable_fds - NUM_RESERVED_FDS, max_files_per_process);
 
 	/*
 	 * Make sure we still have enough to get by.
@@ -2724,6 +2873,19 @@ OpenPipeStream(const char *command, const char *mode)
 	ReleaseLruFiles();
 
 TryAgain:
+
+	/*
+	 * It would be great if we could call RestoreOriginalOpenFileLimit here,
+	 * but since popen() also opens a file in the current process (this side
+	 * of the pipe) we cannot do so safely. Because we might already have many
+	 * more files open than the original limit.
+	 *
+	 * The only way to address this would be implementing a custom popen()
+	 * that calls RestoreOriginalOpenFileLimit only around the actual fork
+	 * call, but that seems too much effort to handle the corner case where
+	 * this external command uses both select() and tries to open more files
+	 * than select() allows for.
+	 */
 	fflush(NULL);
 	pqsignal(SIGPIPE, SIG_DFL);
 	errno = 0;
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index e3067ab6597..d24e7f1c8df 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -177,6 +177,8 @@ extern void RemovePgTempFiles(void);
 extern void RemovePgTempFilesInDir(const char *tmpdirname, bool missing_ok,
 								   bool unlink_all);
 extern bool looks_like_temp_rel_name(const char *name);
+extern void RestoreOriginalOpenFileLimit(void);
+extern void RestoreCustomOpenFileLimit(void);
 
 extern int	pg_fsync(int fd);
 extern int	pg_fsync_no_writethrough(int fd);
-- 
2.43.0

v7-0003-Reflect-the-value-of-max_safe_fds-in-max_files_pe.patchtext/x-patch; charset=utf-8; name=v7-0003-Reflect-the-value-of-max_safe_fds-in-max_files_pe.patchDownload
From af312866227074205e96fb80ccf86d74ff050d71 Mon Sep 17 00:00:00 2001
From: Jelte Fennema-Nio <github-tech@jeltef.nl>
Date: Wed, 12 Feb 2025 01:08:07 +0100
Subject: [PATCH v7 3/3] Reflect the value of max_safe_fds in
 max_files_per_process

It is currently hard to figure out if max_safe_fds is significantly
lower than max_files_per_process. This starts reflecting the value of
max_safe_fds in max_files_per_process after our limit detection. We
still want to have two separate variables because for the bootstrap or
standalone-backend cases their values differ on purpose.
---
 src/backend/storage/file/fd.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index de6bdf9c141..1454568810d 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -1198,6 +1198,7 @@ set_max_safe_fds(void)
 {
 	int			usable_fds;
 	int			already_open;
+	char	   *max_safe_fds_string;
 
 	/*----------
 	 * We want to set max_safe_fds to
@@ -1212,6 +1213,12 @@ set_max_safe_fds(void)
 
 	max_safe_fds = Min(usable_fds - NUM_RESERVED_FDS, max_files_per_process);
 
+	/* Update GUC variable to allow users to see the result */
+	max_safe_fds_string = psprintf("%d", max_safe_fds);
+	SetConfigOption("max_files_per_process", max_safe_fds_string,
+					PGC_POSTMASTER, PGC_S_OVERRIDE);
+	pfree(max_safe_fds_string);
+
 	/*
 	 * Make sure we still have enough to get by.
 	 */
-- 
2.43.0

#25Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Jelte Fennema-Nio (#24)
Re: Bump soft open file limit (RLIMIT_NOFILE) to hard limit on startup

On 18/03/2025 01:08, Jelte Fennema-Nio wrote:

On Mon Feb 24, 2025 at 12:01 PM CET, Jelte Fennema-Nio wrote:

Right after pressing send I realized I could remove two more useless
lines...

Rebased patchset attached (trivial conflict against pg_noreturn
changes).

v7-0001-Adds-a-helper-for-places-that-shell-out-to-system.patch:

/*
* A custom wrapper around the system() function that calls the necessary
* functions pre/post-fork.
*/
int
System(const char *command, uint32 wait_event_info)
{
int rc;

fflush(NULL);
pgstat_report_wait_start(wait_event_info);

if (wait_event_info == WAIT_EVENT_RESTORE_COMMAND)
{
/*
* Set in_restore_command to tell the signal handler that we should
* exit right away on SIGTERM. This is done for the duration of the
* system() call because there isn't a good way to break out while it
* is executing. Since we might call proc_exit() in a signal handler,
* it is best to put any additional logic outside of this section
* where in_restore_command is set to true.
*/
in_restore_command = true;

/*
* Also check if we had already received the signal, so that we don't
* miss a shutdown request received just before this.
*/
if (shutdown_requested)
proc_exit(1);
}

rc = system(command);

if (wait_event_info == WAIT_EVENT_RESTORE_COMMAND)
in_restore_command = false;

pgstat_report_wait_end();
return rc;
}

Let's move that 'in_restore_command' business to the caller. It's weird
modularity for the function to implicitly behave differently for some
callers. And 'wait_event_info' should only affect pgstat reporting, not
actual behavior.

I don't feel good about the function name. How about pg_system() or
something? postmaster/startup.c also seems like a weird place for it;
not sure where it belongs but probably not there. Maybe next to
OpenPipeStream() in fd.c, or move both to a new file.

v7-0002-Bump-postmaster-soft-open-file-limit-RLIMIT_NOFIL.patch:

@@ -274,6 +275,7 @@ System(const char *command, uint32 wait_event_info)
{
int rc;

+ RestoreOriginalOpenFileLimit();
fflush(NULL);
pgstat_report_wait_start(wait_event_info);

@@ -303,6 +305,7 @@ System(const char *command, uint32 wait_event_info)
in_restore_command = false;

pgstat_report_wait_end();
+ RestoreCustomOpenFileLimit();
return rc;
}

Looks a bit funny that both functions are called Restore<something>().
Not sure what to suggest instead though. Maybe RaiseOpenFileLimit() and
LowerOpenFileLimit().

@@ -2724,6 +2873,19 @@ OpenPipeStream(const char *command, const char *mode)
ReleaseLruFiles();

TryAgain:
+
+	/*
+	 * It would be great if we could call RestoreOriginalOpenFileLimit here,
+	 * but since popen() also opens a file in the current process (this side
+	 * of the pipe) we cannot do so safely. Because we might already have many
+	 * more files open than the original limit.
+	 *
+	 * The only way to address this would be implementing a custom popen()
+	 * that calls RestoreOriginalOpenFileLimit only around the actual fork
+	 * call, but that seems too much effort to handle the corner case where
+	 * this external command uses both select() and tries to open more files
+	 * than select() allows for.
+	 */
fflush(NULL);
pqsignal(SIGPIPE, SIG_DFL);
errno = 0;

What would it take to re-implement popen() with fork+exec? Genuine
question; I have a feeling it might be complicated to do correctly on
all platforms, but I don't know.

--
Heikki Linnakangas
Neon (https://neon.tech)

#26Jelte Fennema-Nio
postgres@jeltef.nl
In reply to: Heikki Linnakangas (#25)
3 attachment(s)
Re: Bump soft open file limit (RLIMIT_NOFILE) to hard limit on startup

On Fri Apr 4, 2025 at 7:34 PM CEST, Heikki Linnakangas wrote:

Let's move that 'in_restore_command' business to the caller. It's weird
modularity for the function to implicitly behave differently for some
callers.

I definitely agree with the sentiment, and it was what I originally
planned to do. But then I went for this approach anyway because commit
8fb13dd6ab5b explicitely moved all code except for the actual call to
system() out of the PreRestoreCommand()/PostRestoreCommand() section
(which is also described in the code comment).

So I didn't move the the in_restore_command stuff to the caller, and
improved the function comment to call out this unfortunate coupling.

And 'wait_event_info' should only affect pgstat reporting, not
actual behavior.

Given that we need to keep the restore command stuff in this function, I
think the only other option is to add a dedicated argument for the
restore command stuff, like "bool is_restore_command". But that would
require every caller, except for the restore command, to pass an
additional "false" as an argument. To me the additionaly noise that that
adds seems like a worse issue than the non-purity we get by
piggy-backing on the wait_event_info argument.

I don't feel good about the function name. How about pg_system() or
something?

Done

postmaster/startup.c also seems like a weird place for it;
not sure where it belongs but probably not there. Maybe next to
OpenPipeStream() in fd.c, or move both to a new file.

Moved it to fd.c

Looks a bit funny that both functions are called Restore<something>().
Not sure what to suggest instead though. Maybe RaiseOpenFileLimit() and
LowerOpenFileLimit().

Changed them to UseOriginalOpenFileLimit() and
UseOriginalOpenFileLimit()

What would it take to re-implement popen() with fork+exec? Genuine
question; I have a feeling it might be complicated to do correctly on
all platforms, but I don't know.

I initially attempted to re-implement it, but after looking at the
fairly complex FreeBSD implementation of popen[1]https://github.com/freebsd/freebsd-src/blob/c98367641991019bac0e8cd55b70682171820534/lib/libc/gen/popen.c#L63-L181 that suddenly seemed
more trouble than it's worth.

[1]: https://github.com/freebsd/freebsd-src/blob/c98367641991019bac0e8cd55b70682171820534/lib/libc/gen/popen.c#L63-L181

Attachments:

v8-0001-Adds-a-helper-for-places-that-shell-out-to-system.patchtext/x-patch; charset=utf-8; name=v8-0001-Adds-a-helper-for-places-that-shell-out-to-system.patchDownload
From 5dbab2ccf0d74313dbc2cbaeb418346de8cc2f48 Mon Sep 17 00:00:00 2001
From: Jelte Fennema-Nio <github-tech@jeltef.nl>
Date: Sun, 23 Feb 2025 16:52:29 +0100
Subject: [PATCH v8 1/3] Adds a helper for places that shell out to system()

We need to call system() in a few places, and to do so safely we need
some pre/post-fork logic. This encapsulates that logic into a single
System helper function. The main reason to do this is that in a follow
up commit we want to add even more logic pre and post-fork.
---
 src/backend/access/transam/xlogarchive.c | 29 ++--------------
 src/backend/archive/shell_archive.c      |  5 +--
 src/backend/postmaster/startup.c         |  1 +
 src/backend/storage/file/fd.c            | 42 ++++++++++++++++++++++++
 src/include/storage/fd.h                 |  1 +
 5 files changed, 48 insertions(+), 30 deletions(-)

diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index 1ef1713c91a..c7640ec5025 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -158,27 +158,8 @@ RestoreArchivedFile(char *path, const char *xlogfname,
 			(errmsg_internal("executing restore command \"%s\"",
 							 xlogRestoreCmd)));
 
-	fflush(NULL);
-	pgstat_report_wait_start(WAIT_EVENT_RESTORE_COMMAND);
-
-	/*
-	 * PreRestoreCommand() informs the SIGTERM handler for the startup process
-	 * that it should proc_exit() right away.  This is done for the duration
-	 * of the system() call because there isn't a good way to break out while
-	 * it is executing.  Since we might call proc_exit() in a signal handler,
-	 * it is best to put any additional logic before or after the
-	 * PreRestoreCommand()/PostRestoreCommand() section.
-	 */
-	PreRestoreCommand();
-
-	/*
-	 * Copy xlog from archival storage to XLOGDIR
-	 */
-	rc = system(xlogRestoreCmd);
-
-	PostRestoreCommand();
-
-	pgstat_report_wait_end();
+	/* Copy xlog from archival storage to XLOGDIR */
+	rc = pg_system(xlogRestoreCmd, WAIT_EVENT_RESTORE_COMMAND);
 	pfree(xlogRestoreCmd);
 
 	if (rc == 0)
@@ -325,11 +306,7 @@ ExecuteRecoveryCommand(const char *command, const char *commandName,
 	/*
 	 * execute the constructed command
 	 */
-	fflush(NULL);
-	pgstat_report_wait_start(wait_event_info);
-	rc = system(xlogRecoveryCmd);
-	pgstat_report_wait_end();
-
+	rc = pg_system(xlogRecoveryCmd, wait_event_info);
 	pfree(xlogRecoveryCmd);
 
 	if (rc != 0)
diff --git a/src/backend/archive/shell_archive.c b/src/backend/archive/shell_archive.c
index 828723afe47..64c2c6aa760 100644
--- a/src/backend/archive/shell_archive.c
+++ b/src/backend/archive/shell_archive.c
@@ -75,10 +75,7 @@ shell_archive_file(ArchiveModuleState *state, const char *file,
 			(errmsg_internal("executing archive command \"%s\"",
 							 xlogarchcmd)));
 
-	fflush(NULL);
-	pgstat_report_wait_start(WAIT_EVENT_ARCHIVE_COMMAND);
-	rc = system(xlogarchcmd);
-	pgstat_report_wait_end();
+	rc = pg_system(xlogarchcmd, WAIT_EVENT_ARCHIVE_COMMAND);
 
 	if (rc != 0)
 	{
diff --git a/src/backend/postmaster/startup.c b/src/backend/postmaster/startup.c
index 7149a67fcbc..eb79fda1fd9 100644
--- a/src/backend/postmaster/startup.c
+++ b/src/backend/postmaster/startup.c
@@ -33,6 +33,7 @@
 #include "utils/guc.h"
 #include "utils/memutils.h"
 #include "utils/timeout.h"
+#include "utils/wait_event_types.h"
 
 
 #ifndef USE_POSTMASTER_DEATH_SIGNAL
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 0e8299dd556..718d8079ad7 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -2734,6 +2734,48 @@ OpenTransientFilePerm(const char *fileName, int fileFlags, mode_t fileMode)
 	return -1;					/* failure */
 }
 
+/*
+ * A custom wrapper around the system() function that calls the necessary
+ * functions pre/post-fork.
+ *
+ * If WAIT_EVENT_RESTORE_COMMAND is passed as the wait_event_info, it will also
+ * call the necessary PreRestoreCommand/PostRerstoreCommand functions. It's
+ * unfortunate that we have to do couple the behaviour of this function so
+ * tighlty to the restore command logic, but it's the only way to make sure
+ * that we don't have additional logic inbetween the PreRestoreCommand and
+ * PostRestoreCommand calls.
+ */
+int
+pg_system(const char *command, uint32 wait_event_info)
+{
+	int			rc;
+
+	fflush(NULL);
+	pgstat_report_wait_start(wait_event_info);
+
+	if (wait_event_info == WAIT_EVENT_RESTORE_COMMAND)
+	{
+		/*
+		 * PreRestoreCommand() informs the SIGTERM handler for the startup
+		 * process that it should proc_exit() right away.  This is done for
+		 * the duration of the system() call because there isn't a good way to
+		 * break out while it is executing.  Since we might call proc_exit()
+		 * in a signal handler, it is best to put any additional logic before
+		 * or after the PreRestoreCommand()/PostRestoreCommand() section.
+		 */
+		PreRestoreCommand();
+	}
+
+	rc = system(command);
+
+	if (wait_event_info == WAIT_EVENT_RESTORE_COMMAND)
+		PostRestoreCommand();
+
+	pgstat_report_wait_end();
+	return rc;
+}
+
+
 /*
  * Routines that want to initiate a pipe stream should use OpenPipeStream
  * rather than plain popen().  This lets fd.c deal with freeing FDs if
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index b77d8e5e30e..2d445674a1a 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -187,6 +187,7 @@ extern int	pg_fsync_writethrough(int fd);
 extern int	pg_fdatasync(int fd);
 extern bool pg_file_exists(const char *name);
 extern void pg_flush_data(int fd, off_t offset, off_t nbytes);
+extern int	pg_system(const char *command, uint32 wait_event_info);
 extern int	pg_truncate(const char *path, off_t length);
 extern void fsync_fname(const char *fname, bool isdir);
 extern int	fsync_fname_ext(const char *fname, bool isdir, bool ignore_perm, int elevel);

base-commit: f09088a01d3372fdfe5da12dd0b2e24989f0caa6
-- 
2.43.0

v8-0002-Bump-postmaster-soft-open-file-limit-RLIMIT_NOFIL.patchtext/x-patch; charset=utf-8; name=v8-0002-Bump-postmaster-soft-open-file-limit-RLIMIT_NOFIL.patchDownload
From c43950afa51b22892211770a41d4be4609b7b9ac Mon Sep 17 00:00:00 2001
From: Jelte Fennema-Nio <github-tech@jeltef.nl>
Date: Tue, 11 Feb 2025 19:15:36 +0100
Subject: [PATCH v8 2/3] Bump postmaster soft open file limit (RLIMIT_NOFILE)
 when necessary

The default open file limit of 1024 on Linux is extremely low. The
reason that this hasn't changed change is because doing so would break
legacy programs that use the select(2) system call in hard to debug
ways. So instead programs that want to opt-in to a higher open file
limit are expected to bump their soft limit to their hard limit on
startup. Details on this are very well explained in a blogpost by the
systemd author[1]. There's also a similar change done by the Go
language[2].

This starts bumping postmaster its soft open file limit when we realize
that we'll run into the soft limit with the requested
max_files_per_process GUC. We do so by slightly changing the meaning of
the max_files_per_process GUC. The actual (not publicly exposed) limit
is max_safe_fds, previously this would be set to:
max_files_per_process - already_open_files - NUM_RESERVED_FDS
After this change we now try to set max_safe_fds to
max_files_per_process if the system allows that. This is deemed more
natural to understand for users, because now the limit of files that
they can open is actually what they configured in max_files_per_process.

Adding this infrastructure to change RLIMIT_NOFILE when needed is
especially useful for the AIO work that Andres is doing, because
io_uring consumes a lot of file descriptors. Even without looking at AIO
there is a large number of reports from people that require changing
their soft file limit before starting Postgres, sometimes falling back
to lowering max_files_per_process when they fail to do so[3-8]. It's
also not all that strange to fail at setting the soft open file limit
because there are multiple places where one can configure such limits
and usually only one of them is effective (which one depends on how
Postgres is started). In cloud environments its also often not possible
for user to change the soft limit, because they don't control the way
that Postgres is started.

One thing to note is that we temporarily restore the original soft
limit when shell-ing out to other executables. This is done as a
precaution in case those executables are using select(2).

[1]: https://0pointer.net/blog/file-descriptor-limits.html
[2]: https://github.com/golang/go/issues/46279
[3]: https://serverfault.com/questions/785330/getting-too-many-open-files-error-for-postgres
[4]: https://serverfault.com/questions/716982/how-to-raise-max-no-of-file-descriptors-for-daemons-running-on-debian-jessie
[5]: https://www.postgresql.org/message-id/flat/CAKtc8vXh7NvP_qWj8EqqorPY97bvxSaX3h5u7a9PptRFHW5x7g%40mail.gmail.com
[6]: https://www.postgresql.org/message-id/flat/113ce31b0908120955w77029099i7ececc053084095a%40mail.gmail.com
[7]: https://github.com/abiosoft/colima/discussions/836
[8]: https://www.postgresql.org/message-id/flat/29663.1007738957%40sss.pgh.pa.us#2079ec9e2d8b251593812a3711bfe9e9
---
 src/backend/storage/file/fd.c | 199 +++++++++++++++++++++++++++++++---
 1 file changed, 184 insertions(+), 15 deletions(-)

diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 718d8079ad7..b3a58deef43 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -158,6 +158,13 @@ int			max_files_per_process = 1000;
  */
 int			max_safe_fds = FD_MINFREE;	/* default if not changed */
 
+#ifdef HAVE_GETRLIMIT
+static bool saved_original_max_open_files;
+static struct rlimit original_max_open_files;
+static struct rlimit custom_max_open_files;
+#endif
+
+
 /* Whether it is safe to continue running after fsync() fails. */
 bool		data_sync_retry = false;
 
@@ -946,6 +953,152 @@ InitTemporaryFileAccess(void)
 #endif
 }
 
+/*
+ * Returns true if the passed in highestfd is the last one that we're allowed
+ * to open based on our. This should only be called if
+ */
+static bool
+IsOpenFileLimit(int highestfd)
+{
+#ifdef HAVE_GETRLIMIT
+	if (!saved_original_max_open_files)
+	{
+		return false;
+	}
+
+	return highestfd >= custom_max_open_files.rlim_cur - 1;
+#else
+	return false;
+#endif
+}
+
+/*
+ * Increases the open file limit (RLIMIT_NOFILE) by the requested amount.
+ * Returns true if successful, false otherwise.
+ */
+static bool
+IncreaseOpenFileLimit(int extra_files)
+{
+#ifdef HAVE_GETRLIMIT
+	struct rlimit rlim;
+
+	if (!saved_original_max_open_files)
+	{
+		return false;
+	}
+
+	rlim = custom_max_open_files;
+
+	/* If we're already at the max we reached our limit */
+	if (rlim.rlim_cur == original_max_open_files.rlim_max)
+		return false;
+
+	/* Otherwise try to increase the soft limit to what we need */
+	rlim.rlim_cur = Min(rlim.rlim_cur + extra_files, rlim.rlim_max);
+
+	if (setrlimit(RLIMIT_NOFILE, &rlim) != 0)
+	{
+		/* We made sure not to exceed the hard limit, so this shouldn't fail */
+		ereport(WARNING, (errmsg("setrlimit failed: %m")));
+		return false;
+	}
+
+	custom_max_open_files = rlim;
+
+	elog(LOG, "increased open file limit to %ld", (long) rlim.rlim_cur);
+
+	return true;
+#else
+	return false;
+#endif
+}
+
+/*
+ * Saves the original open file limit (RLIMIT_NOFILE) the first time when this
+ * is called. If called again it's a no-op.
+ *
+ * Returns true if successful, false otherwise.
+ */
+static void
+SaveOriginalOpenFileLimit(void)
+{
+#ifdef HAVE_GETRLIMIT
+	int			status;
+
+	if (saved_original_max_open_files)
+	{
+		/* Already saved, no need to do it again */
+		return;
+	}
+
+	status = getrlimit(RLIMIT_NOFILE, &original_max_open_files);
+	if (status != 0)
+	{
+		ereport(WARNING, (errmsg("getrlimit failed: %m")));
+		return;
+	}
+
+	custom_max_open_files = original_max_open_files;
+	saved_original_max_open_files = true;
+	return;
+#endif
+}
+
+/*
+ * UseOriginalOpenFileLimit --- Makes the process use the original open file
+ * 		limit that was present at postmaster start.
+ *
+ * This should be called before spawning subprocesses that might use select(2)
+ * which can only handle file descriptors up to 1024.
+ */
+static void
+UseOriginalOpenFileLimit(void)
+{
+#ifdef HAVE_GETRLIMIT
+	if (!saved_original_max_open_files)
+	{
+		return;
+	}
+
+	if (custom_max_open_files.rlim_cur == original_max_open_files.rlim_cur)
+	{
+		/* Not changed, so no need to call setrlimit at all */
+		return;
+	}
+
+	if (setrlimit(RLIMIT_NOFILE, &original_max_open_files) != 0)
+	{
+		ereport(WARNING, (errmsg("setrlimit failed: %m")));
+	}
+#endif
+}
+
+/*
+ * UseCustomOpenFileLimit --- Makes the process use our custom open file limit
+ * 		after that we configured based on the max_files_per_process GUC.
+ */
+static void
+UseCustomOpenFileLimit(void)
+{
+#ifdef HAVE_GETRLIMIT
+	if (!saved_original_max_open_files)
+	{
+		return;
+	}
+
+	if (custom_max_open_files.rlim_cur == original_max_open_files.rlim_cur)
+	{
+		/* Not changed, so no need to call setrlimit at all */
+		return;
+	}
+
+	if (setrlimit(RLIMIT_NOFILE, &custom_max_open_files) != 0)
+	{
+		ereport(WARNING, (errmsg("setrlimit failed: %m")));
+	}
+#endif
+}
+
 /*
  * count_usable_fds --- count how many FDs the system will let us open,
  *		and estimate how many are already open.
@@ -969,38 +1122,39 @@ count_usable_fds(int max_to_probe, int *usable_fds, int *already_open)
 	int			highestfd = 0;
 	int			j;
 
-#ifdef HAVE_GETRLIMIT
-	struct rlimit rlim;
-	int			getrlimit_status;
-#endif
-
 	size = 1024;
 	fd = (int *) palloc(size * sizeof(int));
 
-#ifdef HAVE_GETRLIMIT
-	getrlimit_status = getrlimit(RLIMIT_NOFILE, &rlim);
-	if (getrlimit_status != 0)
-		ereport(WARNING, (errmsg("getrlimit failed: %m")));
-#endif							/* HAVE_GETRLIMIT */
+	SaveOriginalOpenFileLimit();
 
 	/* dup until failure or probe limit reached */
 	for (;;)
 	{
 		int			thisfd;
 
-#ifdef HAVE_GETRLIMIT
-
 		/*
 		 * don't go beyond RLIMIT_NOFILE; causes irritating kernel logs on
 		 * some platforms
 		 */
-		if (getrlimit_status == 0 && highestfd >= rlim.rlim_cur - 1)
-			break;
-#endif
+		if (IsOpenFileLimit(highestfd))
+		{
+			if (!IncreaseOpenFileLimit(max_to_probe - used))
+				break;
+		}
 
 		thisfd = dup(2);
 		if (thisfd < 0)
 		{
+			/*
+			 * Eventhough we do the pre-check above, it's still possible that
+			 * the call to dup fails with EMFILE. This can happen if the last
+			 * file descriptor was already assigned to an "already open" file.
+			 * One example of this happening, is if we're already at the soft
+			 * limit when we call count_usable_fds.
+			 */
+			if (errno == EMFILE && IncreaseOpenFileLimit(max_to_probe - used))
+				continue;
+
 			/* Expect EMFILE or ENFILE, else it's fishy */
 			if (errno != EMFILE && errno != ENFILE)
 				elog(WARNING, "duplicating stderr file descriptor failed after %d successes: %m", used);
@@ -2750,6 +2904,7 @@ pg_system(const char *command, uint32 wait_event_info)
 {
 	int			rc;
 
+	UseOriginalOpenFileLimit();
 	fflush(NULL);
 	pgstat_report_wait_start(wait_event_info);
 
@@ -2772,6 +2927,7 @@ pg_system(const char *command, uint32 wait_event_info)
 		PostRestoreCommand();
 
 	pgstat_report_wait_end();
+	UseCustomOpenFileLimit();
 	return rc;
 }
 
@@ -2805,6 +2961,19 @@ OpenPipeStream(const char *command, const char *mode)
 	ReleaseLruFiles();
 
 TryAgain:
+
+	/*
+	 * It would be great if we could call UseOriginalOpenFileLimit here, but
+	 * since popen() also opens a file in the current process (this side of the
+	 * pipe) we cannot do so safely. Because we might already have many more
+	 * files open than the original limit.
+	 *
+	 * The only way to address this would be implementing a custom popen() that
+	 * calls UseOriginalOpenFileLimit only around the actual fork call, but
+	 * that seems too much effort to handle the corner case where this external
+	 * command uses both select() and tries to open more files than select()
+	 * allows for.
+	 */
 	fflush(NULL);
 	pqsignal(SIGPIPE, SIG_DFL);
 	errno = 0;
-- 
2.43.0

v8-0003-Reflect-the-value-of-max_safe_fds-in-max_files_pe.patchtext/x-patch; charset=utf-8; name=v8-0003-Reflect-the-value-of-max_safe_fds-in-max_files_pe.patchDownload
From 298c3c436eb4535df95d7efb0b17105cc6e6c770 Mon Sep 17 00:00:00 2001
From: Jelte Fennema-Nio <github-tech@jeltef.nl>
Date: Wed, 12 Feb 2025 01:08:07 +0100
Subject: [PATCH v8 3/3] Reflect the value of max_safe_fds in
 max_files_per_process

It is currently hard to figure out if max_safe_fds is significantly
lower than max_files_per_process. This starts reflecting the value of
max_safe_fds in max_files_per_process after our limit detection. We
still want to have two separate variables because for the bootstrap or
standalone-backend cases their values differ on purpose.
---
 src/backend/storage/file/fd.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index b3a58deef43..8cee38e6c17 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -1199,6 +1199,7 @@ set_max_safe_fds(void)
 {
 	int			usable_fds;
 	int			already_open;
+	char	   *max_safe_fds_string;
 
 	/*----------
 	 * We want to set max_safe_fds to
@@ -1214,6 +1215,16 @@ set_max_safe_fds(void)
 
 	max_safe_fds = Min(usable_fds, max_files_per_process);
 
+	/*
+	 * Update GUC variable to allow users to see if the result is different
+	 * than what the used value turns out to be different than what they had
+	 * configured.
+	 */
+	max_safe_fds_string = psprintf("%d", max_safe_fds);
+	SetConfigOption("max_files_per_process", max_safe_fds_string,
+					PGC_POSTMASTER, PGC_S_OVERRIDE);
+	pfree(max_safe_fds_string);
+
 	/*
 	 * Take off the FDs reserved for system() etc.
 	 */
-- 
2.43.0

#27Peter Eisentraut
peter@eisentraut.org
In reply to: Jelte Fennema-Nio (#26)
Re: Bump soft open file limit (RLIMIT_NOFILE) to hard limit on startup

On 13.04.25 21:30, Jelte Fennema-Nio wrote:

On Fri Apr 4, 2025 at 7:34 PM CEST, Heikki Linnakangas wrote:

Let's move that 'in_restore_command' business to the caller. It's
weird modularity for the function to implicitly behave differently for
some callers.

I definitely agree with the sentiment, and it was what I originally
planned to do. But then I went for this approach anyway because commit
8fb13dd6ab5b explicitely moved all code except for the actual call to
system() out of the PreRestoreCommand()/PostRestoreCommand() section
(which is also described in the code comment).
So I didn't move the the in_restore_command stuff to the caller, and
improved the function comment to call out this unfortunate coupling.

And 'wait_event_info' should only affect pgstat reporting, not actual
behavior.

Given that we need to keep the restore command stuff in this function, I
think the only other option is to add a dedicated argument for the
restore command stuff, like "bool is_restore_command". But that would
require every caller, except for the restore command, to pass an
additional "false" as an argument. To me the additionaly noise that that
adds seems like a worse issue than the non-purity we get by
piggy-backing on the wait_event_info argument.

I don't feel good about the function name. How about pg_system() or
something?

This patch set is showing compiler warnings because pg_system() wasn't
properly declared where needed. Please provide an update that builds
cleanly.

Also, it appears the patch for pgbench disappeared from the series. Was
that intentional?

#28Jelte Fennema-Nio
postgres@jeltef.nl
In reply to: Peter Eisentraut (#27)
3 attachment(s)
Re: Bump soft open file limit (RLIMIT_NOFILE) to hard limit on startup

On Wed, 29 Oct 2025 at 11:11, Peter Eisentraut <peter@eisentraut.org> wrote:

This patch set is showing compiler warnings because pg_system() wasn't
properly declared where needed. Please provide an update that builds
cleanly.

It still compiled fine on my local branch, but indeed after a rebase not anymore. I guess some include got removed in some header in the meantime. Attached an updated version (which added an storage/fd.h include).

Also, it appears the patch for pgbench disappeared from the series. Was
that intentional?

Yes, the pgbench change got committed for PG18 already (see Andres' last message in the thread).

Attachments:

v8-0001-Adds-a-helper-for-places-that-shell-out-to-system.patchtext/x-patch; charset=utf-8; name=v8-0001-Adds-a-helper-for-places-that-shell-out-to-system.patchDownload
From 9e478de009b5c1a77ca47fc268cb51adbb657dd8 Mon Sep 17 00:00:00 2001
From: Jelte Fennema-Nio <github-tech@jeltef.nl>
Date: Sun, 23 Feb 2025 16:52:29 +0100
Subject: [PATCH v8 1/3] Adds a helper for places that shell out to system()

We need to call system() in a few places, and to do so safely we need
some pre/post-fork logic. This encapsulates that logic into a single
System helper function. The main reason to do this is that in a follow
up commit we want to add even more logic pre and post-fork.
---
 src/backend/access/transam/xlogarchive.c | 29 ++--------------
 src/backend/archive/shell_archive.c      |  6 ++--
 src/backend/postmaster/startup.c         |  1 +
 src/backend/storage/file/fd.c            | 42 ++++++++++++++++++++++++
 src/include/storage/fd.h                 |  1 +
 5 files changed, 49 insertions(+), 30 deletions(-)

diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index 1ef1713c91a..c7640ec5025 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -158,27 +158,8 @@ RestoreArchivedFile(char *path, const char *xlogfname,
 			(errmsg_internal("executing restore command \"%s\"",
 							 xlogRestoreCmd)));
 
-	fflush(NULL);
-	pgstat_report_wait_start(WAIT_EVENT_RESTORE_COMMAND);
-
-	/*
-	 * PreRestoreCommand() informs the SIGTERM handler for the startup process
-	 * that it should proc_exit() right away.  This is done for the duration
-	 * of the system() call because there isn't a good way to break out while
-	 * it is executing.  Since we might call proc_exit() in a signal handler,
-	 * it is best to put any additional logic before or after the
-	 * PreRestoreCommand()/PostRestoreCommand() section.
-	 */
-	PreRestoreCommand();
-
-	/*
-	 * Copy xlog from archival storage to XLOGDIR
-	 */
-	rc = system(xlogRestoreCmd);
-
-	PostRestoreCommand();
-
-	pgstat_report_wait_end();
+	/* Copy xlog from archival storage to XLOGDIR */
+	rc = pg_system(xlogRestoreCmd, WAIT_EVENT_RESTORE_COMMAND);
 	pfree(xlogRestoreCmd);
 
 	if (rc == 0)
@@ -325,11 +306,7 @@ ExecuteRecoveryCommand(const char *command, const char *commandName,
 	/*
 	 * execute the constructed command
 	 */
-	fflush(NULL);
-	pgstat_report_wait_start(wait_event_info);
-	rc = system(xlogRecoveryCmd);
-	pgstat_report_wait_end();
-
+	rc = pg_system(xlogRecoveryCmd, wait_event_info);
 	pfree(xlogRecoveryCmd);
 
 	if (rc != 0)
diff --git a/src/backend/archive/shell_archive.c b/src/backend/archive/shell_archive.c
index 828723afe47..5dd8e2c7247 100644
--- a/src/backend/archive/shell_archive.c
+++ b/src/backend/archive/shell_archive.c
@@ -22,6 +22,7 @@
 #include "archive/shell_archive.h"
 #include "common/percentrepl.h"
 #include "pgstat.h"
+#include "storage/fd.h"
 
 static bool shell_archive_configured(ArchiveModuleState *state);
 static bool shell_archive_file(ArchiveModuleState *state,
@@ -75,10 +76,7 @@ shell_archive_file(ArchiveModuleState *state, const char *file,
 			(errmsg_internal("executing archive command \"%s\"",
 							 xlogarchcmd)));
 
-	fflush(NULL);
-	pgstat_report_wait_start(WAIT_EVENT_ARCHIVE_COMMAND);
-	rc = system(xlogarchcmd);
-	pgstat_report_wait_end();
+	rc = pg_system(xlogarchcmd, WAIT_EVENT_ARCHIVE_COMMAND);
 
 	if (rc != 0)
 	{
diff --git a/src/backend/postmaster/startup.c b/src/backend/postmaster/startup.c
index 27e86cf393f..783eb88e59d 100644
--- a/src/backend/postmaster/startup.c
+++ b/src/backend/postmaster/startup.c
@@ -33,6 +33,7 @@
 #include "utils/guc.h"
 #include "utils/memutils.h"
 #include "utils/timeout.h"
+#include "utils/wait_event_types.h"
 
 
 #ifndef USE_POSTMASTER_DEATH_SIGNAL
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index a4ec7959f31..9f3b58c3767 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -2731,6 +2731,48 @@ OpenTransientFilePerm(const char *fileName, int fileFlags, mode_t fileMode)
 	return -1;					/* failure */
 }
 
+/*
+ * A custom wrapper around the system() function that calls the necessary
+ * functions pre/post-fork.
+ *
+ * If WAIT_EVENT_RESTORE_COMMAND is passed as the wait_event_info, it will also
+ * call the necessary PreRestoreCommand/PostRerstoreCommand functions. It's
+ * unfortunate that we have to do couple the behaviour of this function so
+ * tighlty to the restore command logic, but it's the only way to make sure
+ * that we don't have additional logic inbetween the PreRestoreCommand and
+ * PostRestoreCommand calls.
+ */
+int
+pg_system(const char *command, uint32 wait_event_info)
+{
+	int			rc;
+
+	fflush(NULL);
+	pgstat_report_wait_start(wait_event_info);
+
+	if (wait_event_info == WAIT_EVENT_RESTORE_COMMAND)
+	{
+		/*
+		 * PreRestoreCommand() informs the SIGTERM handler for the startup
+		 * process that it should proc_exit() right away.  This is done for
+		 * the duration of the system() call because there isn't a good way to
+		 * break out while it is executing.  Since we might call proc_exit()
+		 * in a signal handler, it is best to put any additional logic before
+		 * or after the PreRestoreCommand()/PostRestoreCommand() section.
+		 */
+		PreRestoreCommand();
+	}
+
+	rc = system(command);
+
+	if (wait_event_info == WAIT_EVENT_RESTORE_COMMAND)
+		PostRestoreCommand();
+
+	pgstat_report_wait_end();
+	return rc;
+}
+
+
 /*
  * Routines that want to initiate a pipe stream should use OpenPipeStream
  * rather than plain popen().  This lets fd.c deal with freeing FDs if
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index b77d8e5e30e..2d445674a1a 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -187,6 +187,7 @@ extern int	pg_fsync_writethrough(int fd);
 extern int	pg_fdatasync(int fd);
 extern bool pg_file_exists(const char *name);
 extern void pg_flush_data(int fd, off_t offset, off_t nbytes);
+extern int	pg_system(const char *command, uint32 wait_event_info);
 extern int	pg_truncate(const char *path, off_t length);
 extern void fsync_fname(const char *fname, bool isdir);
 extern int	fsync_fname_ext(const char *fname, bool isdir, bool ignore_perm, int elevel);

base-commit: ad25744f436ed7809fc754e1a44630b087812fbc
-- 
2.51.1

v8-0002-Bump-postmaster-soft-open-file-limit-RLIMIT_NOFIL.patchtext/x-patch; charset=utf-8; name=v8-0002-Bump-postmaster-soft-open-file-limit-RLIMIT_NOFIL.patchDownload
From 557a63fb212e89f0448b08025fb5c88c46ea198e Mon Sep 17 00:00:00 2001
From: Jelte Fennema-Nio <github-tech@jeltef.nl>
Date: Tue, 11 Feb 2025 19:15:36 +0100
Subject: [PATCH v8 2/3] Bump postmaster soft open file limit (RLIMIT_NOFILE)
 when necessary

The default open file limit of 1024 on Linux is extremely low. The
reason that this hasn't changed change is because doing so would break
legacy programs that use the select(2) system call in hard to debug
ways. So instead programs that want to opt-in to a higher open file
limit are expected to bump their soft limit to their hard limit on
startup. Details on this are very well explained in a blogpost by the
systemd author[1]. There's also a similar change done by the Go
language[2].

This starts bumping postmaster its soft open file limit when we realize
that we'll run into the soft limit with the requested
max_files_per_process GUC. We do so by slightly changing the meaning of
the max_files_per_process GUC. The actual (not publicly exposed) limit
is max_safe_fds, previously this would be set to:
max_files_per_process - already_open_files - NUM_RESERVED_FDS
After this change we now try to set max_safe_fds to
max_files_per_process if the system allows that. This is deemed more
natural to understand for users, because now the limit of files that
they can open is actually what they configured in max_files_per_process.

Adding this infrastructure to change RLIMIT_NOFILE when needed is
especially useful for the AIO work that Andres is doing, because
io_uring consumes a lot of file descriptors. Even without looking at AIO
there is a large number of reports from people that require changing
their soft file limit before starting Postgres, sometimes falling back
to lowering max_files_per_process when they fail to do so[3-8]. It's
also not all that strange to fail at setting the soft open file limit
because there are multiple places where one can configure such limits
and usually only one of them is effective (which one depends on how
Postgres is started). In cloud environments its also often not possible
for user to change the soft limit, because they don't control the way
that Postgres is started.

One thing to note is that we temporarily restore the original soft
limit when shell-ing out to other executables. This is done as a
precaution in case those executables are using select(2).

[1]: https://0pointer.net/blog/file-descriptor-limits.html
[2]: https://github.com/golang/go/issues/46279
[3]: https://serverfault.com/questions/785330/getting-too-many-open-files-error-for-postgres
[4]: https://serverfault.com/questions/716982/how-to-raise-max-no-of-file-descriptors-for-daemons-running-on-debian-jessie
[5]: https://www.postgresql.org/message-id/flat/CAKtc8vXh7NvP_qWj8EqqorPY97bvxSaX3h5u7a9PptRFHW5x7g%40mail.gmail.com
[6]: https://www.postgresql.org/message-id/flat/113ce31b0908120955w77029099i7ececc053084095a%40mail.gmail.com
[7]: https://github.com/abiosoft/colima/discussions/836
[8]: https://www.postgresql.org/message-id/flat/29663.1007738957%40sss.pgh.pa.us#2079ec9e2d8b251593812a3711bfe9e9
---
 src/backend/storage/file/fd.c | 199 +++++++++++++++++++++++++++++++---
 1 file changed, 184 insertions(+), 15 deletions(-)

diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 9f3b58c3767..37de12fbb7e 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -158,6 +158,13 @@ int			max_files_per_process = 1000;
  */
 int			max_safe_fds = FD_MINFREE;	/* default if not changed */
 
+#ifdef HAVE_GETRLIMIT
+static bool saved_original_max_open_files;
+static struct rlimit original_max_open_files;
+static struct rlimit custom_max_open_files;
+#endif
+
+
 /* Whether it is safe to continue running after fsync() fails. */
 bool		data_sync_retry = false;
 
@@ -943,6 +950,152 @@ InitTemporaryFileAccess(void)
 #endif
 }
 
+/*
+ * Returns true if the passed in highestfd is the last one that we're allowed
+ * to open based on our. This should only be called if
+ */
+static bool
+IsOpenFileLimit(int highestfd)
+{
+#ifdef HAVE_GETRLIMIT
+	if (!saved_original_max_open_files)
+	{
+		return false;
+	}
+
+	return highestfd >= custom_max_open_files.rlim_cur - 1;
+#else
+	return false;
+#endif
+}
+
+/*
+ * Increases the open file limit (RLIMIT_NOFILE) by the requested amount.
+ * Returns true if successful, false otherwise.
+ */
+static bool
+IncreaseOpenFileLimit(int extra_files)
+{
+#ifdef HAVE_GETRLIMIT
+	struct rlimit rlim;
+
+	if (!saved_original_max_open_files)
+	{
+		return false;
+	}
+
+	rlim = custom_max_open_files;
+
+	/* If we're already at the max we reached our limit */
+	if (rlim.rlim_cur == original_max_open_files.rlim_max)
+		return false;
+
+	/* Otherwise try to increase the soft limit to what we need */
+	rlim.rlim_cur = Min(rlim.rlim_cur + extra_files, rlim.rlim_max);
+
+	if (setrlimit(RLIMIT_NOFILE, &rlim) != 0)
+	{
+		/* We made sure not to exceed the hard limit, so this shouldn't fail */
+		ereport(WARNING, (errmsg("setrlimit failed: %m")));
+		return false;
+	}
+
+	custom_max_open_files = rlim;
+
+	elog(LOG, "increased open file limit to %ld", (long) rlim.rlim_cur);
+
+	return true;
+#else
+	return false;
+#endif
+}
+
+/*
+ * Saves the original open file limit (RLIMIT_NOFILE) the first time when this
+ * is called. If called again it's a no-op.
+ *
+ * Returns true if successful, false otherwise.
+ */
+static void
+SaveOriginalOpenFileLimit(void)
+{
+#ifdef HAVE_GETRLIMIT
+	int			status;
+
+	if (saved_original_max_open_files)
+	{
+		/* Already saved, no need to do it again */
+		return;
+	}
+
+	status = getrlimit(RLIMIT_NOFILE, &original_max_open_files);
+	if (status != 0)
+	{
+		ereport(WARNING, (errmsg("getrlimit failed: %m")));
+		return;
+	}
+
+	custom_max_open_files = original_max_open_files;
+	saved_original_max_open_files = true;
+	return;
+#endif
+}
+
+/*
+ * UseOriginalOpenFileLimit --- Makes the process use the original open file
+ * 		limit that was present at postmaster start.
+ *
+ * This should be called before spawning subprocesses that might use select(2)
+ * which can only handle file descriptors up to 1024.
+ */
+static void
+UseOriginalOpenFileLimit(void)
+{
+#ifdef HAVE_GETRLIMIT
+	if (!saved_original_max_open_files)
+	{
+		return;
+	}
+
+	if (custom_max_open_files.rlim_cur == original_max_open_files.rlim_cur)
+	{
+		/* Not changed, so no need to call setrlimit at all */
+		return;
+	}
+
+	if (setrlimit(RLIMIT_NOFILE, &original_max_open_files) != 0)
+	{
+		ereport(WARNING, (errmsg("setrlimit failed: %m")));
+	}
+#endif
+}
+
+/*
+ * UseCustomOpenFileLimit --- Makes the process use our custom open file limit
+ * 		after that we configured based on the max_files_per_process GUC.
+ */
+static void
+UseCustomOpenFileLimit(void)
+{
+#ifdef HAVE_GETRLIMIT
+	if (!saved_original_max_open_files)
+	{
+		return;
+	}
+
+	if (custom_max_open_files.rlim_cur == original_max_open_files.rlim_cur)
+	{
+		/* Not changed, so no need to call setrlimit at all */
+		return;
+	}
+
+	if (setrlimit(RLIMIT_NOFILE, &custom_max_open_files) != 0)
+	{
+		ereport(WARNING, (errmsg("setrlimit failed: %m")));
+	}
+#endif
+}
+
 /*
  * count_usable_fds --- count how many FDs the system will let us open,
  *		and estimate how many are already open.
@@ -966,38 +1119,39 @@ count_usable_fds(int max_to_probe, int *usable_fds, int *already_open)
 	int			highestfd = 0;
 	int			j;
 
-#ifdef HAVE_GETRLIMIT
-	struct rlimit rlim;
-	int			getrlimit_status;
-#endif
-
 	size = 1024;
 	fd = (int *) palloc(size * sizeof(int));
 
-#ifdef HAVE_GETRLIMIT
-	getrlimit_status = getrlimit(RLIMIT_NOFILE, &rlim);
-	if (getrlimit_status != 0)
-		ereport(WARNING, (errmsg("getrlimit failed: %m")));
-#endif							/* HAVE_GETRLIMIT */
+	SaveOriginalOpenFileLimit();
 
 	/* dup until failure or probe limit reached */
 	for (;;)
 	{
 		int			thisfd;
 
-#ifdef HAVE_GETRLIMIT
-
 		/*
 		 * don't go beyond RLIMIT_NOFILE; causes irritating kernel logs on
 		 * some platforms
 		 */
-		if (getrlimit_status == 0 && highestfd >= rlim.rlim_cur - 1)
-			break;
-#endif
+		if (IsOpenFileLimit(highestfd))
+		{
+			if (!IncreaseOpenFileLimit(max_to_probe - used))
+				break;
+		}
 
 		thisfd = dup(2);
 		if (thisfd < 0)
 		{
+			/*
+			 * Eventhough we do the pre-check above, it's still possible that
+			 * the call to dup fails with EMFILE. This can happen if the last
+			 * file descriptor was already assigned to an "already open" file.
+			 * One example of this happening, is if we're already at the soft
+			 * limit when we call count_usable_fds.
+			 */
+			if (errno == EMFILE && IncreaseOpenFileLimit(max_to_probe - used))
+				continue;
+
 			/* Expect EMFILE or ENFILE, else it's fishy */
 			if (errno != EMFILE && errno != ENFILE)
 				elog(WARNING, "duplicating stderr file descriptor failed after %d successes: %m", used);
@@ -2747,6 +2901,7 @@ pg_system(const char *command, uint32 wait_event_info)
 {
 	int			rc;
 
+	UseOriginalOpenFileLimit();
 	fflush(NULL);
 	pgstat_report_wait_start(wait_event_info);
 
@@ -2769,6 +2924,7 @@ pg_system(const char *command, uint32 wait_event_info)
 		PostRestoreCommand();
 
 	pgstat_report_wait_end();
+	UseCustomOpenFileLimit();
 	return rc;
 }
 
@@ -2802,6 +2958,19 @@ OpenPipeStream(const char *command, const char *mode)
 	ReleaseLruFiles();
 
 TryAgain:
+
+	/*
+	 * It would be great if we could call UseOriginalOpenFileLimit here, but
+	 * since popen() also opens a file in the current process (this side of the
+	 * pipe) we cannot do so safely. Because we might already have many more
+	 * files open than the original limit.
+	 *
+	 * The only way to address this would be implementing a custom popen() that
+	 * calls UseOriginalOpenFileLimit only around the actual fork call, but
+	 * that seems too much effort to handle the corner case where this external
+	 * command uses both select() and tries to open more files than select()
+	 * allows for.
+	 */
 	fflush(NULL);
 	pqsignal(SIGPIPE, SIG_DFL);
 	errno = 0;
-- 
2.51.1

v8-0003-Reflect-the-value-of-max_safe_fds-in-max_files_pe.patchtext/x-patch; charset=utf-8; name=v8-0003-Reflect-the-value-of-max_safe_fds-in-max_files_pe.patchDownload
From 6bf857b0ddb43b2a5abb62bcdc5669debc96edab Mon Sep 17 00:00:00 2001
From: Jelte Fennema-Nio <github-tech@jeltef.nl>
Date: Wed, 12 Feb 2025 01:08:07 +0100
Subject: [PATCH v8 3/3] Reflect the value of max_safe_fds in
 max_files_per_process

It is currently hard to figure out if max_safe_fds is significantly
lower than max_files_per_process. This starts reflecting the value of
max_safe_fds in max_files_per_process after our limit detection. We
still want to have two separate variables because for the bootstrap or
standalone-backend cases their values differ on purpose.
---
 src/backend/storage/file/fd.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 37de12fbb7e..3cd99054c9f 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -1196,6 +1196,7 @@ set_max_safe_fds(void)
 {
 	int			usable_fds;
 	int			already_open;
+	char	   *max_safe_fds_string;
 
 	/*----------
 	 * We want to set max_safe_fds to
@@ -1211,6 +1212,16 @@ set_max_safe_fds(void)
 
 	max_safe_fds = Min(usable_fds, max_files_per_process);
 
+	/*
+	 * Update GUC variable to allow users to see if the result is different
+	 * than what the used value turns out to be different than what they had
+	 * configured.
+	 */
+	max_safe_fds_string = psprintf("%d", max_safe_fds);
+	SetConfigOption("max_files_per_process", max_safe_fds_string,
+					PGC_POSTMASTER, PGC_S_OVERRIDE);
+	pfree(max_safe_fds_string);
+
 	/*
 	 * Take off the FDs reserved for system() etc.
 	 */
-- 
2.51.1

#29Jelte Fennema-Nio
postgres@jeltef.nl
In reply to: Andres Freund (#3)
2 attachment(s)
Re: Bump soft open file limit (RLIMIT_NOFILE) to hard limit on startup

On Tue Feb 11, 2025 at 8:42 PM CET, Andres Freund wrote:

Not sure that's quite the right thing to do for postmaster. What I'd start
with is to increase the soft limit to
"already used files" + max_files_per_process.

I needed to rebase this patch, and that made me finally take the time to
do the restoration of the file limit for subprocesses properly: In
previous versions of this patch it restored the limit before the call to
system() and it didn't restore it at all for popen. This latest version
the patch adds custom pg_system() and pg_popen() functions that restore
the limits in the child process right after the fork, but before the
exec.

There are two reasons to do this:
1. Any executables that still use select(2) will get clear "out of file
descriptors" errors instead of failing in mysterious ways.
2. Future looking when we'll have multi-threading (which this change is
needed for) it would be problematic to restore the original limit
temporarily in the postgres process tree. Some other thread might
want to open a file while the limit is too low. By only calling
setrlimit with the lower value in the child process there's not a
single moment where the original Postgres process has a too low limit.

Attachments:

v9-0001-Bump-postmaster-soft-open-file-limit-RLIMIT_NOFIL.patchtext/x-patch; charset=utf-8; name=v9-0001-Bump-postmaster-soft-open-file-limit-RLIMIT_NOFIL.patchDownload
From 699c49d1f75c143277007b6171c9499bfb6757fb Mon Sep 17 00:00:00 2001
From: Jelte Fennema-Nio <postgres@jeltef.nl>
Date: Sun, 7 Dec 2025 15:07:08 +0100
Subject: [PATCH v9 1/2] Bump postmaster soft open file limit (RLIMIT_NOFILE)
 when necessary

The default open file limit of 1024 on Linux is extremely low. The
reason that this hasn't changed change is because doing so would break
legacy programs that use the select(2) system call in hard to debug
ways. So instead programs that want to opt-in to a higher open file
limit are expected to bump their soft limit to their hard limit on
startup. Details on this are very well explained in a blogpost by the
systemd author[1]. There's also a similar change done by the Go
language[2].

This starts bumping postmaster its soft open file limit when we realize
that we'll run into the soft limit with the requested
max_files_per_process GUC. We do so by slightly changing the meaning of
the max_files_per_process GUC. The actual (not publicly exposed) limit
is max_safe_fds, previously this would be set to:
max_files_per_process - already_open_files - NUM_RESERVED_FDS
After this change we now try to set max_safe_fds to
max_files_per_process if the system allows that. This is deemed more
natural to understand for users, because now the limit of files that
they can open is actually what they configured in max_files_per_process.

Adding this infrastructure to change RLIMIT_NOFILE when needed is
especially useful for the AIO work, because io_uring consumes a lot of
file descriptors. Even without looking at AIO there is a large number of
reports from people that require changing their soft file limit before
starting Postgres, sometimes falling back to lowering
max_files_per_process when they fail to do so[3-8].
It's also not all that strange to fail at setting the soft open file
limit because there are multiple places where one can configure such
limits and usually only one of them is effective (which one depends on
how Postgres is started). In cloud environments its also often not
possible for user to change the soft limit, because they don't control
the way that Postgres is started. Finally, for the multi-threading work
this is also very important because then the open file limit is not
per-backend anymore, but for the whole system.

The most complex change in this patch is how we shell out to other
systems executables. Instead of using system and popen directly we now
implement our own versions of these commands that restore the original
file limits before starting the other executables. This is done as a
precaution in case those executables are using select(2).

P.S. The initdb test needed to be updated because on our OpenBSD CI the
stderr would now contain:

LOG:  increased open file limit to 1004

[1]: https://0pointer.net/blog/file-descriptor-limits.html
[2]: https://github.com/golang/go/issues/46279
[3]: https://serverfault.com/questions/785330/getting-too-many-open-files-error-for-postgres
[4]: https://serverfault.com/questions/716982/how-to-raise-max-no-of-file-descriptors-for-daemons-running-on-debian-jessie
[5]: https://www.postgresql.org/message-id/flat/CAKtc8vXh7NvP_qWj8EqqorPY97bvxSaX3h5u7a9PptRFHW5x7g%40mail.gmail.com
[6]: https://www.postgresql.org/message-id/flat/113ce31b0908120955w77029099i7ececc053084095a%40mail.gmail.com
[7]: https://github.com/abiosoft/colima/discussions/836
[8]: https://www.postgresql.org/message-id/flat/29663.1007738957%40sss.pgh.pa.us#2079ec9e2d8b251593812a3711bfe9e9
---
 meson.build                              |   1 +
 src/backend/access/transam/xlogarchive.c |  18 +-
 src/backend/archive/shell_archive.c      |   3 +-
 src/backend/storage/file/fd.c            | 442 +++++++++++++++++++++--
 src/bin/initdb/t/001_initdb.pl           |   6 +-
 src/include/pg_config.h.in               |   3 +
 src/include/storage/fd.h                 |   1 +
 7 files changed, 437 insertions(+), 37 deletions(-)

diff --git a/meson.build b/meson.build
index 6e7ddd74683..ac778b87818 100644
--- a/meson.build
+++ b/meson.build
@@ -2891,6 +2891,7 @@ func_checks = [
   ['localeconv_l'],
   ['mbstowcs_l'],
   ['mkdtemp'],
+  ['pipe2'],
   ['posix_fadvise'],
   ['posix_fallocate'],
   ['ppoll'],
diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index 1ef1713c91a..5a99ddc291e 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -158,10 +158,9 @@ RestoreArchivedFile(char *path, const char *xlogfname,
 			(errmsg_internal("executing restore command \"%s\"",
 							 xlogRestoreCmd)));
 
-	fflush(NULL);
-	pgstat_report_wait_start(WAIT_EVENT_RESTORE_COMMAND);
-
 	/*
+	 * Copy xlog from archival storage to XLOGDIR
+	 *
 	 * PreRestoreCommand() informs the SIGTERM handler for the startup process
 	 * that it should proc_exit() right away.  This is done for the duration
 	 * of the system() call because there isn't a good way to break out while
@@ -169,16 +168,13 @@ RestoreArchivedFile(char *path, const char *xlogfname,
 	 * it is best to put any additional logic before or after the
 	 * PreRestoreCommand()/PostRestoreCommand() section.
 	 */
+	fflush(NULL);
+	pgstat_report_wait_start(WAIT_EVENT_RESTORE_COMMAND);
 	PreRestoreCommand();
-
-	/*
-	 * Copy xlog from archival storage to XLOGDIR
-	 */
-	rc = system(xlogRestoreCmd);
-
+	rc = pg_system(xlogRestoreCmd);
 	PostRestoreCommand();
-
 	pgstat_report_wait_end();
+
 	pfree(xlogRestoreCmd);
 
 	if (rc == 0)
@@ -327,7 +323,7 @@ ExecuteRecoveryCommand(const char *command, const char *commandName,
 	 */
 	fflush(NULL);
 	pgstat_report_wait_start(wait_event_info);
-	rc = system(xlogRecoveryCmd);
+	rc = pg_system(xlogRecoveryCmd);
 	pgstat_report_wait_end();
 
 	pfree(xlogRecoveryCmd);
diff --git a/src/backend/archive/shell_archive.c b/src/backend/archive/shell_archive.c
index 828723afe47..beb81ee7d9d 100644
--- a/src/backend/archive/shell_archive.c
+++ b/src/backend/archive/shell_archive.c
@@ -22,6 +22,7 @@
 #include "archive/shell_archive.h"
 #include "common/percentrepl.h"
 #include "pgstat.h"
+#include "storage/fd.h"
 
 static bool shell_archive_configured(ArchiveModuleState *state);
 static bool shell_archive_file(ArchiveModuleState *state,
@@ -77,7 +78,7 @@ shell_archive_file(ArchiveModuleState *state, const char *file,
 
 	fflush(NULL);
 	pgstat_report_wait_start(WAIT_EVENT_ARCHIVE_COMMAND);
-	rc = system(xlogarchcmd);
+	rc = pg_system(xlogarchcmd);
 	pgstat_report_wait_end();
 
 	if (rc != 0)
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index e9eaaf9c829..40212d733e9 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -80,6 +80,7 @@
 #include <sys/types.h>
 #ifndef WIN32
 #include <sys/mman.h>
+#include <sys/wait.h>
 #endif
 #include <limits.h>
 #include <unistd.h>
@@ -91,6 +92,7 @@
 #include "common/file_perm.h"
 #include "common/file_utils.h"
 #include "common/pg_prng.h"
+#include "libpq/pqsignal.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/startup.h"
@@ -158,6 +160,13 @@ int			max_files_per_process = 1000;
  */
 int			max_safe_fds = FD_MINFREE;	/* default if not changed */
 
+#ifdef HAVE_GETRLIMIT
+static bool saved_original_max_open_files;
+static struct rlimit original_max_open_files;
+static struct rlimit custom_max_open_files;
+#endif
+
+
 /* Whether it is safe to continue running after fsync() fails. */
 bool		data_sync_retry = false;
 
@@ -261,6 +270,11 @@ typedef struct
 		FILE	   *file;
 		DIR		   *dir;
 		int			fd;
+		struct
+		{
+			FILE	   *file;
+			pid_t		pid;
+		}			pipe;
 	}			desc;
 } AllocateDesc;
 
@@ -943,6 +957,128 @@ InitTemporaryFileAccess(void)
 #endif
 }
 
+/*
+ * Returns true if the passed in highestfd is the last one that we're allowed
+ * to open based on our. This should only be called if
+ */
+static bool
+IsOpenFileLimit(int highestfd)
+{
+#ifdef HAVE_GETRLIMIT
+	if (!saved_original_max_open_files)
+	{
+		return false;
+	}
+
+	return highestfd >= custom_max_open_files.rlim_cur - 1;
+#else
+	return false;
+#endif
+}
+
+/*
+ * Increases the open file limit (RLIMIT_NOFILE) by the requested amount.
+ * Returns true if successful, false otherwise.
+ */
+static bool
+IncreaseOpenFileLimit(int extra_files)
+{
+#ifdef HAVE_GETRLIMIT
+	struct rlimit rlim;
+
+	if (!saved_original_max_open_files)
+	{
+		return false;
+	}
+
+	rlim = custom_max_open_files;
+
+	/* If we're already at the max we reached our limit */
+	if (rlim.rlim_cur == original_max_open_files.rlim_max)
+		return false;
+
+	/* Otherwise try to increase the soft limit to what we need */
+	rlim.rlim_cur = Min(rlim.rlim_cur + extra_files, rlim.rlim_max);
+
+	if (setrlimit(RLIMIT_NOFILE, &rlim) != 0)
+	{
+		/* We made sure not to exceed the hard limit, so this shouldn't fail */
+		ereport(WARNING, (errmsg("setrlimit failed: %m")));
+		return false;
+	}
+
+	custom_max_open_files = rlim;
+
+	elog(LOG, "increased open file limit to %ld", (long) rlim.rlim_cur);
+
+	return true;
+#else
+	return false;
+#endif
+}
+
+/*
+ * Saves the original open file limit (RLIMIT_NOFILE) the first time when this
+ * is called. If called again it's a no-op.
+ *
+ * Returns true if successful, false otherwise.
+ */
+static void
+SaveOriginalOpenFileLimit(void)
+{
+#ifdef HAVE_GETRLIMIT
+	int			status;
+
+	if (saved_original_max_open_files)
+	{
+		/* Already saved, no need to do it again */
+		return;
+	}
+
+	status = getrlimit(RLIMIT_NOFILE, &original_max_open_files);
+	if (status != 0)
+	{
+		ereport(WARNING, (errmsg("getrlimit failed: %m")));
+		return;
+	}
+
+	custom_max_open_files = original_max_open_files;
+	saved_original_max_open_files = true;
+	return;
+#endif
+}
+
+#ifndef WIN32
+/*
+ * UseOriginalOpenFileLimit --- Makes the process use the original open file
+ * 		limit that was present at postmaster start.
+ *
+ * This should be called before spawning subprocesses that might use select(2)
+ * which can only handle file descriptors up to 1024.
+ */
+static void
+UseOriginalOpenFileLimit(void)
+{
+#ifdef HAVE_GETRLIMIT
+	if (!saved_original_max_open_files)
+	{
+		return;
+	}
+
+	if (custom_max_open_files.rlim_cur == original_max_open_files.rlim_cur)
+	{
+		/* Not changed, so no need to call setrlimit at all */
+		return;
+	}
+
+	if (setrlimit(RLIMIT_NOFILE, &original_max_open_files) != 0)
+	{
+		ereport(WARNING, (errmsg("setrlimit failed: %m")));
+	}
+#endif
+}
+#endif							/* WIN32 */
+
 /*
  * count_usable_fds --- count how many FDs the system will let us open,
  *		and estimate how many are already open.
@@ -966,38 +1102,39 @@ count_usable_fds(int max_to_probe, int *usable_fds, int *already_open)
 	int			highestfd = 0;
 	int			j;
 
-#ifdef HAVE_GETRLIMIT
-	struct rlimit rlim;
-	int			getrlimit_status;
-#endif
-
 	size = 1024;
 	fd = (int *) palloc(size * sizeof(int));
 
-#ifdef HAVE_GETRLIMIT
-	getrlimit_status = getrlimit(RLIMIT_NOFILE, &rlim);
-	if (getrlimit_status != 0)
-		ereport(WARNING, (errmsg("getrlimit failed: %m")));
-#endif							/* HAVE_GETRLIMIT */
+	SaveOriginalOpenFileLimit();
 
 	/* dup until failure or probe limit reached */
 	for (;;)
 	{
 		int			thisfd;
 
-#ifdef HAVE_GETRLIMIT
-
 		/*
 		 * don't go beyond RLIMIT_NOFILE; causes irritating kernel logs on
 		 * some platforms
 		 */
-		if (getrlimit_status == 0 && highestfd >= rlim.rlim_cur - 1)
-			break;
-#endif
+		if (IsOpenFileLimit(highestfd))
+		{
+			if (!IncreaseOpenFileLimit(max_to_probe - used))
+				break;
+		}
 
 		thisfd = dup(2);
 		if (thisfd < 0)
 		{
+			/*
+			 * Eventhough we do the pre-check above, it's still possible that
+			 * the call to dup fails with EMFILE. This can happen if the last
+			 * file descriptor was already assigned to an "already open" file.
+			 * One example of this happening, is if we're already at the soft
+			 * limit when we call count_usable_fds.
+			 */
+			if (errno == EMFILE && IncreaseOpenFileLimit(max_to_probe - used))
+				continue;
+
 			/* Expect EMFILE or ENFILE, else it's fishy */
 			if (errno != EMFILE && errno != ENFILE)
 				elog(WARNING, "duplicating stderr file descriptor failed after %d successes: %m", used);
@@ -2731,6 +2868,265 @@ OpenTransientFilePerm(const char *fileName, int fileFlags, mode_t fileMode)
 	return -1;					/* failure */
 }
 
+#ifndef WIN32
+/*
+ * Helper function to fork a child process for executing a shell command.
+ *
+ * This handles all the standard child process setup:
+ * - Blocks signals before fork to avoid race conditions
+ * - Lowers the file limit in the child (UseOriginalOpenFileLimit)
+ * - Resets signal handlers that the backend has set to SIG_IGN
+ * - Unblocks signals in both parent and child after setup
+ *
+ * Returns -1 on fork failure, 0 in the child, or the child PID in the parent.
+ * After this returns in the child, the caller should do any additional setup
+ * (like pipe redirection) and then call exec_shell_command().
+ */
+static pid_t
+fork_for_shell_command(void)
+{
+	pid_t		pid;
+	sigset_t	save_mask;
+
+	/* Block signals during fork to avoid race conditions in child setup */
+	sigprocmask(SIG_SETMASK, &BlockSig, &save_mask);
+
+	pid = fork();
+	if (pid == 0)
+	{
+		/* Child process */
+		UseOriginalOpenFileLimit();
+
+		/*
+		 * Reset signal handlers to default. The backend ignores these, and
+		 * SIG_IGN is preserved across exec, so we must reset them to allow
+		 * the child process to respond to signals normally.
+		 */
+		pqsignal(SIGINT, SIG_DFL);
+		pqsignal(SIGQUIT, SIG_DFL);
+		pqsignal(SIGPIPE, SIG_DFL);
+	}
+
+	/* Restore/unblock signals in both parent and child (also if fork failed) */
+	sigprocmask(SIG_SETMASK, &save_mask, NULL);
+
+	return pid;
+}
+
+/*
+ * Execute a shell command via /bin/sh. This function does not return.
+ * Should be called in the child process after fork_for_shell_command().
+ */
+static pg_noreturn void
+exec_shell_command(const char *command)
+{
+	execl("/bin/sh", "sh", "-c", command, (char *) NULL);
+	_exit(127);
+}
+#endif
+
+/*
+ * pg_system - Custom system() that lowers file limit in child process.
+ *
+ * This is needed because we may have bumped the soft RLIMIT_NOFILE limit
+ * above its original value. The child process should use the original limit
+ * to avoid issues with programs that use select(2), which can only handle
+ * FDs up to FD_SETSIZE (typically 1024).
+ *
+ * We can't just call UseOriginalOpenFileLimit() before the standard system()
+ * because that would temporarily lower the limit in the parent process,
+ * which could cause concurrent file operations to fail in a multithreaded
+ * environment.
+ */
+int
+pg_system(const char *command)
+{
+#ifndef WIN32
+	pid_t		pid;
+	int			status;
+	pid_t		waitresult;
+
+	pid = fork_for_shell_command();
+	if (pid == (pid_t) -1)
+		return -1;
+	if (pid == 0)
+		exec_shell_command(command);	/* does not return */
+
+	/* Wait for child */
+	do
+	{
+		waitresult = waitpid(pid, &status, 0);
+	} while (waitresult == (pid_t) -1 && errno == EINTR);
+
+	if (waitresult != pid)
+		return -1;
+
+	return status;
+#else
+	/* On Windows, just use standard system() - no RLIMIT_NOFILE bumping there */
+	return system(command);
+#endif
+}
+
+
+/*
+ * pg_popen - Custom popen() that lowers file limit in child process.
+ *
+ * This is needed because we may have bumped the soft RLIMIT_NOFILE limit
+ * above its original value. The child process should use the original limit
+ * to avoid issues with programs that use select(2), which can only handle
+ * FDs up to FD_SETSIZE (typically 1024).
+ *
+ * We can't just call UseOriginalOpenFileLimit() before popen() of the stdlib
+ * because popen() opens a file descriptor in the parent process (the parent's
+ * side of the pipe), which would fail if we're already above the original
+ * limit.
+ *
+ * Returns the FILE* for the pipe, or NULL on error.
+ * On success, *child_pid is set to the child's PID (or 0 on Windows where
+ * we use standard popen).
+ */
+static FILE *
+pg_popen(const char *command, const char *mode, pid_t *child_pid)
+{
+#ifndef WIN32
+	int			pipe_fds[2];
+	int			parent_fd;
+	int			child_fd;
+	int			child_target_fd;
+	pid_t		pid;
+	FILE	   *pipe_file;
+	int			save_errno;
+
+	/* Only "r" and "w" modes are supported */
+	if (mode[0] == 'r')
+	{
+		/* Parent reads from child's stdout */
+		parent_fd = 0;
+		child_fd = 1;
+		child_target_fd = STDOUT_FILENO;
+	}
+	else if (mode[0] == 'w')
+	{
+		/* Parent writes to child's stdin */
+		parent_fd = 1;
+		child_fd = 0;
+		child_target_fd = STDIN_FILENO;
+	}
+	else
+	{
+		errno = EINVAL;
+		return NULL;
+	}
+
+	/*
+	 * Set close-on-exec to prevent other concurrently spawned child processes
+	 * from inheriting the pipe file descriptors. This is impossible while
+	 * we're not using threads, but since this is part of adding such
+	 * multithreading support let's pretend that we already have it. For
+	 * systems that don't have pipe2, there's always a risk of race
+	 * conditions, but let's minimize that by quickly setting the
+	 * close-on-exec flag with fnctl.
+	 */
+#ifdef HAVE_PIPE2
+	if (pipe2(pipe_fds, O_CLOEXEC) < 0)
+		return NULL;
+#else
+	if (pipe(pipe_fds) < 0)
+		return NULL;
+	if (fcntl(pipe_fds[0], F_SETFD, FD_CLOEXEC) == -1 ||
+		fcntl(pipe_fds[1], F_SETFD, FD_CLOEXEC) == -1)
+	{
+		save_errno = errno;
+		close(pipe_fds[0]);
+		close(pipe_fds[1]);
+		errno = save_errno;
+		return NULL;
+	}
+#endif
+
+	fflush(NULL);
+
+	pid = fork_for_shell_command();
+	if (pid < 0)
+	{
+		/* fork failed */
+		save_errno = errno;
+		close(pipe_fds[0]);
+		close(pipe_fds[1]);
+		errno = save_errno;
+		return NULL;
+	}
+
+	if (pid == 0)
+	{
+		/* Child process - set up pipe redirection and exec */
+		close(pipe_fds[parent_fd]);
+		if (pipe_fds[child_fd] != child_target_fd)
+		{
+			dup2(pipe_fds[child_fd], child_target_fd);
+			close(pipe_fds[child_fd]);
+		}
+
+		exec_shell_command(command);	/* does not return */
+	}
+
+	/* Parent process */
+	close(pipe_fds[child_fd]);
+	pipe_file = fdopen(pipe_fds[parent_fd], mode);
+	if (pipe_file == NULL)
+	{
+		save_errno = errno;
+		close(pipe_fds[parent_fd]);
+		errno = save_errno;
+		return NULL;
+	}
+
+	*child_pid = pid;
+	return pipe_file;
+#else
+	/* On Windows, just use standard popen - no RLIMIT_NOFILE bumping there */
+	fflush(NULL);
+	*child_pid = 0;
+	return popen(command, mode);
+#endif
+}
+
+/*
+ * pg_pclose - Close a pipe opened by pg_popen().
+ *
+ * On Unix, this closes the FILE* and waits for the child process.
+ * On Windows, this just calls pclose().
+ *
+ * Returns the child's exit status (like pclose), or -1 on error.
+ */
+static int
+pg_pclose(FILE *fp, pid_t child_pid)
+{
+#ifndef WIN32
+	int			status;
+	pid_t		waitresult;
+
+	/* Close the FILE* first */
+	if (fclose(fp) != 0)
+		return -1;
+
+	/* Then wait for the child process, like pclose() does */
+	do
+	{
+		waitresult = waitpid(child_pid, &status, 0);
+	} while (waitresult == (pid_t) -1 && errno == EINTR);
+
+	if (waitresult == child_pid)
+		return status;
+	else
+		return -1;
+#else
+	(void) child_pid;			/* unused on Windows */
+	return pclose(fp);
+#endif
+}
+
 /*
  * Routines that want to initiate a pipe stream should use OpenPipeStream
  * rather than plain popen().  This lets fd.c deal with freeing FDs if
@@ -2745,6 +3141,7 @@ OpenPipeStream(const char *command, const char *mode)
 {
 	FILE	   *file;
 	int			save_errno;
+	pid_t		child_pid;
 
 	DO_DB(elog(LOG, "OpenPipeStream: Allocated %d (%s)",
 			   numAllocatedDescs, command));
@@ -2760,22 +3157,21 @@ OpenPipeStream(const char *command, const char *mode)
 	ReleaseLruFiles();
 
 TryAgain:
-	fflush(NULL);
-	pqsignal(SIGPIPE, SIG_DFL);
+
 	errno = 0;
-	file = popen(command, mode);
+	file = pg_popen(command, mode, &child_pid);
 	save_errno = errno;
 	pqsignal(SIGPIPE, SIG_IGN);
-	errno = save_errno;
 	if (file != NULL)
 	{
 		AllocateDesc *desc = &allocatedDescs[numAllocatedDescs];
 
 		desc->kind = AllocateDescPipe;
-		desc->desc.file = file;
+		desc->desc.pipe.file = file;
+		desc->desc.pipe.pid = child_pid;
 		desc->create_subid = GetCurrentSubTransactionId();
 		numAllocatedDescs++;
-		return desc->desc.file;
+		return desc->desc.pipe.file;
 	}
 
 	if (errno == EMFILE || errno == ENFILE)
@@ -2808,7 +3204,7 @@ FreeDesc(AllocateDesc *desc)
 			result = fclose(desc->desc.file);
 			break;
 		case AllocateDescPipe:
-			result = pclose(desc->desc.file);
+			result = pg_pclose(desc->desc.pipe.file, desc->desc.pipe.pid);
 			break;
 		case AllocateDescDir:
 			result = closedir(desc->desc.dir);
@@ -3060,7 +3456,7 @@ ClosePipeStream(FILE *file)
 	{
 		AllocateDesc *desc = &allocatedDescs[i];
 
-		if (desc->kind == AllocateDescPipe && desc->desc.file == file)
+		if (desc->kind == AllocateDescPipe && desc->desc.pipe.file == file)
 			return FreeDesc(desc);
 	}
 
diff --git a/src/bin/initdb/t/001_initdb.pl b/src/bin/initdb/t/001_initdb.pl
index 1e9543c2585..9a758e26fb7 100644
--- a/src/bin/initdb/t/001_initdb.pl
+++ b/src/bin/initdb/t/001_initdb.pl
@@ -131,7 +131,7 @@ if ($ENV{with_icu} eq 'yes')
 		],
 		'option --icu-locale');
 
-	command_like(
+	command_checks_all(
 		[
 			'initdb', '--no-sync',
 			'--auth' => 'trust',
@@ -145,7 +145,9 @@ if ($ENV{with_icu} eq 'yes')
 			'--lc-time' => 'C',
 			"$tempdir/data4"
 		],
-		qr/^\s+default collation:\s+und\n/ms,
+		0,
+		[qr/^\s+default collation:\s+und\n/ms],
+		[],    # stderr may have LOG messages
 		'options --locale-provider=icu --locale=und --lc-*=C');
 
 	command_fails_like(
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index b0b0cfdaf79..beeee61017d 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -307,6 +307,9 @@
 /* Define to 1 if you have the <pam/pam_appl.h> header file. */
 #undef HAVE_PAM_PAM_APPL_H
 
+/* Define to 1 if you have the `pipe2' function. */
+#undef HAVE_PIPE2
+
 /* Define to 1 if you have the `posix_fadvise' function. */
 #undef HAVE_POSIX_FADVISE
 
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index a1bdefec4a5..4484d557c37 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -186,6 +186,7 @@ extern int	pg_fsync_writethrough(int fd);
 extern int	pg_fdatasync(int fd);
 extern bool pg_file_exists(const char *name);
 extern void pg_flush_data(int fd, pgoff_t offset, pgoff_t nbytes);
+extern int	pg_system(const char *command);
 extern int	pg_truncate(const char *path, pgoff_t length);
 extern void fsync_fname(const char *fname, bool isdir);
 extern int	fsync_fname_ext(const char *fname, bool isdir, bool ignore_perm, int elevel);

base-commit: 6498287696dafc1ebd380ea4eb249124989294d3
-- 
2.52.0

v9-0002-Reflect-the-value-of-max_safe_fds-in-max_files_pe.patchtext/x-patch; charset=utf-8; name=v9-0002-Reflect-the-value-of-max_safe_fds-in-max_files_pe.patchDownload
From 94c19d3857102dbf6470d64295fc263b6e68556e Mon Sep 17 00:00:00 2001
From: Jelte Fennema-Nio <postgres@jeltef.nl>
Date: Sun, 7 Dec 2025 15:06:54 +0100
Subject: [PATCH v9 2/2] Reflect the value of max_safe_fds in
 max_files_per_process

It is currently hard to figure out if max_safe_fds is significantly
lower than max_files_per_process. This starts reflecting the value of
max_safe_fds in max_files_per_process after our limit detection. We
still want to have two separate variables because for the bootstrap or
standalone-backend cases their values differ on purpose.
---
 src/backend/storage/file/fd.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 40212d733e9..2d27207ab89 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -1179,6 +1179,7 @@ set_max_safe_fds(void)
 {
 	int			usable_fds;
 	int			already_open;
+	char	   *max_safe_fds_string;
 
 	/*----------
 	 * We want to set max_safe_fds to
@@ -1194,6 +1195,16 @@ set_max_safe_fds(void)
 
 	max_safe_fds = Min(usable_fds, max_files_per_process);
 
+	/*
+	 * Update GUC variable to allow users to see if the result is different
+	 * than what the used value turns out to be different than what they had
+	 * configured.
+	 */
+	max_safe_fds_string = psprintf("%d", max_safe_fds);
+	SetConfigOption("max_files_per_process", max_safe_fds_string,
+					PGC_POSTMASTER, PGC_S_OVERRIDE);
+	pfree(max_safe_fds_string);
+
 	/*
 	 * Take off the FDs reserved for system() etc.
 	 */
-- 
2.52.0