File descriptors in exec'd subprocesses

Started by Thomas Munroalmost 3 years ago13 messages
#1Thomas Munro
thomas.munro@gmail.com
1 attachment(s)

While investigating code paths that use system() and popen() and
trying to write latch-multiplexing replacements, which I'll write
about separately, leaks of $SUBJECT became obvious. On a FreeBSD box,
you can see the sockets, pipes and data and WAL files that the
subprocess inherits:

create table t (x text);
copy t from program 'procstat -f $$';
select * from t;

(On Linux, you might try something like 'ls -slap /proc/self/fd'. Not
sure how to do it on a Mac but note that 'lsof' is no good for this
purpose because the first thing it does is close all descriptors > 2;
maybe 'lsof -p $PPID' inside a shell wrapper so you're analysing the
shell process that sits between postgres and lsof, rather than lsof
itself?)

Since we've started assuming a few other bits of SUSv3 (POSIX 2008)
functionality, the standard that specified O_CLOEXEC, and since in
practice Linux, *BSD, macOS, AIX, Solaris, illumos all have it, I
think we can unconditionally just use it on all files we open. That
is, if we were to make fallback code, it would be untested, and if we
were to do it with fcntl() always it would be a frequent extra system
call that we don't need to support a computer that doesn't exist. For
sockets and pipes, much more rarely created, some systems have
non-standard extensions along the same lines, but I guess we should
stick with standards and call fcntl(FD_CLOEXEC) for now.

There is a place in fd.c that already referenced O_CLOEXEC (it wasn't
really using it, just making an assertion that flags don't collide),
with #ifdef around it, but that was only conditional because at the
time of commit 04cad8f7 we had a macOS 10.4 system (released 2005) in
the 'farm which obviously didn't know about POSIX 2008 interfaces. We
can just remove that #ifdef. (It's probably OK to remove the test of
O_DSYNC too but I'll think about that another time.)

On Windows, handles, at least as we create them, are not inherited so
the problem doesn't come up AFAICS. I *think* if we were to use
Windows' own open(), that would be an issue, but we have our own
CreateFile() thing and it doesn't turn on inheritance IIUC. So I just
gave O_CLOEXEC a zero definition there. It would be interesting to
know what handles a subprocess sees. If someone who knows how to
drive Windows could run a subprogram that just does the equivalent of
'sleep 60' they might be able to see that in one of those handle spy
tools, to visually check the above. (If I'm wrong about that, it
might be possible for a subprocess to interfere with a
ProcSignalBarrier command to close all files, so I'd love to know
about it.)

We were already doing FD_CLOEXEC on the latch self-pipe with comment
"we surely do not want any child processes messing with them", so it's
not like this wasn't a well-known issue before, but I guess it just
never bothered anyone enough to do anything about the more general
problem.

With the attached, the test at the top of this email shows only in,
out, error, and one thing that procstat opened itself.

Are there any more descriptors we need to think about?

Attachments:

0001-Don-t-leak-descriptors-into-subprograms.patchtext/x-patch; charset=US-ASCII; name=0001-Don-t-leak-descriptors-into-subprograms.patchDownload
From 3a267bccd3a4a80f87ae620fc615a310a3f5f1ce Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sun, 5 Feb 2023 12:18:14 +1300
Subject: [PATCH] Don't leak descriptors into subprograms.

Open data and WAL files with O_CLOEXEC from SUSv3, available on all our
target systems.  Do the same for sockets with FD_CLOEXEC.  This means
that programs executed by COPY, or archiving scripts etc, will not be
able to mess with those descriptors (of course nothing stops them from
opening files, so this is more about tidyness and preventing accidental
problems rather than security).
---
 src/backend/access/transam/xlog.c | 9 ++++++---
 src/backend/libpq/pqcomm.c        | 9 +++++++++
 src/backend/storage/file/fd.c     | 2 --
 src/backend/storage/smgr/md.c     | 9 +++++----
 src/include/port/win32_port.h     | 7 +++++++
 5 files changed, 27 insertions(+), 9 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index fb4c860bde..3b44a0f237 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2937,7 +2937,8 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 	 * Try to use existent file (checkpoint maker may have created it already)
 	 */
 	*added = false;
-	fd = BasicOpenFile(path, O_RDWR | PG_BINARY | get_sync_bit(sync_method));
+	fd = BasicOpenFile(path, O_RDWR | PG_BINARY | O_CLOEXEC |
+							 get_sync_bit(sync_method));
 	if (fd < 0)
 	{
 		if (errno != ENOENT)
@@ -3098,7 +3099,8 @@ XLogFileInit(XLogSegNo logsegno, TimeLineID logtli)
 		return fd;
 
 	/* Now open original target segment (might not be file I just made) */
-	fd = BasicOpenFile(path, O_RDWR | PG_BINARY | get_sync_bit(sync_method));
+	fd = BasicOpenFile(path, O_RDWR | PG_BINARY | O_CLOEXEC |
+							 get_sync_bit(sync_method));
 	if (fd < 0)
 		ereport(ERROR,
 				(errcode_for_file_access(),
@@ -3329,7 +3331,8 @@ XLogFileOpen(XLogSegNo segno, TimeLineID tli)
 
 	XLogFilePath(path, tli, segno, wal_segment_size);
 
-	fd = BasicOpenFile(path, O_RDWR | PG_BINARY | get_sync_bit(sync_method));
+	fd = BasicOpenFile(path, O_RDWR | PG_BINARY | O_CLOEXEC |
+							 get_sync_bit(sync_method));
 	if (fd < 0)
 		ereport(PANIC,
 				(errcode_for_file_access(),
diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index 864c9debe8..d4622533fd 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -200,6 +200,15 @@ pq_init(void)
 				(errmsg("could not set socket to nonblocking mode: %m")));
 #endif
 
+#ifndef WIN32
+	/*
+	 * Also make sure that any subprocesses that execute a new program don't
+	 * inherit the socket.
+	 */
+	if (fcntl(MyProcPort->sock, F_SETFD, FD_CLOEXEC) < 0)
+		elog(FATAL, "fcntl(F_SETFD) failed on socket: %m");
+#endif
+
 	FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, FeBeWaitSetNEvents);
 	socket_pos = AddWaitEventToSet(FeBeWaitSet, WL_SOCKET_WRITEABLE,
 								   MyProcPort->sock, NULL, NULL);
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 926d000f2e..933f17b398 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -1033,10 +1033,8 @@ tryAgain:
 					   O_TRUNC |
 					   O_WRONLY)) == 0,
 					 "PG_O_DIRECT value collides with standard flag");
-#if defined(O_CLOEXEC)
 	StaticAssertStmt((PG_O_DIRECT & O_CLOEXEC) == 0,
 					 "PG_O_DIRECT value collides with O_CLOEXEC");
-#endif
 #if defined(O_DSYNC)
 	StaticAssertStmt((PG_O_DIRECT & O_DSYNC) == 0,
 					 "PG_O_DIRECT value collides with O_DSYNC");
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 60c9905eff..c464d6338c 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -205,14 +205,15 @@ mdcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo)
 
 	path = relpath(reln->smgr_rlocator, forknum);
 
-	fd = PathNameOpenFile(path, O_RDWR | O_CREAT | O_EXCL | PG_BINARY);
+	fd = PathNameOpenFile(path,
+						  O_RDWR | O_CREAT | O_EXCL | PG_BINARY | O_CLOEXEC);
 
 	if (fd < 0)
 	{
 		int			save_errno = errno;
 
 		if (isRedo)
-			fd = PathNameOpenFile(path, O_RDWR | PG_BINARY);
+			fd = PathNameOpenFile(path, O_RDWR | PG_BINARY | O_CLOEXEC);
 		if (fd < 0)
 		{
 			/* be sure to report the error reported by create, not open */
@@ -523,7 +524,7 @@ mdopenfork(SMgrRelation reln, ForkNumber forknum, int behavior)
 
 	path = relpath(reln->smgr_rlocator, forknum);
 
-	fd = PathNameOpenFile(path, O_RDWR | PG_BINARY);
+	fd = PathNameOpenFile(path, O_RDWR | PG_BINARY | O_CLOEXEC);
 
 	if (fd < 0)
 	{
@@ -1188,7 +1189,7 @@ _mdfd_openseg(SMgrRelation reln, ForkNumber forknum, BlockNumber segno,
 	fullpath = _mdfd_segpath(reln, forknum, segno);
 
 	/* open the file */
-	fd = PathNameOpenFile(fullpath, O_RDWR | PG_BINARY | oflags);
+	fd = PathNameOpenFile(fullpath, O_RDWR | PG_BINARY | O_CLOEXEC | oflags);
 
 	pfree(fullpath);
 
diff --git a/src/include/port/win32_port.h b/src/include/port/win32_port.h
index 9488195799..a8685f1520 100644
--- a/src/include/port/win32_port.h
+++ b/src/include/port/win32_port.h
@@ -353,6 +353,13 @@ extern int	_pglstat64(const char *name, struct stat *buf);
  */
 #define O_DSYNC 0x0080
 
+/*
+ * Our open wrapper does not create inheritable handles, so it is safe to
+ * ignore O_CLOEXEC.  (If we were using Windows' own open(), it might be
+ * necessary to convert this to _O_NOINHERIT.)
+ */
+#define O_CLOEXEC 0
+
 /*
  * Supplement to <errno.h>.
  *
-- 
2.39.1

#2Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#1)
Re: File descriptors in exec'd subprocesses

On Sun, Feb 5, 2023 at 1:00 PM Thomas Munro <thomas.munro@gmail.com> wrote:

SUSv3 (POSIX 2008)

Oh, oops, 2008 actually corresponds to SUSv4. Hmm.

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Thomas Munro (#2)
Re: File descriptors in exec'd subprocesses

Thomas Munro <thomas.munro@gmail.com> writes:

On Sun, Feb 5, 2023 at 1:00 PM Thomas Munro <thomas.munro@gmail.com> wrote:

SUSv3 (POSIX 2008)

Oh, oops, 2008 actually corresponds to SUSv4. Hmm.

Worst case, if we come across some allegedly-supported platform without
O_CLOEXEC, we #define that to zero. Said platform is no worse off
than it was before.

regards, tom lane

#4Andres Freund
andres@anarazel.de
In reply to: Thomas Munro (#1)
Re: File descriptors in exec'd subprocesses

Hi,

Unsurprisingly I'm in favor of this.

On February 5, 2023 1:00:50 AM GMT+01:00, Thomas Munro <thomas.munro@gmail.com> wrote:

Are there any more descriptors we need to think about?

Postmaster's listen sockets? Saves a bunch of syscalls, at least. Logging collector pipe write end, in backends?

Greetings,

Andres

--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#4)
Re: File descriptors in exec'd subprocesses

Andres Freund <andres@anarazel.de> writes:

On February 5, 2023 1:00:50 AM GMT+01:00, Thomas Munro <thomas.munro@gmail.com> wrote:

Are there any more descriptors we need to think about?

Postmaster's listen sockets?

I wonder whether O_CLOEXEC on that would be inherited by the
client-communication sockets, though. That's fine ... unless you
are doing EXEC_BACKEND.

regards, tom lane

#6Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#5)
Re: File descriptors in exec'd subprocesses

Hi,

On 2023-02-05 11:06:13 -0500, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

On February 5, 2023 1:00:50 AM GMT+01:00, Thomas Munro <thomas.munro@gmail.com> wrote:

Are there any more descriptors we need to think about?

Postmaster's listen sockets?

I wonder whether O_CLOEXEC on that would be inherited by the
client-communication sockets, though.

I'd be very suprised if it were.

<hack>

Nope, at least not on linux. Verified by looking at /proc/*/fdinfo/n
after adding SOCK_CLOEXEC to just the socket() call. 'flags' changes
from 02 -> 02000002 for the listen socket, but stays at 04002 for the
client socket. If I add SOCK_CLOEXEC to accept() (well, accept4()), it
does change from 04002 to 02004002.

Greetings,

Andres Freund

#7Thomas Munro
thomas.munro@gmail.com
In reply to: Andres Freund (#4)
3 attachment(s)
Re: File descriptors in exec'd subprocesses

On Mon, Feb 6, 2023 at 3:29 AM Andres Freund <andres@anarazel.de> wrote:

On February 5, 2023 1:00:50 AM GMT+01:00, Thomas Munro <thomas.munro@gmail.com> wrote:

Are there any more descriptors we need to think about?

Postmaster's listen sockets? Saves a bunch of syscalls, at least.

Assuming you mean accepted sockets, yeah, I see how two save two
syscalls there, and since you nerd-sniped me into looking into the
SOCK_CLOEXEC landscape, I like it even more now that I've understood
that accept4() is rubber-stamped for the next revision of POSIX[1]https://www.austingroupbugs.net/view.php?id=411 and
is already accepted almost everywhere. It's not just window dressing,
you need it to write multi-threaded programs that fork/exec without
worrying about the window between fd creation and fcntl(FD_CLOEXEC) in
another thread; hopefully one day we will care about that sort of
thing in some places too! Here's a separate patch for that.

I *guess* we need HAVE_DECL_ACCEPT4 for the guarded availability
system (cf pwritev) when Apple gets the memo, but see below. Hard to
say if AIX is still receiving memos (cf recent speculation in the
Register). All other target OSes seem to have had this stuff for a
while.

Since client connections already do fcntl(FD_CLOEXEC), postgres_fdw
connections didn't have this problem. It seems reasonable to want to
skip a couple of system calls there too; also, client programs might
also be interested in future-POSIX's atomic race-free close-on-exec
socket fabrication. So here also is a patch to use SOCK_CLOEXEC on
that end too, if available.

But ... hmph, all we can do here is test for the existence of
SOCK_NONBLOCK and SOCK_CLOEXEC, since there is no new function to test
for. Maybe we should just assume accept4() also exists if these exist
(it's hard to imagine that Apple or IBM would address atomicity on one
end but not the other of a socket), but predictions are so difficult,
especially about the future! Anyone want to guess if it's better to
leave the meson/configure probe in for the accept4 end or just roll
with the macros?

Logging collector pipe write end, in backends?

The write end of the logging pipe is already closed, after dup2'ing it
to STDOUT_FILENO to STDERR_FILENO, so archive commands and suchlike do
receive the handle, but they want them. It's the intended and
documented behaviour that anything written to that will finish up in
the log.

As for pipe2(O_CLOEXEC), I see the point of it in a multi-threaded
application. It's not terribly useful for us though, because we
usually want to close only one end, except in the case of the
self-pipe. But the self-pipe is no longer used on the systems that
have pipe2()-from-the-future.

I haven't tested this under EXEC_BACKEND yet.

[1]: https://www.austingroupbugs.net/view.php?id=411

Attachments:

v2-0001-Don-t-leak-descriptors-into-subprograms.patchapplication/x-patch; name=v2-0001-Don-t-leak-descriptors-into-subprograms.patchDownload
From c9a61ba8eaddb197c31095a163f7efdf2e3ea797 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sun, 5 Feb 2023 12:18:14 +1300
Subject: [PATCH v2 1/3] Don't leak descriptors into subprograms.

Open data and WAL files with O_CLOEXEC (POSIX 2018, SUSv4).  All of our
target systems have it, except Windows.  Windows already has that
behavior, because we wrote our own open() implementation.

Do the same for sockets with FD_CLOEXEC.  (A separate commit will
improve this with SOCK_CLOEXEC on some systems, but we'll need the
FD_CLOEXEC path as a fallback.)

This means that programs executed by COPY and archiving commands etc,
will not be able to mess with those descriptors (of course nothing stops
them from opening files, so this is more about tidyness and preventing
accidental problems than security).

Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CA%2BhUKGKb6FsAdQWcRL35KJsftv%2B9zXqQbzwkfRf1i0J2e57%2BhQ%40mail.gmail.com
---
 src/backend/access/transam/xlog.c | 9 ++++++---
 src/backend/libpq/pqcomm.c        | 9 +++++++++
 src/backend/storage/file/fd.c     | 2 --
 src/backend/storage/smgr/md.c     | 9 +++++----
 src/include/port/win32_port.h     | 7 +++++++
 5 files changed, 27 insertions(+), 9 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index fb4c860bde..3b44a0f237 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2937,7 +2937,8 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 	 * Try to use existent file (checkpoint maker may have created it already)
 	 */
 	*added = false;
-	fd = BasicOpenFile(path, O_RDWR | PG_BINARY | get_sync_bit(sync_method));
+	fd = BasicOpenFile(path, O_RDWR | PG_BINARY | O_CLOEXEC |
+							 get_sync_bit(sync_method));
 	if (fd < 0)
 	{
 		if (errno != ENOENT)
@@ -3098,7 +3099,8 @@ XLogFileInit(XLogSegNo logsegno, TimeLineID logtli)
 		return fd;
 
 	/* Now open original target segment (might not be file I just made) */
-	fd = BasicOpenFile(path, O_RDWR | PG_BINARY | get_sync_bit(sync_method));
+	fd = BasicOpenFile(path, O_RDWR | PG_BINARY | O_CLOEXEC |
+							 get_sync_bit(sync_method));
 	if (fd < 0)
 		ereport(ERROR,
 				(errcode_for_file_access(),
@@ -3329,7 +3331,8 @@ XLogFileOpen(XLogSegNo segno, TimeLineID tli)
 
 	XLogFilePath(path, tli, segno, wal_segment_size);
 
-	fd = BasicOpenFile(path, O_RDWR | PG_BINARY | get_sync_bit(sync_method));
+	fd = BasicOpenFile(path, O_RDWR | PG_BINARY | O_CLOEXEC |
+							 get_sync_bit(sync_method));
 	if (fd < 0)
 		ereport(PANIC,
 				(errcode_for_file_access(),
diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index 864c9debe8..d4622533fd 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -200,6 +200,15 @@ pq_init(void)
 				(errmsg("could not set socket to nonblocking mode: %m")));
 #endif
 
+#ifndef WIN32
+	/*
+	 * Also make sure that any subprocesses that execute a new program don't
+	 * inherit the socket.
+	 */
+	if (fcntl(MyProcPort->sock, F_SETFD, FD_CLOEXEC) < 0)
+		elog(FATAL, "fcntl(F_SETFD) failed on socket: %m");
+#endif
+
 	FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, FeBeWaitSetNEvents);
 	socket_pos = AddWaitEventToSet(FeBeWaitSet, WL_SOCKET_WRITEABLE,
 								   MyProcPort->sock, NULL, NULL);
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 926d000f2e..933f17b398 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -1033,10 +1033,8 @@ tryAgain:
 					   O_TRUNC |
 					   O_WRONLY)) == 0,
 					 "PG_O_DIRECT value collides with standard flag");
-#if defined(O_CLOEXEC)
 	StaticAssertStmt((PG_O_DIRECT & O_CLOEXEC) == 0,
 					 "PG_O_DIRECT value collides with O_CLOEXEC");
-#endif
 #if defined(O_DSYNC)
 	StaticAssertStmt((PG_O_DIRECT & O_DSYNC) == 0,
 					 "PG_O_DIRECT value collides with O_DSYNC");
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 60c9905eff..c464d6338c 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -205,14 +205,15 @@ mdcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo)
 
 	path = relpath(reln->smgr_rlocator, forknum);
 
-	fd = PathNameOpenFile(path, O_RDWR | O_CREAT | O_EXCL | PG_BINARY);
+	fd = PathNameOpenFile(path,
+						  O_RDWR | O_CREAT | O_EXCL | PG_BINARY | O_CLOEXEC);
 
 	if (fd < 0)
 	{
 		int			save_errno = errno;
 
 		if (isRedo)
-			fd = PathNameOpenFile(path, O_RDWR | PG_BINARY);
+			fd = PathNameOpenFile(path, O_RDWR | PG_BINARY | O_CLOEXEC);
 		if (fd < 0)
 		{
 			/* be sure to report the error reported by create, not open */
@@ -523,7 +524,7 @@ mdopenfork(SMgrRelation reln, ForkNumber forknum, int behavior)
 
 	path = relpath(reln->smgr_rlocator, forknum);
 
-	fd = PathNameOpenFile(path, O_RDWR | PG_BINARY);
+	fd = PathNameOpenFile(path, O_RDWR | PG_BINARY | O_CLOEXEC);
 
 	if (fd < 0)
 	{
@@ -1188,7 +1189,7 @@ _mdfd_openseg(SMgrRelation reln, ForkNumber forknum, BlockNumber segno,
 	fullpath = _mdfd_segpath(reln, forknum, segno);
 
 	/* open the file */
-	fd = PathNameOpenFile(fullpath, O_RDWR | PG_BINARY | oflags);
+	fd = PathNameOpenFile(fullpath, O_RDWR | PG_BINARY | O_CLOEXEC | oflags);
 
 	pfree(fullpath);
 
diff --git a/src/include/port/win32_port.h b/src/include/port/win32_port.h
index 9488195799..a8685f1520 100644
--- a/src/include/port/win32_port.h
+++ b/src/include/port/win32_port.h
@@ -353,6 +353,13 @@ extern int	_pglstat64(const char *name, struct stat *buf);
  */
 #define O_DSYNC 0x0080
 
+/*
+ * Our open wrapper does not create inheritable handles, so it is safe to
+ * ignore O_CLOEXEC.  (If we were using Windows' own open(), it might be
+ * necessary to convert this to _O_NOINHERIT.)
+ */
+#define O_CLOEXEC 0
+
 /*
  * Supplement to <errno.h>.
  *
-- 
2.39.1

v2-0002-Use-accept4-to-accept-connections-where-available.patchapplication/x-patch; name=v2-0002-Use-accept4-to-accept-connections-where-available.patchDownload
From fd2bb0e83193fc337a8fc7e9e369387630252bd0 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Mon, 6 Feb 2023 13:51:19 +1300
Subject: [PATCH v2 2/3] Use accept4() to accept connections, where available.

The postmaster previously accepted connections and then made extra
system calls in child processes to make them non-blocking and (as of a
recent commit) close-on-exit.

On nearly all systems, that can be done atomically with flags to the
newer accept4() variant (recently accepted by POSIX for a future
revision, but around in the wild for a while now), saving a couple of
system calls.

The exceptions are Windows, where we didn't have to worry about that
problem anyway, EXEC_BACKEND builds on Unix (used by developers only for
slightly-more-like-Windows testing), and the straggler Unixes that
haven't implemented it yet (at the time of writing: macOS and AIX).

Suggested-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CA%2BhUKGKb6FsAdQWcRL35KJsftv%2B9zXqQbzwkfRf1i0J2e57%2BhQ%40mail.gmail.com
---
 configure                  | 12 ++++++++++++
 configure.ac               |  1 +
 meson.build                |  1 +
 src/backend/libpq/pqcomm.c | 33 +++++++++++++++++++++++++++------
 src/include/pg_config.h.in |  4 ++++
 5 files changed, 45 insertions(+), 6 deletions(-)

diff --git a/configure b/configure
index 5d07fd0bb9..514da7c30f 100755
--- a/configure
+++ b/configure
@@ -16207,6 +16207,18 @@ _ACEOF
 
 # We can't use AC_REPLACE_FUNCS to replace these functions, because it
 # won't handle deployment target restrictions on macOS
+ac_fn_c_check_decl "$LINENO" "accept4" "ac_cv_have_decl_accept4" "#include <sys/socket.h>
+"
+if test "x$ac_cv_have_decl_accept4" = xyes; then :
+  ac_have_decl=1
+else
+  ac_have_decl=0
+fi
+
+cat >>confdefs.h <<_ACEOF
+#define HAVE_DECL_ACCEPT4 $ac_have_decl
+_ACEOF
+
 ac_fn_c_check_decl "$LINENO" "preadv" "ac_cv_have_decl_preadv" "#include <sys/uio.h>
 "
 if test "x$ac_cv_have_decl_preadv" = xyes; then :
diff --git a/configure.ac b/configure.ac
index e9b74ced6c..6c10592fea 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1841,6 +1841,7 @@ AC_CHECK_DECLS([strlcat, strlcpy, strnlen])
 
 # We can't use AC_REPLACE_FUNCS to replace these functions, because it
 # won't handle deployment target restrictions on macOS
+AC_CHECK_DECLS([accept4], [], [], [#include <sys/socket.h>])
 AC_CHECK_DECLS([preadv], [], [AC_LIBOBJ(preadv)], [#include <sys/uio.h>])
 AC_CHECK_DECLS([pwritev], [], [AC_LIBOBJ(pwritev)], [#include <sys/uio.h>])
 
diff --git a/meson.build b/meson.build
index e379a252a5..c20a339340 100644
--- a/meson.build
+++ b/meson.build
@@ -2094,6 +2094,7 @@ decl_checks = [
 # checking for library symbols wouldn't handle deployment target
 # restrictions on macOS
 decl_checks += [
+  ['accept4', 'sys/socket.h'],
   ['preadv', 'sys/uio.h'],
   ['pwritev', 'sys/uio.h'],
 ]
diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index d4622533fd..70734f1a4b 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -184,6 +184,14 @@ pq_init(void)
 	/* set up process-exit hook to close the socket */
 	on_proc_exit(socket_close, 0);
 
+#ifndef WIN32
+	/*
+	 * On most systems, the socket has already been made non-blocking and
+	 * close-on-exec by StreamConnection().  On systems that don't have
+	 * accept4() yet, and EXEC_BACKEND builds that need the socket to survive
+	 * exec*(), we do that separately here in the child process.
+	 */
+#if !HAVE_DECL_ACCEPT4 || defined(EXEC_BACKEND)
 	/*
 	 * In backends (as soon as forked) we operate the underlying socket in
 	 * nonblocking mode and use latches to implement blocking semantics if
@@ -194,19 +202,17 @@ pq_init(void)
 	 * the client, which might require changing the mode again, leading to
 	 * infinite recursion.
 	 */
-#ifndef WIN32
 	if (!pg_set_noblock(MyProcPort->sock))
 		ereport(COMMERROR,
 				(errmsg("could not set socket to nonblocking mode: %m")));
-#endif
 
-#ifndef WIN32
 	/*
 	 * Also make sure that any subprocesses that execute a new program don't
 	 * inherit the socket.
 	 */
 	if (fcntl(MyProcPort->sock, F_SETFD, FD_CLOEXEC) < 0)
 		elog(FATAL, "fcntl(F_SETFD) failed on socket: %m");
+#endif
 #endif
 
 	FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, FeBeWaitSetNEvents);
@@ -701,9 +707,24 @@ StreamConnection(pgsocket server_fd, Port *port)
 {
 	/* accept connection and fill in the client (remote) address */
 	port->raddr.salen = sizeof(port->raddr.addr);
-	if ((port->sock = accept(server_fd,
-							 (struct sockaddr *) &port->raddr.addr,
-							 &port->raddr.salen)) == PGINVALID_SOCKET)
+
+#if HAVE_DECL_ACCEPT4 && !defined(EXEC_BACKEND)
+	/*
+	 * If we have accept4(), we can avoid a couple of fcntl() calls in
+	 * pq_init().  We can't use this optimization in EXEC_BACKEND builds,
+	 * though, because they need the socket to survive exec().
+	 */
+	port->sock = accept4(server_fd,
+						 (struct sockaddr *) &port->raddr.addr,
+						 &port->raddr.salen,
+						 SOCK_CLOEXEC | SOCK_NONBLOCK);
+#else
+	port->sock = accept(server_fd,
+						(struct sockaddr *) &port->raddr.addr,
+						&port->raddr.salen);
+#endif
+
+	if (port->sock == PGINVALID_SOCKET)
 	{
 		ereport(LOG,
 				(errcode_for_socket_access(),
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index c5a80b829e..b4777d7c77 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -91,6 +91,10 @@
 /* Define to 1 if you have the `CRYPTO_lock' function. */
 #undef HAVE_CRYPTO_LOCK
 
+/* Define to 1 if you have the declaration of `accept4', and to 0 if you
+   don't. */
+#undef HAVE_DECL_ACCEPT4
+
 /* Define to 1 if you have the declaration of `fdatasync', and to 0 if you
    don't. */
 #undef HAVE_DECL_FDATASYNC
-- 
2.39.1

v2-0003-Use-newer-client-socket-options-where-available.patchapplication/x-patch; name=v2-0003-Use-newer-client-socket-options-where-available.patchDownload
From 8cd306114867b24640e70491eb35f61bf6771457 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Mon, 6 Feb 2023 14:54:03 +1300
Subject: [PATCH v2 3/3] Use newer client socket options where available.

As a recent commit did for the server side, we can now also try to set
client sockets to nonblocking and close-on-exit atomically when creating
the socket.  This uses flags expected in the next POSIX revision, but
already widely available on common platforms.

This saves a couple of fcntl() calls, and plugs a tiny window for
resource leaks in multi-threaded programs that might fork-and-exec while
another thread would otherwise have to run socket() and then fcntl().

Discussion: https://postgr.es/m/CA%2BhUKGKb6FsAdQWcRL35KJsftv%2B9zXqQbzwkfRf1i0J2e57%2BhQ%40mail.gmail.com
---
 src/interfaces/libpq/fe-connect.c | 23 ++++++++++++++++++++++-
 1 file changed, 22 insertions(+), 1 deletion(-)

diff --git a/src/interfaces/libpq/fe-connect.c b/src/interfaces/libpq/fe-connect.c
index 50b5df3490..e25ba94863 100644
--- a/src/interfaces/libpq/fe-connect.c
+++ b/src/interfaces/libpq/fe-connect.c
@@ -2464,6 +2464,7 @@ keep_going:						/* We will come back to here until there is
 				{
 					struct addrinfo *addr_cur = conn->addr_cur;
 					char		host_addr[NI_MAXHOST];
+					int			sock_type;
 
 					/*
 					 * Advance to next possible host, if we've tried all of
@@ -2494,7 +2495,23 @@ keep_going:						/* We will come back to here until there is
 						conn->connip = strdup(host_addr);
 
 					/* Try to create the socket */
-					conn->sock = socket(addr_cur->ai_family, SOCK_STREAM, 0);
+					sock_type = SOCK_STREAM;
+#ifdef SOCK_CLOEXEC
+					/*
+					 * Atomically mark close-on-exec, if possible on this
+					 * platform, for the benefit of multi-threaded programs.
+					 * Otherwise see below.
+					 */
+					sock_type |= SOCK_CLOEXEC;
+#endif
+#ifdef SOCK_NONBLOCK
+					/*
+					 * We might as well skip a system call for nonblocking mode
+					 * too, if we can.
+					 */
+					sock_type |= SOCK_NONBLOCK;
+#endif
+					conn->sock = socket(addr_cur->ai_family, sock_type, 0);
 					if (conn->sock == PGINVALID_SOCKET)
 					{
 						int			errorno = SOCK_ERRNO;
@@ -2540,6 +2557,7 @@ keep_going:						/* We will come back to here until there is
 							goto keep_going;
 						}
 					}
+#ifndef SOCK_NONBLOCK
 					if (!pg_set_noblock(conn->sock))
 					{
 						libpq_append_conn_error(conn, "could not set socket to nonblocking mode: %s",
@@ -2547,7 +2565,9 @@ keep_going:						/* We will come back to here until there is
 						conn->try_next_addr = true;
 						goto keep_going;
 					}
+#endif
 
+#ifndef SOCK_CLOEXEC
 #ifdef F_SETFD
 					if (fcntl(conn->sock, F_SETFD, FD_CLOEXEC) == -1)
 					{
@@ -2557,6 +2577,7 @@ keep_going:						/* We will come back to here until there is
 						goto keep_going;
 					}
 #endif							/* F_SETFD */
+#endif
 
 					if (addr_cur->ai_family != AF_UNIX)
 					{
-- 
2.39.1

#8Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#7)
3 attachment(s)
Re: File descriptors in exec'd subprocesses

I had missed one: the "watch" end of the postmaster pipe also needs FD_CLOEXEC.

Attachments:

v3-0001-Don-t-leak-descriptors-into-subprograms.patchtext/x-patch; charset=US-ASCII; name=v3-0001-Don-t-leak-descriptors-into-subprograms.patchDownload
From 36f8ed2406307f8b1578ae85510c5a07718e1ea8 Mon Sep 17 00:00:00 2001
From: Thomas Munro <tmunro@postgresql.org>
Date: Mon, 20 Feb 2023 23:26:36 +1300
Subject: [PATCH v3 1/3] Don't leak descriptors into subprograms.

Open data and WAL files with O_CLOEXEC, from SUSv4 (POSIX.1-2008).  All
of our target systems have it, except Windows.  Our open()
implementation on Windows already has that behavior, so provide a dummy
O_CLOEXEC flag on that platform.

Do the same for sockets with FD_CLOEXEC.  (Later commits might improve
this with SOCK_CLOEXEC on systems that have it, but we'll need the
FD_CLOEXEC path as a fallback anyway, so do it this way in this initial
commit.)

Likewise for the postmaster death pipe.

This means that programs executed by COPY and archiving commands etc,
will not inherit access to those descriptors.

Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CA%2BhUKGKb6FsAdQWcRL35KJsftv%2B9zXqQbzwkfRf1i0J2e57%2BhQ%40mail.gmail.com
---
 src/backend/access/transam/xlog.c |  9 ++++++---
 src/backend/libpq/pqcomm.c        | 10 ++++++++++
 src/backend/storage/file/fd.c     |  2 --
 src/backend/storage/smgr/md.c     |  9 +++++----
 src/backend/utils/init/miscinit.c |  8 ++++++++
 src/include/port/win32_port.h     |  7 +++++++
 6 files changed, 36 insertions(+), 9 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f9f0f6db8d..87af608d15 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2936,7 +2936,8 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 	 * Try to use existent file (checkpoint maker may have created it already)
 	 */
 	*added = false;
-	fd = BasicOpenFile(path, O_RDWR | PG_BINARY | get_sync_bit(sync_method));
+	fd = BasicOpenFile(path, O_RDWR | PG_BINARY | O_CLOEXEC |
+					   get_sync_bit(sync_method));
 	if (fd < 0)
 	{
 		if (errno != ENOENT)
@@ -3097,7 +3098,8 @@ XLogFileInit(XLogSegNo logsegno, TimeLineID logtli)
 		return fd;
 
 	/* Now open original target segment (might not be file I just made) */
-	fd = BasicOpenFile(path, O_RDWR | PG_BINARY | get_sync_bit(sync_method));
+	fd = BasicOpenFile(path, O_RDWR | PG_BINARY | O_CLOEXEC |
+					   get_sync_bit(sync_method));
 	if (fd < 0)
 		ereport(ERROR,
 				(errcode_for_file_access(),
@@ -3328,7 +3330,8 @@ XLogFileOpen(XLogSegNo segno, TimeLineID tli)
 
 	XLogFilePath(path, tli, segno, wal_segment_size);
 
-	fd = BasicOpenFile(path, O_RDWR | PG_BINARY | get_sync_bit(sync_method));
+	fd = BasicOpenFile(path, O_RDWR | PG_BINARY | O_CLOEXEC |
+					   get_sync_bit(sync_method));
 	if (fd < 0)
 		ereport(PANIC,
 				(errcode_for_file_access(),
diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index 864c9debe8..0ccec33af4 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -200,6 +200,16 @@ pq_init(void)
 				(errmsg("could not set socket to nonblocking mode: %m")));
 #endif
 
+#ifndef WIN32
+
+	/*
+	 * Also make sure that any subprocesses that execute a new program don't
+	 * inherit the socket.
+	 */
+	if (fcntl(MyProcPort->sock, F_SETFD, FD_CLOEXEC) < 0)
+		elog(FATAL, "fcntl(F_SETFD) failed on socket: %m");
+#endif
+
 	FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, FeBeWaitSetNEvents);
 	socket_pos = AddWaitEventToSet(FeBeWaitSet, WL_SOCKET_WRITEABLE,
 								   MyProcPort->sock, NULL, NULL);
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 926d000f2e..933f17b398 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -1033,10 +1033,8 @@ tryAgain:
 					   O_TRUNC |
 					   O_WRONLY)) == 0,
 					 "PG_O_DIRECT value collides with standard flag");
-#if defined(O_CLOEXEC)
 	StaticAssertStmt((PG_O_DIRECT & O_CLOEXEC) == 0,
 					 "PG_O_DIRECT value collides with O_CLOEXEC");
-#endif
 #if defined(O_DSYNC)
 	StaticAssertStmt((PG_O_DIRECT & O_DSYNC) == 0,
 					 "PG_O_DIRECT value collides with O_DSYNC");
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 8da813600c..2070eba4a2 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -205,14 +205,15 @@ mdcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo)
 
 	path = relpath(reln->smgr_rlocator, forknum);
 
-	fd = PathNameOpenFile(path, O_RDWR | O_CREAT | O_EXCL | PG_BINARY);
+	fd = PathNameOpenFile(path,
+						  O_RDWR | O_CREAT | O_EXCL | PG_BINARY | O_CLOEXEC);
 
 	if (fd < 0)
 	{
 		int			save_errno = errno;
 
 		if (isRedo)
-			fd = PathNameOpenFile(path, O_RDWR | PG_BINARY);
+			fd = PathNameOpenFile(path, O_RDWR | PG_BINARY | O_CLOEXEC);
 		if (fd < 0)
 		{
 			/* be sure to report the error reported by create, not open */
@@ -523,7 +524,7 @@ mdopenfork(SMgrRelation reln, ForkNumber forknum, int behavior)
 
 	path = relpath(reln->smgr_rlocator, forknum);
 
-	fd = PathNameOpenFile(path, O_RDWR | PG_BINARY);
+	fd = PathNameOpenFile(path, O_RDWR | PG_BINARY | O_CLOEXEC);
 
 	if (fd < 0)
 	{
@@ -1210,7 +1211,7 @@ _mdfd_openseg(SMgrRelation reln, ForkNumber forknum, BlockNumber segno,
 	fullpath = _mdfd_segpath(reln, forknum, segno);
 
 	/* open the file */
-	fd = PathNameOpenFile(fullpath, O_RDWR | PG_BINARY | oflags);
+	fd = PathNameOpenFile(fullpath, O_RDWR | PG_BINARY | O_CLOEXEC | oflags);
 
 	pfree(fullpath);
 
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index 59532bbd80..2f13a13898 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -163,6 +163,14 @@ InitPostmasterChild(void)
 
 	/* Request a signal if the postmaster dies, if possible. */
 	PostmasterDeathSignalInit();
+
+	/* Child processes should not inherit the postmaster pipe. */
+#ifndef WIN32
+	if (fcntl(postmaster_alive_fds[POSTMASTER_FD_WATCH], F_SETFD, FD_CLOEXEC) < 0)
+		ereport(FATAL,
+				(errcode_for_socket_access(),
+				 errmsg_internal("could not set postmaster death monitoring pipe to FD_CLOEXEC mode: %m")));
+#endif
 }
 
 /*
diff --git a/src/include/port/win32_port.h b/src/include/port/win32_port.h
index 9488195799..a8685f1520 100644
--- a/src/include/port/win32_port.h
+++ b/src/include/port/win32_port.h
@@ -353,6 +353,13 @@ extern int	_pglstat64(const char *name, struct stat *buf);
  */
 #define O_DSYNC 0x0080
 
+/*
+ * Our open wrapper does not create inheritable handles, so it is safe to
+ * ignore O_CLOEXEC.  (If we were using Windows' own open(), it might be
+ * necessary to convert this to _O_NOINHERIT.)
+ */
+#define O_CLOEXEC 0
+
 /*
  * Supplement to <errno.h>.
  *
-- 
2.39.1

v3-0002-Use-accept4-to-accept-connections-where-available.patchtext/x-patch; charset=US-ASCII; name=v3-0002-Use-accept4-to-accept-connections-where-available.patchDownload
From 002a689ece50518095b9d0fca71d24128fe9b147 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Mon, 6 Feb 2023 13:51:19 +1300
Subject: [PATCH v3 2/3] Use accept4() to accept connections, where available.

The postmaster previously accepted connections and then made extra
system calls in child processes to make them non-blocking and (as of a
recent commit) close-on-exit.

On nearly all systems, that can be done atomically with flags to the
newer accept4() variant (recently accepted by POSIX for a future
revision, but around in the wild for a while now), saving a couple of
system calls.

The exceptions are Windows, where we didn't have to worry about that
problem anyway, EXEC_BACKEND builds on Unix (used by developers only for
slightly-more-like-Windows testing), and the straggler Unixes that
haven't implemented it yet (at the time of writing: macOS and AIX).

Suggested-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CA%2BhUKGKb6FsAdQWcRL35KJsftv%2B9zXqQbzwkfRf1i0J2e57%2BhQ%40mail.gmail.com
---
 configure                  | 12 ++++++++++++
 configure.ac               |  1 +
 meson.build                |  1 +
 src/backend/libpq/pqcomm.c | 34 +++++++++++++++++++++++++++-------
 src/include/pg_config.h.in |  4 ++++
 5 files changed, 45 insertions(+), 7 deletions(-)

diff --git a/configure b/configure
index e35769ea73..d06f64cdbb 100755
--- a/configure
+++ b/configure
@@ -16219,6 +16219,18 @@ _ACEOF
 
 # We can't use AC_REPLACE_FUNCS to replace these functions, because it
 # won't handle deployment target restrictions on macOS
+ac_fn_c_check_decl "$LINENO" "accept4" "ac_cv_have_decl_accept4" "#include <sys/socket.h>
+"
+if test "x$ac_cv_have_decl_accept4" = xyes; then :
+  ac_have_decl=1
+else
+  ac_have_decl=0
+fi
+
+cat >>confdefs.h <<_ACEOF
+#define HAVE_DECL_ACCEPT4 $ac_have_decl
+_ACEOF
+
 ac_fn_c_check_decl "$LINENO" "preadv" "ac_cv_have_decl_preadv" "#include <sys/uio.h>
 "
 if test "x$ac_cv_have_decl_preadv" = xyes; then :
diff --git a/configure.ac b/configure.ac
index af23c15cb2..070a0b33db 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1843,6 +1843,7 @@ AC_CHECK_DECLS([strlcat, strlcpy, strnlen])
 
 # We can't use AC_REPLACE_FUNCS to replace these functions, because it
 # won't handle deployment target restrictions on macOS
+AC_CHECK_DECLS([accept4], [], [], [#include <sys/socket.h>])
 AC_CHECK_DECLS([preadv], [], [AC_LIBOBJ(preadv)], [#include <sys/uio.h>])
 AC_CHECK_DECLS([pwritev], [], [AC_LIBOBJ(pwritev)], [#include <sys/uio.h>])
 
diff --git a/meson.build b/meson.build
index f534704452..3e15d6c825 100644
--- a/meson.build
+++ b/meson.build
@@ -2097,6 +2097,7 @@ decl_checks = [
 # checking for library symbols wouldn't handle deployment target
 # restrictions on macOS
 decl_checks += [
+  ['accept4', 'sys/socket.h'],
   ['preadv', 'sys/uio.h'],
   ['pwritev', 'sys/uio.h'],
 ]
diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index 0ccec33af4..70734f1a4b 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -184,6 +184,14 @@ pq_init(void)
 	/* set up process-exit hook to close the socket */
 	on_proc_exit(socket_close, 0);
 
+#ifndef WIN32
+	/*
+	 * On most systems, the socket has already been made non-blocking and
+	 * close-on-exec by StreamConnection().  On systems that don't have
+	 * accept4() yet, and EXEC_BACKEND builds that need the socket to survive
+	 * exec*(), we do that separately here in the child process.
+	 */
+#if !HAVE_DECL_ACCEPT4 || defined(EXEC_BACKEND)
 	/*
 	 * In backends (as soon as forked) we operate the underlying socket in
 	 * nonblocking mode and use latches to implement blocking semantics if
@@ -194,13 +202,9 @@ pq_init(void)
 	 * the client, which might require changing the mode again, leading to
 	 * infinite recursion.
 	 */
-#ifndef WIN32
 	if (!pg_set_noblock(MyProcPort->sock))
 		ereport(COMMERROR,
 				(errmsg("could not set socket to nonblocking mode: %m")));
-#endif
-
-#ifndef WIN32
 
 	/*
 	 * Also make sure that any subprocesses that execute a new program don't
@@ -208,6 +212,7 @@ pq_init(void)
 	 */
 	if (fcntl(MyProcPort->sock, F_SETFD, FD_CLOEXEC) < 0)
 		elog(FATAL, "fcntl(F_SETFD) failed on socket: %m");
+#endif
 #endif
 
 	FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, FeBeWaitSetNEvents);
@@ -702,9 +707,24 @@ StreamConnection(pgsocket server_fd, Port *port)
 {
 	/* accept connection and fill in the client (remote) address */
 	port->raddr.salen = sizeof(port->raddr.addr);
-	if ((port->sock = accept(server_fd,
-							 (struct sockaddr *) &port->raddr.addr,
-							 &port->raddr.salen)) == PGINVALID_SOCKET)
+
+#if HAVE_DECL_ACCEPT4 && !defined(EXEC_BACKEND)
+	/*
+	 * If we have accept4(), we can avoid a couple of fcntl() calls in
+	 * pq_init().  We can't use this optimization in EXEC_BACKEND builds,
+	 * though, because they need the socket to survive exec().
+	 */
+	port->sock = accept4(server_fd,
+						 (struct sockaddr *) &port->raddr.addr,
+						 &port->raddr.salen,
+						 SOCK_CLOEXEC | SOCK_NONBLOCK);
+#else
+	port->sock = accept(server_fd,
+						(struct sockaddr *) &port->raddr.addr,
+						&port->raddr.salen);
+#endif
+
+	if (port->sock == PGINVALID_SOCKET)
 	{
 		ereport(LOG,
 				(errcode_for_socket_access(),
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 20c82f5979..e21d0f05f5 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -91,6 +91,10 @@
 /* Define to 1 if you have the `CRYPTO_lock' function. */
 #undef HAVE_CRYPTO_LOCK
 
+/* Define to 1 if you have the declaration of `accept4', and to 0 if you
+   don't. */
+#undef HAVE_DECL_ACCEPT4
+
 /* Define to 1 if you have the declaration of `fdatasync', and to 0 if you
    don't. */
 #undef HAVE_DECL_FDATASYNC
-- 
2.39.1

v3-0003-Use-newer-client-socket-options-where-available.patchtext/x-patch; charset=US-ASCII; name=v3-0003-Use-newer-client-socket-options-where-available.patchDownload
From abe187d253e3547ab5dabce9128b973679442baf Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Mon, 6 Feb 2023 14:54:03 +1300
Subject: [PATCH v3 3/3] Use newer client socket options where available.

As a recent commit did for the server side, we can now also try to set
client sockets to nonblocking and close-on-exit atomically when creating
the socket.  This uses flags expected in the next POSIX revision, but
already widely available on common platforms.

This saves a couple of fcntl() calls, and plugs a tiny window for
resource leaks in multi-threaded programs that might fork-and-exec while
another thread would otherwise have to run socket() and then fcntl().

Discussion: https://postgr.es/m/CA%2BhUKGKb6FsAdQWcRL35KJsftv%2B9zXqQbzwkfRf1i0J2e57%2BhQ%40mail.gmail.com
---
 src/interfaces/libpq/fe-connect.c | 23 ++++++++++++++++++++++-
 1 file changed, 22 insertions(+), 1 deletion(-)

diff --git a/src/interfaces/libpq/fe-connect.c b/src/interfaces/libpq/fe-connect.c
index 50b5df3490..953abc640f 100644
--- a/src/interfaces/libpq/fe-connect.c
+++ b/src/interfaces/libpq/fe-connect.c
@@ -2464,6 +2464,7 @@ keep_going:						/* We will come back to here until there is
 				{
 					struct addrinfo *addr_cur = conn->addr_cur;
 					char		host_addr[NI_MAXHOST];
+					int			sock_type;
 
 					/*
 					 * Advance to next possible host, if we've tried all of
@@ -2494,7 +2495,23 @@ keep_going:						/* We will come back to here until there is
 						conn->connip = strdup(host_addr);
 
 					/* Try to create the socket */
-					conn->sock = socket(addr_cur->ai_family, SOCK_STREAM, 0);
+					sock_type = SOCK_STREAM;
+#ifdef SOCK_CLOEXEC
+					/*
+					 * Atomically mark close-on-exec, if possible on this
+					 * platform, for the benefit of multi-threaded programs.
+					 * Otherwise see below.
+					 */
+					sock_type |= SOCK_CLOEXEC;
+#endif
+#ifdef SOCK_NONBLOCK
+					/*
+					 * We might as well skip a system call for nonblocking
+					 * mode too, if we can.
+					 */
+					sock_type |= SOCK_NONBLOCK;
+#endif
+					conn->sock = socket(addr_cur->ai_family, sock_type, 0);
 					if (conn->sock == PGINVALID_SOCKET)
 					{
 						int			errorno = SOCK_ERRNO;
@@ -2540,6 +2557,7 @@ keep_going:						/* We will come back to here until there is
 							goto keep_going;
 						}
 					}
+#ifndef SOCK_NONBLOCK
 					if (!pg_set_noblock(conn->sock))
 					{
 						libpq_append_conn_error(conn, "could not set socket to nonblocking mode: %s",
@@ -2547,7 +2565,9 @@ keep_going:						/* We will come back to here until there is
 						conn->try_next_addr = true;
 						goto keep_going;
 					}
+#endif
 
+#ifndef SOCK_CLOEXEC
 #ifdef F_SETFD
 					if (fcntl(conn->sock, F_SETFD, FD_CLOEXEC) == -1)
 					{
@@ -2557,6 +2577,7 @@ keep_going:						/* We will come back to here until there is
 						goto keep_going;
 					}
 #endif							/* F_SETFD */
+#endif
 
 					if (addr_cur->ai_family != AF_UNIX)
 					{
-- 
2.39.1

#9Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#8)
5 attachment(s)
Re: File descriptors in exec'd subprocesses

Something bothered me about the previous versions: Which layer should
add O_CLOEXEC, given that we have md.c -> PathNameOpenXXX() ->
BasicOpenFile() -> open()? Previously I had md.c adding it, but on
reflection, it makes no sense to open a "File" (virtual file
descriptor) that is *not* O_CLOEXEC. The client doesn't really know
when the raw descriptor is currently open and shouldn't really access
it; it would be strange to want it to survive a call to exec*(). I
think the 'highest' level API that we could consider requiring
O_CLOEXEC to be passed in explicitly, or not, is BasicOpenFile().
Done like that in this version. This is the version I'm thinking of
committing, unless someone wants to argue for another level.

Another good choice would be to do it inside BasicOpenFile(), and then
the patch would be smaller again (xlog.c wouldn't need to mention it,
and there would perhaps be less risk that some long-lived descriptor
somewhere else has failed to request it), but perhaps that would be
presumptuous. That function returns raw descriptors, and the caller,
perhaps an extension, might legitimately want to make an inheritable
descriptor for some reason, I guess? Does anyone think I should move
it in there instead?

I realised that if we're going to use accept4() to cut down on
syscalls, we could also do the same for the postmaster pipe with
pipe2().

Here also is a tiny archeological cleanup to avoid creating
contradictory claims about whether all computers have O_CLOEXEC.

I toyed with the idea of a tiny Linux-only regression test using "COPY
fds FROM PROGRAM 'ls /proc/self/fd'" expecting 0, 1, 2, 3 (3 being
ls's opendir()), but that's probably a little too cute; and also
showed me that pg_regress.c leaks its log file, the fix for which is
to add "e" to its fdopen(), but that's another POSIX-next feature[1]https://wiki.freebsd.org/AtomicCloseOnExec
that seems a little harder to detect, and I gave up on that.

[1]: https://wiki.freebsd.org/AtomicCloseOnExec

Attachments:

v4-0001-Remove-obsolete-coding-for-macOS-10.7.patchtext/x-patch; charset=US-ASCII; name=v4-0001-Remove-obsolete-coding-for-macOS-10.7.patchDownload
From 86bf540c917b49a60ecd7f170ee5e8f1c7df6802 Mon Sep 17 00:00:00 2001
From: Thomas Munro <tmunro@postgresql.org>
Date: Tue, 21 Feb 2023 14:47:19 +1300
Subject: [PATCH v4 1/5] Remove obsolete coding for macOS < 10.7.

Commits 04cad8f7 and 0c088568 supported old macOS systems that didn't
define O_CLOEXEC or O_DSYNC yet, but those arrived in macOS releases
10.7 and 10.6 (respectively), which reached EOL around a decade ago.
We've already made use of other POSIX features that early macOS vintages
can't compile (for example commits 623cc673, d2e15083).

A proposed future commit will use O_CLOEXEC on POSIX systems so it would
seem strange to pretend here that it's optional, and we might as well
give O_DSYNC the same treatment since the reference is also guarded by a
test for a macOS-specific macro, and we know that current Macs have it.

Discussion: https://postgr.es/m/CA%2BhUKGKb6FsAdQWcRL35KJsftv%2B9zXqQbzwkfRf1i0J2e57%2BhQ%40mail.gmail.com
---
 src/backend/storage/file/fd.c | 11 ++---------
 1 file changed, 2 insertions(+), 9 deletions(-)

diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 926d000f2e..ea690f05c6 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -1025,7 +1025,9 @@ tryAgain:
 	 */
 	StaticAssertStmt((PG_O_DIRECT &
 					  (O_APPEND |
+					   O_CLOEXEC |
 					   O_CREAT |
+					   O_DSYNC |
 					   O_EXCL |
 					   O_RDWR |
 					   O_RDONLY |
@@ -1033,15 +1035,6 @@ tryAgain:
 					   O_TRUNC |
 					   O_WRONLY)) == 0,
 					 "PG_O_DIRECT value collides with standard flag");
-#if defined(O_CLOEXEC)
-	StaticAssertStmt((PG_O_DIRECT & O_CLOEXEC) == 0,
-					 "PG_O_DIRECT value collides with O_CLOEXEC");
-#endif
-#if defined(O_DSYNC)
-	StaticAssertStmt((PG_O_DIRECT & O_DSYNC) == 0,
-					 "PG_O_DIRECT value collides with O_DSYNC");
-#endif
-
 	fd = open(fileName, fileFlags & ~PG_O_DIRECT, fileMode);
 #else
 	fd = open(fileName, fileFlags, fileMode);
-- 
2.39.1

v4-0002-Don-t-leak-descriptors-into-subprograms.patchtext/x-patch; charset=US-ASCII; name=v4-0002-Don-t-leak-descriptors-into-subprograms.patchDownload
From 1f8265b2971ae7bfc686aafb9d3fdeae2a0dcc1d Mon Sep 17 00:00:00 2001
From: Thomas Munro <tmunro@postgresql.org>
Date: Mon, 20 Feb 2023 23:26:36 +1300
Subject: [PATCH v4 2/5] Don't leak descriptors into subprograms.

Open long-lived data and WAL file descriptors with O_CLOEXEC.  This flag
was introduced by SUSv4 (POSIX.1-2008), and by now all of our target
Unix systems have it.  Our open() implementation for Windows already had
that behavior, so provide a dummy O_CLOEXEC flag on that platform.

Do the same for sockets and the postmaster pipe with FD_CLOEXEC.  Later
commits might use modern interfaces to remove these extra fcntl() calls
where possible, but we'll need them as a fallback for a couple of
systems, so do it that way in this initial commit.

With this change, subprograms executed for archiving, copying etc will
no longer have access to the server's descriptors, other than the ones
that we decide to pass down.

Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CA%2BhUKGKb6FsAdQWcRL35KJsftv%2B9zXqQbzwkfRf1i0J2e57%2BhQ%40mail.gmail.com
---
 src/backend/access/transam/xlog.c | 9 ++++++---
 src/backend/libpq/pqcomm.c        | 7 +++++++
 src/backend/storage/file/fd.c     | 8 ++++++++
 src/backend/utils/init/miscinit.c | 8 ++++++++
 src/include/port/win32_port.h     | 7 +++++++
 5 files changed, 36 insertions(+), 3 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f9f0f6db8d..87af608d15 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2936,7 +2936,8 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 	 * Try to use existent file (checkpoint maker may have created it already)
 	 */
 	*added = false;
-	fd = BasicOpenFile(path, O_RDWR | PG_BINARY | get_sync_bit(sync_method));
+	fd = BasicOpenFile(path, O_RDWR | PG_BINARY | O_CLOEXEC |
+					   get_sync_bit(sync_method));
 	if (fd < 0)
 	{
 		if (errno != ENOENT)
@@ -3097,7 +3098,8 @@ XLogFileInit(XLogSegNo logsegno, TimeLineID logtli)
 		return fd;
 
 	/* Now open original target segment (might not be file I just made) */
-	fd = BasicOpenFile(path, O_RDWR | PG_BINARY | get_sync_bit(sync_method));
+	fd = BasicOpenFile(path, O_RDWR | PG_BINARY | O_CLOEXEC |
+					   get_sync_bit(sync_method));
 	if (fd < 0)
 		ereport(ERROR,
 				(errcode_for_file_access(),
@@ -3328,7 +3330,8 @@ XLogFileOpen(XLogSegNo segno, TimeLineID tli)
 
 	XLogFilePath(path, tli, segno, wal_segment_size);
 
-	fd = BasicOpenFile(path, O_RDWR | PG_BINARY | get_sync_bit(sync_method));
+	fd = BasicOpenFile(path, O_RDWR | PG_BINARY | O_CLOEXEC |
+					   get_sync_bit(sync_method));
 	if (fd < 0)
 		ereport(PANIC,
 				(errcode_for_file_access(),
diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index 864c9debe8..da5bb5fc5d 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -200,6 +200,13 @@ pq_init(void)
 				(errmsg("could not set socket to nonblocking mode: %m")));
 #endif
 
+#ifndef WIN32
+
+	/* Don't give the socket to any subprograms we execute. */
+	if (fcntl(MyProcPort->sock, F_SETFD, FD_CLOEXEC) < 0)
+		elog(FATAL, "fcntl(F_SETFD) failed on socket: %m");
+#endif
+
 	FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, FeBeWaitSetNEvents);
 	socket_pos = AddWaitEventToSet(FeBeWaitSet, WL_SOCKET_WRITEABLE,
 								   MyProcPort->sock, NULL, NULL);
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index ea690f05c6..132a85c04b 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -1515,6 +1515,14 @@ PathNameOpenFilePerm(const char *fileName, int fileFlags, mode_t fileMode)
 	/* Close excess kernel FDs. */
 	ReleaseLruFiles();
 
+	/*
+	 * Descriptors managed by virtual descriptors are implicitly O_CLOEXEC.
+	 * The client shouldn't be expected to know which kernel descriptors are
+	 * currently open, so it wouldn't make sense for them to be inherited by
+	 * executed subprograms.
+	 */
+	fileFlags |= O_CLOEXEC;
+
 	vfdP->fd = BasicOpenFilePerm(fileName, fileFlags, fileMode);
 
 	if (vfdP->fd < 0)
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index 59532bbd80..7eb7fe87f6 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -163,6 +163,14 @@ InitPostmasterChild(void)
 
 	/* Request a signal if the postmaster dies, if possible. */
 	PostmasterDeathSignalInit();
+
+	/* Don't give the pipe to subprograms that we execute. */
+#ifndef WIN32
+	if (fcntl(postmaster_alive_fds[POSTMASTER_FD_WATCH], F_SETFD, FD_CLOEXEC) < 0)
+		ereport(FATAL,
+				(errcode_for_socket_access(),
+				 errmsg_internal("could not set postmaster death monitoring pipe to FD_CLOEXEC mode: %m")));
+#endif
 }
 
 /*
diff --git a/src/include/port/win32_port.h b/src/include/port/win32_port.h
index 9488195799..5116c2fc06 100644
--- a/src/include/port/win32_port.h
+++ b/src/include/port/win32_port.h
@@ -353,6 +353,13 @@ extern int	_pglstat64(const char *name, struct stat *buf);
  */
 #define O_DSYNC 0x0080
 
+/*
+ * Our open() replacement does not create inheritable handles, so it is safe to
+ * ignore O_CLOEXEC.  (If we were using Windows' own open(), it might be
+ * necessary to convert this to _O_NOINHERIT.)
+ */
+#define O_CLOEXEC 0
+
 /*
  * Supplement to <errno.h>.
  *
-- 
2.39.1

v4-0003-Use-accept4-to-accept-connections-where-available.patchtext/x-patch; charset=US-ASCII; name=v4-0003-Use-accept4-to-accept-connections-where-available.patchDownload
From a8b5a565867716cf0706f6bd1a2946cd96c6d12f Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Mon, 6 Feb 2023 13:51:19 +1300
Subject: [PATCH v4 3/5] Use accept4() to accept connections, where available.

The postmaster previously accepted connections and then made extra
system calls in child processes to make them non-blocking and (as of a
recent commit) close-on-exit.

On all target Unixes except macOS and AIX, that can be done atomically
with flags to the newer accept4() variant (expected in the next POSIX
revision, but around in the wild for many years now), saving a couple of
system calls.

The exceptions are Windows, where we didn't have to worry about that
problem anyway, EXEC_BACKEND builds on Unix (used by developers only for
slightly-more-like-Windows testing), and the straggler Unixes that
haven't implemented it yet (at the time of writing: macOS and AIX).

Suggested-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CA%2BhUKGKb6FsAdQWcRL35KJsftv%2B9zXqQbzwkfRf1i0J2e57%2BhQ%40mail.gmail.com
---
 configure                  | 12 ++++++++++++
 configure.ac               |  1 +
 meson.build                |  1 +
 src/backend/libpq/pqcomm.c | 34 +++++++++++++++++++++++++++-------
 src/include/pg_config.h.in |  4 ++++
 5 files changed, 45 insertions(+), 7 deletions(-)

diff --git a/configure b/configure
index e35769ea73..d06f64cdbb 100755
--- a/configure
+++ b/configure
@@ -16219,6 +16219,18 @@ _ACEOF
 
 # We can't use AC_REPLACE_FUNCS to replace these functions, because it
 # won't handle deployment target restrictions on macOS
+ac_fn_c_check_decl "$LINENO" "accept4" "ac_cv_have_decl_accept4" "#include <sys/socket.h>
+"
+if test "x$ac_cv_have_decl_accept4" = xyes; then :
+  ac_have_decl=1
+else
+  ac_have_decl=0
+fi
+
+cat >>confdefs.h <<_ACEOF
+#define HAVE_DECL_ACCEPT4 $ac_have_decl
+_ACEOF
+
 ac_fn_c_check_decl "$LINENO" "preadv" "ac_cv_have_decl_preadv" "#include <sys/uio.h>
 "
 if test "x$ac_cv_have_decl_preadv" = xyes; then :
diff --git a/configure.ac b/configure.ac
index af23c15cb2..070a0b33db 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1843,6 +1843,7 @@ AC_CHECK_DECLS([strlcat, strlcpy, strnlen])
 
 # We can't use AC_REPLACE_FUNCS to replace these functions, because it
 # won't handle deployment target restrictions on macOS
+AC_CHECK_DECLS([accept4], [], [], [#include <sys/socket.h>])
 AC_CHECK_DECLS([preadv], [], [AC_LIBOBJ(preadv)], [#include <sys/uio.h>])
 AC_CHECK_DECLS([pwritev], [], [AC_LIBOBJ(pwritev)], [#include <sys/uio.h>])
 
diff --git a/meson.build b/meson.build
index f534704452..3e15d6c825 100644
--- a/meson.build
+++ b/meson.build
@@ -2097,6 +2097,7 @@ decl_checks = [
 # checking for library symbols wouldn't handle deployment target
 # restrictions on macOS
 decl_checks += [
+  ['accept4', 'sys/socket.h'],
   ['preadv', 'sys/uio.h'],
   ['pwritev', 'sys/uio.h'],
 ]
diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index da5bb5fc5d..8b97d1a7e3 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -184,6 +184,14 @@ pq_init(void)
 	/* set up process-exit hook to close the socket */
 	on_proc_exit(socket_close, 0);
 
+#ifndef WIN32
+	/*
+	 * On most systems, the socket has already been made non-blocking and
+	 * close-on-exec by StreamConnection().  On systems that don't have
+	 * accept4() yet, and EXEC_BACKEND builds that need the socket to survive
+	 * exec*(), we do that separately here in the child process.
+	 */
+#if !HAVE_DECL_ACCEPT4 || defined(EXEC_BACKEND)
 	/*
 	 * In backends (as soon as forked) we operate the underlying socket in
 	 * nonblocking mode and use latches to implement blocking semantics if
@@ -194,17 +202,14 @@ pq_init(void)
 	 * the client, which might require changing the mode again, leading to
 	 * infinite recursion.
 	 */
-#ifndef WIN32
 	if (!pg_set_noblock(MyProcPort->sock))
 		ereport(COMMERROR,
 				(errmsg("could not set socket to nonblocking mode: %m")));
-#endif
-
-#ifndef WIN32
 
 	/* Don't give the socket to any subprograms we execute. */
 	if (fcntl(MyProcPort->sock, F_SETFD, FD_CLOEXEC) < 0)
 		elog(FATAL, "fcntl(F_SETFD) failed on socket: %m");
+#endif
 #endif
 
 	FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, FeBeWaitSetNEvents);
@@ -699,9 +704,24 @@ StreamConnection(pgsocket server_fd, Port *port)
 {
 	/* accept connection and fill in the client (remote) address */
 	port->raddr.salen = sizeof(port->raddr.addr);
-	if ((port->sock = accept(server_fd,
-							 (struct sockaddr *) &port->raddr.addr,
-							 &port->raddr.salen)) == PGINVALID_SOCKET)
+
+#if HAVE_DECL_ACCEPT4 && !defined(EXEC_BACKEND)
+	/*
+	 * If we have accept4(), we can avoid a couple of fcntl() calls in
+	 * pq_init().  We can't use this optimization in EXEC_BACKEND builds,
+	 * though, because they need the socket to survive exec().
+	 */
+	port->sock = accept4(server_fd,
+						 (struct sockaddr *) &port->raddr.addr,
+						 &port->raddr.salen,
+						 SOCK_CLOEXEC | SOCK_NONBLOCK);
+#else
+	port->sock = accept(server_fd,
+						(struct sockaddr *) &port->raddr.addr,
+						&port->raddr.salen);
+#endif
+
+	if (port->sock == PGINVALID_SOCKET)
 	{
 		ereport(LOG,
 				(errcode_for_socket_access(),
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 20c82f5979..e21d0f05f5 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -91,6 +91,10 @@
 /* Define to 1 if you have the `CRYPTO_lock' function. */
 #undef HAVE_CRYPTO_LOCK
 
+/* Define to 1 if you have the declaration of `accept4', and to 0 if you
+   don't. */
+#undef HAVE_DECL_ACCEPT4
+
 /* Define to 1 if you have the declaration of `fdatasync', and to 0 if you
    don't. */
 #undef HAVE_DECL_FDATASYNC
-- 
2.39.1

v4-0004-Use-newer-client-socket-options-where-available.patchtext/x-patch; charset=US-ASCII; name=v4-0004-Use-newer-client-socket-options-where-available.patchDownload
From fdae48117c9bc0d7d25eed09ec57710fda6c28ed Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Mon, 6 Feb 2023 14:54:03 +1300
Subject: [PATCH v4 4/5] Use newer client socket options where available.

As a recent commit did for the server side, we can now also try to set
client sockets to nonblocking and close-on-exit atomically when creating
the socket.  This uses flags expected in the next POSIX revision, but
already widely available on common platforms.

This saves a couple of fcntl() calls, and plugs a tiny window for
resource leaks in multi-threaded programs that might fork-and-exec while
another thread would otherwise have to run socket() and then fcntl().

Discussion: https://postgr.es/m/CA%2BhUKGKb6FsAdQWcRL35KJsftv%2B9zXqQbzwkfRf1i0J2e57%2BhQ%40mail.gmail.com
---
 src/interfaces/libpq/fe-connect.c | 23 ++++++++++++++++++++++-
 1 file changed, 22 insertions(+), 1 deletion(-)

diff --git a/src/interfaces/libpq/fe-connect.c b/src/interfaces/libpq/fe-connect.c
index 50b5df3490..953abc640f 100644
--- a/src/interfaces/libpq/fe-connect.c
+++ b/src/interfaces/libpq/fe-connect.c
@@ -2464,6 +2464,7 @@ keep_going:						/* We will come back to here until there is
 				{
 					struct addrinfo *addr_cur = conn->addr_cur;
 					char		host_addr[NI_MAXHOST];
+					int			sock_type;
 
 					/*
 					 * Advance to next possible host, if we've tried all of
@@ -2494,7 +2495,23 @@ keep_going:						/* We will come back to here until there is
 						conn->connip = strdup(host_addr);
 
 					/* Try to create the socket */
-					conn->sock = socket(addr_cur->ai_family, SOCK_STREAM, 0);
+					sock_type = SOCK_STREAM;
+#ifdef SOCK_CLOEXEC
+					/*
+					 * Atomically mark close-on-exec, if possible on this
+					 * platform, for the benefit of multi-threaded programs.
+					 * Otherwise see below.
+					 */
+					sock_type |= SOCK_CLOEXEC;
+#endif
+#ifdef SOCK_NONBLOCK
+					/*
+					 * We might as well skip a system call for nonblocking
+					 * mode too, if we can.
+					 */
+					sock_type |= SOCK_NONBLOCK;
+#endif
+					conn->sock = socket(addr_cur->ai_family, sock_type, 0);
 					if (conn->sock == PGINVALID_SOCKET)
 					{
 						int			errorno = SOCK_ERRNO;
@@ -2540,6 +2557,7 @@ keep_going:						/* We will come back to here until there is
 							goto keep_going;
 						}
 					}
+#ifndef SOCK_NONBLOCK
 					if (!pg_set_noblock(conn->sock))
 					{
 						libpq_append_conn_error(conn, "could not set socket to nonblocking mode: %s",
@@ -2547,7 +2565,9 @@ keep_going:						/* We will come back to here until there is
 						conn->try_next_addr = true;
 						goto keep_going;
 					}
+#endif
 
+#ifndef SOCK_CLOEXEC
 #ifdef F_SETFD
 					if (fcntl(conn->sock, F_SETFD, FD_CLOEXEC) == -1)
 					{
@@ -2557,6 +2577,7 @@ keep_going:						/* We will come back to here until there is
 						goto keep_going;
 					}
 #endif							/* F_SETFD */
+#endif
 
 					if (addr_cur->ai_family != AF_UNIX)
 					{
-- 
2.39.1

v4-0005-Use-pipe2-for-postmaster-pipe-where-available.patchtext/x-patch; charset=US-ASCII; name=v4-0005-Use-pipe2-for-postmaster-pipe-where-available.patchDownload
From d686f55890b3cee7f471e038760e4aba4766aee4 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 21 Feb 2023 11:50:08 +1300
Subject: [PATCH v4 5/5] Use pipe2() for postmaster pipe, where available.

The 2-argument variant of pipe() (expected in the next POSIX revision)
saves the need to make a separate fcntl() call later.  It's already
available on all targeted Unixes except macOS and AIX.

Discussion: https://postgr.es/m/CA%2BhUKGKb6FsAdQWcRL35KJsftv%2B9zXqQbzwkfRf1i0J2e57%2BhQ%40mail.gmail.com
---
 configure                           | 12 ++++++++++++
 configure.ac                        |  1 +
 meson.build                         |  1 +
 src/backend/postmaster/postmaster.c | 10 ++++++++++
 src/backend/utils/init/miscinit.c   |  9 ++++++++-
 src/include/pg_config.h.in          |  4 ++++
 6 files changed, 36 insertions(+), 1 deletion(-)

diff --git a/configure b/configure
index d06f64cdbb..0aff9f5732 100755
--- a/configure
+++ b/configure
@@ -16231,6 +16231,18 @@ cat >>confdefs.h <<_ACEOF
 #define HAVE_DECL_ACCEPT4 $ac_have_decl
 _ACEOF
 
+ac_fn_c_check_decl "$LINENO" "pipe2" "ac_cv_have_decl_pipe2" "#include <unistd.h>
+"
+if test "x$ac_cv_have_decl_pipe2" = xyes; then :
+  ac_have_decl=1
+else
+  ac_have_decl=0
+fi
+
+cat >>confdefs.h <<_ACEOF
+#define HAVE_DECL_PIPE2 $ac_have_decl
+_ACEOF
+
 ac_fn_c_check_decl "$LINENO" "preadv" "ac_cv_have_decl_preadv" "#include <sys/uio.h>
 "
 if test "x$ac_cv_have_decl_preadv" = xyes; then :
diff --git a/configure.ac b/configure.ac
index 070a0b33db..d75eb76897 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1844,6 +1844,7 @@ AC_CHECK_DECLS([strlcat, strlcpy, strnlen])
 # We can't use AC_REPLACE_FUNCS to replace these functions, because it
 # won't handle deployment target restrictions on macOS
 AC_CHECK_DECLS([accept4], [], [], [#include <sys/socket.h>])
+AC_CHECK_DECLS([pipe2], [], [], [#include <unistd.h>])
 AC_CHECK_DECLS([preadv], [], [AC_LIBOBJ(preadv)], [#include <sys/uio.h>])
 AC_CHECK_DECLS([pwritev], [], [AC_LIBOBJ(pwritev)], [#include <sys/uio.h>])
 
diff --git a/meson.build b/meson.build
index 3e15d6c825..73b1a49573 100644
--- a/meson.build
+++ b/meson.build
@@ -2098,6 +2098,7 @@ decl_checks = [
 # restrictions on macOS
 decl_checks += [
   ['accept4', 'sys/socket.h'],
+  ['pipe2', 'unistd.h'],
   ['preadv', 'sys/uio.h'],
   ['pwritev', 'sys/uio.h'],
 ]
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 2552327d90..fdcfdfd558 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -6506,12 +6506,22 @@ InitPostmasterDeathWatchHandle(void)
 	 * close the write end as soon as possible after forking, because EOF
 	 * won't be signaled in the read end until all processes have closed the
 	 * write fd. That is taken care of in ClosePostmasterPorts().
+	 *
+	 * If this platform has pipe2(), and we're not in an EXEC_BACKEND build,
+	 * then we can avoid a later fcntl() call by asking for O_CLOEXEC now.
 	 */
 	Assert(MyProcPid == PostmasterPid);
+#if HAVE_DECL_PIPE2 && !defined(EXEC_BACKEND)
+	if (pipe2(postmaster_alive_fds, O_CLOEXEC) < 0)
+		ereport(FATAL,
+				(errcode_for_file_access(),
+				 errmsg_internal("could not create pipe to monitor postmaster death: %m")));
+#else
 	if (pipe(postmaster_alive_fds) < 0)
 		ereport(FATAL,
 				(errcode_for_file_access(),
 				 errmsg_internal("could not create pipe to monitor postmaster death: %m")));
+#endif
 
 	/* Notify fd.c that we've eaten two FDs for the pipe. */
 	ReserveExternalFD();
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index 7eb7fe87f6..f706b73e91 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -164,13 +164,20 @@ InitPostmasterChild(void)
 	/* Request a signal if the postmaster dies, if possible. */
 	PostmasterDeathSignalInit();
 
-	/* Don't give the pipe to subprograms that we execute. */
+	/*
+	 * Don't give the pipe to subprograms that we execute.  We only need to
+	 * do this explicitly here if the platform lacks pipe2(), or we're in an
+	 * EXEC_BACKEND build so the pipe had to survive the first level of
+	 * exec*() to get here.
+	 */
 #ifndef WIN32
+#if !HAVE_DECL_PIPE2 || defined(EXEC_BACKEND)
 	if (fcntl(postmaster_alive_fds[POSTMASTER_FD_WATCH], F_SETFD, FD_CLOEXEC) < 0)
 		ereport(FATAL,
 				(errcode_for_socket_access(),
 				 errmsg_internal("could not set postmaster death monitoring pipe to FD_CLOEXEC mode: %m")));
 #endif
+#endif
 }
 
 /*
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index e21d0f05f5..5cd5acf98e 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -123,6 +123,10 @@
    to 0 if you don't. */
 #undef HAVE_DECL_LLVMORCGETSYMBOLADDRESSIN
 
+/* Define to 1 if you have the declaration of `pipe2', and to 0 if you don't.
+   */
+#undef HAVE_DECL_PIPE2
+
 /* Define to 1 if you have the declaration of `posix_fadvise', and to 0 if you
    don't. */
 #undef HAVE_DECL_POSIX_FADVISE
-- 
2.39.1

#10Gregory Stark (as CFM)
stark.cfm@gmail.com
In reply to: Thomas Munro (#9)
Re: File descriptors in exec'd subprocesses

On Mon, 20 Feb 2023 at 23:04, Thomas Munro <thomas.munro@gmail.com> wrote:

Done like that in this version. This is the version I'm thinking of
committing, unless someone wants to argue for another level.

FWIW the cfbot doesn't understand this patch series. I'm not sure why
but it's only trying to apply the first (the MacOS one) and it's
failing to apply even that.

--
Gregory Stark
As Commitfest Manager

#11Thomas Munro
thomas.munro@gmail.com
In reply to: Gregory Stark (as CFM) (#10)
Re: File descriptors in exec'd subprocesses

On Thu, Mar 2, 2023 at 9:49 AM Gregory Stark (as CFM)
<stark.cfm@gmail.com> wrote:

On Mon, 20 Feb 2023 at 23:04, Thomas Munro <thomas.munro@gmail.com> wrote:

Done like that in this version. This is the version I'm thinking of
committing, unless someone wants to argue for another level.

FWIW the cfbot doesn't understand this patch series. I'm not sure why
but it's only trying to apply the first (the MacOS one) and it's
failing to apply even that.

Ah, it's because I committed one patch in the series. I'll commit one
more, and then repost the rest, shortly.

#12Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#11)
3 attachment(s)
Re: File descriptors in exec'd subprocesses

On Thu, Mar 2, 2023 at 9:57 AM Thomas Munro <thomas.munro@gmail.com> wrote:

On Thu, Mar 2, 2023 at 9:49 AM Gregory Stark (as CFM)
<stark.cfm@gmail.com> wrote:

On Mon, 20 Feb 2023 at 23:04, Thomas Munro <thomas.munro@gmail.com> wrote:

Done like that in this version. This is the version I'm thinking of
committing, unless someone wants to argue for another level.

FWIW the cfbot doesn't understand this patch series. I'm not sure why
but it's only trying to apply the first (the MacOS one) and it's
failing to apply even that.

Ah, it's because I committed one patch in the series. I'll commit one
more, and then repost the rest, shortly.

I pushed the main patch, "Don't leak descriptors into subprograms.".
Here's a rebase of the POSIX-next stuff, but I'll sit on these for a
bit longer to see if the build farm agrees with my claim about the
ubiquity of O_CLOEXEC, and if anyone has comments on this stuff.

Attachments:

v5-0001-Use-accept4-to-accept-connections-where-available.patchtext/x-patch; charset=US-ASCII; name=v5-0001-Use-accept4-to-accept-connections-where-available.patchDownload
From 1d03c0b2811b53a20eeff781918ef99d4c9b9df9 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Mon, 6 Feb 2023 13:51:19 +1300
Subject: [PATCH v5 1/3] Use accept4() to accept connections, where available.

The postmaster previously accepted connections and then made extra
system calls in child processes to make them non-blocking and (as of a
recent commit) close-on-exit.

On all target Unixes except macOS and AIX, that can be done atomically
with flags to the newer accept4() variant (expected in the next POSIX
revision, but around in the wild for many years now), saving a couple of
system calls.

The exceptions are Windows, where we didn't have to worry about that
problem anyway, EXEC_BACKEND builds on Unix (used by developers only for
slightly-more-like-Windows testing), and the straggler Unixes that
haven't implemented it yet (at the time of writing: macOS and AIX).

Suggested-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CA%2BhUKGKb6FsAdQWcRL35KJsftv%2B9zXqQbzwkfRf1i0J2e57%2BhQ%40mail.gmail.com
---
 configure                  | 12 ++++++++++++
 configure.ac               |  1 +
 meson.build                |  1 +
 src/backend/libpq/pqcomm.c | 34 +++++++++++++++++++++++++++-------
 src/include/pg_config.h.in |  4 ++++
 5 files changed, 45 insertions(+), 7 deletions(-)

diff --git a/configure b/configure
index e35769ea73..d06f64cdbb 100755
--- a/configure
+++ b/configure
@@ -16219,6 +16219,18 @@ _ACEOF
 
 # We can't use AC_REPLACE_FUNCS to replace these functions, because it
 # won't handle deployment target restrictions on macOS
+ac_fn_c_check_decl "$LINENO" "accept4" "ac_cv_have_decl_accept4" "#include <sys/socket.h>
+"
+if test "x$ac_cv_have_decl_accept4" = xyes; then :
+  ac_have_decl=1
+else
+  ac_have_decl=0
+fi
+
+cat >>confdefs.h <<_ACEOF
+#define HAVE_DECL_ACCEPT4 $ac_have_decl
+_ACEOF
+
 ac_fn_c_check_decl "$LINENO" "preadv" "ac_cv_have_decl_preadv" "#include <sys/uio.h>
 "
 if test "x$ac_cv_have_decl_preadv" = xyes; then :
diff --git a/configure.ac b/configure.ac
index af23c15cb2..070a0b33db 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1843,6 +1843,7 @@ AC_CHECK_DECLS([strlcat, strlcpy, strnlen])
 
 # We can't use AC_REPLACE_FUNCS to replace these functions, because it
 # won't handle deployment target restrictions on macOS
+AC_CHECK_DECLS([accept4], [], [], [#include <sys/socket.h>])
 AC_CHECK_DECLS([preadv], [], [AC_LIBOBJ(preadv)], [#include <sys/uio.h>])
 AC_CHECK_DECLS([pwritev], [], [AC_LIBOBJ(pwritev)], [#include <sys/uio.h>])
 
diff --git a/meson.build b/meson.build
index 26be83afb6..c91fb05133 100644
--- a/meson.build
+++ b/meson.build
@@ -2097,6 +2097,7 @@ decl_checks = [
 # checking for library symbols wouldn't handle deployment target
 # restrictions on macOS
 decl_checks += [
+  ['accept4', 'sys/socket.h'],
   ['preadv', 'sys/uio.h'],
   ['pwritev', 'sys/uio.h'],
 ]
diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index da5bb5fc5d..8b97d1a7e3 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -184,6 +184,14 @@ pq_init(void)
 	/* set up process-exit hook to close the socket */
 	on_proc_exit(socket_close, 0);
 
+#ifndef WIN32
+	/*
+	 * On most systems, the socket has already been made non-blocking and
+	 * close-on-exec by StreamConnection().  On systems that don't have
+	 * accept4() yet, and EXEC_BACKEND builds that need the socket to survive
+	 * exec*(), we do that separately here in the child process.
+	 */
+#if !HAVE_DECL_ACCEPT4 || defined(EXEC_BACKEND)
 	/*
 	 * In backends (as soon as forked) we operate the underlying socket in
 	 * nonblocking mode and use latches to implement blocking semantics if
@@ -194,17 +202,14 @@ pq_init(void)
 	 * the client, which might require changing the mode again, leading to
 	 * infinite recursion.
 	 */
-#ifndef WIN32
 	if (!pg_set_noblock(MyProcPort->sock))
 		ereport(COMMERROR,
 				(errmsg("could not set socket to nonblocking mode: %m")));
-#endif
-
-#ifndef WIN32
 
 	/* Don't give the socket to any subprograms we execute. */
 	if (fcntl(MyProcPort->sock, F_SETFD, FD_CLOEXEC) < 0)
 		elog(FATAL, "fcntl(F_SETFD) failed on socket: %m");
+#endif
 #endif
 
 	FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, FeBeWaitSetNEvents);
@@ -699,9 +704,24 @@ StreamConnection(pgsocket server_fd, Port *port)
 {
 	/* accept connection and fill in the client (remote) address */
 	port->raddr.salen = sizeof(port->raddr.addr);
-	if ((port->sock = accept(server_fd,
-							 (struct sockaddr *) &port->raddr.addr,
-							 &port->raddr.salen)) == PGINVALID_SOCKET)
+
+#if HAVE_DECL_ACCEPT4 && !defined(EXEC_BACKEND)
+	/*
+	 * If we have accept4(), we can avoid a couple of fcntl() calls in
+	 * pq_init().  We can't use this optimization in EXEC_BACKEND builds,
+	 * though, because they need the socket to survive exec().
+	 */
+	port->sock = accept4(server_fd,
+						 (struct sockaddr *) &port->raddr.addr,
+						 &port->raddr.salen,
+						 SOCK_CLOEXEC | SOCK_NONBLOCK);
+#else
+	port->sock = accept(server_fd,
+						(struct sockaddr *) &port->raddr.addr,
+						&port->raddr.salen);
+#endif
+
+	if (port->sock == PGINVALID_SOCKET)
 	{
 		ereport(LOG,
 				(errcode_for_socket_access(),
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 20c82f5979..e21d0f05f5 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -91,6 +91,10 @@
 /* Define to 1 if you have the `CRYPTO_lock' function. */
 #undef HAVE_CRYPTO_LOCK
 
+/* Define to 1 if you have the declaration of `accept4', and to 0 if you
+   don't. */
+#undef HAVE_DECL_ACCEPT4
+
 /* Define to 1 if you have the declaration of `fdatasync', and to 0 if you
    don't. */
 #undef HAVE_DECL_FDATASYNC
-- 
2.39.1

v5-0002-Use-newer-client-socket-options-where-available.patchtext/x-patch; charset=US-ASCII; name=v5-0002-Use-newer-client-socket-options-where-available.patchDownload
From 51772e0b4b47981056d874c5c170ca5e3945c789 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Mon, 6 Feb 2023 14:54:03 +1300
Subject: [PATCH v5 2/3] Use newer client socket options where available.

As a recent commit did for the server side, we can now also try to set
client sockets to nonblocking and close-on-exit atomically when creating
the socket.  This uses flags expected in the next POSIX revision, but
already widely available on common platforms.

This saves a couple of fcntl() calls, and plugs a tiny window for
resource leaks in multi-threaded programs that might fork-and-exec while
another thread would otherwise have to run socket() and then fcntl().

Discussion: https://postgr.es/m/CA%2BhUKGKb6FsAdQWcRL35KJsftv%2B9zXqQbzwkfRf1i0J2e57%2BhQ%40mail.gmail.com
---
 src/interfaces/libpq/fe-connect.c | 23 ++++++++++++++++++++++-
 1 file changed, 22 insertions(+), 1 deletion(-)

diff --git a/src/interfaces/libpq/fe-connect.c b/src/interfaces/libpq/fe-connect.c
index 8f80c35c89..c5907eb786 100644
--- a/src/interfaces/libpq/fe-connect.c
+++ b/src/interfaces/libpq/fe-connect.c
@@ -2464,6 +2464,7 @@ keep_going:						/* We will come back to here until there is
 				{
 					struct addrinfo *addr_cur = conn->addr_cur;
 					char		host_addr[NI_MAXHOST];
+					int			sock_type;
 
 					/*
 					 * Advance to next possible host, if we've tried all of
@@ -2494,7 +2495,23 @@ keep_going:						/* We will come back to here until there is
 						conn->connip = strdup(host_addr);
 
 					/* Try to create the socket */
-					conn->sock = socket(addr_cur->ai_family, SOCK_STREAM, 0);
+					sock_type = SOCK_STREAM;
+#ifdef SOCK_CLOEXEC
+					/*
+					 * Atomically mark close-on-exec, if possible on this
+					 * platform, for the benefit of multi-threaded programs.
+					 * Otherwise see below.
+					 */
+					sock_type |= SOCK_CLOEXEC;
+#endif
+#ifdef SOCK_NONBLOCK
+					/*
+					 * We might as well skip a system call for nonblocking
+					 * mode too, if we can.
+					 */
+					sock_type |= SOCK_NONBLOCK;
+#endif
+					conn->sock = socket(addr_cur->ai_family, sock_type, 0);
 					if (conn->sock == PGINVALID_SOCKET)
 					{
 						int			errorno = SOCK_ERRNO;
@@ -2540,6 +2557,7 @@ keep_going:						/* We will come back to here until there is
 							goto keep_going;
 						}
 					}
+#ifndef SOCK_NONBLOCK
 					if (!pg_set_noblock(conn->sock))
 					{
 						libpq_append_conn_error(conn, "could not set socket to nonblocking mode: %s",
@@ -2547,7 +2565,9 @@ keep_going:						/* We will come back to here until there is
 						conn->try_next_addr = true;
 						goto keep_going;
 					}
+#endif
 
+#ifndef SOCK_CLOEXEC
 #ifdef F_SETFD
 					if (fcntl(conn->sock, F_SETFD, FD_CLOEXEC) == -1)
 					{
@@ -2557,6 +2577,7 @@ keep_going:						/* We will come back to here until there is
 						goto keep_going;
 					}
 #endif							/* F_SETFD */
+#endif
 
 					if (addr_cur->ai_family != AF_UNIX)
 					{
-- 
2.39.1

v5-0003-Use-pipe2-for-postmaster-pipe-where-available.patchtext/x-patch; charset=US-ASCII; name=v5-0003-Use-pipe2-for-postmaster-pipe-where-available.patchDownload
From 92b0b27733f0f22e230a9139657d9b432ebf6ee0 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 21 Feb 2023 11:50:08 +1300
Subject: [PATCH v5 3/3] Use pipe2() for postmaster pipe, where available.

The 2-argument variant of pipe() (expected in the next POSIX revision)
saves the need to make a separate fcntl() call later.  It's already
available on all targeted Unixes except macOS and AIX.

Discussion: https://postgr.es/m/CA%2BhUKGKb6FsAdQWcRL35KJsftv%2B9zXqQbzwkfRf1i0J2e57%2BhQ%40mail.gmail.com
---
 configure                           | 12 ++++++++++++
 configure.ac                        |  1 +
 meson.build                         |  1 +
 src/backend/postmaster/postmaster.c | 10 ++++++++++
 src/backend/utils/init/miscinit.c   |  9 ++++++++-
 src/include/pg_config.h.in          |  4 ++++
 6 files changed, 36 insertions(+), 1 deletion(-)

diff --git a/configure b/configure
index d06f64cdbb..0aff9f5732 100755
--- a/configure
+++ b/configure
@@ -16231,6 +16231,18 @@ cat >>confdefs.h <<_ACEOF
 #define HAVE_DECL_ACCEPT4 $ac_have_decl
 _ACEOF
 
+ac_fn_c_check_decl "$LINENO" "pipe2" "ac_cv_have_decl_pipe2" "#include <unistd.h>
+"
+if test "x$ac_cv_have_decl_pipe2" = xyes; then :
+  ac_have_decl=1
+else
+  ac_have_decl=0
+fi
+
+cat >>confdefs.h <<_ACEOF
+#define HAVE_DECL_PIPE2 $ac_have_decl
+_ACEOF
+
 ac_fn_c_check_decl "$LINENO" "preadv" "ac_cv_have_decl_preadv" "#include <sys/uio.h>
 "
 if test "x$ac_cv_have_decl_preadv" = xyes; then :
diff --git a/configure.ac b/configure.ac
index 070a0b33db..d75eb76897 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1844,6 +1844,7 @@ AC_CHECK_DECLS([strlcat, strlcpy, strnlen])
 # We can't use AC_REPLACE_FUNCS to replace these functions, because it
 # won't handle deployment target restrictions on macOS
 AC_CHECK_DECLS([accept4], [], [], [#include <sys/socket.h>])
+AC_CHECK_DECLS([pipe2], [], [], [#include <unistd.h>])
 AC_CHECK_DECLS([preadv], [], [AC_LIBOBJ(preadv)], [#include <sys/uio.h>])
 AC_CHECK_DECLS([pwritev], [], [AC_LIBOBJ(pwritev)], [#include <sys/uio.h>])
 
diff --git a/meson.build b/meson.build
index c91fb05133..ad2faefff4 100644
--- a/meson.build
+++ b/meson.build
@@ -2098,6 +2098,7 @@ decl_checks = [
 # restrictions on macOS
 decl_checks += [
   ['accept4', 'sys/socket.h'],
+  ['pipe2', 'unistd.h'],
   ['preadv', 'sys/uio.h'],
   ['pwritev', 'sys/uio.h'],
 ]
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 2552327d90..fdcfdfd558 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -6506,12 +6506,22 @@ InitPostmasterDeathWatchHandle(void)
 	 * close the write end as soon as possible after forking, because EOF
 	 * won't be signaled in the read end until all processes have closed the
 	 * write fd. That is taken care of in ClosePostmasterPorts().
+	 *
+	 * If this platform has pipe2(), and we're not in an EXEC_BACKEND build,
+	 * then we can avoid a later fcntl() call by asking for O_CLOEXEC now.
 	 */
 	Assert(MyProcPid == PostmasterPid);
+#if HAVE_DECL_PIPE2 && !defined(EXEC_BACKEND)
+	if (pipe2(postmaster_alive_fds, O_CLOEXEC) < 0)
+		ereport(FATAL,
+				(errcode_for_file_access(),
+				 errmsg_internal("could not create pipe to monitor postmaster death: %m")));
+#else
 	if (pipe(postmaster_alive_fds) < 0)
 		ereport(FATAL,
 				(errcode_for_file_access(),
 				 errmsg_internal("could not create pipe to monitor postmaster death: %m")));
+#endif
 
 	/* Notify fd.c that we've eaten two FDs for the pipe. */
 	ReserveExternalFD();
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index 7eb7fe87f6..f706b73e91 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -164,13 +164,20 @@ InitPostmasterChild(void)
 	/* Request a signal if the postmaster dies, if possible. */
 	PostmasterDeathSignalInit();
 
-	/* Don't give the pipe to subprograms that we execute. */
+	/*
+	 * Don't give the pipe to subprograms that we execute.  We only need to
+	 * do this explicitly here if the platform lacks pipe2(), or we're in an
+	 * EXEC_BACKEND build so the pipe had to survive the first level of
+	 * exec*() to get here.
+	 */
 #ifndef WIN32
+#if !HAVE_DECL_PIPE2 || defined(EXEC_BACKEND)
 	if (fcntl(postmaster_alive_fds[POSTMASTER_FD_WATCH], F_SETFD, FD_CLOEXEC) < 0)
 		ereport(FATAL,
 				(errcode_for_socket_access(),
 				 errmsg_internal("could not set postmaster death monitoring pipe to FD_CLOEXEC mode: %m")));
 #endif
+#endif
 }
 
 /*
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index e21d0f05f5..5cd5acf98e 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -123,6 +123,10 @@
    to 0 if you don't. */
 #undef HAVE_DECL_LLVMORCGETSYMBOLADDRESSIN
 
+/* Define to 1 if you have the declaration of `pipe2', and to 0 if you don't.
+   */
+#undef HAVE_DECL_PIPE2
+
 /* Define to 1 if you have the declaration of `posix_fadvise', and to 0 if you
    don't. */
 #undef HAVE_DECL_POSIX_FADVISE
-- 
2.39.1

#13Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#12)
Re: File descriptors in exec'd subprocesses

I pushed the libpq changes. I'll leave the pipe2 and accept4 changes
on ice for now, maybe for a later cycle (unlike the committed patches,
they don't currently fix a known problem, they just avoid some
syscalls that are already fairly rare). For the libpq change, the
build farm seems happy so far. I was a little worried that there
could be ways that #ifdef SOCK_CLOEXEC could be true for a build that
might encounter a too-old kernel and break, but it looks like you'd
have to go so far back into EOL'd releases that even our zombie build
farm animals have it. Only macOS and AIX don't have it yet, and this
should be fine with Apple's availability guards, which leaves just
AIX. (AFAIK headers and kernels are always in sync at the major
version level on AIX, but if not presumably there'd have to be some
similar guard system?)