postgres_fdw uninterruptible during connection establishment / ProcSignalBarrier

Started by Andres Freundover 3 years ago18 messages
#1Andres Freund
andres@anarazel.de

Hi,

I amd working on adding an "installcheck" equivalent mode to the meson
build. One invocation of something "installcheck-world" like lead to things
getting stuck.

Lots of log messages like:

2022-09-25 16:16:30.999 PDT [2705454][client backend][28/1112:41269][pg_regress] LOG: still waiting for backend with PID 2705178 to accept ProcSignalBarrier
2022-09-25 16:16:30.999 PDT [2705454][client backend][28/1112:41269][pg_regress] STATEMENT: DROP DATABASE IF EXISTS "regression_test_parser_regress"
2022-09-25 16:16:31.006 PDT [2705472][client backend][22/3699:41294][pg_regress] LOG: still waiting for backend with PID 2705178 to accept ProcSignalBarrier
2022-09-25 16:16:31.006 PDT [2705472][client backend][22/3699:41294][pg_regress] STATEMENT: DROP DATABASE IF EXISTS "regression_test_predtest_regress"

a stacktrace of 2705178 shows:

(gdb) bt
#0 0x00007f67d26fe1b3 in __GI___poll (fds=fds@entry=0x7ffebe187c88, nfds=nfds@entry=1, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
#1 0x00007f67cfd03c1c in pqSocketPoll (sock=<optimized out>, forRead=forRead@entry=1, forWrite=forWrite@entry=0, end_time=end_time@entry=-1)
at ../../../../home/andres/src/postgresql/src/interfaces/libpq/fe-misc.c:1125
#2 0x00007f67cfd04310 in pqSocketCheck (conn=conn@entry=0x562f875a9b70, forRead=forRead@entry=1, forWrite=forWrite@entry=0, end_time=end_time@entry=-1)
at ../../../../home/andres/src/postgresql/src/interfaces/libpq/fe-misc.c:1066
#3 0x00007f67cfd043fd in pqWaitTimed (forRead=forRead@entry=1, forWrite=forWrite@entry=0, conn=conn@entry=0x562f875a9b70, finish_time=finish_time@entry=-1)
at ../../../../home/andres/src/postgresql/src/interfaces/libpq/fe-misc.c:998
#4 0x00007f67cfcfc47b in connectDBComplete (conn=conn@entry=0x562f875a9b70) at ../../../../home/andres/src/postgresql/src/interfaces/libpq/fe-connect.c:2166
#5 0x00007f67cfcfe248 in PQconnectdbParams (keywords=keywords@entry=0x562f87613d20, values=values@entry=0x562f87613d70, expand_dbname=expand_dbname@entry=0)
at ../../../../home/andres/src/postgresql/src/interfaces/libpq/fe-connect.c:659
#6 0x00007f67cfd29536 in connect_pg_server (server=server@entry=0x562f876139b0, user=user@entry=0x562f87613980)
at ../../../../home/andres/src/postgresql/contrib/postgres_fdw/connection.c:474
#7 0x00007f67cfd29910 in make_new_connection (entry=entry@entry=0x562f8758b2c8, user=user@entry=0x562f87613980)
at ../../../../home/andres/src/postgresql/contrib/postgres_fdw/connection.c:344
#8 0x00007f67cfd29da0 in GetConnection (user=0x562f87613980, will_prep_stmt=will_prep_stmt@entry=false, state=state@entry=0x562f876136a0)
at ../../../../home/andres/src/postgresql/contrib/postgres_fdw/connection.c:204
#9 0x00007f67cfd35294 in postgresBeginForeignScan (node=0x562f87612e70, eflags=<optimized out>)

and it turns out that backend can't be graciously be terminated.

Maybe I am missing something, but I don't think it's OK for
connect_pg_server() to connect in a blocking manner, without accepting
interrupts?

Greetings,

Andres Freund

#2Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#1)
1 attachment(s)
Re: postgres_fdw uninterruptible during connection establishment / ProcSignalBarrier

Hi,

I just re-discovered this issue, via
/messages/by-id/20221209003607.bz2zdznvfnkq4zz3@awork3.anarazel.de

On 2022-09-25 16:22:37 -0700, Andres Freund wrote:

Maybe I am missing something, but I don't think it's OK for
connect_pg_server() to connect in a blocking manner, without accepting
interrupts?

It's definitely not. This basically means network issues or such can lead to
connections being unkillable...

We know how to do better, c.f. libpqrcv_connect(). I hacked that up for
postgres_fdw, and got through quite a few runs without related issues ([1]I eventually encountered a deadlock in REINDEX, but it didn't involve postgres_fw / ProcSignalBarrier).

The same problem is present in two places in dblink.c. Obviously we can copy
and paste the code to dblink.c as well. But that means we have the same not
quite trivial code in three different c files. There's already a fair bit of
duplicated code around AcquireExternalFD().

It seems we should find a place to put backend specific libpq wrapper code
somewhere. Unless we want to relax the rule about not linking libpq into the
backend we can't just put it in the backend directly, though.

The only alternative way to provide a wrapper that I can think of are to
a) introduce a new static library that can be linked to by libpqwalreceiver,
postgres_fdw, dblink
b) add a header with static inline functions implementing interrupt-processing
connection establishment for libpq

Neither really has precedent.

The attached quick patch just adds and uses libpq_connect_interruptible() in
postgres_fdw. If we wanted to move this somewhere generic, at least part of
the external FD handling should also be moved into it.

dblink.c uses a lot of other blocking libpq functions, which obviously also
isn't ok.

Perhaps we could somehow make this easier from within libpq? My first thought
was a connection parameter that'd provide a different implementation of
pqSocketCheck() or such - but I don't think that'd work, because we might
throw errors when processing interrupts, and that'd not be ok from deep within
libpq.

Greetings,

Andres Freund

[1]: I eventually encountered a deadlock in REINDEX, but it didn't involve postgres_fw / ProcSignalBarrier
postgres_fw / ProcSignalBarrier

Attachments:

postgres_fdw_libpq_connect.difftext/x-diff; charset=us-asciiDownload
diff --git i/contrib/postgres_fdw/connection.c w/contrib/postgres_fdw/connection.c
index f0c45b00db8..f9da564a539 100644
--- i/contrib/postgres_fdw/connection.c
+++ w/contrib/postgres_fdw/connection.c
@@ -347,6 +347,67 @@ make_new_connection(ConnCacheEntry *entry, UserMapping *user)
 		 entry->conn, server->servername, user->umid, user->userid);
 }
 
+/*
+ * PQconnectStartParams() wrapper that processes interrupts. Backend code
+ * should *never* enter blocking libpq code as that would prevent
+ * cancellations, global barriers etc from being processed.
+ */
+static PGconn *
+libpq_connect_interruptible(const char *const *keywords,
+							const char *const *values,
+							int expand_dbname,
+							uint32 wait_event_info)
+{
+	PGconn *conn;
+	PostgresPollingStatusType status;
+
+	conn = PQconnectStartParams(keywords, values, false);
+
+	if (!conn)
+		return NULL;
+
+	/*
+	 * Poll connection until we have OK or FAILED status.
+	 *
+	 * Per spec for PQconnectPoll, first wait till socket is write-ready.
+	 */
+	status = PGRES_POLLING_WRITING;
+	while (status != PGRES_POLLING_OK && status != PGRES_POLLING_FAILED)
+	{
+		int			io_flag;
+		int			rc;
+
+		if (status == PGRES_POLLING_READING)
+			io_flag = WL_SOCKET_READABLE;
+#ifdef WIN32
+		/* Windows needs a different test while waiting for connection-made */
+		else if (PQstatus(conn->streamConn) == CONNECTION_STARTED)
+			io_flag = WL_SOCKET_CONNECTED;
+#endif
+		else
+			io_flag = WL_SOCKET_WRITEABLE;
+
+		rc = WaitLatchOrSocket(MyLatch,
+							   WL_EXIT_ON_PM_DEATH | WL_LATCH_SET | io_flag,
+							   PQsocket(conn),
+							   0,
+							   wait_event_info);
+
+		/* Interrupted? */
+		if (rc & WL_LATCH_SET)
+		{
+			ResetLatch(MyLatch);
+			CHECK_FOR_INTERRUPTS();
+		}
+
+		/* If socket is ready, advance the libpq state machine */
+		if (rc & io_flag)
+			status = PQconnectPoll(conn);
+	}
+
+	return conn;
+}
+
 /*
  * Connect to remote server using specified server and user mapping properties.
  */
@@ -471,7 +532,8 @@ connect_pg_server(ForeignServer *server, UserMapping *user)
 		}
 
 		/* OK to make connection */
-		conn = PQconnectdbParams(keywords, values, false);
+		conn = libpq_connect_interruptible(keywords, values,
+										   false, PG_WAIT_EXTENSION);
 
 		if (!conn)
 			ReleaseExternalFD();	/* because the PG_CATCH block won't */
#3Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#2)
Re: postgres_fdw uninterruptible during connection establishment / ProcSignalBarrier

Hi,

On 2022-12-08 18:08:15 -0800, Andres Freund wrote:

I just re-discovered this issue, via
/messages/by-id/20221209003607.bz2zdznvfnkq4zz3@awork3.anarazel.de

On 2022-09-25 16:22:37 -0700, Andres Freund wrote:

Maybe I am missing something, but I don't think it's OK for
connect_pg_server() to connect in a blocking manner, without accepting
interrupts?

It's definitely not. This basically means network issues or such can lead to
connections being unkillable...

We know how to do better, c.f. libpqrcv_connect(). I hacked that up for
postgres_fdw, and got through quite a few runs without related issues ([1]).

The same problem is present in two places in dblink.c. Obviously we can copy
and paste the code to dblink.c as well. But that means we have the same not
quite trivial code in three different c files. There's already a fair bit of
duplicated code around AcquireExternalFD().

It seems we should find a place to put backend specific libpq wrapper code
somewhere. Unless we want to relax the rule about not linking libpq into the
backend we can't just put it in the backend directly, though.

The only alternative way to provide a wrapper that I can think of are to
a) introduce a new static library that can be linked to by libpqwalreceiver,
postgres_fdw, dblink
b) add a header with static inline functions implementing interrupt-processing
connection establishment for libpq

Neither really has precedent.

The attached quick patch just adds and uses libpq_connect_interruptible() in
postgres_fdw. If we wanted to move this somewhere generic, at least part of
the external FD handling should also be moved into it.

dblink.c uses a lot of other blocking libpq functions, which obviously also
isn't ok.

Perhaps we could somehow make this easier from within libpq? My first thought
was a connection parameter that'd provide a different implementation of
pqSocketCheck() or such - but I don't think that'd work, because we might
throw errors when processing interrupts, and that'd not be ok from deep within
libpq.

Any opinions? Due to the simplicity I'm currently leaning to a header-only
helper, but I don't feel confident about it.

The rate of cfbot failures is high enough that it'd be good to do something
about this.

Greetings,

Andres Freund

#4Thomas Munro
thomas.munro@gmail.com
In reply to: Andres Freund (#3)
Re: postgres_fdw uninterruptible during connection establishment / ProcSignalBarrier

On Fri, Dec 30, 2022 at 6:23 AM Andres Freund <andres@anarazel.de> wrote:

On 2022-12-08 18:08:15 -0800, Andres Freund wrote:

On 2022-09-25 16:22:37 -0700, Andres Freund wrote:
The only alternative way to provide a wrapper that I can think of are to
a) introduce a new static library that can be linked to by libpqwalreceiver,
postgres_fdw, dblink
b) add a header with static inline functions implementing interrupt-processing
connection establishment for libpq

Neither really has precedent.

Any opinions? Due to the simplicity I'm currently leaning to a header-only
helper, but I don't feel confident about it.

The header idea is a little bit sneaky (IIUC: a header that is part of
the core tree, but can't be used by core and possibly needs special
treatment in 'headercheck' to get the right include search path, can
only be used by libpqwalreceiver et al which are allowed to link to
libpq), but I think it is compatible with other goals we have
discussed in other threads. I think in the near future we'll probably
remove the concept of non-threaded server builds (as proposed before
in the post HP-UX 10 cleanup thread, with patches, but not quite over
the line yet). Then I think the server could be allowed to link libpq
directly? And at that point this code wouldn't be sneaky anymore and
could optionally move into a .c. Does that makes sense?

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Thomas Munro (#4)
Re: postgres_fdw uninterruptible during connection establishment / ProcSignalBarrier

Thomas Munro <thomas.munro@gmail.com> writes:

The header idea is a little bit sneaky (IIUC: a header that is part of
the core tree, but can't be used by core and possibly needs special
treatment in 'headercheck' to get the right include search path, can
only be used by libpqwalreceiver et al which are allowed to link to
libpq), but I think it is compatible with other goals we have
discussed in other threads. I think in the near future we'll probably
remove the concept of non-threaded server builds (as proposed before
in the post HP-UX 10 cleanup thread, with patches, but not quite over
the line yet). Then I think the server could be allowed to link libpq
directly? And at that point this code wouldn't be sneaky anymore and
could optionally move into a .c. Does that makes sense?

I don't like the idea of linking libpq directly into the backend.
It should remain a dynamically-loaded library to avoid problems
during software updates.

regards, tom lane

#6Andres Freund
andres@anarazel.de
In reply to: Thomas Munro (#4)
Re: postgres_fdw uninterruptible during connection establishment / ProcSignalBarrier

Hi,

On 2022-12-30 10:31:22 +1300, Thomas Munro wrote:

On Fri, Dec 30, 2022 at 6:23 AM Andres Freund <andres@anarazel.de> wrote:

On 2022-12-08 18:08:15 -0800, Andres Freund wrote:

On 2022-09-25 16:22:37 -0700, Andres Freund wrote:
The only alternative way to provide a wrapper that I can think of are to
a) introduce a new static library that can be linked to by libpqwalreceiver,
postgres_fdw, dblink
b) add a header with static inline functions implementing interrupt-processing
connection establishment for libpq

Neither really has precedent.

Any opinions? Due to the simplicity I'm currently leaning to a header-only
helper, but I don't feel confident about it.

The header idea is a little bit sneaky (IIUC: a header that is part of
the core tree, but can't be used by core and possibly needs special
treatment in 'headercheck' to get the right include search path, can
only be used by libpqwalreceiver et al which are allowed to link to
libpq), but I think it is compatible with other goals we have
discussed in other threads.

Hm, what special search path / headerscheck magic are you thinking of? I think
something like src/include/libpq/libpq-be-fe-helpers.h defining a bunch of
static inlines should "just" work?

We likely could guard against that header being included from code ending up
in the postgres binary directly by #error'ing if BUILDING_DLL is
defined. That's a very badly named define, but it IIRC has to be iff building
code ending up in postgres directly.

I think in the near future we'll probably remove the concept of non-threaded
server builds (as proposed before in the post HP-UX 10 cleanup thread, with
patches, but not quite over the line yet). Then I think the server could be
allowed to link libpq directly? And at that point this code wouldn't be
sneaky anymore and could optionally move into a .c. Does that makes sense?

I was wondering about linking in libpq directly as well. But I am not sure
it's a good idea. I suspect we'd run into some issues around libraries
(including extensions) linking to different versions of libpq etc - if we
directly link to libpq that'd end up in tears.

It might be a different story if we had a version of libpq built with
different symbol names etc. But that's not exactly trivial either.

Greetings,

Andres Freund

#7Thomas Munro
thomas.munro@gmail.com
In reply to: Andres Freund (#6)
Re: postgres_fdw uninterruptible during connection establishment / ProcSignalBarrier

On Fri, Dec 30, 2022 at 10:54 AM Andres Freund <andres@anarazel.de> wrote:

On 2022-12-30 10:31:22 +1300, Thomas Munro wrote:

On Fri, Dec 30, 2022 at 6:23 AM Andres Freund <andres@anarazel.de> wrote:

On 2022-12-08 18:08:15 -0800, Andres Freund wrote:

On 2022-09-25 16:22:37 -0700, Andres Freund wrote:
The only alternative way to provide a wrapper that I can think of are to
a) introduce a new static library that can be linked to by libpqwalreceiver,
postgres_fdw, dblink
b) add a header with static inline functions implementing interrupt-processing
connection establishment for libpq

Neither really has precedent.

Any opinions? Due to the simplicity I'm currently leaning to a header-only
helper, but I don't feel confident about it.

The header idea is a little bit sneaky (IIUC: a header that is part of
the core tree, but can't be used by core and possibly needs special
treatment in 'headercheck' to get the right include search path, can
only be used by libpqwalreceiver et al which are allowed to link to
libpq), but I think it is compatible with other goals we have
discussed in other threads.

Hm, what special search path / headerscheck magic are you thinking of? I think
something like src/include/libpq/libpq-be-fe-helpers.h defining a bunch of
static inlines should "just" work?

Oh, I was imagining something slightly different. Not something under
src/include/libpq, but conceptually a separate header-only library
that is above both the backend and libpq. Maybe something like
src/include/febe_util/libpq_connect_interruptible.h. In other words,
I thought your idea b was a header-only version of your idea a. I
think that might be a bit nicer than putting it under libpq?
Superficial difference, perhaps...

And then I assumed that headerscheck would need to be told to add
libpq's header location in -I for that header, but on closer
inspection it already adds that unconditionally so I retract that
comment.

I think in the near future we'll probably remove the concept of non-threaded
server builds (as proposed before in the post HP-UX 10 cleanup thread, with
patches, but not quite over the line yet). Then I think the server could be
allowed to link libpq directly? And at that point this code wouldn't be
sneaky anymore and could optionally move into a .c. Does that makes sense?

I was wondering about linking in libpq directly as well. But I am not sure
it's a good idea. I suspect we'd run into some issues around libraries
(including extensions) linking to different versions of libpq etc - if we
directly link to libpq that'd end up in tears.

It might be a different story if we had a version of libpq built with
different symbol names etc. But that's not exactly trivial either.

Hmm, yeah. Some interesting things to think about. Whether it's a
feature or an accident that new backends can pick up new libpq minor
updates without restarting the postmaster, and how we'd manage a
future libpq major version/ABI break. Getting a bit off topic for
this thread I suppose.

#8Andres Freund
andres@anarazel.de
In reply to: Thomas Munro (#7)
1 attachment(s)
Re: postgres_fdw uninterruptible during connection establishment / ProcSignalBarrier

Hi,

On 2022-12-30 14:14:55 +1300, Thomas Munro wrote:

Oh, I was imagining something slightly different. Not something under
src/include/libpq, but conceptually a separate header-only library
that is above both the backend and libpq. Maybe something like
src/include/febe_util/libpq_connect_interruptible.h. In other words,
I thought your idea b was a header-only version of your idea a. I
think that might be a bit nicer than putting it under libpq?
Superficial difference, perhaps...

It doesn't seem entirely right to introduce a top-level "module" for
libpq-in-extensions to me - we don't do that for other APIs used for
extensions. But a header only library also doesn't quite seem right. So ...

Looking at turning my patch upthread into something slightly less prototype-y,
I noticed that libpqwalreceiver doesn't do AcquireExternalFD(), added to other
backend uses of libpq in 3d475515a15. It's unlikely to matter for
walreceiver.c itelf, but it seems problematic for the logical replication
cases?

It's annoying to introduce PG_TRY/CATCH blocks, just to deal with the
potential for errors inside WaitLatchOrSocket(), which should never happen. I
wonder if we should consider making latch.c error out fatally, instead of
elog(ERROR). If latches are broken, things are bad.

The PG_CATCH() logic in postgres_fdw's GetConnection() looks quite suspicious
to me. It looks like 32a9c0bdf493 took entirely the wrong path. Instead of
optionally not throwing or directly re-establishing connections in
begin_remote_xact(), the PG_CATCH() hammer was applied.

The attached patch adds libpq-be-fe-helpers.h and uses it in libpqwalreceiver,
dblink, postgres_fdw.

As I made libpq-be-fe-helpers.h handle reserving external fds,
libpqwalreceiver now does so. I briefly looked through its users without
seeing cases of leaking in case of errors - which would already have been bad,
since we'd already have leaked a libpq connection/socket.

Given the lack of field complaints and the size of the required changes, I
don't think we should backpatch this, even though it's pretty clearly buggy
as-is.

Some time back Thomas had a patch to introduce a wrapper around
libpq-in-extensions that fixed issues cause by some events being
edge-triggered on windows. It's possible that combining these two efforts
would yield something better. I resisted the urge to create a wrapper around
each connection in this patch, as it'd have ended up being a whole lot more
invasive. But... Thomas, do you have a reference to that code handy?

Greetings,

Andres Freund

Attachments:

v2-0001-wip-don-t-block-inside-PQconnectdb-et-al.patchtext/x-diff; charset=us-asciiDownload
From 34e5ea311322b3bc3bb0f8c925f9ade1a59c6f09 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 3 Jan 2023 11:31:04 -0800
Subject: [PATCH v2] wip: don't block inside PQconnectdb et al

---
 src/include/libpq/libpq-be-fe-helpers.h       | 233 ++++++++++++++++++
 .../libpqwalreceiver/libpqwalreceiver.c       |  53 +---
 contrib/dblink/dblink.c                       |  80 +-----
 contrib/postgres_fdw/connection.c             |  42 +---
 4 files changed, 256 insertions(+), 152 deletions(-)
 create mode 100644 src/include/libpq/libpq-be-fe-helpers.h

diff --git a/src/include/libpq/libpq-be-fe-helpers.h b/src/include/libpq/libpq-be-fe-helpers.h
new file mode 100644
index 00000000000..4f3d3b821f0
--- /dev/null
+++ b/src/include/libpq/libpq-be-fe-helpers.h
@@ -0,0 +1,233 @@
+/*-------------------------------------------------------------------------
+ *
+ * libpq-be-fe-helpers.h
+ *	  Helper functions for using libpq in extensions
+ *
+ * Code built directly into the backend is not allowed to link to libpq
+ * directly. Extension code is allowed to use libpq however. However, libpq
+ * used in extensions has to be careful to block inside libpq, otherwise
+ * interrupts will not be processed, leading to issues like unresolvable
+ * deadlocks. Backend code also needs to take care to acquire/release an
+ * external fd for the connection, otherwise fd.c's accounting of fd's is
+ * broken.
+ *
+ * This file provides helper functions to make it easier to comply with these
+ * rules. It is a header only library as it needs to be linked into each
+ * extension using libpq, and it seems too small to be worth adding a
+ * dedicated static library for.
+ *
+ * TODO: For historical reasons the connections established here are not put
+ * into non-blocking mode. That can lead to blocking even when only the async
+ * libpq functions are used. This should be fixed.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/libpq/libpq-be-fe-helpers.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef LIBPQ_BE_FE_HELPERS_H
+#define LIBPQ_BE_FE_HELPERS_H
+
+/*
+ * Despite the name, BUILDING_DLL is set only when building code directly part
+ * of the backend. Which also is where libpq isn't allowed to be
+ * used. Obviously this doesn't protect against libpq-fe.h getting included
+ * otherwise, but perhaps still protects against a few mistakes...
+ */
+#ifdef BUILDING_DLL
+#error "libpq may not be used code directly built into the backend"
+#endif
+
+#include "libpq-fe.h"
+#include "miscadmin.h"
+#include "storage/fd.h"
+#include "storage/latch.h"
+
+
+static inline void libpqsrv_connect_prepare(void);
+static inline void libpqsrv_connect_internal(PGconn *conn, uint32 wait_event_info);
+
+/*
+ * PQfinish() wrapper that additionally releases the reserved file descriptor.
+ *
+ * It is allowed to call this with a NULL pgconn iff returned by
+ * libpqsrv_connect*.
+ */
+static inline void
+libpqsrv_disconnect(PGconn *conn)
+{
+	/*
+	 * If no connection was established, we haven't reserved an FD for it (or
+	 * already released it). This rule makes it easier to write PG_CATCH()
+	 * handlers for this facility's users.
+	 *
+	 * See also libpqsrv_connect_internal().
+	 */
+	if (conn == NULL)
+		return;
+
+	ReleaseExternalFD();
+	PQfinish(conn);
+}
+
+/*
+ * PQconnectdb() wrapper that reserves a file descriptor and processes
+ * interrupts.
+ */
+static inline PGconn *
+libpqsrv_connect(const char *conninfo, uint32 wait_event_info)
+{
+	PGconn	   *conn = NULL;
+
+	libpqsrv_connect_prepare();
+
+	conn = PQconnectStart(conninfo);
+
+	libpqsrv_connect_internal(conn, wait_event_info);
+
+	return conn;
+}
+
+/*
+ * PQconnectdbParams() wrapper that reserves a file descriptor and processes
+ * interrupts.
+ */
+static inline PGconn *
+libpqsrv_connect_params(const char *const *keywords,
+						const char *const *values,
+						int expand_dbname,
+						uint32 wait_event_info)
+{
+	PGconn	   *conn = NULL;
+
+	libpqsrv_connect_prepare();
+
+	conn = PQconnectStartParams(keywords, values, expand_dbname);
+
+	libpqsrv_connect_internal(conn, wait_event_info);
+
+	return conn;
+}
+
+/*
+ * Helper function for all connection establishment functions.
+ */
+static inline void
+libpqsrv_connect_prepare(void)
+{
+	/*
+	 * We must obey fd.c's limit on non-virtual file descriptors.  Assume that
+	 * a PGconn represents one long-lived FD.  (Doing this here also ensures
+	 * that VFDs are closed if needed to make room.)
+	 */
+	if (!AcquireExternalFD())
+	{
+#ifndef WIN32					/* can't write #if within ereport() macro */
+		ereport(ERROR,
+				(errcode(ERRCODE_SQLCLIENT_UNABLE_TO_ESTABLISH_SQLCONNECTION),
+				 errmsg("could not establish connection"),
+				 errdetail("There are too many open files on the local server."),
+				 errhint("Raise the server's max_files_per_process and/or \"ulimit -n\" limits.")));
+#else
+		ereport(ERROR,
+				(errcode(ERRCODE_SQLCLIENT_UNABLE_TO_ESTABLISH_SQLCONNECTION),
+				 errmsg("could not establish connection"),
+				 errdetail("There are too many open files on the local server."),
+				 errhint("Raise the server's max_files_per_process setting.")));
+#endif
+	}
+
+}
+
+/*
+ * Helper function for all connection establishment functions.
+ */
+static inline void
+libpqsrv_connect_internal(PGconn *conn, uint32 wait_event_info)
+{
+	/*
+	 * With conn == NULL libpqsrv_disconnect() wouldn't release the FD. So do
+	 * that here.
+	 */
+	if (conn == NULL)
+	{
+		ReleaseExternalFD();
+		return;
+	}
+
+	/*
+	 * Can't wait without a socket. Note that we don't want to close the libpq
+	 * connection yet, so callers can emit a useful error.
+	 */
+	if (PQstatus(conn) == CONNECTION_BAD)
+		return;
+
+	/*
+	 * WaitLatchOrSocket() can conceivably fail, handle this case here instead
+	 * of requiring all callers to do so.
+	 */
+	PG_TRY();
+	{
+		PostgresPollingStatusType status;
+
+		/*
+		 * Poll connection until we have OK or FAILED status.
+		 *
+		 * Per spec for PQconnectPoll, first wait till socket is write-ready.
+		 */
+		status = PGRES_POLLING_WRITING;
+		while (status != PGRES_POLLING_OK && status != PGRES_POLLING_FAILED)
+		{
+			int			io_flag;
+			int			rc;
+
+			if (status == PGRES_POLLING_READING)
+				io_flag = WL_SOCKET_READABLE;
+#ifdef WIN32
+
+			/*
+			 * Windows needs a different test while waiting for
+			 * connection-made
+			 */
+			else if (PQstatus(conn) == CONNECTION_STARTED)
+				io_flag = WL_SOCKET_CONNECTED;
+#endif
+			else
+				io_flag = WL_SOCKET_WRITEABLE;
+
+			rc = WaitLatchOrSocket(MyLatch,
+								   WL_EXIT_ON_PM_DEATH | WL_LATCH_SET | io_flag,
+								   PQsocket(conn),
+								   0,
+								   wait_event_info);
+
+			/* Interrupted? */
+			if (rc & WL_LATCH_SET)
+			{
+				ResetLatch(MyLatch);
+				CHECK_FOR_INTERRUPTS();
+			}
+
+			/* If socket is ready, advance the libpq state machine */
+			if (rc & io_flag)
+				status = PQconnectPoll(conn);
+		}
+	}
+	PG_CATCH();
+	{
+		/*
+		 * If an error is thrown here, the callers won't call
+		 * libpqsrv_disconnect() with a conn, so release resources
+		 * immediately.
+		 */
+		ReleaseExternalFD();
+		PQfinish(conn);
+
+		PG_RE_THROW();
+	}
+	PG_END_TRY();
+}
+
+#endif							/* LIBPQ_BE_FE_HELPERS_H */
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 219cd73b7fc..bab2fe29562 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -24,6 +24,7 @@
 #include "common/connect.h"
 #include "funcapi.h"
 #include "libpq-fe.h"
+#include "libpq/libpq-be-fe-helpers.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -125,7 +126,6 @@ libpqrcv_connect(const char *conninfo, bool logical, const char *appname,
 				 char **err)
 {
 	WalReceiverConn *conn;
-	PostgresPollingStatusType status;
 	const char *keys[6];
 	const char *vals[6];
 	int			i = 0;
@@ -172,52 +172,9 @@ libpqrcv_connect(const char *conninfo, bool logical, const char *appname,
 	Assert(i < sizeof(keys));
 
 	conn = palloc0(sizeof(WalReceiverConn));
-	conn->streamConn = PQconnectStartParams(keys, vals,
-											 /* expand_dbname = */ true);
-	if (PQstatus(conn->streamConn) == CONNECTION_BAD)
-	{
-		*err = pchomp(PQerrorMessage(conn->streamConn));
-		return NULL;
-	}
-
-	/*
-	 * Poll connection until we have OK or FAILED status.
-	 *
-	 * Per spec for PQconnectPoll, first wait till socket is write-ready.
-	 */
-	status = PGRES_POLLING_WRITING;
-	do
-	{
-		int			io_flag;
-		int			rc;
-
-		if (status == PGRES_POLLING_READING)
-			io_flag = WL_SOCKET_READABLE;
-#ifdef WIN32
-		/* Windows needs a different test while waiting for connection-made */
-		else if (PQstatus(conn->streamConn) == CONNECTION_STARTED)
-			io_flag = WL_SOCKET_CONNECTED;
-#endif
-		else
-			io_flag = WL_SOCKET_WRITEABLE;
-
-		rc = WaitLatchOrSocket(MyLatch,
-							   WL_EXIT_ON_PM_DEATH | WL_LATCH_SET | io_flag,
-							   PQsocket(conn->streamConn),
-							   0,
-							   WAIT_EVENT_LIBPQWALRECEIVER_CONNECT);
-
-		/* Interrupted? */
-		if (rc & WL_LATCH_SET)
-		{
-			ResetLatch(MyLatch);
-			ProcessWalRcvInterrupts();
-		}
-
-		/* If socket is ready, advance the libpq state machine */
-		if (rc & io_flag)
-			status = PQconnectPoll(conn->streamConn);
-	} while (status != PGRES_POLLING_OK && status != PGRES_POLLING_FAILED);
+	conn->streamConn = libpqsrv_connect_params(keys, vals,
+											    /* expand_dbname = */ true,
+											   WAIT_EVENT_LIBPQWALRECEIVER_CONNECT);
 
 	if (PQstatus(conn->streamConn) != CONNECTION_OK)
 	{
@@ -740,7 +697,7 @@ libpqrcv_PQgetResult(PGconn *streamConn)
 static void
 libpqrcv_disconnect(WalReceiverConn *conn)
 {
-	PQfinish(conn->streamConn);
+	libpqsrv_disconnect(conn->streamConn);
 	PQfreemem(conn->recvBuf);
 	pfree(conn);
 }
diff --git a/contrib/dblink/dblink.c b/contrib/dblink/dblink.c
index 8dd122042b4..130320db571 100644
--- a/contrib/dblink/dblink.c
+++ b/contrib/dblink/dblink.c
@@ -48,6 +48,7 @@
 #include "funcapi.h"
 #include "lib/stringinfo.h"
 #include "libpq-fe.h"
+#include "libpq/libpq-be-fe-helpers.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
 #include "parser/scansup.h"
@@ -59,6 +60,7 @@
 #include "utils/memutils.h"
 #include "utils/rel.h"
 #include "utils/varlena.h"
+#include "utils/wait_event.h"
 
 PG_MODULE_MAGIC;
 
@@ -199,37 +201,14 @@ dblink_get_conn(char *conname_or_str,
 			connstr = conname_or_str;
 		dblink_connstr_check(connstr);
 
-		/*
-		 * We must obey fd.c's limit on non-virtual file descriptors.  Assume
-		 * that a PGconn represents one long-lived FD.  (Doing this here also
-		 * ensures that VFDs are closed if needed to make room.)
-		 */
-		if (!AcquireExternalFD())
-		{
-#ifndef WIN32					/* can't write #if within ereport() macro */
-			ereport(ERROR,
-					(errcode(ERRCODE_SQLCLIENT_UNABLE_TO_ESTABLISH_SQLCONNECTION),
-					 errmsg("could not establish connection"),
-					 errdetail("There are too many open files on the local server."),
-					 errhint("Raise the server's max_files_per_process and/or \"ulimit -n\" limits.")));
-#else
-			ereport(ERROR,
-					(errcode(ERRCODE_SQLCLIENT_UNABLE_TO_ESTABLISH_SQLCONNECTION),
-					 errmsg("could not establish connection"),
-					 errdetail("There are too many open files on the local server."),
-					 errhint("Raise the server's max_files_per_process setting.")));
-#endif
-		}
-
 		/* OK to make connection */
-		conn = PQconnectdb(connstr);
+		conn = libpqsrv_connect(connstr, PG_WAIT_EXTENSION);
 
 		if (PQstatus(conn) == CONNECTION_BAD)
 		{
 			char	   *msg = pchomp(PQerrorMessage(conn));
 
-			PQfinish(conn);
-			ReleaseExternalFD();
+			libpqsrv_disconnect(conn);
 			ereport(ERROR,
 					(errcode(ERRCODE_SQLCLIENT_UNABLE_TO_ESTABLISH_SQLCONNECTION),
 					 errmsg("could not establish connection"),
@@ -312,36 +291,13 @@ dblink_connect(PG_FUNCTION_ARGS)
 	/* check password in connection string if not superuser */
 	dblink_connstr_check(connstr);
 
-	/*
-	 * We must obey fd.c's limit on non-virtual file descriptors.  Assume that
-	 * a PGconn represents one long-lived FD.  (Doing this here also ensures
-	 * that VFDs are closed if needed to make room.)
-	 */
-	if (!AcquireExternalFD())
-	{
-#ifndef WIN32					/* can't write #if within ereport() macro */
-		ereport(ERROR,
-				(errcode(ERRCODE_SQLCLIENT_UNABLE_TO_ESTABLISH_SQLCONNECTION),
-				 errmsg("could not establish connection"),
-				 errdetail("There are too many open files on the local server."),
-				 errhint("Raise the server's max_files_per_process and/or \"ulimit -n\" limits.")));
-#else
-		ereport(ERROR,
-				(errcode(ERRCODE_SQLCLIENT_UNABLE_TO_ESTABLISH_SQLCONNECTION),
-				 errmsg("could not establish connection"),
-				 errdetail("There are too many open files on the local server."),
-				 errhint("Raise the server's max_files_per_process setting.")));
-#endif
-	}
-
 	/* OK to make connection */
-	conn = PQconnectdb(connstr);
+	conn = libpqsrv_connect(connstr, PG_WAIT_EXTENSION);
 
 	if (PQstatus(conn) == CONNECTION_BAD)
 	{
 		msg = pchomp(PQerrorMessage(conn));
-		PQfinish(conn);
-		ReleaseExternalFD();
+		libpqsrv_disconnect(conn);
 		if (rconn)
 			pfree(rconn);
 
@@ -366,10 +322,7 @@ dblink_connect(PG_FUNCTION_ARGS)
 	else
 	{
 		if (pconn->conn)
-		{
-			PQfinish(pconn->conn);
-			ReleaseExternalFD();
-		}
+			libpqsrv_disconnect(conn);
 		pconn->conn = conn;
 	}
 
@@ -402,8 +355,7 @@ dblink_disconnect(PG_FUNCTION_ARGS)
 	if (!conn)
 		dblink_conn_not_avail(conname);
 
-	PQfinish(conn);
-	ReleaseExternalFD();
+	libpqsrv_disconnect(conn);
 	if (rconn)
 	{
 		deleteConnection(conname);
@@ -838,10 +790,7 @@ dblink_record_internal(FunctionCallInfo fcinfo, bool is_async)
 	{
 		/* if needed, close the connection to the database */
 		if (freeconn)
-		{
-			PQfinish(conn);
-			ReleaseExternalFD();
-		}
+			libpqsrv_disconnect(conn);
 	}
 	PG_END_TRY();
 
@@ -1516,10 +1465,7 @@ dblink_exec(PG_FUNCTION_ARGS)
 	{
 		/* if needed, close the connection to the database */
 		if (freeconn)
-		{
-			PQfinish(conn);
-			ReleaseExternalFD();
-		}
+			libpqsrv_disconnect(conn);
 	}
 	PG_END_TRY();
 
@@ -2606,8 +2552,7 @@ createNewConnection(const char *name, remoteConn *rconn)
 
 	if (found)
 	{
-		PQfinish(rconn->conn);
-		ReleaseExternalFD();
+		libpqsrv_disconnect(rconn->conn);
 		pfree(rconn);
 
 		ereport(ERROR,
@@ -2647,8 +2592,7 @@ dblink_security_check(PGconn *conn, remoteConn *rconn)
 	{
 		if (!PQconnectionUsedPassword(conn))
 		{
-			PQfinish(conn);
-			ReleaseExternalFD();
+			libpqsrv_disconnect(conn);
 			if (rconn)
 				pfree(rconn);
 
diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index ed75ce3f79c..7760380f00d 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -17,6 +17,7 @@
 #include "catalog/pg_user_mapping.h"
 #include "commands/defrem.h"
 #include "funcapi.h"
+#include "libpq/libpq-be-fe-helpers.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -446,35 +447,10 @@ connect_pg_server(ForeignServer *server, UserMapping *user)
 		/* verify the set of connection parameters */
 		check_conn_params(keywords, values, user);
 
-		/*
-		 * We must obey fd.c's limit on non-virtual file descriptors.  Assume
-		 * that a PGconn represents one long-lived FD.  (Doing this here also
-		 * ensures that VFDs are closed if needed to make room.)
-		 */
-		if (!AcquireExternalFD())
-		{
-#ifndef WIN32					/* can't write #if within ereport() macro */
-			ereport(ERROR,
-					(errcode(ERRCODE_SQLCLIENT_UNABLE_TO_ESTABLISH_SQLCONNECTION),
-					 errmsg("could not connect to server \"%s\"",
-							server->servername),
-					 errdetail("There are too many open files on the local server."),
-					 errhint("Raise the server's max_files_per_process and/or \"ulimit -n\" limits.")));
-#else
-			ereport(ERROR,
-					(errcode(ERRCODE_SQLCLIENT_UNABLE_TO_ESTABLISH_SQLCONNECTION),
-					 errmsg("could not connect to server \"%s\"",
-							server->servername),
-					 errdetail("There are too many open files on the local server."),
-					 errhint("Raise the server's max_files_per_process setting.")));
-#endif
-		}
-
 		/* OK to make connection */
-		conn = PQconnectdbParams(keywords, values, false);
-
-		if (!conn)
-			ReleaseExternalFD();	/* because the PG_CATCH block won't */
+		conn = libpqsrv_connect_params(keywords, values,
+									    /* expand_dbname = */ false,
+									   PG_WAIT_EXTENSION);
 
 		if (!conn || PQstatus(conn) != CONNECTION_OK)
 			ereport(ERROR,
@@ -507,12 +483,7 @@ connect_pg_server(ForeignServer *server, UserMapping *user)
 	}
 	PG_CATCH();
 	{
-		/* Release PGconn data structure if we managed to create one */
-		if (conn)
-		{
-			PQfinish(conn);
-			ReleaseExternalFD();
-		}
+		libpqsrv_disconnect(conn);
 		PG_RE_THROW();
 	}
 	PG_END_TRY();
@@ -528,9 +499,8 @@ disconnect_pg_server(ConnCacheEntry *entry)
 {
 	if (entry->conn != NULL)
 	{
-		PQfinish(entry->conn);
+		libpqsrv_disconnect(entry->conn);
 		entry->conn = NULL;
-		ReleaseExternalFD();
 	}
 }
 
-- 
2.38.0

#9Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#1)
Re: postgres_fdw uninterruptible during connection establishment / ProcSignalBarrier

On Sun, Sep 25, 2022 at 7:22 PM Andres Freund <andres@anarazel.de> wrote:

Maybe I am missing something, but I don't think it's OK for
connect_pg_server() to connect in a blocking manner, without accepting
interrupts?

I remember noticing various problems in this area years ago. I'm not
sure whether I noticed this particular one, but the commit message for
ae9bfc5d65123aaa0d1cca9988037489760bdeae mentions a few others that I
did notice.

--
Robert Haas
EDB: http://www.enterprisedb.com

#10Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#8)
Re: postgres_fdw uninterruptible during connection establishment / ProcSignalBarrier

Hi,

On 2023-01-03 12:05:20 -0800, Andres Freund wrote:

The attached patch adds libpq-be-fe-helpers.h and uses it in libpqwalreceiver,
dblink, postgres_fdw.

As I made libpq-be-fe-helpers.h handle reserving external fds,
libpqwalreceiver now does so. I briefly looked through its users without
seeing cases of leaking in case of errors - which would already have been bad,
since we'd already have leaked a libpq connection/socket.

Given the lack of field complaints and the size of the required changes, I
don't think we should backpatch this, even though it's pretty clearly buggy
as-is.

Any comments on this? Otherwise I think I'll go with this approach.

Greetings,

Andres Freund

#11Thomas Munro
thomas.munro@gmail.com
In reply to: Andres Freund (#10)
Re: postgres_fdw uninterruptible during connection establishment / ProcSignalBarrier

On Tue, Jan 10, 2023 at 3:05 PM Andres Freund <andres@anarazel.de> wrote:

On 2023-01-03 12:05:20 -0800, Andres Freund wrote:

The attached patch adds libpq-be-fe-helpers.h and uses it in libpqwalreceiver,
dblink, postgres_fdw.

As I made libpq-be-fe-helpers.h handle reserving external fds,
libpqwalreceiver now does so. I briefly looked through its users without
seeing cases of leaking in case of errors - which would already have been bad,
since we'd already have leaked a libpq connection/socket.

Given the lack of field complaints and the size of the required changes, I
don't think we should backpatch this, even though it's pretty clearly buggy
as-is.

Any comments on this? Otherwise I think I'll go with this approach.

+1. Not totally convinced about the location but we are free to
re-organise it any time, and the random CI failures are bad.

#12Andres Freund
andres@anarazel.de
In reply to: Thomas Munro (#11)
4 attachment(s)
Re: postgres_fdw uninterruptible during connection establishment / ProcSignalBarrier

Hi,

On 2023-01-14 14:45:06 +1300, Thomas Munro wrote:

On Tue, Jan 10, 2023 at 3:05 PM Andres Freund <andres@anarazel.de> wrote:

On 2023-01-03 12:05:20 -0800, Andres Freund wrote:

The attached patch adds libpq-be-fe-helpers.h and uses it in libpqwalreceiver,
dblink, postgres_fdw.

As I made libpq-be-fe-helpers.h handle reserving external fds,
libpqwalreceiver now does so. I briefly looked through its users without
seeing cases of leaking in case of errors - which would already have been bad,
since we'd already have leaked a libpq connection/socket.

Given the lack of field complaints and the size of the required changes, I
don't think we should backpatch this, even though it's pretty clearly buggy
as-is.

Any comments on this? Otherwise I think I'll go with this approach.

+1. Not totally convinced about the location but we are free to
re-organise it any time, and the random CI failures are bad.

Cool.

Updated patch attached. I split it into multiple pieces.
1) A fix for [1]/messages/by-id/20230121011237.q52apbvlarfv6jm6@awork3.anarazel.de, included here because I encountered it while testing
2) Introduction of libpq-be-fe-helpers.h
3) Convert dblink and postgres_fdw to the helper
4) Convert libpqwalreceiver.c to the helper

Even if we eventually decide to backpatch 3), we'd likely not backpatch 4), as
there's no bug (although perhaps the lack of FD handling could be called a
bug?).

There's also some light polishing, improving commit message, comments and
moving some internal helper functions to later in the file.

Greetings,

Andres Freund

[1]: /messages/by-id/20230121011237.q52apbvlarfv6jm6@awork3.anarazel.de

Attachments:

v3-0001-Fix-error-handling-in-libpqrcv_connect.patchtext/x-diff; charset=us-asciiDownload
From da9345151423180732d2928fe7fe6ff0b6a11d4d Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 20 Jan 2023 18:27:05 -0800
Subject: [PATCH v3 1/4] Fix error handling in libpqrcv_connect()

Author:
Reviewed-by:
Discussion: https://postgr.es/m/20230121011237.q52apbvlarfv6jm6@awork3.anarazel.de
Backpatch:
---
 .../libpqwalreceiver/libpqwalreceiver.c       | 26 +++++++++++--------
 1 file changed, 15 insertions(+), 11 deletions(-)

diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index c40c6220db8..fefc8660259 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -175,10 +175,7 @@ libpqrcv_connect(const char *conninfo, bool logical, const char *appname,
 	conn->streamConn = PQconnectStartParams(keys, vals,
 											 /* expand_dbname = */ true);
 	if (PQstatus(conn->streamConn) == CONNECTION_BAD)
-	{
-		*err = pchomp(PQerrorMessage(conn->streamConn));
-		return NULL;
-	}
+		goto bad_connection_errmsg;
 
 	/*
 	 * Poll connection until we have OK or FAILED status.
@@ -220,10 +217,7 @@ libpqrcv_connect(const char *conninfo, bool logical, const char *appname,
 	} while (status != PGRES_POLLING_OK && status != PGRES_POLLING_FAILED);
 
 	if (PQstatus(conn->streamConn) != CONNECTION_OK)
-	{
-		*err = pchomp(PQerrorMessage(conn->streamConn));
-		return NULL;
-	}
+		goto bad_connection_errmsg;
 
 	if (logical)
 	{
@@ -234,9 +228,9 @@ libpqrcv_connect(const char *conninfo, bool logical, const char *appname,
 		if (PQresultStatus(res) != PGRES_TUPLES_OK)
 		{
 			PQclear(res);
-			ereport(ERROR,
-					(errmsg("could not clear search path: %s",
-							pchomp(PQerrorMessage(conn->streamConn)))));
+			*err = psprintf(_("could not clear search path: %s"),
+							pchomp(PQerrorMessage(conn->streamConn)));
+			goto bad_connection;
 		}
 		PQclear(res);
 	}
@@ -244,6 +238,16 @@ libpqrcv_connect(const char *conninfo, bool logical, const char *appname,
 	conn->logical = logical;
 
 	return conn;
+
+	/* error path, using libpq's error message */
+bad_connection_errmsg:
+	*err = pchomp(PQerrorMessage(conn->streamConn));
+
+	/* error path, error already set */
+bad_connection:
+	PQfinish(conn->streamConn);
+	pfree(conn);
+	return NULL;
 }
 
 /*
-- 
2.38.0

v3-0002-Add-helper-library-for-use-of-libpq-inside-the-se.patchtext/x-diff; charset=us-asciiDownload
From 61531d075fc385bd22f7ec4cdb38e87da8aaec6d Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 20 Jan 2023 17:15:49 -0800
Subject: [PATCH v3 2/4] Add helper library for use of libpq inside the server
 environment

Currently dblink and postgres_fdw don't process interrupts during connection
establishment. Besides preventing query cancellations etc, this can lead to
undetected deadlocks, as global barriers are not processed.

Libpqwalreceiver in contrast, processes interrupts during connection
establishment. The required code is not trivial, so duplicating it into
additional places does not seem like a good option.

These aforementioned undetected deadlocks are the reason for the spate of CI
test failures in the FreeBSD 'test_running' step.

For now the helper library is just a header, as it needs to be linked into
each extension using libpq, and it seems too small to be worth adding a
dedicated static library for.

The conversion to the helper are done in subsequent commits.

Reviewed-by: Thomas Munro <thomas.munro@gmail.com>
Discussion: https://postgr.es/m/20220925232237.p6uskba2dw6fnwj2@awork3.anarazel.de
---
 src/include/libpq/libpq-be-fe-helpers.h | 242 ++++++++++++++++++++++++
 1 file changed, 242 insertions(+)
 create mode 100644 src/include/libpq/libpq-be-fe-helpers.h

diff --git a/src/include/libpq/libpq-be-fe-helpers.h b/src/include/libpq/libpq-be-fe-helpers.h
new file mode 100644
index 00000000000..41e3bb4376a
--- /dev/null
+++ b/src/include/libpq/libpq-be-fe-helpers.h
@@ -0,0 +1,242 @@
+/*-------------------------------------------------------------------------
+ *
+ * libpq-be-fe-helpers.h
+ *	  Helper functions for using libpq in extensions
+ *
+ * Code built directly into the backend is not allowed to link to libpq
+ * directly. Extension code is allowed to use libpq however. However, libpq
+ * used in extensions has to be careful to block inside libpq, otherwise
+ * interrupts will not be processed, leading to issues like unresolvable
+ * deadlocks. Backend code also needs to take care to acquire/release an
+ * external fd for the connection, otherwise fd.c's accounting of fd's is
+ * broken.
+ *
+ * This file provides helper functions to make it easier to comply with these
+ * rules. It is a header only library as it needs to be linked into each
+ * extension using libpq, and it seems too small to be worth adding a
+ * dedicated static library for.
+ *
+ * TODO: For historical reasons the connections established here are not put
+ * into non-blocking mode. That can lead to blocking even when only the async
+ * libpq functions are used. This should be fixed.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/libpq/libpq-be-fe-helpers.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef LIBPQ_BE_FE_HELPERS_H
+#define LIBPQ_BE_FE_HELPERS_H
+
+/*
+ * Despite the name, BUILDING_DLL is set only when building code directly part
+ * of the backend. Which also is where libpq isn't allowed to be
+ * used. Obviously this doesn't protect against libpq-fe.h getting included
+ * otherwise, but perhaps still protects against a few mistakes...
+ */
+#ifdef BUILDING_DLL
+#error "libpq may not be used code directly built into the backend"
+#endif
+
+#include "libpq-fe.h"
+#include "miscadmin.h"
+#include "storage/fd.h"
+#include "storage/latch.h"
+#include "utils/wait_event.h"
+
+
+static inline void libpqsrv_connect_prepare(void);
+static inline void libpqsrv_connect_internal(PGconn *conn, uint32 wait_event_info);
+
+
+/*
+ * PQconnectdb() wrapper that reserves a file descriptor and processes
+ * interrupts during connection establishment.
+ *
+ * Throws an error if AcquireExternalFD() fails, but does not throw if
+ * connection establishment itself fails. Callers need to use PQstatus() to
+ * check if connection establishment succeeded.
+ */
+static inline PGconn *
+libpqsrv_connect(const char *conninfo, uint32 wait_event_info)
+{
+	PGconn	   *conn = NULL;
+
+	libpqsrv_connect_prepare();
+
+	conn = PQconnectStart(conninfo);
+
+	libpqsrv_connect_internal(conn, wait_event_info);
+
+	return conn;
+}
+
+/*
+ * Like libpqsrv_connect(), except that this is a wrapper for
+ * PQconnectdbParams().
+  */
+static inline PGconn *
+libpqsrv_connect_params(const char *const *keywords,
+						const char *const *values,
+						int expand_dbname,
+						uint32 wait_event_info)
+{
+	PGconn	   *conn = NULL;
+
+	libpqsrv_connect_prepare();
+
+	conn = PQconnectStartParams(keywords, values, expand_dbname);
+
+	libpqsrv_connect_internal(conn, wait_event_info);
+
+	return conn;
+}
+
+/*
+ * PQfinish() wrapper that additionally releases the reserved file descriptor.
+ *
+ * It is allowed to call this with a NULL pgconn iff NULL was returned by
+ * libpqsrv_connect*.
+ */
+static inline void
+libpqsrv_disconnect(PGconn *conn)
+{
+	/*
+	 * If no connection was established, we haven't reserved an FD for it (or
+	 * already released it). This rule makes it easier to write PG_CATCH()
+	 * handlers for this facility's users.
+	 *
+	 * See also libpqsrv_connect_internal().
+	 */
+	if (conn == NULL)
+		return;
+
+	ReleaseExternalFD();
+	PQfinish(conn);
+}
+
+
+/* internal helper functions follow */
+
+
+/*
+ * Helper function for all connection establishment functions.
+ */
+static inline void
+libpqsrv_connect_prepare(void)
+{
+	/*
+	 * We must obey fd.c's limit on non-virtual file descriptors.  Assume that
+	 * a PGconn represents one long-lived FD.  (Doing this here also ensures
+	 * that VFDs are closed if needed to make room.)
+	 */
+	if (!AcquireExternalFD())
+	{
+#ifndef WIN32					/* can't write #if within ereport() macro */
+		ereport(ERROR,
+				(errcode(ERRCODE_SQLCLIENT_UNABLE_TO_ESTABLISH_SQLCONNECTION),
+				 errmsg("could not establish connection"),
+				 errdetail("There are too many open files on the local server."),
+				 errhint("Raise the server's max_files_per_process and/or \"ulimit -n\" limits.")));
+#else
+		ereport(ERROR,
+				(errcode(ERRCODE_SQLCLIENT_UNABLE_TO_ESTABLISH_SQLCONNECTION),
+				 errmsg("could not establish connection"),
+				 errdetail("There are too many open files on the local server."),
+				 errhint("Raise the server's max_files_per_process setting.")));
+#endif
+	}
+}
+
+/*
+ * Helper function for all connection establishment functions.
+ */
+static inline void
+libpqsrv_connect_internal(PGconn *conn, uint32 wait_event_info)
+{
+	/*
+	 * With conn == NULL libpqsrv_disconnect() wouldn't release the FD. So do
+	 * that here.
+	 */
+	if (conn == NULL)
+	{
+		ReleaseExternalFD();
+		return;
+	}
+
+	/*
+	 * Can't wait without a socket. Note that we don't want to close the libpq
+	 * connection yet, so callers can emit a useful error.
+	 */
+	if (PQstatus(conn) == CONNECTION_BAD)
+		return;
+
+	/*
+	 * WaitLatchOrSocket() can conceivably fail, handle that case here instead
+	 * of requiring all callers to do so.
+	 */
+	PG_TRY();
+	{
+		PostgresPollingStatusType status;
+
+		/*
+		 * Poll connection until we have OK or FAILED status.
+		 *
+		 * Per spec for PQconnectPoll, first wait till socket is write-ready.
+		 */
+		status = PGRES_POLLING_WRITING;
+		while (status != PGRES_POLLING_OK && status != PGRES_POLLING_FAILED)
+		{
+			int			io_flag;
+			int			rc;
+
+			if (status == PGRES_POLLING_READING)
+				io_flag = WL_SOCKET_READABLE;
+#ifdef WIN32
+
+			/*
+			 * Windows needs a different test while waiting for
+			 * connection-made
+			 */
+			else if (PQstatus(conn) == CONNECTION_STARTED)
+				io_flag = WL_SOCKET_CONNECTED;
+#endif
+			else
+				io_flag = WL_SOCKET_WRITEABLE;
+
+			rc = WaitLatchOrSocket(MyLatch,
+								   WL_EXIT_ON_PM_DEATH | WL_LATCH_SET | io_flag,
+								   PQsocket(conn),
+								   0,
+								   wait_event_info);
+
+			/* Interrupted? */
+			if (rc & WL_LATCH_SET)
+			{
+				ResetLatch(MyLatch);
+				CHECK_FOR_INTERRUPTS();
+			}
+
+			/* If socket is ready, advance the libpq state machine */
+			if (rc & io_flag)
+				status = PQconnectPoll(conn);
+		}
+	}
+	PG_CATCH();
+	{
+		/*
+		 * If an error is thrown here, the callers won't call
+		 * libpqsrv_disconnect() with a conn, so release resources
+		 * immediately.
+		 */
+		ReleaseExternalFD();
+		PQfinish(conn);
+
+		PG_RE_THROW();
+	}
+	PG_END_TRY();
+}
+
+#endif							/* LIBPQ_BE_FE_HELPERS_H */
-- 
2.38.0

v3-0003-dblink-postgres_fdw-Handle-interrupts-during-conn.patchtext/x-diff; charset=us-asciiDownload
From 149399a0de2f388451cfeea22d37842a1f7494ec Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 20 Jan 2023 17:33:21 -0800
Subject: [PATCH v3 3/4] dblink, postgres_fdw: Handle interrupts during
 connection establishment

Until now dblink and postgres_fdw did not process interrupts during connection
establishment. Besides preventing query cancellations etc, this can lead to
undetected deadlocks, as global barriers are not processed.

These aforementioned undetected deadlocks are the reason for the spate of CI
test failures in the FreeBSD 'test_running' step.

Fix the bug by using the helper from libpq-be-fe-helpers.h, introduced in a
prior commit. Besides fixing the bug, this also removes duplicated code
around reserving file descriptors.

As the change is relatively large and there are no field reports of the
problem, don't backpatch for now.

Reviewed-by: Thomas Munro <thomas.munro@gmail.com>
Discussion: https://postgr.es/m/20220925232237.p6uskba2dw6fnwj2@awork3.anarazel.de
Backpatch:
---
 contrib/dblink/dblink.c           | 79 +++++--------------------------
 contrib/postgres_fdw/connection.c | 42 +++-------------
 2 files changed, 17 insertions(+), 104 deletions(-)

diff --git a/contrib/dblink/dblink.c b/contrib/dblink/dblink.c
index 8dd122042b4..8982d623d3b 100644
--- a/contrib/dblink/dblink.c
+++ b/contrib/dblink/dblink.c
@@ -48,6 +48,7 @@
 #include "funcapi.h"
 #include "lib/stringinfo.h"
 #include "libpq-fe.h"
+#include "libpq/libpq-be-fe-helpers.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
 #include "parser/scansup.h"
@@ -199,37 +200,14 @@ dblink_get_conn(char *conname_or_str,
 			connstr = conname_or_str;
 		dblink_connstr_check(connstr);
 
-		/*
-		 * We must obey fd.c's limit on non-virtual file descriptors.  Assume
-		 * that a PGconn represents one long-lived FD.  (Doing this here also
-		 * ensures that VFDs are closed if needed to make room.)
-		 */
-		if (!AcquireExternalFD())
-		{
-#ifndef WIN32					/* can't write #if within ereport() macro */
-			ereport(ERROR,
-					(errcode(ERRCODE_SQLCLIENT_UNABLE_TO_ESTABLISH_SQLCONNECTION),
-					 errmsg("could not establish connection"),
-					 errdetail("There are too many open files on the local server."),
-					 errhint("Raise the server's max_files_per_process and/or \"ulimit -n\" limits.")));
-#else
-			ereport(ERROR,
-					(errcode(ERRCODE_SQLCLIENT_UNABLE_TO_ESTABLISH_SQLCONNECTION),
-					 errmsg("could not establish connection"),
-					 errdetail("There are too many open files on the local server."),
-					 errhint("Raise the server's max_files_per_process setting.")));
-#endif
-		}
-
 		/* OK to make connection */
-		conn = PQconnectdb(connstr);
+		conn = libpqsrv_connect(connstr, PG_WAIT_EXTENSION);
 
 		if (PQstatus(conn) == CONNECTION_BAD)
 		{
 			char	   *msg = pchomp(PQerrorMessage(conn));
 
-			PQfinish(conn);
-			ReleaseExternalFD();
+			libpqsrv_disconnect(conn);
 			ereport(ERROR,
 					(errcode(ERRCODE_SQLCLIENT_UNABLE_TO_ESTABLISH_SQLCONNECTION),
 					 errmsg("could not establish connection"),
@@ -312,36 +290,13 @@ dblink_connect(PG_FUNCTION_ARGS)
 	/* check password in connection string if not superuser */
 	dblink_connstr_check(connstr);
 
-	/*
-	 * We must obey fd.c's limit on non-virtual file descriptors.  Assume that
-	 * a PGconn represents one long-lived FD.  (Doing this here also ensures
-	 * that VFDs are closed if needed to make room.)
-	 */
-	if (!AcquireExternalFD())
-	{
-#ifndef WIN32					/* can't write #if within ereport() macro */
-		ereport(ERROR,
-				(errcode(ERRCODE_SQLCLIENT_UNABLE_TO_ESTABLISH_SQLCONNECTION),
-				 errmsg("could not establish connection"),
-				 errdetail("There are too many open files on the local server."),
-				 errhint("Raise the server's max_files_per_process and/or \"ulimit -n\" limits.")));
-#else
-		ereport(ERROR,
-				(errcode(ERRCODE_SQLCLIENT_UNABLE_TO_ESTABLISH_SQLCONNECTION),
-				 errmsg("could not establish connection"),
-				 errdetail("There are too many open files on the local server."),
-				 errhint("Raise the server's max_files_per_process setting.")));
-#endif
-	}
-
 	/* OK to make connection */
-	conn = PQconnectdb(connstr);
+	conn = libpqsrv_connect(connstr, PG_WAIT_EXTENSION);
 
 	if (PQstatus(conn) == CONNECTION_BAD)
 	{
 		msg = pchomp(PQerrorMessage(conn));
-		PQfinish(conn);
-		ReleaseExternalFD();
+		libpqsrv_disconnect(conn);
 		if (rconn)
 			pfree(rconn);
 
@@ -366,10 +321,7 @@ dblink_connect(PG_FUNCTION_ARGS)
 	else
 	{
 		if (pconn->conn)
-		{
-			PQfinish(pconn->conn);
-			ReleaseExternalFD();
-		}
+			libpqsrv_disconnect(conn);
 		pconn->conn = conn;
 	}
 
@@ -402,8 +354,7 @@ dblink_disconnect(PG_FUNCTION_ARGS)
 	if (!conn)
 		dblink_conn_not_avail(conname);
 
-	PQfinish(conn);
-	ReleaseExternalFD();
+	libpqsrv_disconnect(conn);
 	if (rconn)
 	{
 		deleteConnection(conname);
@@ -838,10 +789,7 @@ dblink_record_internal(FunctionCallInfo fcinfo, bool is_async)
 	{
 		/* if needed, close the connection to the database */
 		if (freeconn)
-		{
-			PQfinish(conn);
-			ReleaseExternalFD();
-		}
+			libpqsrv_disconnect(conn);
 	}
 	PG_END_TRY();
 
@@ -1516,10 +1464,7 @@ dblink_exec(PG_FUNCTION_ARGS)
 	{
 		/* if needed, close the connection to the database */
 		if (freeconn)
-		{
-			PQfinish(conn);
-			ReleaseExternalFD();
-		}
+			libpqsrv_disconnect(conn);
 	}
 	PG_END_TRY();
 
@@ -2606,8 +2551,7 @@ createNewConnection(const char *name, remoteConn *rconn)
 
 	if (found)
 	{
-		PQfinish(rconn->conn);
-		ReleaseExternalFD();
+		libpqsrv_disconnect(rconn->conn);
 		pfree(rconn);
 
 		ereport(ERROR,
@@ -2647,8 +2591,7 @@ dblink_security_check(PGconn *conn, remoteConn *rconn)
 	{
 		if (!PQconnectionUsedPassword(conn))
 		{
-			PQfinish(conn);
-			ReleaseExternalFD();
+			libpqsrv_disconnect(conn);
 			if (rconn)
 				pfree(rconn);
 
diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index ed75ce3f79c..7760380f00d 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -17,6 +17,7 @@
 #include "catalog/pg_user_mapping.h"
 #include "commands/defrem.h"
 #include "funcapi.h"
+#include "libpq/libpq-be-fe-helpers.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -446,35 +447,10 @@ connect_pg_server(ForeignServer *server, UserMapping *user)
 		/* verify the set of connection parameters */
 		check_conn_params(keywords, values, user);
 
-		/*
-		 * We must obey fd.c's limit on non-virtual file descriptors.  Assume
-		 * that a PGconn represents one long-lived FD.  (Doing this here also
-		 * ensures that VFDs are closed if needed to make room.)
-		 */
-		if (!AcquireExternalFD())
-		{
-#ifndef WIN32					/* can't write #if within ereport() macro */
-			ereport(ERROR,
-					(errcode(ERRCODE_SQLCLIENT_UNABLE_TO_ESTABLISH_SQLCONNECTION),
-					 errmsg("could not connect to server \"%s\"",
-							server->servername),
-					 errdetail("There are too many open files on the local server."),
-					 errhint("Raise the server's max_files_per_process and/or \"ulimit -n\" limits.")));
-#else
-			ereport(ERROR,
-					(errcode(ERRCODE_SQLCLIENT_UNABLE_TO_ESTABLISH_SQLCONNECTION),
-					 errmsg("could not connect to server \"%s\"",
-							server->servername),
-					 errdetail("There are too many open files on the local server."),
-					 errhint("Raise the server's max_files_per_process setting.")));
-#endif
-		}
-
 		/* OK to make connection */
-		conn = PQconnectdbParams(keywords, values, false);
-
-		if (!conn)
-			ReleaseExternalFD();	/* because the PG_CATCH block won't */
+		conn = libpqsrv_connect_params(keywords, values,
+									    /* expand_dbname = */ false,
+									   PG_WAIT_EXTENSION);
 
 		if (!conn || PQstatus(conn) != CONNECTION_OK)
 			ereport(ERROR,
@@ -507,12 +483,7 @@ connect_pg_server(ForeignServer *server, UserMapping *user)
 	}
 	PG_CATCH();
 	{
-		/* Release PGconn data structure if we managed to create one */
-		if (conn)
-		{
-			PQfinish(conn);
-			ReleaseExternalFD();
-		}
+		libpqsrv_disconnect(conn);
 		PG_RE_THROW();
 	}
 	PG_END_TRY();
@@ -528,9 +499,8 @@ disconnect_pg_server(ConnCacheEntry *entry)
 {
 	if (entry->conn != NULL)
 	{
-		PQfinish(entry->conn);
+		libpqsrv_disconnect(entry->conn);
 		entry->conn = NULL;
-		ReleaseExternalFD();
 	}
 }
 
-- 
2.38.0

v3-0004-libpqwalreceiver-Convert-to-libpq-be-fe-helpers.h.patchtext/x-diff; charset=us-asciiDownload
From 70db02b9f7525f290f583351b240ef9c1bf6dd5d Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 20 Jan 2023 18:37:18 -0800
Subject: [PATCH v3 4/4] libpqwalreceiver: Convert to libpq-be-fe-helpers.h

In contrast to the changes to dblink and postgres_fdw, this does not fix a
bug, as libpqwalreceiver did already process interrupts.

Besides reducing code duplication, the conversion leads to libpqwalreceiver
now using reserving file descriptors for libpq connections. While not strictly
required for the use in walreceiver, we are also using libpqwalreceiver for
logical replication, where it does seem more important.

Even if we eventually decide to backpatch the prior commits, there'd be no
need to backpatch this commit, due to not fixing an active bug.

Reviewed-by: Thomas Munro <thomas.munro@gmail.com>
Discussion: https://postgr.es/m/20220925232237.p6uskba2dw6fnwj2@awork3.anarazel.de
---
 .../libpqwalreceiver/libpqwalreceiver.c       | 53 +++----------------
 1 file changed, 7 insertions(+), 46 deletions(-)

diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index fefc8660259..560ec974fa7 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -24,6 +24,7 @@
 #include "common/connect.h"
 #include "funcapi.h"
 #include "libpq-fe.h"
+#include "libpq/libpq-be-fe-helpers.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -125,7 +126,6 @@ libpqrcv_connect(const char *conninfo, bool logical, const char *appname,
 				 char **err)
 {
 	WalReceiverConn *conn;
-	PostgresPollingStatusType status;
 	const char *keys[6];
 	const char *vals[6];
 	int			i = 0;
@@ -172,49 +172,10 @@ libpqrcv_connect(const char *conninfo, bool logical, const char *appname,
 	Assert(i < sizeof(keys));
 
 	conn = palloc0(sizeof(WalReceiverConn));
-	conn->streamConn = PQconnectStartParams(keys, vals,
-											 /* expand_dbname = */ true);
-	if (PQstatus(conn->streamConn) == CONNECTION_BAD)
-		goto bad_connection_errmsg;
-
-	/*
-	 * Poll connection until we have OK or FAILED status.
-	 *
-	 * Per spec for PQconnectPoll, first wait till socket is write-ready.
-	 */
-	status = PGRES_POLLING_WRITING;
-	do
-	{
-		int			io_flag;
-		int			rc;
-
-		if (status == PGRES_POLLING_READING)
-			io_flag = WL_SOCKET_READABLE;
-#ifdef WIN32
-		/* Windows needs a different test while waiting for connection-made */
-		else if (PQstatus(conn->streamConn) == CONNECTION_STARTED)
-			io_flag = WL_SOCKET_CONNECTED;
-#endif
-		else
-			io_flag = WL_SOCKET_WRITEABLE;
-
-		rc = WaitLatchOrSocket(MyLatch,
-							   WL_EXIT_ON_PM_DEATH | WL_LATCH_SET | io_flag,
-							   PQsocket(conn->streamConn),
-							   0,
-							   WAIT_EVENT_LIBPQWALRECEIVER_CONNECT);
-
-		/* Interrupted? */
-		if (rc & WL_LATCH_SET)
-		{
-			ResetLatch(MyLatch);
-			ProcessWalRcvInterrupts();
-		}
-
-		/* If socket is ready, advance the libpq state machine */
-		if (rc & io_flag)
-			status = PQconnectPoll(conn->streamConn);
-	} while (status != PGRES_POLLING_OK && status != PGRES_POLLING_FAILED);
+	conn->streamConn =
+		libpqsrv_connect_params(keys, vals,
+								 /* expand_dbname = */ true,
+								WAIT_EVENT_LIBPQWALRECEIVER_CONNECT);
 
 	if (PQstatus(conn->streamConn) != CONNECTION_OK)
 		goto bad_connection_errmsg;
@@ -245,7 +206,7 @@ bad_connection_errmsg:
 
 	/* error path, error already set */
 bad_connection:
-	PQfinish(conn->streamConn);
+	libpqsrv_disconnect(conn->streamConn);
 	pfree(conn);
 	return NULL;
 }
@@ -744,7 +705,7 @@ libpqrcv_PQgetResult(PGconn *streamConn)
 static void
 libpqrcv_disconnect(WalReceiverConn *conn)
 {
-	PQfinish(conn->streamConn);
+	libpqsrv_disconnect(conn->streamConn);
 	PQfreemem(conn->recvBuf);
 	pfree(conn);
 }
-- 
2.38.0

#13Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#12)
Re: postgres_fdw uninterruptible during connection establishment / ProcSignalBarrier

Hi,

On 2023-01-20 19:00:08 -0800, Andres Freund wrote:

Updated patch attached. I split it into multiple pieces.
1) A fix for [1], included here because I encountered it while testing
2) Introduction of libpq-be-fe-helpers.h
3) Convert dblink and postgres_fdw to the helper
4) Convert libpqwalreceiver.c to the helper

Even if we eventually decide to backpatch 3), we'd likely not backpatch 4), as
there's no bug (although perhaps the lack of FD handling could be called a
bug?).

There's also some light polishing, improving commit message, comments and
moving some internal helper functions to later in the file.

After a tiny bit further polishing, and after separately pushing a resource
leak fix for walrcv_connect(), I pushed this.

Greetings,

Andres Freund

#14Nathan Bossart
nathandbossart@gmail.com
In reply to: Andres Freund (#13)
Re: postgres_fdw uninterruptible during connection establishment / ProcSignalBarrier

On Mon, Jan 23, 2023 at 07:28:06PM -0800, Andres Freund wrote:

After a tiny bit further polishing, and after separately pushing a resource
leak fix for walrcv_connect(), I pushed this.

My colleague Robins Tharakan (CC'd) noticed crashes when testing recent
commits, and he traced it back to e460248. From this information, I
discovered the following typo:

diff --git a/contrib/dblink/dblink.c b/contrib/dblink/dblink.c
index 8982d623d3..78a8bcee6e 100644
--- a/contrib/dblink/dblink.c
+++ b/contrib/dblink/dblink.c
@@ -321,7 +321,7 @@ dblink_connect(PG_FUNCTION_ARGS)
     else
     {
         if (pconn->conn)
-            libpqsrv_disconnect(conn);
+            libpqsrv_disconnect(pconn->conn);
         pconn->conn = conn;
     }

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

#15Andres Freund
andres@anarazel.de
In reply to: Nathan Bossart (#14)
Re: postgres_fdw uninterruptible during connection establishment / ProcSignalBarrier

Hi,

On 2023-01-30 11:30:08 -0800, Nathan Bossart wrote:

On Mon, Jan 23, 2023 at 07:28:06PM -0800, Andres Freund wrote:

After a tiny bit further polishing, and after separately pushing a resource
leak fix for walrcv_connect(), I pushed this.

My colleague Robins Tharakan (CC'd) noticed crashes when testing recent
commits, and he traced it back to e460248. From this information, I
discovered the following typo:

diff --git a/contrib/dblink/dblink.c b/contrib/dblink/dblink.c
index 8982d623d3..78a8bcee6e 100644
--- a/contrib/dblink/dblink.c
+++ b/contrib/dblink/dblink.c
@@ -321,7 +321,7 @@ dblink_connect(PG_FUNCTION_ARGS)
else
{
if (pconn->conn)
-            libpqsrv_disconnect(conn);
+            libpqsrv_disconnect(pconn->conn);
pconn->conn = conn;
}

Ugh. Good catch.

Why don't the dblink tests catch this? Any chance you or Robins could prepare
a patch with fix and test, given that you know how to trigger this?

Greetings,

Andres Freund

#16Nathan Bossart
nathandbossart@gmail.com
In reply to: Andres Freund (#15)
1 attachment(s)
Re: postgres_fdw uninterruptible during connection establishment / ProcSignalBarrier

On Mon, Jan 30, 2023 at 11:49:37AM -0800, Andres Freund wrote:

Why don't the dblink tests catch this? Any chance you or Robins could prepare
a patch with fix and test, given that you know how to trigger this?

It's trivially reproducible by calling 1-argument dblink_connect() multiple
times and then calling dblink_disconnect(). Here's a patch.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachments:

fix_dblink_typo.patchtext/x-diff; charset=us-asciiDownload
diff --git a/contrib/dblink/dblink.c b/contrib/dblink/dblink.c
index 8982d623d3..78a8bcee6e 100644
--- a/contrib/dblink/dblink.c
+++ b/contrib/dblink/dblink.c
@@ -321,7 +321,7 @@ dblink_connect(PG_FUNCTION_ARGS)
 	else
 	{
 		if (pconn->conn)
-			libpqsrv_disconnect(conn);
+			libpqsrv_disconnect(pconn->conn);
 		pconn->conn = conn;
 	}
 
diff --git a/contrib/dblink/expected/dblink.out b/contrib/dblink/expected/dblink.out
index 14d015e4d5..0f5050b409 100644
--- a/contrib/dblink/expected/dblink.out
+++ b/contrib/dblink/expected/dblink.out
@@ -938,6 +938,25 @@ DROP SERVER fdtest;
 ALTER FOREIGN DATA WRAPPER dblink_fdw OPTIONS (nonexistent 'fdw');
 ERROR:  invalid option "nonexistent"
 HINT:  There are no valid options in this context.
+-- test repeated calls to dblink_connect
+SELECT dblink_connect(connection_parameters());
+ dblink_connect 
+----------------
+ OK
+(1 row)
+
+SELECT dblink_connect(connection_parameters());
+ dblink_connect 
+----------------
+ OK
+(1 row)
+
+SELECT dblink_disconnect();
+ dblink_disconnect 
+-------------------
+ OK
+(1 row)
+
 -- test asynchronous notifications
 SELECT dblink_connect(connection_parameters());
  dblink_connect 
diff --git a/contrib/dblink/sql/dblink.sql b/contrib/dblink/sql/dblink.sql
index e560260bfc..7870ce5d5a 100644
--- a/contrib/dblink/sql/dblink.sql
+++ b/contrib/dblink/sql/dblink.sql
@@ -489,6 +489,11 @@ DROP SERVER fdtest;
 -- should fail
 ALTER FOREIGN DATA WRAPPER dblink_fdw OPTIONS (nonexistent 'fdw');
 
+-- test repeated calls to dblink_connect
+SELECT dblink_connect(connection_parameters());
+SELECT dblink_connect(connection_parameters());
+SELECT dblink_disconnect();
+
 -- test asynchronous notifications
 SELECT dblink_connect(connection_parameters());
 
#17Robins Tharakan
tharakan@gmail.com
In reply to: Nathan Bossart (#16)
Re: postgres_fdw uninterruptible during connection establishment / ProcSignalBarrier

Hi Andres,

On Tue, 31 Jan 2023 at 06:31, Nathan Bossart <nathandbossart@gmail.com> wrote:

On Mon, Jan 30, 2023 at 11:49:37AM -0800, Andres Freund wrote:

Why don't the dblink tests catch this? Any chance you or Robins could prepare
a patch with fix and test, given that you know how to trigger this?

It's trivially reproducible by calling 1-argument dblink_connect() multiple
times and then calling dblink_disconnect(). Here's a patch.

My test instance has been running Nathan's patch for the past
few hours, and it looks like this should resolve the issue.

A little bit of background, since the past few days I was
noticing frequent crashes on a test instance. They're not simple
to reproduce manually, but if left running I can reliably see
crashes on an ~hourly basis.

In trying to trace the origin, I ran a multiple-hour test for
each commit going back a few commits and noticed that the
crashes stopped prior to commit e4602483e9, which is when the
issue became visible.

The point is now moot, but let me know if full backtraces still
help (I was concerned looking at the excerpts from some
of the crashes):

"double free or corruption (out)"
"munmap_chunk(): invalid pointer"
"free(): invalid pointer"
str->data[0] = '\0';

Some backtraces
###############

Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `postgres: 6c6d6ba3ee@master@sqith: u21 postgres
127.0.0.1(45334) SELECT '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 __GI___libc_free (mem=0x312f352f65736162) at malloc.c:3102
#0 __GI___libc_free (mem=0x312f352f65736162) at malloc.c:3102
#1 0x00007fc0062dfefd in pqDropConnection (conn=0x564bb3e1a080,
flushInput=true) at fe-connect.c:495
#2 0x00007fc0062e5cb3 in closePGconn (conn=0x564bb3e1a080)
at fe-connect.c:4112
#3 0x00007fc0062e5d55 in PQfinish (conn=0x564bb3e1a080) at fe-connect.c:4134
#4 0x00007fc0061a442b in libpqsrv_disconnect (conn=0x564bb3e1a080)
at ../../src/include/libpq/libpq-be-fe-helpers.h:117
#5 0x00007fc0061a4df1 in dblink_disconnect (fcinfo=0x564bb3d980f0)
at dblink.c:357
#6 0x0000564bb0e70aa7 in ExecInterpExpr (state=0x564bb3d98018,
econtext=0x564bb3d979a0, isnull=0x7ffd60824b0f) at execExprInterp.c:728
#7 0x0000564bb0e72f36 in ExecInterpExprStillValid (state=0x564bb3d98018,
econtext=0x564bb3d979a0, isNull=0x7ffd60824b0f) at execExprInterp.c:1838

============

Core was generated by `postgres: 6c6d6ba3ee@master@sqith: u52 postgres
127.0.0.1(58778) SELECT '.
Program terminated with signal SIGABRT, Aborted.
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1 0x00007fc021792859 in __GI_abort () at abort.c:79
#2 0x00007fc0217fd26e in __libc_message (action=action@entry=do_abort,
fmt=fmt@entry=0x7fc021927298 "%s\n") at ../sysdeps/posix/libc_fatal.c:155
#3 0x00007fc0218052fc in malloc_printerr (
str=str@entry=0x7fc021929670 "double free or corruption (out)")
at malloc.c:5347
#4 0x00007fc021806fa0 in _int_free (av=0x7fc02195cb80 <main_arena>,
p=0x7fc02195cbf0 <main_arena+112>, have_lock=<optimized out>)
at malloc.c:4314
#5 0x00007fc0062e16ed in freePGconn (conn=0x564bb6e36b80)
at fe-connect.c:3977
#6 0x00007fc0062e1d61 in PQfinish (conn=0x564bb6e36b80) at fe-connect.c:4135
#7 0x00007fc0062a142b in libpqsrv_disconnect (conn=0x564bb6e36b80)
at ../../src/include/libpq/libpq-be-fe-helpers.h:117
#8 0x00007fc0062a1df1 in dblink_disconnect (fcinfo=0x564bb5647b58)
at dblink.c:357

============

Program terminated with signal SIGSEGV, Segmentation fault.
#0 resetPQExpBuffer (str=0x559d3af0e838) at pqexpbuffer.c:153
153 str->data[0] = '\0';
#0 resetPQExpBuffer (str=0x559d3af0e838) at pqexpbuffer.c:153
#1 0x00007f2240b0a876 in PQexecStart (conn=0x559d3af0e410) at fe-exec.c:2320
#2 0x00007f2240b0a688 in PQexec (conn=0x559d3af0e410,
query=0x559d56fb8ee8 "p3") at fe-exec.c:2227
#3 0x00007f223ba8c7e4 in dblink_exec (fcinfo=0x559d3b101f58) at dblink.c:1432
#4 0x0000559d2f003c82 in ExecInterpExpr (state=0x559d3b101ac0,
econtext=0x559d34e76578, isnull=0x7ffe3d590fdf) at execExprInterp.c:752

============

Core was generated by `postgres: 728f86fec6@(HEAD detached at
728f86fec6)@sqith: u19 postgres 127.0.0.'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x00007f96f5fc6e99 in SSL_shutdown ()
from /lib/x86_64-linux-gnu/libssl.so.1.1
#0 0x00007f96f5fc6e99 in SSL_shutdown ()
from /lib/x86_64-linux-gnu/libssl.so.1.1
#1 0x00007f96da56027a in pgtls_close (conn=0x55d919752fb0)
at fe-secure-openssl.c:1555
#2 0x00007f96da558e41 in pqsecure_close (conn=0x55d919752fb0)
at fe-secure.c:192
#3 0x00007f96da53dd12 in pqDropConnection (conn=0x55d919752fb0,
flushInput=true) at fe-connect.c:449
#4 0x00007f96da543cb3 in closePGconn (conn=0x55d919752fb0)
at fe-connect.c:4112
#5 0x00007f96da543d55 in PQfinish (conn=0x55d919752fb0) at fe-connect.c:4134
#6 0x00007f96d9ebd42b in libpqsrv_disconnect (conn=0x55d919752fb0)
at ../../src/include/libpq/libpq-be-fe-helpers.h:117
#7 0x00007f96d9ebddf1 in dblink_disconnect (fcinfo=0x55d91f2692a8)
at dblink.c:357

============

Program terminated with signal SIGABRT, Aborted.
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1 0x00007f5f6b632859 in __GI_abort () at abort.c:79
#2 0x00007f5f6b69d26e in __libc_message (action=action@entry=do_abort,
fmt=fmt@entry=0x7f5f6b7c7298 "%s\n") at ../sysdeps/posix/libc_fatal.c:155
#3 0x00007f5f6b6a52fc in malloc_printerr (
str=str@entry=0x7f5f6b7c91e0 "munmap_chunk(): invalid pointer")
at malloc.c:5347
#4 0x00007f5f6b6a554c in munmap_chunk (p=<optimized out>) at malloc.c:2830
#5 0x00007f5f50085efd in pqDropConnection (conn=0x55d12ebcd100,
flushInput=true) at fe-connect.c:495
#6 0x00007f5f5008bcb3 in closePGconn (conn=0x55d12ebcd100)
at fe-connect.c:4112
#7 0x00007f5f5008bd55 in PQfinish (conn=0x55d12ebcd100) at fe-connect.c:4134
#8 0x00007f5f5006c42b in libpqsrv_disconnect (conn=0x55d12ebcd100)
at ../../src/include/libpq/libpq-be-fe-helpers.h:117

============

Program terminated with signal SIGABRT, Aborted.
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1 0x00007f5f6b632859 in __GI_abort () at abort.c:79
#2 0x00007f5f6b69d26e in __libc_message (action=action@entry=do_abort,
fmt=fmt@entry=0x7f5f6b7c7298 "%s\n") at ../sysdeps/posix/libc_fatal.c:155
#3 0x00007f5f6b6a52fc in malloc_printerr (
str=str@entry=0x7f5f6b7c54c1 "free(): invalid pointer") at malloc.c:5347
#4 0x00007f5f6b6a6b2c in _int_free (av=<optimized out>, p=<optimized out>,
have_lock=0) at malloc.c:4173
#5 0x00007f5f500fe6ed in freePGconn (conn=0x55d142273000)
at fe-connect.c:3977
#6 0x00007f5f500fed61 in PQfinish (conn=0x55d142273000) at fe-connect.c:4135
#7 0x00007f5f501de42b in libpqsrv_disconnect (conn=0x55d142273000)
at ../../src/include/libpq/libpq-be-fe-helpers.h:117
#8 0x00007f5f501dedf1 in dblink_disconnect (fcinfo=0x55d1527998f8)
at dblink.c:357

============

Core was generated by `postgres: e4602483e9@(HEAD detached at
e4602483e9)@sqith: u73 postgres 127.0.0.'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 __GI___libc_realloc (oldmem=0x7f7f7f7f7f7f7f7f, bytes=2139070335)
at malloc.c:3154
#0 __GI___libc_realloc (oldmem=0x7f7f7f7f7f7f7f7f, bytes=2139070335)
at malloc.c:3154
#1 0x00007fb7bc0a580a in pqCheckOutBufferSpace (bytes_needed=2139062148,
conn=0x55b191aa9380) at fe-misc.c:329
#2 0x00007fb7bc0a5b1c in pqPutMsgStart (msg_type=88 'X', conn=0x55b191aa9380)
at fe-misc.c:476
#3 0x00007fb7bc097c60 in sendTerminateConn (conn=0x55b191aa9380)
at fe-connect.c:4076
#4 0x00007fb7bc097c97 in closePGconn (conn=0x55b191aa9380)
at fe-connect.c:4096
#5 0x00007fb7bc097d55 in PQfinish (conn=0x55b191aa9380) at fe-connect.c:4134
#6 0x00007fb7bc14a42b in libpqsrv_disconnect (conn=0x55b191aa9380)
at ../../src/include/libpq/libpq-be-fe-helpers.h:117
#7 0x00007fb7bc14adf1 in dblink_disconnect (fcinfo=0x55b193894f00)
at dblink.c:357

============

Thanks to SQLSmith for helping with this find.

-
Robins Tharakan
Amazon Web Services

#18Andres Freund
andres@anarazel.de
In reply to: Nathan Bossart (#16)
Re: postgres_fdw uninterruptible during connection establishment / ProcSignalBarrier

Hi,

On 2023-01-30 12:00:55 -0800, Nathan Bossart wrote:

On Mon, Jan 30, 2023 at 11:49:37AM -0800, Andres Freund wrote:

Why don't the dblink tests catch this? Any chance you or Robins could prepare
a patch with fix and test, given that you know how to trigger this?

It's trivially reproducible by calling 1-argument dblink_connect() multiple
times and then calling dblink_disconnect(). Here's a patch.

Thanks for the quick patch and for the find. Pushed.

Greetings,

Andres