PostgreSQL 17: Bug in libpq when libpq is dlopened/closed multiple times

Started by Daniel Schreiber2 months ago8 messageshackersbugs

daniel.schreiber@hrz.tu-chemnitz.de

2 months ago

bugshackers

Dear PostgreSQL developers,

my colleagues and I probably found a bug in libpq when libpq is dlopened
and closed multiple times during the lifetime of a process. In our setup
we use a PAM module which links to libpq. The process using PAM is
linked against openssl, so openssl is loaded during the complete
lifetime of the process whereas libpq is loaded only during PAM
authentication (and unloaded when PAM has finished).

We observed the bug on a Debian 13 system using libpq from Debian. To
reproduce the bug, compile the attached c file using the following gcc
command line:

gcc libpq1-dlopen.c -Wall -Wextra -o libpq1-dlopen -ldl -lssl -lcrypto

Then run the binary with a postgresql connection string as an argument.
The connection string has to include 'sslmode=require'. The program will
in a loop try to dlopen libpq, then connect to the server, finish the
connection and unload libpq.

According to our findings every time a connection is established after
dlopening libpq one of the 127 available BIO_METHOD structures in
OpenSSL is consumed:
https://github.com/postgres/postgres/blob/REL_17_9/src/interfaces/libpq/fe-secure-openssl.c#L1987

So after 127 cycles registering the callbacks fails and in our use case
the application is no longer able to authenticate using PAM. As a
workaround we LD_PRELOAD libpq in the application.

I am not subscribed yet to the mailing list, so please CC me.

Thank you,

Daniel
--
Daniel Schreiber
Facharbeitsgruppe Systemsoftware
Universitaetsrechenzentrum

Technische Universität Chemnitz
Straße der Nationen 62 (Raum B303)
09111 Chemnitz
Germany

Tel: +49 371 531 35444

Jacob Champion

jacob.champion@enterprisedb.com

2 months ago

In reply to: Daniel Schreiber (#1)

bugshackers

Re: PostgreSQL 17: Bug in libpq when libpq is dlopened/closed multiple times

On Fri, Apr 17, 2026 at 7:33 AM Daniel Schreiber
<daniel.schreiber@hrz.tu-chemnitz.de> wrote:

my colleagues and I probably found a bug in libpq when libpq is dlopened
and closed multiple times during the lifetime of a process. In our setup
we use a PAM module which links to libpq. The process using PAM is
linked against openssl, so openssl is loaded during the complete
lifetime of the process whereas libpq is loaded only during PAM
authentication (and unloaded when PAM has finished).

[snip]

According to our findings every time a connection is established after
dlopening libpq one of the 127 available BIO_METHOD structures in
OpenSSL is consumed:
https://github.com/postgres/postgres/blob/REL_17_9/src/interfaces/libpq/fe-secure-openssl.c#L1987

Right. I think in this *particular* case, we should simply skip the
call to BIO_get_new_index(). We don't need it, IIUC.

But I think we may also need to set expectations on whether or not
infinite dlopen/dlclose loops are supported in general. If we ever
come across a situation in which a call to BIO_get_new_index() is
necessary, that leak just fundamentally can't be plugged. The same is
true for any third-party libraries (or their dependencies, or
theirs...) that require "one-time", irreversible calls which can't be
tracked after we're unloaded. And we can't push these concerns up to
the top level application developer, because they don't know we exist.

(I'd be surprised if this were the only such resource leak across all
supported versions and combinations of Kerberos, OpenSSL, OpenLDAP,
Curl, etc. etc. From a quick search, you're the first to report this
in the ten years since the leak was introduced, so there may be more
dragons where you're headed.)

--Jacob

Jacob Champion

jacob.champion@enterprisedb.com

2 months ago

In reply to: Jacob Champion (#2)

hackersbugs

Re: PostgreSQL 17: Bug in libpq when libpq is dlopened/closed multiple times

[moving to -hackers]

On Fri, Apr 17, 2026 at 12:14 PM Jacob Champion
<jacob.champion@enterprisedb.com> wrote:

On Fri, Apr 17, 2026 at 7:33 AM Daniel Schreiber
<daniel.schreiber@hrz.tu-chemnitz.de> wrote:

my colleagues and I probably found a bug in libpq when libpq is dlopened
and closed multiple times during the lifetime of a process. In our setup
we use a PAM module which links to libpq. The process using PAM is
linked against openssl, so openssl is loaded during the complete
lifetime of the process whereas libpq is loaded only during PAM
authentication (and unloaded when PAM has finished).

[snip]

According to our findings every time a connection is established after
dlopening libpq one of the 127 available BIO_METHOD structures in
OpenSSL is consumed:
https://github.com/postgres/postgres/blob/REL_17_9/src/interfaces/libpq/fe-secure-openssl.c#L1987

Right. I think in this *particular* case, we should simply skip the
call to BIO_get_new_index(). We don't need it, IIUC.

Attached is a proposal to do that.

But I think we may also need to set expectations on whether or not
infinite dlopen/dlclose loops are supported in general. If we ever
come across a situation in which a call to BIO_get_new_index() is
necessary, that leak just fundamentally can't be plugged. The same is
true for any third-party libraries (or their dependencies, or
theirs...) that require "one-time", irreversible calls which can't be
tracked after we're unloaded. And we can't push these concerns up to
the top level application developer, because they don't know we exist.

(I'd be surprised if this were the only such resource leak across all
supported versions and combinations of Kerberos, OpenSSL, OpenLDAP,
Curl, etc. etc. From a quick search, you're the first to report this
in the ten years since the leak was introduced, so there may be more
dragons where you're headed.)

If anyone has thoughts on that, I'd love to hear them. I don't mind
removing this unnecessary code in HEAD, or even backpatching as a
courtesy -- but if it were up to me, I would not guarantee zero global
resource leaks across libpq and its entire dependency graph. (Even if
we magically had control over all those dependencies, I think it'd
still be reasonable for libpq devs to use "allocate once and move on"
patterns... and I want to continue using those in my new code.)

Thanks,
--Jacob

Nico Williams

nico@cryptonector.com

2 months ago

In reply to: Jacob Champion (#3)

hackersbugs

Re: PostgreSQL 17: Bug in libpq when libpq is dlopened/closed multiple times

On Wed, Apr 22, 2026 at 11:29:04AM -0700, Jacob Champion wrote:

(I'd be surprised if this were the only such resource leak across all
supported versions and combinations of Kerberos, OpenSSL, OpenLDAP,
Curl, etc. etc. From a quick search, you're the first to report this
in the ten years since the leak was introduced, so there may be more
dragons where you're headed.)

If anyone has thoughts on that, I'd love to hear them. I don't mind
removing this unnecessary code in HEAD, or even backpatching as a
courtesy -- but if it were up to me, I would not guarantee zero global
resource leaks across libpq and its entire dependency graph. (Even if
we magically had control over all those dependencies, I think it'd
still be reasonable for libpq devs to use "allocate once and move on"
patterns... and I want to continue using those in my new code.)

Leaking a dl handle is a way to prevent unloading. Not saying that's a
great answer, just that it's a workaround.

Tom Lane

tgl@sss.pgh.pa.us

2 months ago

In reply to: Jacob Champion (#3)

hackersbugs

Re: PostgreSQL 17: Bug in libpq when libpq is dlopened/closed multiple times

Jacob Champion <jacob.champion@enterprisedb.com> writes:

If anyone has thoughts on that, I'd love to hear them. I don't mind
removing this unnecessary code in HEAD, or even backpatching as a
courtesy -- but if it were up to me, I would not guarantee zero global
resource leaks across libpq and its entire dependency graph.

I agree that we have no real ability to guarantee that.
Still, as far as the presented patch goes, it seems like a clear
win so I'd vote for fix-and-backpatch.

Should we write the arguments as BIO_TYPE_NONE | BIO_TYPE_SOURCE_SINK
rather than just BIO_TYPE_SOURCE_SINK?

regards, tom lane

Jacob Champion

jacob.champion@enterprisedb.com

2 months ago

In reply to: Nico Williams (#4)

hackersbugs

Re: PostgreSQL 17: Bug in libpq when libpq is dlopened/closed multiple times

On Wed, Apr 22, 2026 at 12:22 PM Nico Williams <nico@cryptonector.com> wrote:

Leaking a dl handle is a way to prevent unloading. Not saying that's a
great answer, just that it's a workaround.

Hmm, I did that for our handle to libpq-oauth, but I imagine that
leaking a handle to _ourselves_ may make someone very unhappy with us
at some point? Plus, it might kick off the tiniest, most pointless
arms race:

// why does libpq do this
dlclose(libpq_handle);
dlclose(libpq_handle);

I guess we could play around with RTLD_NODELETE... Something to keep
in the back pocket, maybe?

--Jacob

Jacob Champion

jacob.champion@enterprisedb.com

2 months ago

In reply to: Tom Lane (#5)

hackersbugs

Re: PostgreSQL 17: Bug in libpq when libpq is dlopened/closed multiple times

On Wed, Apr 22, 2026 at 12:23 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I agree that we have no real ability to guarantee that.
Still, as far as the presented patch goes, it seems like a clear
win so I'd vote for fix-and-backpatch.

Sounds good to me.

Should we write the arguments as BIO_TYPE_NONE | BIO_TYPE_SOURCE_SINK
rather than just BIO_TYPE_SOURCE_SINK?

Good question... Popularity-wise, the shorter spelling shows up across
quite a few projects on GitHub, but the only spelling of
`BIO_meth_new(BIO_TYPE_NONE | ...)` that I can find is a single place
inside OpenSSL's own test suite -- which also uses the shorter
alternative, in two places. So my vote is BIO_TYPE_SOURCE_SINK; we'll
be in good company.

Thanks,
--Jacob

Tom Lane

tgl@sss.pgh.pa.us

2 months ago

In reply to: Jacob Champion (#7)

hackersbugs

Re: PostgreSQL 17: Bug in libpq when libpq is dlopened/closed multiple times

Jacob Champion <jacob.champion@enterprisedb.com> writes:

On Wed, Apr 22, 2026 at 12:23 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Should we write the arguments as BIO_TYPE_NONE | BIO_TYPE_SOURCE_SINK
rather than just BIO_TYPE_SOURCE_SINK?

Good question... Popularity-wise, the shorter spelling shows up across
quite a few projects on GitHub, but the only spelling of
`BIO_meth_new(BIO_TYPE_NONE | ...)` that I can find is a single place
inside OpenSSL's own test suite -- which also uses the shorter
alternative, in two places. So my vote is BIO_TYPE_SOURCE_SINK; we'll
be in good company.

Fair enough.

regards, tom lane

PostgreSQL 17: Bug in libpq when libpq is dlopened/closed multiple times

Attachments:

Attachments: