libpq: Process buffered SSL read bytes to support records >8kB on async API
Dear hackers,
I'm the maintainer of ruby-pg the ruby interface to the PostgreSQL
database. This binding uses the asynchronous API of libpq by default to
facilitate the ruby IO wait and scheduling mechanisms.
This works well with the vanilla postgresql server, but it leads to
starvation with other types of servers using the postgresql wire
protocol 3. This is because the current functioning of the libpq async
interface depends on a maximum size of SSL records of 8kB.
The following servers were reported to starve with ruby-pg:
* AWS RDS Aurora Serverless [1]https://github.com/ged/ruby-pg/issues/325
* YugabyteDb [2]https://github.com/ged/ruby-pg/issues/588
* CockroachDB [3]https://github.com/ged/ruby-pg/issues/583
They block infinitely on certain message sizes sent from the backend to
the libpq frontend. It is best described in [4]https://github.com/ged/ruby-pg/issues/325#issuecomment-737561270. A repro docker
composition is provided by YugabyteDB at [2]https://github.com/ged/ruby-pg/issues/588.
To fix this issue the attached patch calls pqReadData() repeatedly in
PQconsumeInput() until there is no buffered SSL data left to be read.
Another solution could be to process buffered SSL read bytes in
PQisBusy() instead of PQconsumeInput() .
The synchronous libpq API isn't affected, since it supports arbitrary
SSL record sizes already. That's why I think that the asynchronous API
should also support bigger SSL record sizes.
Regards, Lars
[1]: https://github.com/ged/ruby-pg/issues/325
[2]: https://github.com/ged/ruby-pg/issues/588
[3]: https://github.com/ged/ruby-pg/issues/583
[4]: https://github.com/ged/ruby-pg/issues/325#issuecomment-737561270
Attachments:
0001-libpq-Process-buffered-SSL-read-bytes-to-support-rec.patchtext/x-patch; charset=UTF-8; name=0001-libpq-Process-buffered-SSL-read-bytes-to-support-rec.patchDownload
From ab793829a4ce473f1cc2bbc0e2a6f6753553255d Mon Sep 17 00:00:00 2001
From: Lars Kanis <lars@greiz-reinsdorf.de>
Date: Sun, 8 Sep 2024 13:59:05 +0200
Subject: [PATCH] libpq: Process buffered SSL read bytes to support records
>8kB on async API
The async API of libpq doesn't support SSL record sizes >8kB so far.
This size isn't exceeded by vanilla PostgreSQL, but by other products using
the postgres wire protocol 3.
PQconsumeInput() reads all data readable from the socket, so that the read
condition is cleared.
But it doesn't process all the data that is pending on the SSL layer.
Also a subsequent call to PQisBusy() doesn't process it, so that the client
is triggered to wait for more readable data on the socket.
But this never arrives, so that the connection blocks infinitely.
To fix this issue call pqReadData() repeatedly until there is no buffered
SSL data left to be read.
The synchronous libpq API isn't affected, since it supports arbitrary SSL
record sizes already.
---
src/interfaces/libpq/fe-exec.c | 13 +++++++++++++
1 file changed, 13 insertions(+)
diff --git a/src/interfaces/libpq/fe-exec.c b/src/interfaces/libpq/fe-exec.c
index 0d224a852..637894ee1 100644
--- a/src/interfaces/libpq/fe-exec.c
+++ b/src/interfaces/libpq/fe-exec.c
@@ -2006,6 +2006,19 @@ PQconsumeInput(PGconn *conn)
if (pqReadData(conn) < 0)
return 0;
+ #ifdef USE_SSL
+ /*
+ * Ensure all buffered read bytes in the SSL library are processed,
+ * which might be not the case, if the SSL record size exceeds 8k.
+ * Otherwise parseInput can't process the data.
+ */
+ while (conn->ssl_in_use && pgtls_read_pending(conn))
+ {
+ if (pqReadData(conn) < 0)
+ return 0;
+ }
+ #endif
+
/* Parsing of the data waits till later. */
return 1;
}
--
2.43.0
On Sun, Sep 8, 2024 at 1:08 PM Lars Kanis <lars@greiz-reinsdorf.de> wrote:
I'm the maintainer of ruby-pg the ruby interface to the PostgreSQL
database. This binding uses the asynchronous API of libpq by default to
facilitate the ruby IO wait and scheduling mechanisms.This works well with the vanilla postgresql server, but it leads to
starvation with other types of servers using the postgresql wire
protocol 3. This is because the current functioning of the libpq async
interface depends on a maximum size of SSL records of 8kB.
Thanks for the report! I wanted evidence that this wasn't a
ruby-pg-specific problem, so I set up a test case with
Python/psycopg2.
I was able to reproduce a hang when all of the following were true:
- psycopg2's async mode was enabled
- the client performs a PQconsumeInput/PQisBusy loop, waiting on
socket read events when the connection is busy (I used
psycopg2.extras.wait_select() for this)
- the server splits a large message over many large TLS records
- the server packs the final ReadyForQuery message into the same
record as the split message's final fragment
Gory details of the packet sizes, if it's helpful:
- max TLS record size is 12k, because it made the math easier
- server sends DataRow of 32006 bytes, followed by DataRow of 806
bytes, followed by CommandComplete/ReadyForQuery
- so there are three TLS records on the wire containing
1) DataRow 1 fragment 1 (12k bytes)
2) DataRow 1 fragment 2 (12k bytes)
3) DataRow 1 fragment 3 (7430 bytes) + DataRow 2 (806 bytes)
+ CommandComplete + ReadyForQuery
To fix this issue the attached patch calls pqReadData() repeatedly in
PQconsumeInput() until there is no buffered SSL data left to be read.
Another solution could be to process buffered SSL read bytes in
PQisBusy() instead of PQconsumeInput() .
I agree that PQconsumeInput() needs to ensure that the transport
buffers are all drained. But I'm not sure this is a complete solution;
doesn't GSS have the same problem? And are there any other sites that
need to make the same guarantee before returning?
I need to switch away from this for a bit. Would you mind adding this
to the next Commitfest as a placeholder?
https://commitfest.postgresql.org/50/
Thanks,
--Jacob
Thank you Jacob for verifying this issue!
Gory details of the packet sizes, if it's helpful: - max TLS record size is 12k, because it made the math easier - server sends DataRow of 32006 bytes, followed by DataRow of 806 bytes, followed by CommandComplete/ReadyForQuery - so there are three TLS records on the wire containing 1) DataRow 1 fragment 1 (12k bytes) 2) DataRow 1 fragment 2 (12k bytes) 3) DataRow 1 fragment 3 (7430 bytes) + DataRow 2 (806 bytes) + CommandComplete + ReadyForQuery
How did you verify the issue on the server side - with YugabyteDB or
with a modified Postgres server? I'd like to verify the GSSAPI part and
I'm familiar with the Postgres server only.
I agree that PQconsumeInput() needs to ensure that the transport
buffers are all drained. But I'm not sure this is a complete solution;
doesn't GSS have the same problem? And are there any other sites that
need to make the same guarantee before returning?
Which other sites do you mean? The synchronous transfer already works,
since the select() is short-circuit in case of pending bytes: [1]https://github.com/postgres/postgres/blob/77761ee5dddc0518235a51c533893e81e5f375b9/src/interfaces/libpq/fe-misc.c#L1070
I need to switch away from this for a bit. Would you mind adding this
to the next Commitfest as a placeholder?
No problem; registered: https://commitfest.postgresql.org/50/5251/
--
Regards, Lars
[1]: https://github.com/postgres/postgres/blob/77761ee5dddc0518235a51c533893e81e5f375b9/src/interfaces/libpq/fe-misc.c#L1070
https://github.com/postgres/postgres/blob/77761ee5dddc0518235a51c533893e81e5f375b9/src/interfaces/libpq/fe-misc.c#L1070
On Wed, Sep 11, 2024 at 12:08 PM Lars Kanis <lars@greiz-reinsdorf.de> wrote:
How did you verify the issue on the server side - with YugabyteDB or
with a modified Postgres server? I'd like to verify the GSSAPI part and
I'm familiar with the Postgres server only.
Neither, unfortunately -- I have a protocol testbed that I use for
this kind of stuff. I'm happy to share once I get it cleaned up, but
it's unlikely to help you in this case since I haven't implemented
gssenc support. Patching the Postgres server itself seems like a good
way to go.
And are there any other sites that
need to make the same guarantee before returning?Which other sites do you mean?
I'm mostly worried that other parts of libpq might assume that a
single call to pqReadData will drain the buffers. If not, great! --
but I haven't had time to check all the call sites.
I need to switch away from this for a bit. Would you mind adding this
to the next Commitfest as a placeholder?No problem; registered: https://commitfest.postgresql.org/50/5251/
Thank you!
--Jacob
On Thu, 12 Sept 2024 at 04:30, Jacob Champion
<jacob.champion@enterprisedb.com> wrote:
On Wed, Sep 11, 2024 at 12:08 PM Lars Kanis <lars@greiz-reinsdorf.de> wrote:
How did you verify the issue on the server side - with YugabyteDB or
with a modified Postgres server? I'd like to verify the GSSAPI part and
I'm familiar with the Postgres server only.Neither, unfortunately -- I have a protocol testbed that I use for
this kind of stuff. I'm happy to share once I get it cleaned up, but
it's unlikely to help you in this case since I haven't implemented
gssenc support. Patching the Postgres server itself seems like a good
way to go.And are there any other sites that
need to make the same guarantee before returning?Which other sites do you mean?
I'm mostly worried that other parts of libpq might assume that a
single call to pqReadData will drain the buffers. If not, great! --
but I haven't had time to check all the call sites.
@Jacob, could you find some time to wrap this up? This will help us
assess whether it can be refined into a committable state soon.
Regards,
Vignesh
On Sun, Mar 16, 2025 at 6:14 AM vignesh C <vignesh21@gmail.com> wrote:
@Jacob, could you find some time to wrap this up? This will help us
assess whether it can be refined into a committable state soon.
This is a bug report that likely requires backpatching; feature freeze
should not be an issue here.
Thanks,
--Jacob
On Tue, Sep 10, 2024 at 11:49 AM Jacob Champion
<jacob.champion@enterprisedb.com> wrote:
I need to switch away from this for a bit.
"a bit"
In an effort to get this unstuck (and ideally solved during this
commitfest) here are my most recent thoughts:
I agree that PQconsumeInput() needs to ensure that the transport
buffers are all drained. But I'm not sure this is a complete solution;
doesn't GSS have the same problem? And are there any other sites that
need to make the same guarantee before returning?
So, Postgres internals cheat: calls to
pqWait/pqWaitTimed/pqSocketCheck() all reach down into the SSL buffer
to see if there are any pending bytes before polling the socket. I do
not yet understand why this protection is not extended to
GSS-encrypted connections.
In any case, our clients can't make use of that protection either, so
it's unfortunate that we're relying on it ourselves, because it makes
reasoning about a fix more difficult. I'm not about to propose
removing that case in a backport, but as long as it's there covering
up the problem, it's hard to say whether we've truly fixed this bug.
I think the hang could happen in any situation where a third-party
client calls into a function which calls pqReadData(), and then
immediately polls the socket for a read-ready condition. So: what are
the APIs that could be called that way?
Clearly PQconsumeInput() is one. PQconnectPoll() seems like it should
be exposed as well. postgres_fdw's use of PQgetResult() might qualify
(or if it doesn't, I don't understand why not -- there is at least one
path in PQgetResult() where the buffer is not fully consumed). And
nonblocking mode seems like it might sprinkle this hazard into more
places, as pqSendSome() will then start reading data and returning
before a pqWait().
So, to fix this once and for all, do we have to make sure that
pqReadData() always fully drains the SSL/GSS transport buffer instead
of eagerly returning?
--Jacob
On Tue, Jul 1, 2025 at 1:42 PM Jacob Champion
<jacob.champion@enterprisedb.com> wrote:
I do
not yet understand why this protection is not extended to
GSS-encrypted connections.
After repurposing some of my test code for d98cefe11, I'm able to
reproduce the hang with gssencmode when the server uses a
smaller-than-standard (12k) send buffer. Same reproduction case as the
original (32006-byte DataRow, 806-byte DataRow, CommandComplete,
ReadyForQuery).
So a complete patch for this has to consider GSS as well.
So, to fix this once and for all, do we have to make sure that
pqReadData() always fully drains the SSL/GSS transport buffer instead
of eagerly returning?
I'll work on proving that code paths other than PQconsumeInput() are
affected. If they are, I'll start a patch for pqReadData().
--Jacob
On Sun, Sep 8, 2024 at 2:08 PM Lars Kanis <lars@greiz-reinsdorf.de> wrote:
To fix this issue the attached patch calls pqReadData() repeatedly in
PQconsumeInput() until there is no buffered SSL data left to be read.
Another solution could be to process buffered SSL read bytes in
PQisBusy() instead of PQconsumeInput() .
I've definitely seen this problem in AWS RDS, or something related to
this. I'm using dblink to set up lots of connections and using
dblink_is_busy (which calls PQconsumeInput). If there is lots of data
being returned, say, with large results or log information, the server
would stall for unknown reasons (sometimes for several seconds or more)
while peeling off the 8kb fragments. Point being, PQconsumeInput() not
draining the buffers can lead to major performance problems that are hard
to troubleshoot. I would argue this a bug, really.
merlin
On Wed, Jul 2, 2025 at 4:12 PM Jacob Champion
<jacob.champion@enterprisedb.com> wrote:
I'll work on proving that code paths other than PQconsumeInput() are
affected. If they are, I'll start a patch for pqReadData().
Here's one way for a server implementation to hang the client during
PQconnectPoll(): accept the SSLRequest and immediately send a protocol
2.0 error message of (16k + 1) bytes, but split the message across two
TLS records of 8k and (8k + 1) bytes each, and wait for the client to
close the connection.
This case works fine with synchronous connect, but it fails with
async. libpq will fill the first 8k of its 16k input buffer with the
first record. It fills the second half with all but one byte of the
second record, and since it doesn't find a null terminator anywhere in
the buffer, it tells the client to poll on the socket again. Since the
server in this case is waiting for the client to close first, no
additional socket activity is coming, and the connection deadlocks
with one byte pending in the SSL buffer.
I use 2.0 as an example; libpq avoids getting stuck in 3.0 by
enlarging the input buffer to the size of the incoming message. But
that's supposed to be a performance optimization, not anti-deadlock
protection. Even if we were to enshrine the buffer sizes we use in the
protocol specification itself [1]We should not do that., someone's still likely to trip over
this hazard if they start adjusting those performance optimizations.
(Even our use of a 16k initial buffer is itself just an optimization.)
So I think pqReadData() needs to make the same guarantees for SSL/GSS
that it does for plain TCP: when it returns, either 1) there are no
bytes left pending or 2) the socket itself is marked readable.
Otherwise I think we'll continue to chase weird corner cases.
What I'm not sure about is unforeseen consequences -- could fixing
this expose other latent bugs? If we're concerned about that, we might
have to tier the backports to make sure everything remains stable.
Here are the three bugs discussed so far, with the fix for 3 possibly
subsuming 1 and 2 but also affecting more code:
1) pqSocketCheck() checks for pending SSL bytes but not GSS bytes
2) PQconsumeInput() does not drain all pending bytes
3) pqReadData() does not drain all pending bytes
Thanks,
--Jacob
[1]: We should not do that.
On Tue, Jul 15, 2025 at 4:31 PM Jacob Champion <
jacob.champion@enterprisedb.com> wrote
Otherwise I think we'll continue to chase weird corner cases.
Agreed. Here's a little more detail on the case I noticed:
* postgres backend thread managing several libpq connections, with polling
is_busy loop
* when client pushed a lot of log messages (say, with 'RAISE NOTICE'), the
server would stall for significant periods of time, sometimes minutes --
during that time, there would be no activity in the log (the server doesn't
do anything interesting outside of of the polling loop)
* checking backend server log, which relogs the client raised logs via
dblink, each pause happened after roughly 16k bytes
* reducing the log verbosity made the problem go away.
merlin
Hi,
On 2025-07-15 15:31:25 -0700, Jacob Champion wrote:
So I think pqReadData() needs to make the same guarantees for SSL/GSS
that it does for plain TCP: when it returns, either 1) there are no
bytes left pending or 2) the socket itself is marked readable.
Otherwise I think we'll continue to chase weird corner cases.
It's not really the same issue, as at least my reproducer requires modifying
libpq (and thus clearly isn't a bug), but it thought it might still be
interesting. For one, I'm not sure this can only be triggered with the
modifications.
If one modifies libpq to use openssl readahead (which does result in speedups,
because otherwise openssl think it's useful to do lots of 5 byte reads from
the socket), I see occasional hangs in libpq.
diff --git i/src/interfaces/libpq/fe-secure-openssl.c w/src/interfaces/libpq/fe-secure-openssl.c
index 51dd7b9fec0..a8deb28594c 100644
--- i/src/interfaces/libpq/fe-secure-openssl.c
+++ w/src/interfaces/libpq/fe-secure-openssl.c
@@ -864,6 +864,9 @@ initialize_SSL(PGconn *conn)
*/
SSL_CTX_set_mode(SSL_context, SSL_MODE_ACCEPT_MOVING_WRITE_BUFFER);
+ SSL_CTX_set_read_ahead(SSL_context, 1);
+ SSL_CTX_set_default_read_buffer_len(SSL_context, 64*1024);
+
/*
* If the root cert file exists, load it so we can perform certificate
* verification. If sslmode is "verify-full" we will also do further
ninja src/bin/psql/psql && src/bin/psql/psql -Xq -h localhost -c '\copy pg_class to stdout'
hangs reliably for me (if it's going to a file it's less reliable, so there's
a timing aspect here).
At that point psql is not interruptible, which isn't great either...
Backtrace:
#0 __internal_syscall_cancel (a1=a1@entry=140726731206072, a2=a2@entry=1, a3=-1, a4=a4@entry=0, a5=a5@entry=0, a6=a6@entry=0, nr=7)
at ./nptl/cancellation.c:44
#1 0x00007f1e25e3f6ad in __syscall_cancel (a1=a1@entry=140726731206072, a2=a2@entry=1, a3=<optimized out>, a4=a4@entry=0, a5=a5@entry=0, a6=a6@entry=0, nr=7)
at ./nptl/cancellation.c:75
#2 0x00007f1e25eb39c6 in __GI___poll (fds=fds@entry=0x7ffd7ed2edb8, nfds=nfds@entry=1, timeout=<optimized out>) at ../sysdeps/unix/sysv/linux/poll.c:29
#3 0x00007f1e26248dea in poll (__fds=0x7ffd7ed2edb8, __nfds=1, __timeout=<optimized out>) at /usr/include/x86_64-linux-gnu/bits/poll2.h:44
#4 PQsocketPoll (sock=<optimized out>, forRead=<optimized out>, forWrite=<optimized out>, end_time=<optimized out>)
at ../../../../../home/andres/src/postgresql/src/interfaces/libpq/fe-misc.c:1175
#5 0x00007f1e262491a8 in pqSocketCheck (conn=conn@entry=0x55f69b7e68f0, forRead=1, forWrite=0, end_time=-1)
at ../../../../../home/andres/src/postgresql/src/interfaces/libpq/fe-misc.c:1114
#6 0x00007f1e262492b3 in pqWaitTimed (forRead=<optimized out>, forWrite=<optimized out>, conn=0x55f69b7e68f0, end_time=<optimized out>)
at ../../../../../home/andres/src/postgresql/src/interfaces/libpq/fe-misc.c:1039
#7 0x00007f1e262492f1 in pqWait (forRead=forRead@entry=1, forWrite=forWrite@entry=0, conn=conn@entry=0x55f69b7e68f0)
at ../../../../../home/andres/src/postgresql/src/interfaces/libpq/fe-misc.c:1021
#8 0x00007f1e26234c46 in pqGetCopyData3 (conn=conn@entry=0x55f69b7e68f0, buffer=buffer@entry=0x7ffd7ed2efb8, async=async@entry=0)
at ../../../../../home/andres/src/postgresql/src/interfaces/libpq/fe-protocol3.c:1846
#9 0x00007f1e26243d98 in PQgetCopyData (conn=conn@entry=0x55f69b7e68f0, buffer=buffer@entry=0x7ffd7ed2efb8, async=async@entry=0)
at ../../../../../home/andres/src/postgresql/src/interfaces/libpq/fe-exec.c:2840
#10 0x000055f65d28f9c6 in handleCopyOut (conn=0x55f69b7e68f0, copystream=copystream@entry=0x7f1e25f985c0 <_IO_2_1_stdout_>, res=res@entry=0x7ffd7ed2f008)
at ../../../../../home/andres/src/postgresql/src/bin/psql/copy.c:442
#11 0x000055f65d27b07d in HandleCopyResult (resultp=resultp@entry=0x7ffd7ed2f158, copystream=0x7f1e25f985c0 <_IO_2_1_stdout_>)
at ../../../../../home/andres/src/postgresql/src/bin/psql/common.c:950
#12 0x000055f65d27dab8 in ExecQueryAndProcessResults (query=query@entry=0x55f69b8b3470 "COPY pg_attribute TO STDOUT ",
elapsed_msec=elapsed_msec@entry=0x7ffd7ed2f1b8, svpt_gone_p=svpt_gone_p@entry=0x7ffd7ed2f1b7, is_watch=is_watch@entry=false, min_rows=min_rows@entry=0,
opt=opt@entry=0x0, printQueryFout=0x0) at ../../../../../home/andres/src/postgresql/src/bin/psql/common.c:1934
#13 0x000055f65d27cdc9 in SendQuery (query=0x55f69b8b3470 "COPY pg_attribute TO STDOUT ")
at ../../../../../home/andres/src/postgresql/src/bin/psql/common.c:1212
#14 0x000055f65d28f6dd in do_copy (args=args@entry=0x55f69b8b1f10 "pg_attribute to stdout")
at ../../../../../home/andres/src/postgresql/src/bin/psql/copy.c:370
#15 0x000055f65d299f24 in exec_command_copy (scan_state=scan_state@entry=0x55f69b8a71f0, active_branch=active_branch@entry=true)
at ../../../../../home/andres/src/postgresql/src/bin/psql/command.c:951
#16 0x000055f65d2a2065 in exec_command (cmd=cmd@entry=0x55f69b8b3310 "copy", scan_state=scan_state@entry=0x55f69b8a71f0, cstack=cstack@entry=0x55f69b887ba0,
query_buf=query_buf@entry=0x0, previous_buf=previous_buf@entry=0x0) at ../../../../../home/andres/src/postgresql/src/bin/psql/command.c:338
#17 0x000055f65d2a2b59 in HandleSlashCmds (scan_state=scan_state@entry=0x55f69b8a71f0, cstack=cstack@entry=0x55f69b887ba0, query_buf=query_buf@entry=0x0,
previous_buf=previous_buf@entry=0x0) at ../../../../../home/andres/src/postgresql/src/bin/psql/command.c:241
#18 0x000055f65d29305f in main (argc=<optimized out>, argv=<optimized out>) at ../../../../../home/andres/src/postgresql/src/bin/psql/startup.c:409
It does seem ... weird ... that pqGetCopyData3() just processes whatever is
already in the libpq buffer and then waits for the socket to become readable,
before ever calling pqReadData(). What guarantees, even without the change
above, that a) there's no pending data inside openssl b) that we're not
waiting for a write?
If I make pqGetCopyData3() call pqReadData() before the pqWait() and continue
if it returns 1, the hang vanishes.
The same
if (pqWait(true, false, conn) ||
pqReadData(conn) < 0)
return -2;
pattern (without a preceding pqReadData()) exists in several other places :(
What I'm not sure about is unforeseen consequences -- could fixing
this expose other latent bugs? If we're concerned about that, we might
have to tier the backports to make sure everything remains stable.
Here are the three bugs discussed so far, with the fix for 3 possibly
subsuming 1 and 2 but also affecting more code:1) pqSocketCheck() checks for pending SSL bytes but not GSS bytes
2) PQconsumeInput() does not drain all pending bytes
3) pqReadData() does not drain all pending bytes
What are the limits for the maximum amount of data this could make us buffer
in addition to what we are buffering right now? It's not entirely obvious to
me that a loop around pqReadData() as long as there is pending data couldn't
make us buffer a lot of data.
Do you have a WIP patch?
Greetings,
Andres Freund
On Wed, Jul 16, 2025 at 7:36 AM Merlin Moncure <mmoncure@gmail.com> wrote:
Agreed. Here's a little more detail on the case I noticed:
* postgres backend thread managing several libpq connections, with polling is_busy loop
* when client pushed a lot of log messages (say, with 'RAISE NOTICE'), the server would stall for significant periods of time, sometimes minutes -- during that time, there would be no activity in the log (the server doesn't do anything interesting outside of of the polling loop)
* checking backend server log, which relogs the client raised logs via dblink, each pause happened after roughly 16k bytes
* reducing the log verbosity made the problem go away.
Thanks for this! I'm not sure I'll be able to prove that this is
related, as opposed to some other potential problem, but it's good to
have the information.
--Jacob
On Wed, Jul 16, 2025 at 11:11 AM Andres Freund <andres@anarazel.de> wrote:
If one modifies libpq to use openssl readahead (which does result in speedups,
because otherwise openssl think it's useful to do lots of 5 byte reads from
the socket), I see occasional hangs in libpq.
Now that is a very interesting coincidence, because the patch I'm
working on is about to add an assertion that readahead is turned
*off*. :D
Based on my understanding of [1]https://docs.openssl.org/3.5/man3/SSL_pending/#synopsis, readahead makes this overall problem
much worse by opportunistically slurping bytes off the wire and doing
absolutely nothing with them until you call SSL_read() enough times to
finally get to them. It would even hide those bytes from the pending
count:
Therefore, it is possible for no more bytes to be readable from the
underlying BIO (because OpenSSL has already read them) and for
SSL_pending() to return 0, even though readable application data bytes
are available (because the data is in unprocessed buffered records).
That seems like the polar opposite of what we want. If clients are
polling on the raw socket, those pending bytes must be accounted for.
It does seem ... weird ... that pqGetCopyData3() just processes whatever is
already in the libpq buffer and then waits for the socket to become readable,
before ever calling pqReadData(). What guarantees, even without the change
above, that a) there's no pending data inside openssl b) that we're not
waiting for a write?
(a) is the bug discussed here, AFAICT? I have not considered (b) yet,
but that seems like a separate problem.
If I make pqGetCopyData3() call pqReadData() before the pqWait() and continue
if it returns 1, the hang vanishes.
Sure, but that's cheating. I think I'd expect that adding extra
"unnecessary" reads would help cover up the problem that readahead has
caused.
What are the limits for the maximum amount of data this could make us buffer
in addition to what we are buffering right now?
I still need to make sure, but I believe it should be an additional
16k due to the current limits on GSS token sizes and TLS record sizes.
In normal operation, I think we basically immediately double the input
buffer sizes (which start at 16k) as soon as a long message comes in.
So that order of magnitude didn't immediately seem worrisome to me.
It's not entirely obvious to
me that a loop around pqReadData() as long as there is pending data couldn't
make us buffer a lot of data.
FWIW, I'm not planning to loop around it; I'm planning to add a
pgsecure_* primitive that drains whatever the transport is holding
onto.
Do you have a WIP patch?
I'm working on one now.
--Jacob
[1]: https://docs.openssl.org/3.5/man3/SSL_pending/#synopsis
Hi,
On 2025-07-16 11:50:46 -0700, Jacob Champion wrote:
On Wed, Jul 16, 2025 at 11:11 AM Andres Freund <andres@anarazel.de> wrote:
If one modifies libpq to use openssl readahead (which does result in speedups,
because otherwise openssl think it's useful to do lots of 5 byte reads from
the socket), I see occasional hangs in libpq.Now that is a very interesting coincidence, because the patch I'm
working on is about to add an assertion that readahead is turned
*off*. :D
Heh.
Based on my understanding of [1], readahead makes this overall problem
much worse by opportunistically slurping bytes off the wire and doing
absolutely nothing with them until you call SSL_read() enough times to
finally get to them.
Right - but it also substantially reduces the number of syscalls :(. I'm not
sure that it's wise to close that door permanently. Clearly it's not something
we need to care about in the back branches.
Without it openssl will do one read for the record length, and another read
for the record contents. I.e. it'll often double the number of syscalls.
Without readahead:
perf stat -e raw_syscalls:sys_enter,cycles,context-switches src/bin/pgbench/pgbench -c $c -j$j -M prepared -n -P 1 -f <(echo 'select 1') -h localhost -t1000
5,561 raw_syscalls:sys_enter
89,125,976 cycles
1,009 context-switches
With readahead:
4,556 raw_syscalls:sys_enter
83,582,571 cycles
1,007 context-switches
It would even hide those bytes from the pending
count:Therefore, it is possible for no more bytes to be readable from the
underlying BIO (because OpenSSL has already read them) and for
SSL_pending() to return 0, even though readable application data bytes
are available (because the data is in unprocessed buffered records).
It's rather depressing, like how this this API behaviour came to be?
Luckily there seems to be SSL_has_pending():
SSL_has_pending() returns 1 if s has buffered data (whether processed or
unprocessed) and 0 otherwise
And it seems old enough:
The SSL_has_pending() function was added in OpenSSL 1.1.0.
It does seem ... weird ... that pqGetCopyData3() just processes whatever is
already in the libpq buffer and then waits for the socket to become readable,
before ever calling pqReadData(). What guarantees, even without the change
above, that a) there's no pending data inside openssl b) that we're not
waiting for a write?(a) is the bug discussed here, AFAICT?
I'm not entirely sure, tbh.
If I make pqGetCopyData3() call pqReadData() before the pqWait() and continue
if it returns 1, the hang vanishes.Sure, but that's cheating. I think I'd expect that adding extra
"unnecessary" reads would help cover up the problem that readahead has
caused.
To me the pattern in our code seems bogus, even without readahead, no?
getCopyDataMessage() can return 0 because our internal buffer is empty,
without ever reading from the socket or openssl. But if the SSL record
contained two messages (or parts of two messages), all the required data may
be in openssl. In which case the pqWait() that pqGetCopyData3() does will wait
forever. That's a problem unrelated to readahead, no?
I.e. the extra read actually ensures that we're not doing a pqWait() without
knowing whether we need one.
Greetings,
Andres Freund
On Wed, Jul 16, 2025 at 2:34 PM Andres Freund <andres@anarazel.de> wrote:
Based on my understanding of [1], readahead makes this overall problem
much worse by opportunistically slurping bytes off the wire and doing
absolutely nothing with them until you call SSL_read() enough times to
finally get to them.Right - but it also substantially reduces the number of syscalls :(. I'm not
sure that it's wise to close that door permanently.
I think it's very likely that OpenSSL's implementation of readahead is
fundamentally incompatible with our async implementation. For one, the
documented way to call SSL_read() without accidentally hitting the
socket is via SSL_pending(), which readahead is documented to break.
For two, if you're worried about how much data we could potentially
have to hold during a "drain all pending" operation, I think readahead
changes the upper bound from "the size of one TLS record" to
"SO_RCVBUF", doesn't it?
Without it openssl will do one read for the record length, and another read
for the record contents. I.e. it'll often double the number of syscalls.
That is unfortunate... but if we're talking about a
correctness/performance tradeoff then I'm somewhat less sympathetic to
the performance side.
Luckily there seems to be SSL_has_pending():
SSL_has_pending() returns 1 if s has buffered data (whether processed or
unprocessed) and 0 otherwise
IIUC, we can't use SSL_has_pending() either. I need to know how
exactly many bytes to drain out of the userspace buffer without
introducing more bytes into it.
To me the pattern in our code seems bogus, even without readahead, no?
getCopyDataMessage() can return 0 because our internal buffer is empty,
without ever reading from the socket or openssl. But if the SSL record
contained two messages (or parts of two messages), all the required data may
be in openssl. In which case the pqWait() that pqGetCopyData3() does will wait
forever.
I think the "cheating" in pqSocketCheck() that I mentioned upthread
ensures that pqWait() will see the OpenSSL data and return
immediately. (That's a major architectural smell IMO, but it is there.
Just not for GSS, which might explain some hangs discussed on the list
a while back.)
I.e. the extra read actually ensures that we're not doing a pqWait() without
knowing whether we need one.
I don't think so, because (without my WIP patch) you still can't
guarantee that the single call to pqReadData() got all of the
readahead data. And with my WIP patch, you don't need the extra call,
because the previous call to pqReadData() will have ensured that
there's no leftover SSL buffer to begin with. Am I missing something?
Thanks!
--Jacob
Hi,
On 2025-07-16 15:25:14 -0700, Jacob Champion wrote:
On Wed, Jul 16, 2025 at 2:34 PM Andres Freund <andres@anarazel.de> wrote:
Based on my understanding of [1], readahead makes this overall problem
much worse by opportunistically slurping bytes off the wire and doing
absolutely nothing with them until you call SSL_read() enough times to
finally get to them.Right - but it also substantially reduces the number of syscalls :(. I'm not
sure that it's wise to close that door permanently.I think it's very likely that OpenSSL's implementation of readahead is
fundamentally incompatible with our async implementation. For one, the
documented way to call SSL_read() without accidentally hitting the
socket is via SSL_pending(), which readahead is documented to break.
Why do we care about not hitting the socket? We always operate the socket in
non-blocking mode anyway?
FWIW, I have seen too many folks move away from encrypted connections for
backups due to openssl being the bottlenecks to just accept that we'll never
go beyond the worst performance. Any actually decently performing networking
will need to divorce when socket IO happens and when openssl sees the data
much more heavily. Just relying more and more on extremely tight coupling is a
dead end.
For two, if you're worried about how much data we could potentially
have to hold during a "drain all pending" operation, I think readahead
changes the upper bound from "the size of one TLS record" to
"SO_RCVBUF", doesn't it?
It seems to be a configurable limit, with
SSL_set_default_read_buffer_len(). But the default is to just read up to a
whole frame's worth of data, instead of of reading the length
separately. I.e. an irrelevant amount of memory.
Without it openssl will do one read for the record length, and another read
for the record contents. I.e. it'll often double the number of syscalls.That is unfortunate... but if we're talking about a
correctness/performance tradeoff then I'm somewhat less sympathetic to
the performance side.
I'm obviously not advocating for incorrectness.
Luckily there seems to be SSL_has_pending():
SSL_has_pending() returns 1 if s has buffered data (whether processed or
unprocessed) and 0 otherwiseIIUC, we can't use SSL_has_pending() either. I need to know how
exactly many bytes to drain out of the userspace buffer without
introducing more bytes into it.
Why? The only case where we ought to care about whether pending data exists
inside openssl is if our internal buffer is either empty or doesn't contain
the entire message we're trying to consume. In either of those two cases we
can just consume some data from openssl, without caring how much it precisely
is.
To me the pattern in our code seems bogus, even without readahead, no?
getCopyDataMessage() can return 0 because our internal buffer is empty,
without ever reading from the socket or openssl. But if the SSL record
contained two messages (or parts of two messages), all the required data may
be in openssl. In which case the pqWait() that pqGetCopyData3() does will wait
forever.I think the "cheating" in pqSocketCheck() that I mentioned upthread
ensures that pqWait() will see the OpenSSL data and return
immediately. (That's a major architectural smell IMO, but it is there.
Just not for GSS, which might explain some hangs discussed on the list
a while back.)
I guess so, but it's hard to know without a patch.
I.e. the extra read actually ensures that we're not doing a pqWait() without
knowing whether we need one.I don't think so, because (without my WIP patch) you still can't
guarantee that the single call to pqReadData() got all of the
readahead data.
It wouldn't need to get all the data in one pqReadData() in this case, as we'd
just loop around and do the whole thing again, until the entire message is
read in (regardless of whether there's remaining data in a partially consumed
record, or in the "unprocessed buffer").
Greetings,
Andres Freund
On Wed, Jul 16, 2025 at 4:35 PM Andres Freund <andres@anarazel.de> wrote:
Why do we care about not hitting the socket? We always operate the socket in
non-blocking mode anyway?
IIUC, that would change pqReadData() from a bounded read to an
unbounded one. Then we have to somehow protect against a server that
can send data faster than we can process it.
FWIW, I have seen too many folks move away from encrypted connections for
backups due to openssl being the bottlenecks to just accept that we'll never
go beyond the worst performance. Any actually decently performing networking
will need to divorce when socket IO happens and when openssl sees the data
much more heavily.
Can't we make our custom BIO do that?
Just relying more and more on extremely tight coupling is a
dead end.
Why? This abstraction _must_ leak. Because of PQsocket() at minimum,
but also there's a bunch of code complexity around the read/write
behavior and EOF differences for a TLS connection. When I look at the
code in pqReadData() I am very sad.
TLS does not behave like raw TCP. So if there's no way to get rid of
the coupling, IMO we might as well use it.
For two, if you're worried about how much data we could potentially
have to hold during a "drain all pending" operation, I think readahead
changes the upper bound from "the size of one TLS record" to
"SO_RCVBUF", doesn't it?It seems to be a configurable limit, with
SSL_set_default_read_buffer_len(). But the default is to just read up to a
whole frame's worth of data, instead of of reading the length
separately. I.e. an irrelevant amount of memory.
Hmm, okay.
IIUC, we can't use SSL_has_pending() either. I need to know how
exactly many bytes to drain out of the userspace buffer without
introducing more bytes into it.Why? The only case where we ought to care about whether pending data exists
inside openssl is if our internal buffer is either empty or doesn't contain
the entire message we're trying to consume. In either of those two cases we
can just consume some data from openssl, without caring how much it precisely
is.
I can't tell if you're discussing the fix for this bug or a
hypothetical future architecture that makes readahead safe. I don't
think PQconnectPoll() behaves the way you're describing; there is no
retry-read concept.
The order of operations is:
- read some available data; don't care how much
- parse the message(s) if we have enough data
- otherwise tell the caller to poll for read
That's an architecture that's already very tightly coupled to the
behavior of un-userspace-buffered TCP/UNIX sockets. Which is why I'm
trying to have pqReadData() make compatible guarantees for SSL/GSS.
--Jacob
On Wed, Jul 16, 2025 at 11:50 AM Jacob Champion
<jacob.champion@enterprisedb.com> wrote:
On Wed, Jul 16, 2025 at 11:11 AM Andres Freund <andres@anarazel.de> wrote:
Do you have a WIP patch?
I'm working on one now.
The attached still needs some documentation work, and closer
inspection of the new assertions and OpenSSL error handling, but I
think it's good enough to get my intent across.
0001 is a fix for pqWait+gssencmode, which is hopefully the easier of
the two to review/backport. The current behavior for GSS is, IMO, an
obvious oversight.
0002 adds drain semantics at the end of pqReadData(). If we were
writing this from scratch, it wouldn't look like this: there's no
point in having the transport layer split data off into an undersized
conn->inBuffer if we're just going to come right back and drain it all
anyway. But I'm hoping this is easier to reason about and verify,
compared to a bigger rewrite.
I've renamed the existing pgtls_read_pending() to
pgtls_read_is_pending(), to try to disambiguate it from the new
*_drain_pending() APIs, but I'm not particularly attached to my choice
of names.
Thanks,
--Jacob
Attachments:
0001-WIP-extend-TLS-read-pending-concept-to-GSS-too.patchapplication/octet-stream; name=0001-WIP-extend-TLS-read-pending-concept-to-GSS-too.patchDownload
From 5b459b24696419c8747c506c26a813288ef899f5 Mon Sep 17 00:00:00 2001
From: Jacob Champion <jacob.champion@enterprisedb.com>
Date: Fri, 18 Jul 2025 10:06:13 -0700
Subject: [PATCH 1/2] WIP: extend "TLS read pending" concept to GSS too
pgtls_read_pending() is renamed to pgtls_read_is_pending(), to avoid
conflation with the forthcoming pgtls_drain_pending().
---
src/interfaces/libpq/fe-misc.c | 6 ++----
src/interfaces/libpq/fe-secure-gssapi.c | 6 ++++++
src/interfaces/libpq/fe-secure-openssl.c | 2 +-
src/interfaces/libpq/fe-secure.c | 19 +++++++++++++++++++
src/interfaces/libpq/libpq-int.h | 4 +++-
5 files changed, 31 insertions(+), 6 deletions(-)
diff --git a/src/interfaces/libpq/fe-misc.c b/src/interfaces/libpq/fe-misc.c
index dca44fdc5d2..434216ff89f 100644
--- a/src/interfaces/libpq/fe-misc.c
+++ b/src/interfaces/libpq/fe-misc.c
@@ -1099,14 +1099,12 @@ pqSocketCheck(PGconn *conn, int forRead, int forWrite, pg_usec_time_t end_time)
return -1;
}
-#ifdef USE_SSL
- /* Check for SSL library buffering read bytes */
- if (forRead && conn->ssl_in_use && pgtls_read_pending(conn))
+ /* Check for SSL/GSS library buffering read bytes */
+ if (forRead && pqsecure_read_is_pending(conn))
{
/* short-circuit the select */
return 1;
}
-#endif
}
/* We will retry as long as we get EINTR */
diff --git a/src/interfaces/libpq/fe-secure-gssapi.c b/src/interfaces/libpq/fe-secure-gssapi.c
index bc9e1ce06fa..7c88e64cfd2 100644
--- a/src/interfaces/libpq/fe-secure-gssapi.c
+++ b/src/interfaces/libpq/fe-secure-gssapi.c
@@ -469,6 +469,12 @@ gss_read(PGconn *conn, void *recv_buffer, size_t length, ssize_t *ret)
return PGRES_POLLING_OK;
}
+bool
+pg_GSS_read_is_pending(PGconn *conn)
+{
+ return PqGSSResultLength > PqGSSResultNext;
+}
+
/*
* Negotiate GSSAPI transport for a connection. When complete, returns
* PGRES_POLLING_OK. Will return PGRES_POLLING_READING or
diff --git a/src/interfaces/libpq/fe-secure-openssl.c b/src/interfaces/libpq/fe-secure-openssl.c
index 51dd7b9fec0..8f975561e51 100644
--- a/src/interfaces/libpq/fe-secure-openssl.c
+++ b/src/interfaces/libpq/fe-secure-openssl.c
@@ -231,7 +231,7 @@ rloop:
}
bool
-pgtls_read_pending(PGconn *conn)
+pgtls_read_is_pending(PGconn *conn)
{
return SSL_pending(conn->ssl) > 0;
}
diff --git a/src/interfaces/libpq/fe-secure.c b/src/interfaces/libpq/fe-secure.c
index e686681ba15..94c97ec26fb 100644
--- a/src/interfaces/libpq/fe-secure.c
+++ b/src/interfaces/libpq/fe-secure.c
@@ -243,6 +243,25 @@ pqsecure_raw_read(PGconn *conn, void *ptr, size_t len)
return n;
}
+/*
+ * Returns true if there are any bytes available in the transport buffer.
+ */
+bool
+pqsecure_read_is_pending(PGconn *conn)
+{
+#ifdef USE_SSL
+ if (conn->ssl_in_use)
+ return pgtls_read_is_pending(conn);
+#endif
+#ifdef ENABLE_GSS
+ if (conn->gssenc)
+ return pg_GSS_read_is_pending(conn);
+#endif
+
+ /* Plaintext connections have no transport buffer. */
+ return 0;
+}
+
/*
* Write data to a secure connection.
*
diff --git a/src/interfaces/libpq/libpq-int.h b/src/interfaces/libpq/libpq-int.h
index 70c28f2ffca..51bc2d3b740 100644
--- a/src/interfaces/libpq/libpq-int.h
+++ b/src/interfaces/libpq/libpq-int.h
@@ -809,6 +809,7 @@ extern int pqWriteReady(PGconn *conn);
extern PostgresPollingStatusType pqsecure_open_client(PGconn *);
extern void pqsecure_close(PGconn *);
extern ssize_t pqsecure_read(PGconn *, void *ptr, size_t len);
+extern bool pqsecure_read_is_pending(PGconn *);
extern ssize_t pqsecure_write(PGconn *, const void *ptr, size_t len);
extern ssize_t pqsecure_raw_read(PGconn *, void *ptr, size_t len);
extern ssize_t pqsecure_raw_write(PGconn *, const void *ptr, size_t len);
@@ -847,7 +848,7 @@ extern ssize_t pgtls_read(PGconn *conn, void *ptr, size_t len);
/*
* Is there unread data waiting in the SSL read buffer?
*/
-extern bool pgtls_read_pending(PGconn *conn);
+extern bool pgtls_read_is_pending(PGconn *conn);
/*
* Write data to a secure connection.
@@ -895,6 +896,7 @@ extern PostgresPollingStatusType pqsecure_open_gss(PGconn *conn);
*/
extern ssize_t pg_GSS_write(PGconn *conn, const void *ptr, size_t len);
extern ssize_t pg_GSS_read(PGconn *conn, void *ptr, size_t len);
+extern bool pg_GSS_read_is_pending(PGconn *conn);
#endif
/* === in fe-trace.c === */
--
2.34.1
0002-WIP-add-a-drain-pending-concept-to-SSL-GSS.patchapplication/octet-stream; name=0002-WIP-add-a-drain-pending-concept-to-SSL-GSS.patchDownload
From 401d97a1b116839139685ed9a3ee22291bc44ea3 Mon Sep 17 00:00:00 2001
From: Jacob Champion <jacob.champion@enterprisedb.com>
Date: Thu, 17 Jul 2025 08:45:45 -0700
Subject: [PATCH 2/2] WIP: add a "drain pending" concept to SSL/GSS
---
src/interfaces/libpq/fe-misc.c | 51 ++++++++++-
src/interfaces/libpq/fe-secure-gssapi.c | 27 ++++++
src/interfaces/libpq/fe-secure-openssl.c | 109 +++++++++++++++++++++++
src/interfaces/libpq/fe-secure.c | 51 ++++++++++-
src/interfaces/libpq/libpq-int.h | 7 ++
5 files changed, 242 insertions(+), 3 deletions(-)
diff --git a/src/interfaces/libpq/fe-misc.c b/src/interfaces/libpq/fe-misc.c
index 434216ff89f..49cf32e1bb7 100644
--- a/src/interfaces/libpq/fe-misc.c
+++ b/src/interfaces/libpq/fe-misc.c
@@ -55,6 +55,7 @@ static int pqPutMsgBytes(const void *buf, size_t len, PGconn *conn);
static int pqSendSome(PGconn *conn, int len);
static int pqSocketCheck(PGconn *conn, int forRead, int forWrite,
pg_usec_time_t end_time);
+static int pqReadData_internal(PGconn *conn);
/*
* PQlibVersion: return the libpq version number
@@ -593,6 +594,13 @@ pqPutMsgEnd(PGconn *conn)
/* ----------
* pqReadData: read more data, if any is available
+ *
+ * Upon a successful return, callers may assume that either 1) all available
+ * bytes have been consumed from the socket, or 2) the socket is still marked
+ * readable by the OS. (In other words: after a successful pqReadData, it's safe
+ * to tell a client to poll for readable bytes on the socket without any further
+ * draining of the SSL/GSS transport buffers.)
+ *
* Possible return values:
* 1: successfully loaded at least one more byte
* 0: no data is presently available, but no error detected
@@ -605,8 +613,8 @@ pqPutMsgEnd(PGconn *conn)
int
pqReadData(PGconn *conn)
{
- int someread = 0;
- int nread;
+ int available;
+ bool pending;
if (conn->sock == PGINVALID_SOCKET)
{
@@ -614,6 +622,45 @@ pqReadData(PGconn *conn)
return -1;
}
+ available = pqReadData_internal(conn);
+ if (available < 0)
+ return -1;
+
+ /*
+ * Make sure there are no bytes stuck in layers between conn->inBuffer and
+ * the socket, to make it safe for clients to poll on PQsocket(). See
+ * pqsecure_drain_pending's documentation for details.
+ */
+ pending = pqsecure_read_is_pending(conn);
+
+ if (available && pending)
+ {
+ if (pqsecure_drain_pending(conn))
+ return -1;
+ }
+ else if (!available)
+ {
+ /*
+ * If we're not returning any bytes from the underlying transport,
+ * that must imply there aren't any in the transport buffer...
+ */
+ Assert(!pending);
+ }
+
+ return available;
+}
+
+/*
+ * Workhorse for pqReadData(). It's kept separate from the
+ * pqsecure_drain_pending() logic to avoid adding to this function's goto
+ * complexity.
+ */
+static int
+pqReadData_internal(PGconn *conn)
+{
+ int someread = 0;
+ int nread;
+
/* Left-justify any data in the buffer to make room */
if (conn->inStart < conn->inEnd)
{
diff --git a/src/interfaces/libpq/fe-secure-gssapi.c b/src/interfaces/libpq/fe-secure-gssapi.c
index 7c88e64cfd2..329c2559b88 100644
--- a/src/interfaces/libpq/fe-secure-gssapi.c
+++ b/src/interfaces/libpq/fe-secure-gssapi.c
@@ -475,6 +475,33 @@ pg_GSS_read_is_pending(PGconn *conn)
return PqGSSResultLength > PqGSSResultNext;
}
+int
+pg_GSS_drain_pending(PGconn *conn)
+{
+ int pending;
+
+ /* Figure out how many bytes to take off the connection. */
+ Assert(PqGSSResultLength >= PqGSSResultNext);
+ pending = PqGSSResultLength - PqGSSResultNext;
+
+ if (!pending)
+ {
+ /* Nothing to do. */
+ return 0;
+ }
+
+ /* Expand the input buffer if necessary. */
+ if (pqCheckInBufferSpace(conn->inEnd + (size_t) pending, conn))
+ return -1; /* errorMessage already set */
+
+ /* Now read the buffered data. */
+ memcpy(conn->inBuffer + conn->inEnd, PqGSSResultBuffer + PqGSSResultNext, pending);
+ conn->inEnd += pending;
+ PqGSSResultNext += pending;
+
+ return 0;
+}
+
/*
* Negotiate GSSAPI transport for a connection. When complete, returns
* PGRES_POLLING_OK. Will return PGRES_POLLING_READING or
diff --git a/src/interfaces/libpq/fe-secure-openssl.c b/src/interfaces/libpq/fe-secure-openssl.c
index 8f975561e51..b95e229bd08 100644
--- a/src/interfaces/libpq/fe-secure-openssl.c
+++ b/src/interfaces/libpq/fe-secure-openssl.c
@@ -236,6 +236,115 @@ pgtls_read_is_pending(PGconn *conn)
return SSL_pending(conn->ssl) > 0;
}
+/*
+ * Helper callback for use with ERR_print_errors_cb(). This appends the raw
+ * error queue to the provided PQExpBuffer, one entry per line.
+ *
+ * Note that this is not pretty output; it's meant for debugging.
+ */
+static int
+append_error_queue(const char *str, size_t len, void *u)
+{
+ PQExpBuffer buf = u;
+
+ appendBinaryPQExpBuffer(buf, str, len);
+ appendPQExpBufferChar(buf, '\n');
+
+ return 0;
+}
+
+int
+pgtls_drain_pending(PGconn *conn)
+{
+ int pending;
+ size_t drained;
+
+ /*
+ * OpenSSL readahead is documented to break SSL_pending(). Plus, we can't
+ * afford to have OpenSSL take bytes off the socket without processing
+ * them; that breaks the postconditions for pqsecure_drain_pending().
+ */
+ Assert(!SSL_get_read_ahead(conn->ssl));
+
+ /* Figure out how many bytes to take off the connection. */
+ pending = SSL_pending(conn->ssl);
+
+ if (!pending)
+ {
+ /* Nothing to do. */
+ return 0;
+ }
+ else if (pending < 0)
+ {
+ /* Shouldn't be possible, but don't let it mess up the math below. */
+ Assert(false);
+ libpq_append_conn_error(conn, "OpenSSL reports negative bytes pending");
+ return -1;
+ }
+ else if (pending == INT_MAX)
+ {
+ /*
+ * If we ever found a legitimate way to hit this, we'd need to loop
+ * around to call SSL_pending() again. Throw an error rather than
+ * complicate the code in that way, because SSL_read() should be
+ * bounded to the size of a single TLS record, and conn->inBuffer
+ * can't currently go past INT_MAX in size anyway.
+ */
+ libpq_append_conn_error(conn, "OpenSSL reports INT_MAX bytes pending");
+ return -1;
+ }
+
+ /* Expand the input buffer if necessary. */
+ if (pqCheckInBufferSpace(conn->inEnd + (size_t) pending, conn))
+ return -1; /* errorMessage already set */
+
+ /*
+ * Now read the buffered data.
+ *
+ * Don't defer to pgtls_read(); OpenSSL should guarantee that pending data
+ * comes off in a single call, and we don't want to use the more
+ * complicated read-loop behavior. We still have to manage the error
+ * queue.
+ */
+ ERR_clear_error();
+ if (!SSL_read_ex(conn->ssl, conn->inBuffer + conn->inEnd, pending, &drained))
+ {
+ int err = SSL_get_error(conn->ssl, 0);
+
+ /*
+ * Something is very wrong. Report the error code and the entirety of
+ * the error queue without any attempt at interpretation. Probably not
+ * worth complicating things for the sake of translation, either.
+ */
+ appendPQExpBuffer(&conn->errorMessage,
+ "unexpected error code %d while draining SSL buffer; ",
+ err);
+
+ if (ERR_peek_error())
+ {
+ appendPQExpBufferStr(&conn->errorMessage, "error queue follows:\n");
+ ERR_print_errors_cb(append_error_queue, &conn->errorMessage);
+ }
+ else
+ appendPQExpBufferStr(&conn->errorMessage,
+ "no error queue provided\n");
+
+ return -1;
+ }
+
+ /* Final consistency check. */
+ if (drained != pending)
+ {
+ libpq_append_conn_error(conn,
+ "drained only %zu of %d pending bytes in SSL buffer",
+ drained, pending);
+ return -1;
+ }
+
+ conn->inEnd += pending;
+ return 0;
+}
+
ssize_t
pgtls_write(PGconn *conn, const void *ptr, size_t len)
{
diff --git a/src/interfaces/libpq/fe-secure.c b/src/interfaces/libpq/fe-secure.c
index 94c97ec26fb..a522724373f 100644
--- a/src/interfaces/libpq/fe-secure.c
+++ b/src/interfaces/libpq/fe-secure.c
@@ -244,7 +244,9 @@ pqsecure_raw_read(PGconn *conn, void *ptr, size_t len)
}
/*
- * Returns true if there are any bytes available in the transport buffer.
+ * Returns true if there are any bytes available in the transport buffer. See
+ * pqsecure_drain_pending() for a more complete discussion of the concepts
+ * involved.
*/
bool
pqsecure_read_is_pending(PGconn *conn)
@@ -262,6 +264,53 @@ pqsecure_read_is_pending(PGconn *conn)
return 0;
}
+/*
+ * Drains any transport data that is already buffered in userspace and adds it
+ * to conn->inBuffer, enlarging inBuffer if necessary. The drain fails if
+ * inBuffer cannot be made to hold all available transport data.
+ *
+ * This operation is necessary to prevent deadlock, due to a layering violation
+ * designed into our asynchronous client API: pqReadData() and all the parsing
+ * routines above it receive data from the SSL/GSS transport buffer, but clients
+ * poll on the raw PQsocket() handle.
+ *
+ * If the
+ *
+ * Implementations should not attempt to read any more data from the socket
+ * while draining the transport buffer. After a successful return,
+ * pqsecure_bytes_pending() must be zero.
+ */
+int
+pqsecure_drain_pending(PGconn *conn)
+{
+ int ret;
+
+#ifdef USE_SSL
+ if (conn->ssl_in_use)
+ {
+ ret = pgtls_drain_pending(conn);
+ }
+ else
+#endif
+#ifdef ENABLE_GSS
+ if (conn->gssenc)
+ {
+ ret = pg_GSS_drain_pending(conn);
+ }
+ else
+#endif
+ {
+ /* Plaintext connections have no transport buffer. */
+ ret = 0;
+ }
+
+ /* Keep the implementation honest. */
+ if (ret == 0)
+ Assert(!pqsecure_read_is_pending(conn));
+
+ return ret;
+}
+
/*
* Write data to a secure connection.
*
diff --git a/src/interfaces/libpq/libpq-int.h b/src/interfaces/libpq/libpq-int.h
index 51bc2d3b740..f89db453527 100644
--- a/src/interfaces/libpq/libpq-int.h
+++ b/src/interfaces/libpq/libpq-int.h
@@ -810,6 +810,7 @@ extern PostgresPollingStatusType pqsecure_open_client(PGconn *);
extern void pqsecure_close(PGconn *);
extern ssize_t pqsecure_read(PGconn *, void *ptr, size_t len);
extern bool pqsecure_read_is_pending(PGconn *);
+extern int pqsecure_drain_pending(PGconn *);
extern ssize_t pqsecure_write(PGconn *, const void *ptr, size_t len);
extern ssize_t pqsecure_raw_read(PGconn *, void *ptr, size_t len);
extern ssize_t pqsecure_raw_write(PGconn *, const void *ptr, size_t len);
@@ -850,6 +851,11 @@ extern ssize_t pgtls_read(PGconn *conn, void *ptr, size_t len);
*/
extern bool pgtls_read_is_pending(PGconn *conn);
+/*
+ * Reads any data waiting in the SSL read buffer into the connection buffer.
+ */
+extern int pgtls_drain_pending(PGconn *conn);
+
/*
* Write data to a secure connection.
*
@@ -897,6 +903,7 @@ extern PostgresPollingStatusType pqsecure_open_gss(PGconn *conn);
extern ssize_t pg_GSS_write(PGconn *conn, const void *ptr, size_t len);
extern ssize_t pg_GSS_read(PGconn *conn, void *ptr, size_t len);
extern bool pg_GSS_read_is_pending(PGconn *conn);
+extern int pg_GSS_drain_pending(PGconn *conn);
#endif
/* === in fe-trace.c === */
--
2.34.1
On Fri, Jul 18, 2025 at 11:11 AM Jacob Champion
<jacob.champion@enterprisedb.com> wrote:
The current behavior for GSS is, IMO, an
obvious oversight.
(A better way to word this would have been "clearly an oversight"; I
didn't mean to imply that the correct behavior is obvious. The
opposite, in fact.)
--Jacob
Hi,
On 2025-07-17 09:48:29 -0700, Jacob Champion wrote:
On Wed, Jul 16, 2025 at 4:35 PM Andres Freund <andres@anarazel.de> wrote:
Why do we care about not hitting the socket? We always operate the socket in
non-blocking mode anyway?IIUC, that would change pqReadData() from a bounded read to an
unbounded one. Then we have to somehow protect against a server that
can send data faster than we can process it.
I don't see why it would have to. If a connection has sufficient data in its
buffer on the libpq level, we obviously don't need to read more.
FWIW, I have seen too many folks move away from encrypted connections for
backups due to openssl being the bottlenecks to just accept that we'll never
go beyond the worst performance. Any actually decently performing networking
will need to divorce when socket IO happens and when openssl sees the data
much more heavily.Can't we make our custom BIO do that?
We can, but that actually just makes this issue *more* acute, not less. We
would obviously have to take the data buffered on the BIO level into account
before signalling to the libpq user that no data is ready.
FWIW, even with BIO level buffering, it turns out that openssl readahead is
*still* a decent performance boon, going twice as many times in the BIO isn't
free, even if the BIO does buffering.
Just relying more and more on extremely tight coupling is a dead end.
Why? This abstraction _must_ leak.
I don't really buy this. We have buffering in libpq, even without TLS. Most of
the issues we need to solve already exists for non-TLS connections (except for
the annoying thing of introduces writes when doing reads and vice versa).
It's just that our implementation is bad.
TLS does not behave like raw TCP. So if there's no way to get rid of
the coupling, IMO we might as well use it.
I agree that they're not the same. I don't see what that has to do with
introducing hard requirement for no buffering within openssl or layers below
(a future BIO that does caching).
IIUC, we can't use SSL_has_pending() either. I need to know how
exactly many bytes to drain out of the userspace buffer without
introducing more bytes into it.Why? The only case where we ought to care about whether pending data exists
inside openssl is if our internal buffer is either empty or doesn't contain
the entire message we're trying to consume. In either of those two cases we
can just consume some data from openssl, without caring how much it precisely
is.I can't tell if you're discussing the fix for this bug or a hypothetical
future architecture that makes readahead safe.
I don't know your fix really looks like - afaict you haven't shared it. So
it's hard to actually comment with specifics to it.
I am not saying that a to-be-backpatched-fix needs to make openssl readahead
work, that'd be absurd. But I am concerned with more fundamentally entrenching
the idea that there never is any buffering below libpq's buffering, it'll
cause trouble down the line.
I don't think PQconnectPoll() behaves the way you're describing; there is no
retry-read concept.
FWIW, I don't care about what we do during connection establishment. Doing
lots of tiny reads or such will never matter for efficiency, compared to all
the other costs.
To me the relevant cases are use of libpq in established connections, either in a
blocking way, or in a non-blocking way.
For blocking use we have that gross SSL hack in place (i.e. the
pgtls_read_pending() check in pqSocketCheck()), but we obviously *don't* for
the non-blocking case - the problem of this thread.
Our APIs around the non-blocking use of libpq are pretty
bonkers:
- PQconsumeInput() doesn't tell you whether it actually consumed input
- We kind of document PQconsumeInput() to consume all the available input, but
it does no such thing, it just fills the internal buffer. Which might be
big, or it might not be, depending on what previously happened on the
connection.
- PQisBusy() returns true even if there's unconsumed data in the socket
There's really no sane way to write an event loop with this that doesn't do
unnecessary poll()s, because PQconsumeInput() might have consumed data, but
PQisBusy() might *still* return true, because not enough input has been
consumed.
I suspect that the the most minimal way to fix this, without breaking the
external API, would be for PQisBusy() to try to read more data. But that's
also pretty gross.
Greetings,
Andres Freund
On Fri, Jul 18, 2025 at 12:55 PM Andres Freund <andres@anarazel.de> wrote:
Hi,
I think we're talking past each other, so let me try to focus on just
a few items here. I'm happy to go back and respond point-by-point if
needed.
I don't know your fix really looks like - afaict you haven't shared it. So
it's hard to actually comment with specifics to it.
Just upthread [1]/messages/by-id/CAOYmi+mD6EbDEYfwZON0FCUAvGO+2=jR2V4KQYx+d+g0ap0Amg@mail.gmail.com. Emails probably crossed while you were typing.
I am not saying that a to-be-backpatched-fix needs to make openssl readahead
work, that'd be absurd. But I am concerned with more fundamentally entrenching
the idea that there never is any buffering below libpq's buffering, it'll
cause trouble down the line.
And I'm not saying that I'm fundamentally opposed to a future
architecture that allows readahead. But I do want to pin behavior
that's required for safety in the _current_ architecture. We can unpin
it if the order of operations is changed; assertions that have been
added can always be deleted.
FWIW, I don't care about what we do during connection establishment.
I have to care, because upthread I've shown that we can deadlock there
too. My goal in this thread is to fix the deadlock generally, on all
branches.
--Jacob
[1]: /messages/by-id/CAOYmi+mD6EbDEYfwZON0FCUAvGO+2=jR2V4KQYx+d+g0ap0Amg@mail.gmail.com
On Fri, Jul 18, 2025 at 11:11 AM Jacob Champion
<jacob.champion@enterprisedb.com> wrote:
The attached still needs some documentation work
v2 does a bunch of commit message work, but I imagine it needs a good
bit of copy-editing for clarity.
I'm not in any hurry to smash this in. I think we still need
- independent verification of the architectural issue, to make sure
it's not any deeper or shallower than pqReadData()
- independent verification that this fixes the bugs that have been described
- measurement of the performance characteristics of the new code
- verification of the maximum amount of additional buffer memory that
can be consumed during the drain
- consensus that we want to maintain this new behavior
- discussion of what we want this code to look like going forward
Andres, does this patch help clarify my thoughts upthread? Ideally the
additional code isn't getting in the way of any future
rearchitectures, since it only pins the new requirement in the code
that needs it.
Thanks,
--Jacob
Attachments:
v2-0001-libpq-Extend-read-pending-check-from-SSL-to-GSS.patchapplication/octet-stream; name=v2-0001-libpq-Extend-read-pending-check-from-SSL-to-GSS.patchDownload
From 7b0ad8c9ac1d7e10d097f869f35be87aaf2bae08 Mon Sep 17 00:00:00 2001
From: Jacob Champion <jacob.champion@enterprisedb.com>
Date: Fri, 18 Jul 2025 10:06:13 -0700
Subject: [PATCH v2 1/2] libpq: Extend "read pending" check from SSL to GSS
An extra check for pending bytes in the SSL layer has been part of
pqReadReady() for a very long time (79ff2e96d). But when GSS transport
encryption was added, it didn't receive the same treatment. (As
79ff2e96d notes, "The bug that I fixed in this patch is exceptionally
hard to reproduce reliably.")
Without that check, it's possible to hit a hang in gssencmode, if the
server splits a large libpq message such that the final message in a
streamed response is part of the same wrapped token as the split
message:
DataRowDataRowDataRowDataRowDataRowData
-- token boundary --
RowDataRowCommandCompleteReadyForQuery
If the split message takes up enough memory to nearly fill libpq's
receive buffer, libpq may return from pqReadData() before the later
messages are pulled out of the PqGSSRecvBuffer. Without additional
socket activity from the server, pqReadReady() (via pqSocketCheck())
will never again return true, hanging the connection.
Pull the pending-bytes check into the pqsecure API layer, where both SSL
and GSS now implement it.
Note that this does not fix the root problem! Third party clients of
libpq have no way to call pqsecure_read_is_pending() in their own
polling. This just brings the GSS implementation up to par with the
existing SSL workaround; a broader fix is left to a subsequent commit.
(However, pgtls_read_pending() is renamed to pgtls_read_is_pending(), to
avoid conflation with the forthcoming pgtls_drain_pending().)
Discussion: https://postgr.es/m/CAOYmi%2BmpymrgZ76Jre2dx_PwRniS9YZojwH0rZnTuiGHCsj0rA%40mail.gmail.com
---
src/interfaces/libpq/fe-misc.c | 6 ++----
src/interfaces/libpq/fe-secure-gssapi.c | 6 ++++++
src/interfaces/libpq/fe-secure-openssl.c | 2 +-
src/interfaces/libpq/fe-secure.c | 19 +++++++++++++++++++
src/interfaces/libpq/libpq-int.h | 4 +++-
5 files changed, 31 insertions(+), 6 deletions(-)
diff --git a/src/interfaces/libpq/fe-misc.c b/src/interfaces/libpq/fe-misc.c
index dca44fdc5d2..434216ff89f 100644
--- a/src/interfaces/libpq/fe-misc.c
+++ b/src/interfaces/libpq/fe-misc.c
@@ -1099,14 +1099,12 @@ pqSocketCheck(PGconn *conn, int forRead, int forWrite, pg_usec_time_t end_time)
return -1;
}
-#ifdef USE_SSL
- /* Check for SSL library buffering read bytes */
- if (forRead && conn->ssl_in_use && pgtls_read_pending(conn))
+ /* Check for SSL/GSS library buffering read bytes */
+ if (forRead && pqsecure_read_is_pending(conn))
{
/* short-circuit the select */
return 1;
}
-#endif
}
/* We will retry as long as we get EINTR */
diff --git a/src/interfaces/libpq/fe-secure-gssapi.c b/src/interfaces/libpq/fe-secure-gssapi.c
index bc9e1ce06fa..7c88e64cfd2 100644
--- a/src/interfaces/libpq/fe-secure-gssapi.c
+++ b/src/interfaces/libpq/fe-secure-gssapi.c
@@ -469,6 +469,12 @@ gss_read(PGconn *conn, void *recv_buffer, size_t length, ssize_t *ret)
return PGRES_POLLING_OK;
}
+bool
+pg_GSS_read_is_pending(PGconn *conn)
+{
+ return PqGSSResultLength > PqGSSResultNext;
+}
+
/*
* Negotiate GSSAPI transport for a connection. When complete, returns
* PGRES_POLLING_OK. Will return PGRES_POLLING_READING or
diff --git a/src/interfaces/libpq/fe-secure-openssl.c b/src/interfaces/libpq/fe-secure-openssl.c
index 51dd7b9fec0..8f975561e51 100644
--- a/src/interfaces/libpq/fe-secure-openssl.c
+++ b/src/interfaces/libpq/fe-secure-openssl.c
@@ -231,7 +231,7 @@ rloop:
}
bool
-pgtls_read_pending(PGconn *conn)
+pgtls_read_is_pending(PGconn *conn)
{
return SSL_pending(conn->ssl) > 0;
}
diff --git a/src/interfaces/libpq/fe-secure.c b/src/interfaces/libpq/fe-secure.c
index e686681ba15..94c97ec26fb 100644
--- a/src/interfaces/libpq/fe-secure.c
+++ b/src/interfaces/libpq/fe-secure.c
@@ -243,6 +243,25 @@ pqsecure_raw_read(PGconn *conn, void *ptr, size_t len)
return n;
}
+/*
+ * Returns true if there are any bytes available in the transport buffer.
+ */
+bool
+pqsecure_read_is_pending(PGconn *conn)
+{
+#ifdef USE_SSL
+ if (conn->ssl_in_use)
+ return pgtls_read_is_pending(conn);
+#endif
+#ifdef ENABLE_GSS
+ if (conn->gssenc)
+ return pg_GSS_read_is_pending(conn);
+#endif
+
+ /* Plaintext connections have no transport buffer. */
+ return 0;
+}
+
/*
* Write data to a secure connection.
*
diff --git a/src/interfaces/libpq/libpq-int.h b/src/interfaces/libpq/libpq-int.h
index a701c25038a..c38c1ea0086 100644
--- a/src/interfaces/libpq/libpq-int.h
+++ b/src/interfaces/libpq/libpq-int.h
@@ -810,6 +810,7 @@ extern int pqWriteReady(PGconn *conn);
extern PostgresPollingStatusType pqsecure_open_client(PGconn *);
extern void pqsecure_close(PGconn *);
extern ssize_t pqsecure_read(PGconn *, void *ptr, size_t len);
+extern bool pqsecure_read_is_pending(PGconn *);
extern ssize_t pqsecure_write(PGconn *, const void *ptr, size_t len);
extern ssize_t pqsecure_raw_read(PGconn *, void *ptr, size_t len);
extern ssize_t pqsecure_raw_write(PGconn *, const void *ptr, size_t len);
@@ -848,7 +849,7 @@ extern ssize_t pgtls_read(PGconn *conn, void *ptr, size_t len);
/*
* Is there unread data waiting in the SSL read buffer?
*/
-extern bool pgtls_read_pending(PGconn *conn);
+extern bool pgtls_read_is_pending(PGconn *conn);
/*
* Write data to a secure connection.
@@ -896,6 +897,7 @@ extern PostgresPollingStatusType pqsecure_open_gss(PGconn *conn);
*/
extern ssize_t pg_GSS_write(PGconn *conn, const void *ptr, size_t len);
extern ssize_t pg_GSS_read(PGconn *conn, void *ptr, size_t len);
+extern bool pg_GSS_read_is_pending(PGconn *conn);
#endif
/* === in fe-trace.c === */
--
2.34.1
v2-0002-libpq-Drain-all-pending-bytes-from-SSL-GSS-during.patchapplication/octet-stream; name=v2-0002-libpq-Drain-all-pending-bytes-from-SSL-GSS-during.patchDownload
From aeda2ae243c6bfb35e47154d4c9a703e15beb673 Mon Sep 17 00:00:00 2001
From: Jacob Champion <jacob.champion@enterprisedb.com>
Date: Thu, 17 Jul 2025 08:45:45 -0700
Subject: [PATCH v2 2/2] libpq: Drain all pending bytes from SSL/GSS during
pqReadData()
The previous commit strengthened a workaround for a hang when large
messages are split across TLS records/GSS tokens. Because that
workaround is implemented in libpq internals, it can only help us when
libpq itself is polling on the socket. In nonblocking situations, where
the client above libpq is expected to poll, the same bugs can show up.
As a contrived example, consider a large protocol-2.0 error coming back
from a server during PQconnectPoll(), split in an odd way across two
records:
-- TLS record (8192-byte payload) --
EEEE[...repeated a total of 8192 times]
-- TLS record (8193-byte payload) --
EEEE[...repeated a total of 8192 times]\0
The first record will fill the first half of the libpq receive buffer,
which is 16k long by default. The second record completely fills the
last half with its first 8192 bytes, leaving the terminating NULL in the
OpenSSL buffer. Since we still haven't seen the terminator at our level,
PQconnectPoll() will return PGRES_POLLING_READING, expecting to come
back when the server has sent "the rest" of the data. But there is
nothing left to read from the socket; OpenSSL had to pull all of the
data in the 8193-byte record off of the wire to decrypt it.
(A real server would probably not split up the records this way, nor
keep the connection open after sending a fatal connection error. But
servers that regularly use larger TLS records can get the libpq receive
buffer into the same state if DataRows are big enough, as reported on
the list.)
This is a layering violation. libpq makes decisions based on data in the
application buffer, above the transport buffer (whether SSL or GSS), but
clients are polling the socket, below the transport buffer. One way to
fix this in a backportable way, without changing APIs too much, is to
ensure data never stays in the transport buffer. Then pqReadData's
postconditions will look similar for both raw sockets and SSL/GSS: any
available data is either in the application buffer, or still on the
socket.
Building on the prior commit, add a function to the pqsecure API layer
which drains all pending data from the transport layer into
conn->inBuffer, expanding the buffer as necessary. pqReadData() calls
this function before returning when pending data exists. This is not
particularly efficient from an architectural perspective (the
pqsecure_read() implementations take care to fit their packets into the
current buffer, and that effort is now completely discarded), but it's
hopefully easier to reason about than a full rewrite would be for the
back branches.
Reported-by: Lars Kanis <lars@greiz-reinsdorf.de>
Discussion: https://postgr.es/m/2039ac58-d3e0-434b-ac1a-2a987f3b4cb1%40greiz-reinsdorf.de
Backpatch-through: 13
---
src/interfaces/libpq/fe-misc.c | 51 ++++++++++-
src/interfaces/libpq/fe-secure-gssapi.c | 27 ++++++
src/interfaces/libpq/fe-secure-openssl.c | 109 +++++++++++++++++++++++
src/interfaces/libpq/fe-secure.c | 97 +++++++++++++++++++-
src/interfaces/libpq/libpq-int.h | 7 ++
5 files changed, 288 insertions(+), 3 deletions(-)
diff --git a/src/interfaces/libpq/fe-misc.c b/src/interfaces/libpq/fe-misc.c
index 434216ff89f..49cf32e1bb7 100644
--- a/src/interfaces/libpq/fe-misc.c
+++ b/src/interfaces/libpq/fe-misc.c
@@ -55,6 +55,7 @@ static int pqPutMsgBytes(const void *buf, size_t len, PGconn *conn);
static int pqSendSome(PGconn *conn, int len);
static int pqSocketCheck(PGconn *conn, int forRead, int forWrite,
pg_usec_time_t end_time);
+static int pqReadData_internal(PGconn *conn);
/*
* PQlibVersion: return the libpq version number
@@ -593,6 +594,13 @@ pqPutMsgEnd(PGconn *conn)
/* ----------
* pqReadData: read more data, if any is available
+ *
+ * Upon a successful return, callers may assume that either 1) all available
+ * bytes have been consumed from the socket, or 2) the socket is still marked
+ * readable by the OS. (In other words: after a successful pqReadData, it's safe
+ * to tell a client to poll for readable bytes on the socket without any further
+ * draining of the SSL/GSS transport buffers.)
+ *
* Possible return values:
* 1: successfully loaded at least one more byte
* 0: no data is presently available, but no error detected
@@ -605,8 +613,8 @@ pqPutMsgEnd(PGconn *conn)
int
pqReadData(PGconn *conn)
{
- int someread = 0;
- int nread;
+ int available;
+ bool pending;
if (conn->sock == PGINVALID_SOCKET)
{
@@ -614,6 +622,45 @@ pqReadData(PGconn *conn)
return -1;
}
+ available = pqReadData_internal(conn);
+ if (available < 0)
+ return -1;
+
+ /*
+ * Make sure there are no bytes stuck in layers between conn->inBuffer and
+ * the socket, to make it safe for clients to poll on PQsocket(). See
+ * pqsecure_drain_pending's documentation for details.
+ */
+ pending = pqsecure_read_is_pending(conn);
+
+ if (available && pending)
+ {
+ if (pqsecure_drain_pending(conn))
+ return -1;
+ }
+ else if (!available)
+ {
+ /*
+ * If we're not returning any bytes from the underlying transport,
+ * that must imply there aren't any in the transport buffer...
+ */
+ Assert(!pending);
+ }
+
+ return available;
+}
+
+/*
+ * Workhorse for pqReadData(). It's kept separate from the
+ * pqsecure_drain_pending() logic to avoid adding to this function's goto
+ * complexity.
+ */
+static int
+pqReadData_internal(PGconn *conn)
+{
+ int someread = 0;
+ int nread;
+
/* Left-justify any data in the buffer to make room */
if (conn->inStart < conn->inEnd)
{
diff --git a/src/interfaces/libpq/fe-secure-gssapi.c b/src/interfaces/libpq/fe-secure-gssapi.c
index 7c88e64cfd2..329c2559b88 100644
--- a/src/interfaces/libpq/fe-secure-gssapi.c
+++ b/src/interfaces/libpq/fe-secure-gssapi.c
@@ -475,6 +475,33 @@ pg_GSS_read_is_pending(PGconn *conn)
return PqGSSResultLength > PqGSSResultNext;
}
+int
+pg_GSS_drain_pending(PGconn *conn)
+{
+ int pending;
+
+ /* Figure out how many bytes to take off the connection. */
+ Assert(PqGSSResultLength >= PqGSSResultNext);
+ pending = PqGSSResultLength - PqGSSResultNext;
+
+ if (!pending)
+ {
+ /* Nothing to do. */
+ return 0;
+ }
+
+ /* Expand the input buffer if necessary. */
+ if (pqCheckInBufferSpace(conn->inEnd + (size_t) pending, conn))
+ return -1; /* errorMessage already set */
+
+ /* Now read the buffered data. */
+ memcpy(conn->inBuffer + conn->inEnd, PqGSSResultBuffer + PqGSSResultNext, pending);
+ conn->inEnd += pending;
+ PqGSSResultNext += pending;
+
+ return 0;
+}
+
/*
* Negotiate GSSAPI transport for a connection. When complete, returns
* PGRES_POLLING_OK. Will return PGRES_POLLING_READING or
diff --git a/src/interfaces/libpq/fe-secure-openssl.c b/src/interfaces/libpq/fe-secure-openssl.c
index 8f975561e51..b95e229bd08 100644
--- a/src/interfaces/libpq/fe-secure-openssl.c
+++ b/src/interfaces/libpq/fe-secure-openssl.c
@@ -236,6 +236,115 @@ pgtls_read_is_pending(PGconn *conn)
return SSL_pending(conn->ssl) > 0;
}
+/*
+ * Helper callback for use with ERR_print_errors_cb(). This appends the raw
+ * error queue to the provided PQExpBuffer, one entry per line.
+ *
+ * Note that this is not pretty output; it's meant for debugging.
+ */
+static int
+append_error_queue(const char *str, size_t len, void *u)
+{
+ PQExpBuffer buf = u;
+
+ appendBinaryPQExpBuffer(buf, str, len);
+ appendPQExpBufferChar(buf, '\n');
+
+ return 0;
+}
+
+int
+pgtls_drain_pending(PGconn *conn)
+{
+ int pending;
+ size_t drained;
+
+ /*
+ * OpenSSL readahead is documented to break SSL_pending(). Plus, we can't
+ * afford to have OpenSSL take bytes off the socket without processing
+ * them; that breaks the postconditions for pqsecure_drain_pending().
+ */
+ Assert(!SSL_get_read_ahead(conn->ssl));
+
+ /* Figure out how many bytes to take off the connection. */
+ pending = SSL_pending(conn->ssl);
+
+ if (!pending)
+ {
+ /* Nothing to do. */
+ return 0;
+ }
+ else if (pending < 0)
+ {
+ /* Shouldn't be possible, but don't let it mess up the math below. */
+ Assert(false);
+ libpq_append_conn_error(conn, "OpenSSL reports negative bytes pending");
+ return -1;
+ }
+ else if (pending == INT_MAX)
+ {
+ /*
+ * If we ever found a legitimate way to hit this, we'd need to loop
+ * around to call SSL_pending() again. Throw an error rather than
+ * complicate the code in that way, because SSL_read() should be
+ * bounded to the size of a single TLS record, and conn->inBuffer
+ * can't currently go past INT_MAX in size anyway.
+ */
+ libpq_append_conn_error(conn, "OpenSSL reports INT_MAX bytes pending");
+ return -1;
+ }
+
+ /* Expand the input buffer if necessary. */
+ if (pqCheckInBufferSpace(conn->inEnd + (size_t) pending, conn))
+ return -1; /* errorMessage already set */
+
+ /*
+ * Now read the buffered data.
+ *
+ * Don't defer to pgtls_read(); OpenSSL should guarantee that pending data
+ * comes off in a single call, and we don't want to use the more
+ * complicated read-loop behavior. We still have to manage the error
+ * queue.
+ */
+ ERR_clear_error();
+ if (!SSL_read_ex(conn->ssl, conn->inBuffer + conn->inEnd, pending, &drained))
+ {
+ int err = SSL_get_error(conn->ssl, 0);
+
+ /*
+ * Something is very wrong. Report the error code and the entirety of
+ * the error queue without any attempt at interpretation. Probably not
+ * worth complicating things for the sake of translation, either.
+ */
+ appendPQExpBuffer(&conn->errorMessage,
+ "unexpected error code %d while draining SSL buffer; ",
+ err);
+
+ if (ERR_peek_error())
+ {
+ appendPQExpBufferStr(&conn->errorMessage, "error queue follows:\n");
+ ERR_print_errors_cb(append_error_queue, &conn->errorMessage);
+ }
+ else
+ appendPQExpBufferStr(&conn->errorMessage,
+ "no error queue provided\n");
+
+ return -1;
+ }
+
+ /* Final consistency check. */
+ if (drained != pending)
+ {
+ libpq_append_conn_error(conn,
+ "drained only %zu of %d pending bytes in SSL buffer",
+ drained, pending);
+ return -1;
+ }
+
+ conn->inEnd += pending;
+ return 0;
+}
+
ssize_t
pgtls_write(PGconn *conn, const void *ptr, size_t len)
{
diff --git a/src/interfaces/libpq/fe-secure.c b/src/interfaces/libpq/fe-secure.c
index 94c97ec26fb..82ae07be674 100644
--- a/src/interfaces/libpq/fe-secure.c
+++ b/src/interfaces/libpq/fe-secure.c
@@ -244,7 +244,9 @@ pqsecure_raw_read(PGconn *conn, void *ptr, size_t len)
}
/*
- * Returns true if there are any bytes available in the transport buffer.
+ * Returns true if there are any bytes available in the transport buffer. See
+ * pqsecure_drain_pending() for a more complete discussion of the concepts
+ * involved.
*/
bool
pqsecure_read_is_pending(PGconn *conn)
@@ -262,6 +264,99 @@ pqsecure_read_is_pending(PGconn *conn)
return 0;
}
+/*---
+ * Drains any transport data that is already buffered in userspace and adds it
+ * to conn->inBuffer, enlarging inBuffer if necessary. The drain fails if
+ * inBuffer cannot be made to hold all available transport data.
+ *
+ * Implementations should not attempt to read any more data from the socket
+ * while draining the transport buffer. After a successful return,
+ * pqsecure_bytes_pending() must be zero.
+ *
+ * This operation is necessary to prevent deadlock, due to a layering violation
+ * designed into our asynchronous client API: pqReadData() and all the parsing
+ * routines above it receive data from the SSL/GSS transport buffer, but clients
+ * poll on the raw PQsocket() handle. So data can be "lost" in the intermediate
+ * layer if we don't take it out here.
+ *
+ * To illustrate what we're trying to prevent, say that the server is sending
+ * two messages at once in response to a query (Aaaa and Bb), the libpq buffer
+ * is five characters in size, and TLS records max out at three-character
+ * payloads.
+ *
+ * Client libpq SSL Socket
+ * | | | |
+ * | [ ] [ ] [ ] [1] Buffers are empty, client is
+ * x --------------------------> | polling on socket
+ * | | | |
+ * | [ ] [ ] [xxx] [2] First record is received; poll
+ * | <-------------------------- | signals read-ready
+ * | | | |
+ * x ---> [ ] [ ] [xxx] [3] Client calls PQconsumeInput()
+ * | | | |
+ * | [ ] -> [ ] [xxx] [4] libpq calls pqReadData() to fill
+ * | | | | the receive buffer
+ * | [ ] [Aaa] <-- [ ] [5] SSL pulls payload off the wire
+ * | | | | and decrypts it
+ * | [Aaa ] <- [ ] [ ] [6] pqsecure_read() takes all data
+ * | | | |
+ * | <--- [Aaa ] [ ] [ ] [7] PQconsumeInput() returns with a
+ * x --------------------------> | partial message, PQisBusy() is
+ * | | | | still true, client polls again
+ * | [Aaa ] [ ] [xxx] [8] Second record is received; poll
+ * | <-------------------------- | signals read-ready
+ * | | | |
+ * x ---> [Aaa ] [ ] [xxx] [9] Client calls PQconsumeInput()
+ * | | | |
+ * | [Aaa ] -> [ ] [xxx] [10] libpq calls pqReadData() to fill
+ * | | | | the receive buffer
+ * | [Aaa ] [aBb] <-- [ ] [11] SSL decrypts
+ * | | | |
+ * | [AaaaB] <- [b ] [ ] [12] pqsecure_read() fills its
+ * | | | | buffer, taking only two bytes
+ * | <--- [AaaaB] [b ] [ ] [13] PQconsumeInput() returns with a
+ * | | | | complete message buffered;
+ * | | | | PQisBusy() is false
+ * x ---> [AaaaB] [b ] [ ] [14] Client calls PQgetResult()
+ * | | | |
+ * | <--- [B ] [b ] [ ] [15] Aaaa is returned; PQisBusy() is
+ * x --------------------------> | true and client polls again
+ * . | | .
+ * . [B ] [b ] . [16] No packets, and client hangs.
+ * . | | .
+ *
+ */
+int
+pqsecure_drain_pending(PGconn *conn)
+{
+ int ret;
+
+#ifdef USE_SSL
+ if (conn->ssl_in_use)
+ {
+ ret = pgtls_drain_pending(conn);
+ }
+ else
+#endif
+#ifdef ENABLE_GSS
+ if (conn->gssenc)
+ {
+ ret = pg_GSS_drain_pending(conn);
+ }
+ else
+#endif
+ {
+ /* Plaintext connections have no transport buffer. */
+ ret = 0;
+ }
+
+ /* Keep the implementation honest. */
+ if (ret == 0)
+ Assert(!pqsecure_read_is_pending(conn));
+
+ return ret;
+}
+
/*
* Write data to a secure connection.
*
diff --git a/src/interfaces/libpq/libpq-int.h b/src/interfaces/libpq/libpq-int.h
index c38c1ea0086..a0690a96a32 100644
--- a/src/interfaces/libpq/libpq-int.h
+++ b/src/interfaces/libpq/libpq-int.h
@@ -811,6 +811,7 @@ extern PostgresPollingStatusType pqsecure_open_client(PGconn *);
extern void pqsecure_close(PGconn *);
extern ssize_t pqsecure_read(PGconn *, void *ptr, size_t len);
extern bool pqsecure_read_is_pending(PGconn *);
+extern int pqsecure_drain_pending(PGconn *);
extern ssize_t pqsecure_write(PGconn *, const void *ptr, size_t len);
extern ssize_t pqsecure_raw_read(PGconn *, void *ptr, size_t len);
extern ssize_t pqsecure_raw_write(PGconn *, const void *ptr, size_t len);
@@ -851,6 +852,11 @@ extern ssize_t pgtls_read(PGconn *conn, void *ptr, size_t len);
*/
extern bool pgtls_read_is_pending(PGconn *conn);
+/*
+ * Reads any data waiting in the SSL read buffer into the connection buffer.
+ */
+extern int pgtls_drain_pending(PGconn *conn);
+
/*
* Write data to a secure connection.
*
@@ -898,6 +904,7 @@ extern PostgresPollingStatusType pqsecure_open_gss(PGconn *conn);
extern ssize_t pg_GSS_write(PGconn *conn, const void *ptr, size_t len);
extern ssize_t pg_GSS_read(PGconn *conn, void *ptr, size_t len);
extern bool pg_GSS_read_is_pending(PGconn *conn);
+extern int pg_GSS_drain_pending(PGconn *conn);
#endif
/* === in fe-trace.c === */
--
2.34.1
On 01/08/2025 00:48, Jacob Champion wrote:
On Fri, Jul 18, 2025 at 11:11 AM Jacob Champion
<jacob.champion@enterprisedb.com> wrote:The attached still needs some documentation work
v2 does a bunch of commit message work, but I imagine it needs a good
bit of copy-editing for clarity.I'm not in any hurry to smash this in. I think we still need
- independent verification of the architectural issue, to make sure
it's not any deeper or shallower than pqReadData()
- independent verification that this fixes the bugs that have been described
- measurement of the performance characteristics of the new code
- verification of the maximum amount of additional buffer memory that
can be consumed during the drain
- consensus that we want to maintain this new behavior
- discussion of what we want this code to look like going forwardAndres, does this patch help clarify my thoughts upthread? Ideally the
additional code isn't getting in the way of any future
rearchitectures, since it only pins the new requirement in the code
that needs it.
A customer just ran into this issue and it took the team and I a few
days to debug until I remembered this thread. We're running PostgreSQL
with no changes to the networking parts, but there's a proxy in between
that decrypts and re-encrypts the TLS traffic. So I'm now motivated to
get this fixed :-).
I'll start reviewing the patch, but in the meanwhile, Jacob, could you
share the reproducer and any other testing scripts you have that might
be useful here?
- Heikki
On 01/08/2025 00:48, Jacob Champion wrote:
On Fri, Jul 18, 2025 at 11:11 AM Jacob Champion
<jacob.champion@enterprisedb.com> wrote:The attached still needs some documentation work
v2 does a bunch of commit message work, but I imagine it needs a good
bit of copy-editing for clarity.I'm not in any hurry to smash this in. I think we still need
- independent verification of the architectural issue, to make sure
it's not any deeper or shallower than pqReadData()
- independent verification that this fixes the bugs that have been described
- measurement of the performance characteristics of the new code
- verification of the maximum amount of additional buffer memory that
can be consumed during the drain
- consensus that we want to maintain this new behavior
- discussion of what we want this code to look like going forward
I agree that making pqReadData() drain the transport buffer is the right
approach. I'm not sure if this can ever work with the OpenSSL readahead
that was discussed, but we don't need to solve that now.
Once we make pqReadData() to always drain the buffer, does
pqSocketCheck() still need to check for pending bytes? It seems harmless
to do it in any case, though, so probably best to keep it.
gss_read() calls pqReadReady() which now calls pqsecure_bytes_pending(),
which now also checks for bytes pending in the GSS buffer. Is that a
good thing or a bad thing or does it not matter?
In pqReadData we have this:
/*
* A return value of 0 could mean just that no data is now available, or
* it could mean EOF --- that is, the server has closed the connection.
* Since we have the socket in nonblock mode, the only way to tell the
* difference is to see if select() is saying that the file is ready.
* Grumble. Fortunately, we don't expect this path to be taken much,
* since in normal practice we should not be trying to read data unless
* the file selected for reading already.
*
* In SSL mode it's even worse: SSL_read() could say WANT_READ and then
* data could arrive before we make the pqReadReady() test, but the second
* SSL_read() could still say WANT_READ because the data received was not
* a complete SSL record. So we must play dumb and assume there is more
* data, relying on the SSL layer to detect true EOF.
*/#ifdef USE_SSL
if (conn->ssl_in_use)
return 0;
#endif
Should we do the same for GSS as we do for SSL here?
Apart from the above-mentioned things, the patch looks bug-free to me.
However, it feels like a layering violation. The *_drain_pending()
functions should not write directly to conn->inBuffer, or expand the buffer.
To avoid that, I propose the attached to move the buffer-expansin logic
to the caller. The pqsecure API now has this:
/*
* Return the number of bytes available in the transport buffer.
*
* If pqsecure_read() is called for this number of bytes, it's
guaranteed to
* return successfully without reading from the underlying socket. See
* pqDrainPending() for a more complete discussion of the concepts
involved.
*/
ssize_t pqsecure_bytes_pending(PGconn *conn);
pqReadData(), or its pqDrainPending() subroutine to be precise, uses
that and pqsecure_read() to drain the buffer. This is less code because
it reuses pqsecure_read(), and doesn't require the TLS/GSS
implementation to reach out into connection->inBuffer.
This does add that assumption to pqsecure_read() that's mentioned in the
comment above though. If we want to avoid that assumption, we could add
another "pqsecure_read_pending()" function that's just like
pqsecure_read(), except that it would only read pending bytes and never
from the socket.
This is completely untested, I just wanted to show the internal design
for now.
- Heikki
Attachments:
v3-0001-libpq-Extend-read-pending-check-from-SSL-to-GSS.patchtext/x-patch; charset=UTF-8; name=v3-0001-libpq-Extend-read-pending-check-from-SSL-to-GSS.patchDownload
From 9956f7448719c62614c2d459f5de8e1f9b9dc3f7 Mon Sep 17 00:00:00 2001
From: Jacob Champion <jacob.champion@enterprisedb.com>
Date: Fri, 18 Jul 2025 10:06:13 -0700
Subject: [PATCH v3 1/2] libpq: Extend "read pending" check from SSL to GSS
An extra check for pending bytes in the SSL layer has been part of
pqReadReady() for a very long time (79ff2e96d). But when GSS transport
encryption was added, it didn't receive the same treatment. (As
79ff2e96d notes, "The bug that I fixed in this patch is exceptionally
hard to reproduce reliably.")
Without that check, it's possible to hit a hang in gssencmode, if the
server splits a large libpq message such that the final message in a
streamed response is part of the same wrapped token as the split
message:
DataRowDataRowDataRowDataRowDataRowData
-- token boundary --
RowDataRowCommandCompleteReadyForQuery
If the split message takes up enough memory to nearly fill libpq's
receive buffer, libpq may return from pqReadData() before the later
messages are pulled out of the PqGSSRecvBuffer. Without additional
socket activity from the server, pqReadReady() (via pqSocketCheck())
will never again return true, hanging the connection.
Pull the pending-bytes check into the pqsecure API layer, where both SSL
and GSS now implement it.
Note that this does not fix the root problem! Third party clients of
libpq have no way to call pqsecure_read_is_pending() in their own
polling. This just brings the GSS implementation up to par with the
existing SSL workaround; a broader fix is left to a subsequent commit.
(However, pgtls_read_pending() is renamed to pgtls_read_is_pending(), to
avoid conflation with the forthcoming pgtls_drain_pending().)
Discussion: https://postgr.es/m/CAOYmi%2BmpymrgZ76Jre2dx_PwRniS9YZojwH0rZnTuiGHCsj0rA%40mail.gmail.com
---
src/interfaces/libpq/fe-misc.c | 6 ++----
src/interfaces/libpq/fe-secure-gssapi.c | 6 ++++++
src/interfaces/libpq/fe-secure-openssl.c | 2 +-
src/interfaces/libpq/fe-secure.c | 19 +++++++++++++++++++
src/interfaces/libpq/libpq-int.h | 4 +++-
5 files changed, 31 insertions(+), 6 deletions(-)
diff --git a/src/interfaces/libpq/fe-misc.c b/src/interfaces/libpq/fe-misc.c
index dca44fdc5d2..434216ff89f 100644
--- a/src/interfaces/libpq/fe-misc.c
+++ b/src/interfaces/libpq/fe-misc.c
@@ -1099,14 +1099,12 @@ pqSocketCheck(PGconn *conn, int forRead, int forWrite, pg_usec_time_t end_time)
return -1;
}
-#ifdef USE_SSL
- /* Check for SSL library buffering read bytes */
- if (forRead && conn->ssl_in_use && pgtls_read_pending(conn))
+ /* Check for SSL/GSS library buffering read bytes */
+ if (forRead && pqsecure_read_is_pending(conn))
{
/* short-circuit the select */
return 1;
}
-#endif
}
/* We will retry as long as we get EINTR */
diff --git a/src/interfaces/libpq/fe-secure-gssapi.c b/src/interfaces/libpq/fe-secure-gssapi.c
index 843b31e175f..5067b3de9f4 100644
--- a/src/interfaces/libpq/fe-secure-gssapi.c
+++ b/src/interfaces/libpq/fe-secure-gssapi.c
@@ -471,6 +471,12 @@ gss_read(PGconn *conn, void *recv_buffer, size_t length, ssize_t *ret)
return PGRES_POLLING_OK;
}
+bool
+pg_GSS_read_is_pending(PGconn *conn)
+{
+ return PqGSSResultLength > PqGSSResultNext;
+}
+
/*
* Negotiate GSSAPI transport for a connection. When complete, returns
* PGRES_POLLING_OK. Will return PGRES_POLLING_READING or
diff --git a/src/interfaces/libpq/fe-secure-openssl.c b/src/interfaces/libpq/fe-secure-openssl.c
index d2045c73ae6..69256c91cdb 100644
--- a/src/interfaces/libpq/fe-secure-openssl.c
+++ b/src/interfaces/libpq/fe-secure-openssl.c
@@ -231,7 +231,7 @@ rloop:
}
bool
-pgtls_read_pending(PGconn *conn)
+pgtls_read_is_pending(PGconn *conn)
{
return SSL_pending(conn->ssl) > 0;
}
diff --git a/src/interfaces/libpq/fe-secure.c b/src/interfaces/libpq/fe-secure.c
index e686681ba15..94c97ec26fb 100644
--- a/src/interfaces/libpq/fe-secure.c
+++ b/src/interfaces/libpq/fe-secure.c
@@ -243,6 +243,25 @@ pqsecure_raw_read(PGconn *conn, void *ptr, size_t len)
return n;
}
+/*
+ * Returns true if there are any bytes available in the transport buffer.
+ */
+bool
+pqsecure_read_is_pending(PGconn *conn)
+{
+#ifdef USE_SSL
+ if (conn->ssl_in_use)
+ return pgtls_read_is_pending(conn);
+#endif
+#ifdef ENABLE_GSS
+ if (conn->gssenc)
+ return pg_GSS_read_is_pending(conn);
+#endif
+
+ /* Plaintext connections have no transport buffer. */
+ return 0;
+}
+
/*
* Write data to a secure connection.
*
diff --git a/src/interfaces/libpq/libpq-int.h b/src/interfaces/libpq/libpq-int.h
index 0f3661b8889..e8a1f248805 100644
--- a/src/interfaces/libpq/libpq-int.h
+++ b/src/interfaces/libpq/libpq-int.h
@@ -819,6 +819,7 @@ extern int pqWriteReady(PGconn *conn);
extern PostgresPollingStatusType pqsecure_open_client(PGconn *);
extern void pqsecure_close(PGconn *);
extern ssize_t pqsecure_read(PGconn *, void *ptr, size_t len);
+extern bool pqsecure_read_is_pending(PGconn *);
extern ssize_t pqsecure_write(PGconn *, const void *ptr, size_t len);
extern ssize_t pqsecure_raw_read(PGconn *, void *ptr, size_t len);
extern ssize_t pqsecure_raw_write(PGconn *, const void *ptr, size_t len);
@@ -857,7 +858,7 @@ extern ssize_t pgtls_read(PGconn *conn, void *ptr, size_t len);
/*
* Is there unread data waiting in the SSL read buffer?
*/
-extern bool pgtls_read_pending(PGconn *conn);
+extern bool pgtls_read_is_pending(PGconn *conn);
/*
* Write data to a secure connection.
@@ -905,6 +906,7 @@ extern PostgresPollingStatusType pqsecure_open_gss(PGconn *conn);
*/
extern ssize_t pg_GSS_write(PGconn *conn, const void *ptr, size_t len);
extern ssize_t pg_GSS_read(PGconn *conn, void *ptr, size_t len);
+extern bool pg_GSS_read_is_pending(PGconn *conn);
#endif
/* === in fe-trace.c === */
--
2.47.3
v3-0002-libpq-Drain-all-pending-bytes-from-SSL-GSS-during.patchtext/x-patch; charset=UTF-8; name=v3-0002-libpq-Drain-all-pending-bytes-from-SSL-GSS-during.patchDownload
From 11a8f0a2c91b9ab81df90c6d306896f2269d82b5 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Fri, 5 Dec 2025 15:11:49 +0200
Subject: [PATCH v3 2/2] libpq: Drain all pending bytes from SSL/GSS during
pqReadData()
The previous commit strengthened a workaround for a hang when large
messages are split across TLS records/GSS tokens. Because that
workaround is implemented in libpq internals, it can only help us when
libpq itself is polling on the socket. In nonblocking situations, where
the client above libpq is expected to poll, the same bugs can show up.
As a contrived example, consider a large protocol-2.0 error coming back
from a server during PQconnectPoll(), split in an odd way across two
records:
-- TLS record (8192-byte payload) --
EEEE[...repeated a total of 8192 times]
-- TLS record (8193-byte payload) --
EEEE[...repeated a total of 8192 times]\0
The first record will fill the first half of the libpq receive buffer,
which is 16k long by default. The second record completely fills the
last half with its first 8192 bytes, leaving the terminating NULL in the
OpenSSL buffer. Since we still haven't seen the terminator at our level,
PQconnectPoll() will return PGRES_POLLING_READING, expecting to come
back when the server has sent "the rest" of the data. But there is
nothing left to read from the socket; OpenSSL had to pull all of the
data in the 8193-byte record off of the wire to decrypt it.
(A real server would probably not split up the records this way, nor
keep the connection open after sending a fatal connection error. But
servers that regularly use larger TLS records can get the libpq receive
buffer into the same state if DataRows are big enough, as reported on
the list.)
This is a layering violation. libpq makes decisions based on data in the
application buffer, above the transport buffer (whether SSL or GSS), but
clients are polling the socket, below the transport buffer. One way to
fix this in a backportable way, without changing APIs too much, is to
ensure data never stays in the transport buffer. Then pqReadData's
postconditions will look similar for both raw sockets and SSL/GSS: any
available data is either in the application buffer, or still on the
socket.
Building on the prior commit, make pqReadData() to drain all pending
data from the transport layer into conn->inBuffer, expanding the
buffer as necessary. This is not particularly efficient from an
architectural perspective (the pqsecure_read() implementations take
care to fit their packets into the current buffer, and that effort is
now completely discarded), but it's hopefully easier to reason about
than a full rewrite would be for the back branches.
Author: Jacob Champion <jacob.champion@enterprisedb.com>
Reported-by: Lars Kanis <lars@greiz-reinsdorf.de>
Discussion: https://postgr.es/m/2039ac58-d3e0-434b-ac1a-2a987f3b4cb1%40greiz-reinsdorf.de
Backpatch-through: 14
---
src/interfaces/libpq/fe-misc.c | 140 ++++++++++++++++++++++-
src/interfaces/libpq/fe-secure-gssapi.c | 7 +-
src/interfaces/libpq/fe-secure-openssl.c | 38 +++++-
src/interfaces/libpq/fe-secure.c | 14 ++-
src/interfaces/libpq/libpq-int.h | 8 +-
5 files changed, 189 insertions(+), 18 deletions(-)
diff --git a/src/interfaces/libpq/fe-misc.c b/src/interfaces/libpq/fe-misc.c
index 434216ff89f..dd555799ae3 100644
--- a/src/interfaces/libpq/fe-misc.c
+++ b/src/interfaces/libpq/fe-misc.c
@@ -55,6 +55,8 @@ static int pqPutMsgBytes(const void *buf, size_t len, PGconn *conn);
static int pqSendSome(PGconn *conn, int len);
static int pqSocketCheck(PGconn *conn, int forRead, int forWrite,
pg_usec_time_t end_time);
+static int pqReadData_internal(PGconn *conn);
+static int pqDrainPending(PGconn *conn);
/*
* PQlibVersion: return the libpq version number
@@ -593,6 +595,13 @@ pqPutMsgEnd(PGconn *conn)
/* ----------
* pqReadData: read more data, if any is available
+ *
+ * Upon a successful return, callers may assume that either 1) all available
+ * bytes have been consumed from the socket, or 2) the socket is still marked
+ * readable by the OS. (In other words: after a successful pqReadData, it's safe
+ * to tell a client to poll for readable bytes on the socket without any further
+ * draining of the SSL/GSS transport buffers.)
+ *
* Possible return values:
* 1: successfully loaded at least one more byte
* 0: no data is presently available, but no error detected
@@ -605,8 +614,7 @@ pqPutMsgEnd(PGconn *conn)
int
pqReadData(PGconn *conn)
{
- int someread = 0;
- int nread;
+ int available;
if (conn->sock == PGINVALID_SOCKET)
{
@@ -614,6 +622,40 @@ pqReadData(PGconn *conn)
return -1;
}
+ available = pqReadData_internal(conn);
+ if (available < 0)
+ return -1;
+ else if (available > 0)
+ {
+ /*
+ * Make sure there are no bytes stuck in layers between conn->inBuffer
+ * and the socket, to make it safe for clients to poll on PQsocket().
+ */
+ if (pqDrainPending(conn))
+ return -1;
+ }
+ else
+ {
+ /*
+ * If we're not returning any bytes from the underlying transport,
+ * that must imply there aren't any in the transport buffer...
+ */
+ Assert(pqsecure_bytes_pending(conn) == 0);
+ }
+
+ return available;
+}
+
+/*
+ * Workhorse for pqReadData(). It's kept separate from the pqDrainPending()
+ * logic to avoid adding to this function's goto complexity.
+ */
+static int
+pqReadData_internal(PGconn *conn)
+{
+ int someread = 0;
+ int nread;
+
/* Left-justify any data in the buffer to make room */
if (conn->inStart < conn->inEnd)
{
@@ -724,6 +766,8 @@ retry3:
* SSL_read() could still say WANT_READ because the data received was not
* a complete SSL record. So we must play dumb and assume there is more
* data, relying on the SSL layer to detect true EOF.
+ *
+ * XXX: do we have the same issue with GSS encryption?
*/
#ifdef USE_SSL
@@ -800,6 +844,96 @@ definitelyFailed:
return -1;
}
+/*---
+ * Drains any transport data that is already buffered in userspace and adds it
+ * to conn->inBuffer, enlarging inBuffer if necessary. The drain fails if
+ * inBuffer cannot be made to hold all available transport data.
+ *
+ * We assume that the underlying secure transport implementation does not
+ * attempt to read any more data from the socket while draining the transport
+ * buffer. After a successful return, pqsecure_bytes_pending() must be zero.
+ *
+ * This operation is necessary to prevent deadlock, due to a layering violation
+ * designed into our asynchronous client API: pqReadData() and all the parsing
+ * routines above it receive data from the SSL/GSS transport buffer, but clients
+ * poll on the raw PQsocket() handle. So data can be "lost" in the intermediate
+ * layer if we don't take it out here.
+ *
+ * To illustrate what we're trying to prevent, say that the server is sending
+ * two messages at once in response to a query (Aaaa and Bb), the libpq buffer
+ * is five characters in size, and TLS records max out at three-character
+ * payloads.
+ *
+ * Client libpq SSL Socket
+ * | | | |
+ * | [ ] [ ] [ ] [1] Buffers are empty, client is
+ * x --------------------------> | polling on socket
+ * | | | |
+ * | [ ] [ ] [xxx] [2] First record is received; poll
+ * | <-------------------------- | signals read-ready
+ * | | | |
+ * x ---> [ ] [ ] [xxx] [3] Client calls PQconsumeInput()
+ * | | | |
+ * | [ ] -> [ ] [xxx] [4] libpq calls pqReadData() to fill
+ * | | | | the receive buffer
+ * | [ ] [Aaa] <-- [ ] [5] SSL pulls payload off the wire
+ * | | | | and decrypts it
+ * | [Aaa ] <- [ ] [ ] [6] pqsecure_read() takes all data
+ * | | | |
+ * | <--- [Aaa ] [ ] [ ] [7] PQconsumeInput() returns with a
+ * x --------------------------> | partial message, PQisBusy() is
+ * | | | | still true, client polls again
+ * | [Aaa ] [ ] [xxx] [8] Second record is received; poll
+ * | <-------------------------- | signals read-ready
+ * | | | |
+ * x ---> [Aaa ] [ ] [xxx] [9] Client calls PQconsumeInput()
+ * | | | |
+ * | [Aaa ] -> [ ] [xxx] [10] libpq calls pqReadData() to fill
+ * | | | | the receive buffer
+ * | [Aaa ] [aBb] <-- [ ] [11] SSL decrypts
+ * | | | |
+ * | [AaaaB] <- [b ] [ ] [12] pqsecure_read() fills its
+ * | | | | buffer, taking only two bytes
+ * | <--- [AaaaB] [b ] [ ] [13] PQconsumeInput() returns with a
+ * | | | | complete message buffered;
+ * | | | | PQisBusy() is false
+ * x ---> [AaaaB] [b ] [ ] [14] Client calls PQgetResult()
+ * | | | |
+ * | <--- [B ] [b ] [ ] [15] Aaaa is returned; PQisBusy() is
+ * x --------------------------> | true and client polls again
+ * . | | .
+ * . [B ] [b ] . [16] No packets, and client hangs.
+ * . | | .
+ *
+ */
+static int
+pqDrainPending(PGconn *conn)
+{
+ ssize_t bytes_pending;
+ ssize_t nread;
+
+ bytes_pending = pqsecure_bytes_pending(conn);
+ if (bytes_pending <= 0)
+ return bytes_pending;
+
+ /* Expand the input buffer if necessary. */
+ if (pqCheckInBufferSpace(conn->inEnd + (size_t) bytes_pending, conn))
+ return -1; /* errorMessage already set */
+
+ nread = pqsecure_read(conn, conn->inBuffer + conn->inEnd,
+ bytes_pending);
+
+ /* When there are bytes pending, the read function is not supposed to fail */
+ if (nread != bytes_pending)
+ {
+ libpq_append_conn_error(conn,
+ "drained only %zu of %zd pending bytes in transport buffer",
+ nread, bytes_pending);
+ return -1;
+ }
+ return 0;
+}
+
/*
* pqSendSome: send data waiting in the output buffer.
*
@@ -1100,7 +1234,7 @@ pqSocketCheck(PGconn *conn, int forRead, int forWrite, pg_usec_time_t end_time)
}
/* Check for SSL/GSS library buffering read bytes */
- if (forRead && pqsecure_read_is_pending(conn))
+ if (forRead && pqsecure_bytes_pending(conn) != 0)
{
/* short-circuit the select */
return 1;
diff --git a/src/interfaces/libpq/fe-secure-gssapi.c b/src/interfaces/libpq/fe-secure-gssapi.c
index 5067b3de9f4..05abdbffcc6 100644
--- a/src/interfaces/libpq/fe-secure-gssapi.c
+++ b/src/interfaces/libpq/fe-secure-gssapi.c
@@ -471,10 +471,11 @@ gss_read(PGconn *conn, void *recv_buffer, size_t length, ssize_t *ret)
return PGRES_POLLING_OK;
}
-bool
-pg_GSS_read_is_pending(PGconn *conn)
+ssize_t
+pg_GSS_bytes_pending(PGconn *conn)
{
- return PqGSSResultLength > PqGSSResultNext;
+ Assert(PqGSSResultLength >= PqGSSResultNext);
+ return (ssize_t) (PqGSSResultLength - PqGSSResultNext);
}
/*
diff --git a/src/interfaces/libpq/fe-secure-openssl.c b/src/interfaces/libpq/fe-secure-openssl.c
index 69256c91cdb..5f652486a9c 100644
--- a/src/interfaces/libpq/fe-secure-openssl.c
+++ b/src/interfaces/libpq/fe-secure-openssl.c
@@ -230,10 +230,42 @@ rloop:
return n;
}
-bool
-pgtls_read_is_pending(PGconn *conn)
+ssize_t
+pgtls_bytes_pending(PGconn *conn)
{
- return SSL_pending(conn->ssl) > 0;
+ int pending;
+
+ /*
+ * OpenSSL readahead is documented to break SSL_pending(). Plus, we can't
+ * afford to have OpenSSL take bytes off the socket without processing
+ * them; that breaks the postconditions for pqsecure_drain_pending().
+ */
+ Assert(!SSL_get_read_ahead(conn->ssl));
+
+ /* Figure out how many bytes to take off the connection. */
+ pending = SSL_pending(conn->ssl);
+
+ if (pending < 0)
+ {
+ /* shouldn't be possible */
+ Assert(false);
+ libpq_append_conn_error(conn, "OpenSSL reports negative bytes pending");
+ return -1;
+ }
+ else if (pending == INT_MAX)
+ {
+ /*
+ * If we ever found a legitimate way to hit this, we'd need to loop
+ * around in the caller to call pgtls_bytes_pending() again. Throw an
+ * error rather than complicate the code in that way, because
+ * SSL_read() should be bounded to the size of a single TLS record,
+ * and conn->inBuffer can't currently go past INT_MAX in size anyway.
+ */
+ libpq_append_conn_error(conn, "OpenSSL reports INT_MAX bytes pending");
+ return -1;
+ }
+
+ return (ssize_t) pending;
}
ssize_t
diff --git a/src/interfaces/libpq/fe-secure.c b/src/interfaces/libpq/fe-secure.c
index 94c97ec26fb..8ee4969d820 100644
--- a/src/interfaces/libpq/fe-secure.c
+++ b/src/interfaces/libpq/fe-secure.c
@@ -244,18 +244,22 @@ pqsecure_raw_read(PGconn *conn, void *ptr, size_t len)
}
/*
- * Returns true if there are any bytes available in the transport buffer.
+ * Return the number of bytes available in the transport buffer.
+ *
+ * If pqsecure_read() is called for this number of bytes, it's guaranteed to
+ * return successfully without reading from the underlying socket. See
+ * pqDrainPending() for a more complete discussion of the concepts involved.
*/
-bool
-pqsecure_read_is_pending(PGconn *conn)
+ssize_t
+pqsecure_bytes_pending(PGconn *conn)
{
#ifdef USE_SSL
if (conn->ssl_in_use)
- return pgtls_read_is_pending(conn);
+ return pgtls_bytes_pending(conn);
#endif
#ifdef ENABLE_GSS
if (conn->gssenc)
- return pg_GSS_read_is_pending(conn);
+ return pg_GSS_bytes_pending(conn);
#endif
/* Plaintext connections have no transport buffer. */
diff --git a/src/interfaces/libpq/libpq-int.h b/src/interfaces/libpq/libpq-int.h
index e8a1f248805..37fe77e9ca7 100644
--- a/src/interfaces/libpq/libpq-int.h
+++ b/src/interfaces/libpq/libpq-int.h
@@ -819,7 +819,7 @@ extern int pqWriteReady(PGconn *conn);
extern PostgresPollingStatusType pqsecure_open_client(PGconn *);
extern void pqsecure_close(PGconn *);
extern ssize_t pqsecure_read(PGconn *, void *ptr, size_t len);
-extern bool pqsecure_read_is_pending(PGconn *);
+extern ssize_t pqsecure_bytes_pending(PGconn *);
extern ssize_t pqsecure_write(PGconn *, const void *ptr, size_t len);
extern ssize_t pqsecure_raw_read(PGconn *, void *ptr, size_t len);
extern ssize_t pqsecure_raw_write(PGconn *, const void *ptr, size_t len);
@@ -856,9 +856,9 @@ extern void pgtls_close(PGconn *conn);
extern ssize_t pgtls_read(PGconn *conn, void *ptr, size_t len);
/*
- * Is there unread data waiting in the SSL read buffer?
+ * Return the number of bytes available in the transport buffer.
*/
-extern bool pgtls_read_is_pending(PGconn *conn);
+extern ssize_t pgtls_bytes_pending(PGconn *conn);
/*
* Write data to a secure connection.
@@ -906,7 +906,7 @@ extern PostgresPollingStatusType pqsecure_open_gss(PGconn *conn);
*/
extern ssize_t pg_GSS_write(PGconn *conn, const void *ptr, size_t len);
extern ssize_t pg_GSS_read(PGconn *conn, void *ptr, size_t len);
-extern bool pg_GSS_read_is_pending(PGconn *conn);
+extern ssize_t pg_GSS_bytes_pending(PGconn *conn);
#endif
/* === in fe-trace.c === */
--
2.47.3