libpq: Process buffered SSL read bytes to support records >8kB on async API

Started by Lars Kanisover 1 year ago25 messages
Jump to latest
#1Lars Kanis
lars@greiz-reinsdorf.de

Dear hackers,

I'm the maintainer of ruby-pg the ruby interface to the PostgreSQL
database. This binding uses the asynchronous API of libpq by default to
facilitate the ruby IO wait and scheduling mechanisms.

This works well with the vanilla postgresql server, but it leads to
starvation with other types of servers using the postgresql wire
protocol 3. This is because the current functioning of the libpq async
interface depends on a maximum size of SSL records of 8kB.

The following servers were reported to starve with ruby-pg:

* AWS RDS Aurora Serverless [1]https://github.com/ged/ruby-pg/issues/325
* YugabyteDb [2]https://github.com/ged/ruby-pg/issues/588
* CockroachDB [3]https://github.com/ged/ruby-pg/issues/583

They block infinitely on certain message sizes sent from the backend to
the libpq frontend. It is best described in [4]https://github.com/ged/ruby-pg/issues/325#issuecomment-737561270. A repro docker
composition is provided by YugabyteDB at [2]https://github.com/ged/ruby-pg/issues/588.

To fix this issue the attached patch calls pqReadData() repeatedly in
PQconsumeInput() until there is no buffered SSL data left to be read.
Another solution could be to process buffered SSL read bytes in
PQisBusy() instead of PQconsumeInput() .

The synchronous libpq API isn't affected, since it supports arbitrary
SSL record sizes already. That's why I think that the asynchronous API
should also support bigger SSL record sizes.

Regards, Lars

[1]: https://github.com/ged/ruby-pg/issues/325
[2]: https://github.com/ged/ruby-pg/issues/588
[3]: https://github.com/ged/ruby-pg/issues/583
[4]: https://github.com/ged/ruby-pg/issues/325#issuecomment-737561270

Attachments:

0001-libpq-Process-buffered-SSL-read-bytes-to-support-rec.patchtext/x-patch; charset=UTF-8; name=0001-libpq-Process-buffered-SSL-read-bytes-to-support-rec.patchDownload+13-1
#2Jacob Champion
jacob.champion@enterprisedb.com
In reply to: Lars Kanis (#1)
Re: libpq: Process buffered SSL read bytes to support records >8kB on async API

On Sun, Sep 8, 2024 at 1:08 PM Lars Kanis <lars@greiz-reinsdorf.de> wrote:

I'm the maintainer of ruby-pg the ruby interface to the PostgreSQL
database. This binding uses the asynchronous API of libpq by default to
facilitate the ruby IO wait and scheduling mechanisms.

This works well with the vanilla postgresql server, but it leads to
starvation with other types of servers using the postgresql wire
protocol 3. This is because the current functioning of the libpq async
interface depends on a maximum size of SSL records of 8kB.

Thanks for the report! I wanted evidence that this wasn't a
ruby-pg-specific problem, so I set up a test case with
Python/psycopg2.

I was able to reproduce a hang when all of the following were true:
- psycopg2's async mode was enabled
- the client performs a PQconsumeInput/PQisBusy loop, waiting on
socket read events when the connection is busy (I used
psycopg2.extras.wait_select() for this)
- the server splits a large message over many large TLS records
- the server packs the final ReadyForQuery message into the same
record as the split message's final fragment

Gory details of the packet sizes, if it's helpful:
- max TLS record size is 12k, because it made the math easier
- server sends DataRow of 32006 bytes, followed by DataRow of 806
bytes, followed by CommandComplete/ReadyForQuery
- so there are three TLS records on the wire containing
  1) DataRow 1 fragment 1 (12k bytes)
  2) DataRow 1 fragment 2 (12k bytes)
  3) DataRow 1 fragment 3 (7430 bytes) + DataRow 2 (806 bytes)
       + CommandComplete + ReadyForQuery

To fix this issue the attached patch calls pqReadData() repeatedly in
PQconsumeInput() until there is no buffered SSL data left to be read.
Another solution could be to process buffered SSL read bytes in
PQisBusy() instead of PQconsumeInput() .

I agree that PQconsumeInput() needs to ensure that the transport
buffers are all drained. But I'm not sure this is a complete solution;
doesn't GSS have the same problem? And are there any other sites that
need to make the same guarantee before returning?

I need to switch away from this for a bit. Would you mind adding this
to the next Commitfest as a placeholder?

https://commitfest.postgresql.org/50/

Thanks,
--Jacob

#3Lars Kanis
lars@greiz-reinsdorf.de
In reply to: Jacob Champion (#2)
Re: libpq: Process buffered SSL read bytes to support records >8kB on async API

Thank you Jacob for verifying this issue!

Gory details of the packet sizes, if it's helpful:
- max TLS record size is 12k, because it made the math easier
- server sends DataRow of 32006 bytes, followed by DataRow of 806
bytes, followed by CommandComplete/ReadyForQuery
- so there are three TLS records on the wire containing
1) DataRow 1 fragment 1 (12k bytes)
2) DataRow 1 fragment 2 (12k bytes)
3) DataRow 1 fragment 3 (7430 bytes) + DataRow 2 (806 bytes)
+ CommandComplete + ReadyForQuery

How did you verify the issue on the server side - with YugabyteDB or
with a modified Postgres server? I'd like to verify the GSSAPI part and
I'm familiar with the Postgres server only.

I agree that PQconsumeInput() needs to ensure that the transport
buffers are all drained. But I'm not sure this is a complete solution;
doesn't GSS have the same problem? And are there any other sites that
need to make the same guarantee before returning?

Which other sites do you mean? The synchronous transfer already works,
since the select() is short-circuit in case of pending bytes: [1]https://github.com/postgres/postgres/blob/77761ee5dddc0518235a51c533893e81e5f375b9/src/interfaces/libpq/fe-misc.c#L1070

I need to switch away from this for a bit. Would you mind adding this
to the next Commitfest as a placeholder?

No problem; registered: https://commitfest.postgresql.org/50/5251/

--

Regards, Lars

[1]: https://github.com/postgres/postgres/blob/77761ee5dddc0518235a51c533893e81e5f375b9/src/interfaces/libpq/fe-misc.c#L1070
https://github.com/postgres/postgres/blob/77761ee5dddc0518235a51c533893e81e5f375b9/src/interfaces/libpq/fe-misc.c#L1070

#4Jacob Champion
jacob.champion@enterprisedb.com
In reply to: Lars Kanis (#3)
Re: libpq: Process buffered SSL read bytes to support records >8kB on async API

On Wed, Sep 11, 2024 at 12:08 PM Lars Kanis <lars@greiz-reinsdorf.de> wrote:

How did you verify the issue on the server side - with YugabyteDB or
with a modified Postgres server? I'd like to verify the GSSAPI part and
I'm familiar with the Postgres server only.

Neither, unfortunately -- I have a protocol testbed that I use for
this kind of stuff. I'm happy to share once I get it cleaned up, but
it's unlikely to help you in this case since I haven't implemented
gssenc support. Patching the Postgres server itself seems like a good
way to go.

And are there any other sites that
need to make the same guarantee before returning?

Which other sites do you mean?

I'm mostly worried that other parts of libpq might assume that a
single call to pqReadData will drain the buffers. If not, great! --
but I haven't had time to check all the call sites.

I need to switch away from this for a bit. Would you mind adding this
to the next Commitfest as a placeholder?

No problem; registered: https://commitfest.postgresql.org/50/5251/

Thank you!

--Jacob

#5vignesh C
vignesh21@gmail.com
In reply to: Jacob Champion (#4)
Re: libpq: Process buffered SSL read bytes to support records >8kB on async API

On Thu, 12 Sept 2024 at 04:30, Jacob Champion
<jacob.champion@enterprisedb.com> wrote:

On Wed, Sep 11, 2024 at 12:08 PM Lars Kanis <lars@greiz-reinsdorf.de> wrote:

How did you verify the issue on the server side - with YugabyteDB or
with a modified Postgres server? I'd like to verify the GSSAPI part and
I'm familiar with the Postgres server only.

Neither, unfortunately -- I have a protocol testbed that I use for
this kind of stuff. I'm happy to share once I get it cleaned up, but
it's unlikely to help you in this case since I haven't implemented
gssenc support. Patching the Postgres server itself seems like a good
way to go.

And are there any other sites that
need to make the same guarantee before returning?

Which other sites do you mean?

I'm mostly worried that other parts of libpq might assume that a
single call to pqReadData will drain the buffers. If not, great! --
but I haven't had time to check all the call sites.

@Jacob, could you find some time to wrap this up? This will help us
assess whether it can be refined into a committable state soon.

Regards,
Vignesh

#6Jacob Champion
jacob.champion@enterprisedb.com
In reply to: vignesh C (#5)
Re: libpq: Process buffered SSL read bytes to support records >8kB on async API

On Sun, Mar 16, 2025 at 6:14 AM vignesh C <vignesh21@gmail.com> wrote:

@Jacob, could you find some time to wrap this up? This will help us
assess whether it can be refined into a committable state soon.

This is a bug report that likely requires backpatching; feature freeze
should not be an issue here.

Thanks,
--Jacob

#7Jacob Champion
jacob.champion@enterprisedb.com
In reply to: Jacob Champion (#2)
Re: libpq: Process buffered SSL read bytes to support records >8kB on async API

On Tue, Sep 10, 2024 at 11:49 AM Jacob Champion
<jacob.champion@enterprisedb.com> wrote:

I need to switch away from this for a bit.

"a bit"

In an effort to get this unstuck (and ideally solved during this
commitfest) here are my most recent thoughts:

I agree that PQconsumeInput() needs to ensure that the transport
buffers are all drained. But I'm not sure this is a complete solution;
doesn't GSS have the same problem? And are there any other sites that
need to make the same guarantee before returning?

So, Postgres internals cheat: calls to
pqWait/pqWaitTimed/pqSocketCheck() all reach down into the SSL buffer
to see if there are any pending bytes before polling the socket. I do
not yet understand why this protection is not extended to
GSS-encrypted connections.

In any case, our clients can't make use of that protection either, so
it's unfortunate that we're relying on it ourselves, because it makes
reasoning about a fix more difficult. I'm not about to propose
removing that case in a backport, but as long as it's there covering
up the problem, it's hard to say whether we've truly fixed this bug.

I think the hang could happen in any situation where a third-party
client calls into a function which calls pqReadData(), and then
immediately polls the socket for a read-ready condition. So: what are
the APIs that could be called that way?

Clearly PQconsumeInput() is one. PQconnectPoll() seems like it should
be exposed as well. postgres_fdw's use of PQgetResult() might qualify
(or if it doesn't, I don't understand why not -- there is at least one
path in PQgetResult() where the buffer is not fully consumed). And
nonblocking mode seems like it might sprinkle this hazard into more
places, as pqSendSome() will then start reading data and returning
before a pqWait().

So, to fix this once and for all, do we have to make sure that
pqReadData() always fully drains the SSL/GSS transport buffer instead
of eagerly returning?

--Jacob

#8Jacob Champion
jacob.champion@enterprisedb.com
In reply to: Jacob Champion (#7)
Re: libpq: Process buffered SSL read bytes to support records >8kB on async API

On Tue, Jul 1, 2025 at 1:42 PM Jacob Champion
<jacob.champion@enterprisedb.com> wrote:

I do
not yet understand why this protection is not extended to
GSS-encrypted connections.

After repurposing some of my test code for d98cefe11, I'm able to
reproduce the hang with gssencmode when the server uses a
smaller-than-standard (12k) send buffer. Same reproduction case as the
original (32006-byte DataRow, 806-byte DataRow, CommandComplete,
ReadyForQuery).

So a complete patch for this has to consider GSS as well.

So, to fix this once and for all, do we have to make sure that
pqReadData() always fully drains the SSL/GSS transport buffer instead
of eagerly returning?

I'll work on proving that code paths other than PQconsumeInput() are
affected. If they are, I'll start a patch for pqReadData().

--Jacob

#9Merlin Moncure
mmoncure@gmail.com
In reply to: Lars Kanis (#1)
Re: libpq: Process buffered SSL read bytes to support records >8kB on async API

On Sun, Sep 8, 2024 at 2:08 PM Lars Kanis <lars@greiz-reinsdorf.de> wrote:

To fix this issue the attached patch calls pqReadData() repeatedly in
PQconsumeInput() until there is no buffered SSL data left to be read.
Another solution could be to process buffered SSL read bytes in
PQisBusy() instead of PQconsumeInput() .

I've definitely seen this problem in AWS RDS, or something related to
this. I'm using dblink to set up lots of connections and using
dblink_is_busy (which calls PQconsumeInput). If there is lots of data
being returned, say, with large results or log information, the server
would stall for unknown reasons (sometimes for several seconds or more)
while peeling off the 8kb fragments. Point being, PQconsumeInput() not
draining the buffers can lead to major performance problems that are hard
to troubleshoot. I would argue this a bug, really.

merlin

#10Jacob Champion
jacob.champion@enterprisedb.com
In reply to: Jacob Champion (#8)
Re: libpq: Process buffered SSL read bytes to support records >8kB on async API

On Wed, Jul 2, 2025 at 4:12 PM Jacob Champion
<jacob.champion@enterprisedb.com> wrote:

I'll work on proving that code paths other than PQconsumeInput() are
affected. If they are, I'll start a patch for pqReadData().

Here's one way for a server implementation to hang the client during
PQconnectPoll(): accept the SSLRequest and immediately send a protocol
2.0 error message of (16k + 1) bytes, but split the message across two
TLS records of 8k and (8k + 1) bytes each, and wait for the client to
close the connection.

This case works fine with synchronous connect, but it fails with
async. libpq will fill the first 8k of its 16k input buffer with the
first record. It fills the second half with all but one byte of the
second record, and since it doesn't find a null terminator anywhere in
the buffer, it tells the client to poll on the socket again. Since the
server in this case is waiting for the client to close first, no
additional socket activity is coming, and the connection deadlocks
with one byte pending in the SSL buffer.

I use 2.0 as an example; libpq avoids getting stuck in 3.0 by
enlarging the input buffer to the size of the incoming message. But
that's supposed to be a performance optimization, not anti-deadlock
protection. Even if we were to enshrine the buffer sizes we use in the
protocol specification itself [1]We should not do that., someone's still likely to trip over
this hazard if they start adjusting those performance optimizations.
(Even our use of a 16k initial buffer is itself just an optimization.)

So I think pqReadData() needs to make the same guarantees for SSL/GSS
that it does for plain TCP: when it returns, either 1) there are no
bytes left pending or 2) the socket itself is marked readable.
Otherwise I think we'll continue to chase weird corner cases.

What I'm not sure about is unforeseen consequences -- could fixing
this expose other latent bugs? If we're concerned about that, we might
have to tier the backports to make sure everything remains stable.
Here are the three bugs discussed so far, with the fix for 3 possibly
subsuming 1 and 2 but also affecting more code:

1) pqSocketCheck() checks for pending SSL bytes but not GSS bytes
2) PQconsumeInput() does not drain all pending bytes
3) pqReadData() does not drain all pending bytes

Thanks,
--Jacob

[1]: We should not do that.

#11Merlin Moncure
mmoncure@gmail.com
In reply to: Jacob Champion (#10)
Re: libpq: Process buffered SSL read bytes to support records >8kB on async API

On Tue, Jul 15, 2025 at 4:31 PM Jacob Champion <
jacob.champion@enterprisedb.com> wrote

Otherwise I think we'll continue to chase weird corner cases.

Agreed. Here's a little more detail on the case I noticed:

* postgres backend thread managing several libpq connections, with polling
is_busy loop
* when client pushed a lot of log messages (say, with 'RAISE NOTICE'), the
server would stall for significant periods of time, sometimes minutes --
during that time, there would be no activity in the log (the server doesn't
do anything interesting outside of of the polling loop)
* checking backend server log, which relogs the client raised logs via
dblink, each pause happened after roughly 16k bytes
* reducing the log verbosity made the problem go away.

merlin

#12Andres Freund
andres@anarazel.de
In reply to: Jacob Champion (#10)
Re: libpq: Process buffered SSL read bytes to support records >8kB on async API

Hi,

On 2025-07-15 15:31:25 -0700, Jacob Champion wrote:

So I think pqReadData() needs to make the same guarantees for SSL/GSS
that it does for plain TCP: when it returns, either 1) there are no
bytes left pending or 2) the socket itself is marked readable.
Otherwise I think we'll continue to chase weird corner cases.

It's not really the same issue, as at least my reproducer requires modifying
libpq (and thus clearly isn't a bug), but it thought it might still be
interesting. For one, I'm not sure this can only be triggered with the
modifications.

If one modifies libpq to use openssl readahead (which does result in speedups,
because otherwise openssl think it's useful to do lots of 5 byte reads from
the socket), I see occasional hangs in libpq.

diff --git i/src/interfaces/libpq/fe-secure-openssl.c w/src/interfaces/libpq/fe-secure-openssl.c
index 51dd7b9fec0..a8deb28594c 100644
--- i/src/interfaces/libpq/fe-secure-openssl.c
+++ w/src/interfaces/libpq/fe-secure-openssl.c
@@ -864,6 +864,9 @@ initialize_SSL(PGconn *conn)
      */
     SSL_CTX_set_mode(SSL_context, SSL_MODE_ACCEPT_MOVING_WRITE_BUFFER);
+    SSL_CTX_set_read_ahead(SSL_context, 1);
+    SSL_CTX_set_default_read_buffer_len(SSL_context, 64*1024);
+
     /*
      * If the root cert file exists, load it so we can perform certificate
      * verification. If sslmode is "verify-full" we will also do further

ninja src/bin/psql/psql && src/bin/psql/psql -Xq -h localhost -c '\copy pg_class to stdout'

hangs reliably for me (if it's going to a file it's less reliable, so there's
a timing aspect here).

At that point psql is not interruptible, which isn't great either...

Backtrace:

#0 __internal_syscall_cancel (a1=a1@entry=140726731206072, a2=a2@entry=1, a3=-1, a4=a4@entry=0, a5=a5@entry=0, a6=a6@entry=0, nr=7)
at ./nptl/cancellation.c:44
#1 0x00007f1e25e3f6ad in __syscall_cancel (a1=a1@entry=140726731206072, a2=a2@entry=1, a3=<optimized out>, a4=a4@entry=0, a5=a5@entry=0, a6=a6@entry=0, nr=7)
at ./nptl/cancellation.c:75
#2 0x00007f1e25eb39c6 in __GI___poll (fds=fds@entry=0x7ffd7ed2edb8, nfds=nfds@entry=1, timeout=<optimized out>) at ../sysdeps/unix/sysv/linux/poll.c:29
#3 0x00007f1e26248dea in poll (__fds=0x7ffd7ed2edb8, __nfds=1, __timeout=<optimized out>) at /usr/include/x86_64-linux-gnu/bits/poll2.h:44
#4 PQsocketPoll (sock=<optimized out>, forRead=<optimized out>, forWrite=<optimized out>, end_time=<optimized out>)
at ../../../../../home/andres/src/postgresql/src/interfaces/libpq/fe-misc.c:1175
#5 0x00007f1e262491a8 in pqSocketCheck (conn=conn@entry=0x55f69b7e68f0, forRead=1, forWrite=0, end_time=-1)
at ../../../../../home/andres/src/postgresql/src/interfaces/libpq/fe-misc.c:1114
#6 0x00007f1e262492b3 in pqWaitTimed (forRead=<optimized out>, forWrite=<optimized out>, conn=0x55f69b7e68f0, end_time=<optimized out>)
at ../../../../../home/andres/src/postgresql/src/interfaces/libpq/fe-misc.c:1039
#7 0x00007f1e262492f1 in pqWait (forRead=forRead@entry=1, forWrite=forWrite@entry=0, conn=conn@entry=0x55f69b7e68f0)
at ../../../../../home/andres/src/postgresql/src/interfaces/libpq/fe-misc.c:1021
#8 0x00007f1e26234c46 in pqGetCopyData3 (conn=conn@entry=0x55f69b7e68f0, buffer=buffer@entry=0x7ffd7ed2efb8, async=async@entry=0)
at ../../../../../home/andres/src/postgresql/src/interfaces/libpq/fe-protocol3.c:1846
#9 0x00007f1e26243d98 in PQgetCopyData (conn=conn@entry=0x55f69b7e68f0, buffer=buffer@entry=0x7ffd7ed2efb8, async=async@entry=0)
at ../../../../../home/andres/src/postgresql/src/interfaces/libpq/fe-exec.c:2840
#10 0x000055f65d28f9c6 in handleCopyOut (conn=0x55f69b7e68f0, copystream=copystream@entry=0x7f1e25f985c0 <_IO_2_1_stdout_>, res=res@entry=0x7ffd7ed2f008)
at ../../../../../home/andres/src/postgresql/src/bin/psql/copy.c:442
#11 0x000055f65d27b07d in HandleCopyResult (resultp=resultp@entry=0x7ffd7ed2f158, copystream=0x7f1e25f985c0 <_IO_2_1_stdout_>)
at ../../../../../home/andres/src/postgresql/src/bin/psql/common.c:950
#12 0x000055f65d27dab8 in ExecQueryAndProcessResults (query=query@entry=0x55f69b8b3470 "COPY pg_attribute TO STDOUT ",
elapsed_msec=elapsed_msec@entry=0x7ffd7ed2f1b8, svpt_gone_p=svpt_gone_p@entry=0x7ffd7ed2f1b7, is_watch=is_watch@entry=false, min_rows=min_rows@entry=0,
opt=opt@entry=0x0, printQueryFout=0x0) at ../../../../../home/andres/src/postgresql/src/bin/psql/common.c:1934
#13 0x000055f65d27cdc9 in SendQuery (query=0x55f69b8b3470 "COPY pg_attribute TO STDOUT ")
at ../../../../../home/andres/src/postgresql/src/bin/psql/common.c:1212
#14 0x000055f65d28f6dd in do_copy (args=args@entry=0x55f69b8b1f10 "pg_attribute to stdout")
at ../../../../../home/andres/src/postgresql/src/bin/psql/copy.c:370
#15 0x000055f65d299f24 in exec_command_copy (scan_state=scan_state@entry=0x55f69b8a71f0, active_branch=active_branch@entry=true)
at ../../../../../home/andres/src/postgresql/src/bin/psql/command.c:951
#16 0x000055f65d2a2065 in exec_command (cmd=cmd@entry=0x55f69b8b3310 "copy", scan_state=scan_state@entry=0x55f69b8a71f0, cstack=cstack@entry=0x55f69b887ba0,
query_buf=query_buf@entry=0x0, previous_buf=previous_buf@entry=0x0) at ../../../../../home/andres/src/postgresql/src/bin/psql/command.c:338
#17 0x000055f65d2a2b59 in HandleSlashCmds (scan_state=scan_state@entry=0x55f69b8a71f0, cstack=cstack@entry=0x55f69b887ba0, query_buf=query_buf@entry=0x0,
previous_buf=previous_buf@entry=0x0) at ../../../../../home/andres/src/postgresql/src/bin/psql/command.c:241
#18 0x000055f65d29305f in main (argc=<optimized out>, argv=<optimized out>) at ../../../../../home/andres/src/postgresql/src/bin/psql/startup.c:409

It does seem ... weird ... that pqGetCopyData3() just processes whatever is
already in the libpq buffer and then waits for the socket to become readable,
before ever calling pqReadData(). What guarantees, even without the change
above, that a) there's no pending data inside openssl b) that we're not
waiting for a write?

If I make pqGetCopyData3() call pqReadData() before the pqWait() and continue
if it returns 1, the hang vanishes.

The same
if (pqWait(true, false, conn) ||
pqReadData(conn) < 0)
return -2;

pattern (without a preceding pqReadData()) exists in several other places :(

What I'm not sure about is unforeseen consequences -- could fixing
this expose other latent bugs? If we're concerned about that, we might
have to tier the backports to make sure everything remains stable.
Here are the three bugs discussed so far, with the fix for 3 possibly
subsuming 1 and 2 but also affecting more code:

1) pqSocketCheck() checks for pending SSL bytes but not GSS bytes
2) PQconsumeInput() does not drain all pending bytes
3) pqReadData() does not drain all pending bytes

What are the limits for the maximum amount of data this could make us buffer
in addition to what we are buffering right now? It's not entirely obvious to
me that a loop around pqReadData() as long as there is pending data couldn't
make us buffer a lot of data.

Do you have a WIP patch?

Greetings,

Andres Freund

#13Jacob Champion
jacob.champion@enterprisedb.com
In reply to: Merlin Moncure (#11)
Re: libpq: Process buffered SSL read bytes to support records >8kB on async API

On Wed, Jul 16, 2025 at 7:36 AM Merlin Moncure <mmoncure@gmail.com> wrote:

Agreed. Here's a little more detail on the case I noticed:

* postgres backend thread managing several libpq connections, with polling is_busy loop
* when client pushed a lot of log messages (say, with 'RAISE NOTICE'), the server would stall for significant periods of time, sometimes minutes -- during that time, there would be no activity in the log (the server doesn't do anything interesting outside of of the polling loop)
* checking backend server log, which relogs the client raised logs via dblink, each pause happened after roughly 16k bytes
* reducing the log verbosity made the problem go away.

Thanks for this! I'm not sure I'll be able to prove that this is
related, as opposed to some other potential problem, but it's good to
have the information.

--Jacob

#14Jacob Champion
jacob.champion@enterprisedb.com
In reply to: Andres Freund (#12)
Re: libpq: Process buffered SSL read bytes to support records >8kB on async API

On Wed, Jul 16, 2025 at 11:11 AM Andres Freund <andres@anarazel.de> wrote:

If one modifies libpq to use openssl readahead (which does result in speedups,
because otherwise openssl think it's useful to do lots of 5 byte reads from
the socket), I see occasional hangs in libpq.

Now that is a very interesting coincidence, because the patch I'm
working on is about to add an assertion that readahead is turned
*off*. :D

Based on my understanding of [1]https://docs.openssl.org/3.5/man3/SSL_pending/#synopsis, readahead makes this overall problem
much worse by opportunistically slurping bytes off the wire and doing
absolutely nothing with them until you call SSL_read() enough times to
finally get to them. It would even hide those bytes from the pending
count:

Therefore, it is possible for no more bytes to be readable from the
underlying BIO (because OpenSSL has already read them) and for
SSL_pending() to return 0, even though readable application data bytes
are available (because the data is in unprocessed buffered records).

That seems like the polar opposite of what we want. If clients are
polling on the raw socket, those pending bytes must be accounted for.

It does seem ... weird ... that pqGetCopyData3() just processes whatever is
already in the libpq buffer and then waits for the socket to become readable,
before ever calling pqReadData(). What guarantees, even without the change
above, that a) there's no pending data inside openssl b) that we're not
waiting for a write?

(a) is the bug discussed here, AFAICT? I have not considered (b) yet,
but that seems like a separate problem.

If I make pqGetCopyData3() call pqReadData() before the pqWait() and continue
if it returns 1, the hang vanishes.

Sure, but that's cheating. I think I'd expect that adding extra
"unnecessary" reads would help cover up the problem that readahead has
caused.

What are the limits for the maximum amount of data this could make us buffer
in addition to what we are buffering right now?

I still need to make sure, but I believe it should be an additional
16k due to the current limits on GSS token sizes and TLS record sizes.
In normal operation, I think we basically immediately double the input
buffer sizes (which start at 16k) as soon as a long message comes in.
So that order of magnitude didn't immediately seem worrisome to me.

It's not entirely obvious to
me that a loop around pqReadData() as long as there is pending data couldn't
make us buffer a lot of data.

FWIW, I'm not planning to loop around it; I'm planning to add a
pgsecure_* primitive that drains whatever the transport is holding
onto.

Do you have a WIP patch?

I'm working on one now.

--Jacob

[1]: https://docs.openssl.org/3.5/man3/SSL_pending/#synopsis

#15Andres Freund
andres@anarazel.de
In reply to: Jacob Champion (#14)
Re: libpq: Process buffered SSL read bytes to support records >8kB on async API

Hi,

On 2025-07-16 11:50:46 -0700, Jacob Champion wrote:

On Wed, Jul 16, 2025 at 11:11 AM Andres Freund <andres@anarazel.de> wrote:

If one modifies libpq to use openssl readahead (which does result in speedups,
because otherwise openssl think it's useful to do lots of 5 byte reads from
the socket), I see occasional hangs in libpq.

Now that is a very interesting coincidence, because the patch I'm
working on is about to add an assertion that readahead is turned
*off*. :D

Heh.

Based on my understanding of [1], readahead makes this overall problem
much worse by opportunistically slurping bytes off the wire and doing
absolutely nothing with them until you call SSL_read() enough times to
finally get to them.

Right - but it also substantially reduces the number of syscalls :(. I'm not
sure that it's wise to close that door permanently. Clearly it's not something
we need to care about in the back branches.

Without it openssl will do one read for the record length, and another read
for the record contents. I.e. it'll often double the number of syscalls.

Without readahead:

perf stat -e raw_syscalls:sys_enter,cycles,context-switches src/bin/pgbench/pgbench -c $c -j$j -M prepared -n -P 1 -f <(echo 'select 1') -h localhost -t1000

5,561 raw_syscalls:sys_enter
89,125,976 cycles
1,009 context-switches

With readahead:

4,556 raw_syscalls:sys_enter
83,582,571 cycles
1,007 context-switches

It would even hide those bytes from the pending
count:

Therefore, it is possible for no more bytes to be readable from the
underlying BIO (because OpenSSL has already read them) and for
SSL_pending() to return 0, even though readable application data bytes
are available (because the data is in unprocessed buffered records).

It's rather depressing, like how this this API behaviour came to be?

Luckily there seems to be SSL_has_pending():

SSL_has_pending() returns 1 if s has buffered data (whether processed or
unprocessed) and 0 otherwise

And it seems old enough:

The SSL_has_pending() function was added in OpenSSL 1.1.0.

It does seem ... weird ... that pqGetCopyData3() just processes whatever is
already in the libpq buffer and then waits for the socket to become readable,
before ever calling pqReadData(). What guarantees, even without the change
above, that a) there's no pending data inside openssl b) that we're not
waiting for a write?

(a) is the bug discussed here, AFAICT?

I'm not entirely sure, tbh.

If I make pqGetCopyData3() call pqReadData() before the pqWait() and continue
if it returns 1, the hang vanishes.

Sure, but that's cheating. I think I'd expect that adding extra
"unnecessary" reads would help cover up the problem that readahead has
caused.

To me the pattern in our code seems bogus, even without readahead, no?

getCopyDataMessage() can return 0 because our internal buffer is empty,
without ever reading from the socket or openssl. But if the SSL record
contained two messages (or parts of two messages), all the required data may
be in openssl. In which case the pqWait() that pqGetCopyData3() does will wait
forever. That's a problem unrelated to readahead, no?

I.e. the extra read actually ensures that we're not doing a pqWait() without
knowing whether we need one.

Greetings,

Andres Freund

#16Jacob Champion
jacob.champion@enterprisedb.com
In reply to: Andres Freund (#15)
Re: libpq: Process buffered SSL read bytes to support records >8kB on async API

On Wed, Jul 16, 2025 at 2:34 PM Andres Freund <andres@anarazel.de> wrote:

Based on my understanding of [1], readahead makes this overall problem
much worse by opportunistically slurping bytes off the wire and doing
absolutely nothing with them until you call SSL_read() enough times to
finally get to them.

Right - but it also substantially reduces the number of syscalls :(. I'm not
sure that it's wise to close that door permanently.

I think it's very likely that OpenSSL's implementation of readahead is
fundamentally incompatible with our async implementation. For one, the
documented way to call SSL_read() without accidentally hitting the
socket is via SSL_pending(), which readahead is documented to break.
For two, if you're worried about how much data we could potentially
have to hold during a "drain all pending" operation, I think readahead
changes the upper bound from "the size of one TLS record" to
"SO_RCVBUF", doesn't it?

Without it openssl will do one read for the record length, and another read
for the record contents. I.e. it'll often double the number of syscalls.

That is unfortunate... but if we're talking about a
correctness/performance tradeoff then I'm somewhat less sympathetic to
the performance side.

Luckily there seems to be SSL_has_pending():

SSL_has_pending() returns 1 if s has buffered data (whether processed or
unprocessed) and 0 otherwise

IIUC, we can't use SSL_has_pending() either. I need to know how
exactly many bytes to drain out of the userspace buffer without
introducing more bytes into it.

To me the pattern in our code seems bogus, even without readahead, no?

getCopyDataMessage() can return 0 because our internal buffer is empty,
without ever reading from the socket or openssl. But if the SSL record
contained two messages (or parts of two messages), all the required data may
be in openssl. In which case the pqWait() that pqGetCopyData3() does will wait
forever.

I think the "cheating" in pqSocketCheck() that I mentioned upthread
ensures that pqWait() will see the OpenSSL data and return
immediately. (That's a major architectural smell IMO, but it is there.
Just not for GSS, which might explain some hangs discussed on the list
a while back.)

I.e. the extra read actually ensures that we're not doing a pqWait() without
knowing whether we need one.

I don't think so, because (without my WIP patch) you still can't
guarantee that the single call to pqReadData() got all of the
readahead data. And with my WIP patch, you don't need the extra call,
because the previous call to pqReadData() will have ensured that
there's no leftover SSL buffer to begin with. Am I missing something?

Thanks!
--Jacob

#17Andres Freund
andres@anarazel.de
In reply to: Jacob Champion (#16)
Re: libpq: Process buffered SSL read bytes to support records >8kB on async API

Hi,

On 2025-07-16 15:25:14 -0700, Jacob Champion wrote:

On Wed, Jul 16, 2025 at 2:34 PM Andres Freund <andres@anarazel.de> wrote:

Based on my understanding of [1], readahead makes this overall problem
much worse by opportunistically slurping bytes off the wire and doing
absolutely nothing with them until you call SSL_read() enough times to
finally get to them.

Right - but it also substantially reduces the number of syscalls :(. I'm not
sure that it's wise to close that door permanently.

I think it's very likely that OpenSSL's implementation of readahead is
fundamentally incompatible with our async implementation. For one, the
documented way to call SSL_read() without accidentally hitting the
socket is via SSL_pending(), which readahead is documented to break.

Why do we care about not hitting the socket? We always operate the socket in
non-blocking mode anyway?

FWIW, I have seen too many folks move away from encrypted connections for
backups due to openssl being the bottlenecks to just accept that we'll never
go beyond the worst performance. Any actually decently performing networking
will need to divorce when socket IO happens and when openssl sees the data
much more heavily. Just relying more and more on extremely tight coupling is a
dead end.

For two, if you're worried about how much data we could potentially
have to hold during a "drain all pending" operation, I think readahead
changes the upper bound from "the size of one TLS record" to
"SO_RCVBUF", doesn't it?

It seems to be a configurable limit, with
SSL_set_default_read_buffer_len(). But the default is to just read up to a
whole frame's worth of data, instead of of reading the length
separately. I.e. an irrelevant amount of memory.

Without it openssl will do one read for the record length, and another read
for the record contents. I.e. it'll often double the number of syscalls.

That is unfortunate... but if we're talking about a
correctness/performance tradeoff then I'm somewhat less sympathetic to
the performance side.

I'm obviously not advocating for incorrectness.

Luckily there seems to be SSL_has_pending():

SSL_has_pending() returns 1 if s has buffered data (whether processed or
unprocessed) and 0 otherwise

IIUC, we can't use SSL_has_pending() either. I need to know how
exactly many bytes to drain out of the userspace buffer without
introducing more bytes into it.

Why? The only case where we ought to care about whether pending data exists
inside openssl is if our internal buffer is either empty or doesn't contain
the entire message we're trying to consume. In either of those two cases we
can just consume some data from openssl, without caring how much it precisely
is.

To me the pattern in our code seems bogus, even without readahead, no?

getCopyDataMessage() can return 0 because our internal buffer is empty,
without ever reading from the socket or openssl. But if the SSL record
contained two messages (or parts of two messages), all the required data may
be in openssl. In which case the pqWait() that pqGetCopyData3() does will wait
forever.

I think the "cheating" in pqSocketCheck() that I mentioned upthread
ensures that pqWait() will see the OpenSSL data and return
immediately. (That's a major architectural smell IMO, but it is there.
Just not for GSS, which might explain some hangs discussed on the list
a while back.)

I guess so, but it's hard to know without a patch.

I.e. the extra read actually ensures that we're not doing a pqWait() without
knowing whether we need one.

I don't think so, because (without my WIP patch) you still can't
guarantee that the single call to pqReadData() got all of the
readahead data.

It wouldn't need to get all the data in one pqReadData() in this case, as we'd
just loop around and do the whole thing again, until the entire message is
read in (regardless of whether there's remaining data in a partially consumed
record, or in the "unprocessed buffer").

Greetings,

Andres Freund

#18Jacob Champion
jacob.champion@enterprisedb.com
In reply to: Andres Freund (#17)
Re: libpq: Process buffered SSL read bytes to support records >8kB on async API

On Wed, Jul 16, 2025 at 4:35 PM Andres Freund <andres@anarazel.de> wrote:

Why do we care about not hitting the socket? We always operate the socket in
non-blocking mode anyway?

IIUC, that would change pqReadData() from a bounded read to an
unbounded one. Then we have to somehow protect against a server that
can send data faster than we can process it.

FWIW, I have seen too many folks move away from encrypted connections for
backups due to openssl being the bottlenecks to just accept that we'll never
go beyond the worst performance. Any actually decently performing networking
will need to divorce when socket IO happens and when openssl sees the data
much more heavily.

Can't we make our custom BIO do that?

Just relying more and more on extremely tight coupling is a
dead end.

Why? This abstraction _must_ leak. Because of PQsocket() at minimum,
but also there's a bunch of code complexity around the read/write
behavior and EOF differences for a TLS connection. When I look at the
code in pqReadData() I am very sad.

TLS does not behave like raw TCP. So if there's no way to get rid of
the coupling, IMO we might as well use it.

For two, if you're worried about how much data we could potentially
have to hold during a "drain all pending" operation, I think readahead
changes the upper bound from "the size of one TLS record" to
"SO_RCVBUF", doesn't it?

It seems to be a configurable limit, with
SSL_set_default_read_buffer_len(). But the default is to just read up to a
whole frame's worth of data, instead of of reading the length
separately. I.e. an irrelevant amount of memory.

Hmm, okay.

IIUC, we can't use SSL_has_pending() either. I need to know how
exactly many bytes to drain out of the userspace buffer without
introducing more bytes into it.

Why? The only case where we ought to care about whether pending data exists
inside openssl is if our internal buffer is either empty or doesn't contain
the entire message we're trying to consume. In either of those two cases we
can just consume some data from openssl, without caring how much it precisely
is.

I can't tell if you're discussing the fix for this bug or a
hypothetical future architecture that makes readahead safe. I don't
think PQconnectPoll() behaves the way you're describing; there is no
retry-read concept.

The order of operations is:
- read some available data; don't care how much
- parse the message(s) if we have enough data
- otherwise tell the caller to poll for read

That's an architecture that's already very tightly coupled to the
behavior of un-userspace-buffered TCP/UNIX sockets. Which is why I'm
trying to have pqReadData() make compatible guarantees for SSL/GSS.

--Jacob

#19Jacob Champion
jacob.champion@enterprisedb.com
In reply to: Jacob Champion (#14)
Re: libpq: Process buffered SSL read bytes to support records >8kB on async API

On Wed, Jul 16, 2025 at 11:50 AM Jacob Champion
<jacob.champion@enterprisedb.com> wrote:

On Wed, Jul 16, 2025 at 11:11 AM Andres Freund <andres@anarazel.de> wrote:

Do you have a WIP patch?

I'm working on one now.

The attached still needs some documentation work, and closer
inspection of the new assertions and OpenSSL error handling, but I
think it's good enough to get my intent across.

0001 is a fix for pqWait+gssencmode, which is hopefully the easier of
the two to review/backport. The current behavior for GSS is, IMO, an
obvious oversight.

0002 adds drain semantics at the end of pqReadData(). If we were
writing this from scratch, it wouldn't look like this: there's no
point in having the transport layer split data off into an undersized
conn->inBuffer if we're just going to come right back and drain it all
anyway. But I'm hoping this is easier to reason about and verify,
compared to a bigger rewrite.

I've renamed the existing pgtls_read_pending() to
pgtls_read_is_pending(), to try to disambiguate it from the new
*_drain_pending() APIs, but I'm not particularly attached to my choice
of names.

Thanks,
--Jacob

Attachments:

0001-WIP-extend-TLS-read-pending-concept-to-GSS-too.patchapplication/octet-stream; name=0001-WIP-extend-TLS-read-pending-concept-to-GSS-too.patchDownload+31-7
0002-WIP-add-a-drain-pending-concept-to-SSL-GSS.patchapplication/octet-stream; name=0002-WIP-add-a-drain-pending-concept-to-SSL-GSS.patchDownload+242-4
#20Jacob Champion
jacob.champion@enterprisedb.com
In reply to: Jacob Champion (#19)
Re: libpq: Process buffered SSL read bytes to support records >8kB on async API

On Fri, Jul 18, 2025 at 11:11 AM Jacob Champion
<jacob.champion@enterprisedb.com> wrote:

The current behavior for GSS is, IMO, an
obvious oversight.

(A better way to word this would have been "clearly an oversight"; I
didn't mean to imply that the correct behavior is obvious. The
opposite, in fact.)

--Jacob

#21Andres Freund
andres@anarazel.de
In reply to: Jacob Champion (#18)
#22Jacob Champion
jacob.champion@enterprisedb.com
In reply to: Andres Freund (#21)
#23Jacob Champion
jacob.champion@enterprisedb.com
In reply to: Jacob Champion (#19)
#24Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Jacob Champion (#23)
#25Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Jacob Champion (#23)