BUG #18210: libpq: PQputCopyData sometimes fails in non-blocking mode over GSSAPI encrypted connection

Started by PG Bug reporting formover 2 years ago6 messagesbugs

noreply@postgresql.org

over 2 years ago

The following bug has been logged on the website:

Bug reference: 18210
Logged by: Lars Kanis
Email address: lars@greiz-reinsdorf.de
PostgreSQL version: 16.1
Operating system: Ubuntu-23.10
Description:

I'm the maintainer of the ruby binding of libpq and this issue happens in
Ruby, but it isn't related to the Ruby environment. It is only that the ruby
binding always runs with the libpq non-blocking mode enabled internally, so
that this issue happens.

The error "GSSAPI caller failed to retransmit all data needing to be
retried" is raised here:
https://github.com/postgres/postgres/blob/eeb0ebad79d9350305d9e111fbac76e20fa4b2fe/src/interfaces/libpq/fe-secure-gssapi.c#L110
It happens only in non-blocking mode over GSSAPI encrypted connections. It
isn't reliable and depends on the network timing. When sending a 7MB file in
alternating pieces of 27KB and 180 Byte per PQputCopyData() there is a 50%
chance to get the failure over the local network. It doesn't happen if TLS
is used instead.

I added some debug prints into libpq to show the amounts of data
transferred. It also shows the gdb backtrace of where the error is raised.
It like so:
```
[...lots of data blocks successfully transferred...]
PQputCopyData() size=186
PQputCopyData() size=27714
pqPutMsgEnd() conn->outCount=30162 toSend=24576
pg_GSS_write() len=24576 PqGSSSendConsumed=0
PQputCopyData() size=189
PQputCopyData() size=27405
pqPutMsgEnd() conn->outCount=33190 toSend=32768
pg_GSS_write() len=32768 PqGSSSendConsumed=0
PQputCopyData() size=185
PQputCopyData() size=27481
pqPutMsgEnd() conn->outCount=28098 toSend=24576
pg_GSS_write() len=24576 PqGSSSendConsumed=0
PQputCopyData() size=182
pqPutMsgEnd() conn->outCount=11965 toSend=8192
pg_GSS_write() len=8192 PqGSSSendConsumed=8256

Breakpoint 1, 0x00007ffff20fe811 in appendPQExpBufferStr ()
from /home/kanis/ruby-pg/ports/x86_64-linux/lib/libpq.so.5
(gdb) bt
#0 0x00007ffff20fe811 in appendPQExpBufferStr () from
/home/kanis/ruby-pg/ports/x86_64-linux/lib/libpq.so.5
#1 0x00007ffff2103545 in pg_GSS_write () from
/home/kanis/ruby-pg/ports/x86_64-linux/lib/libpq.so.5
#2 0x00007ffff20fbbad in pqsecure_write () from
/home/kanis/ruby-pg/ports/x86_64-linux/lib/libpq.so.5
#3 0x00007ffff20f569d in pqSendSome () from
/home/kanis/ruby-pg/ports/x86_64-linux/lib/libpq.so.5
#4 0x00007ffff20f517c in pqPutMsgEnd () from
/home/kanis/ruby-pg/ports/x86_64-linux/lib/libpq.so.5
#5 0x00007ffff20f07ba in PQputCopyData () from
/home/kanis/ruby-pg/ports/x86_64-linux/lib/libpq.so.5
#6 0x00007ffff1f259e2 in ?? () from
/home/kanis/.rvm/gems/ruby-3.2.1/gems/pg-1.5.4-x86_64-linux/lib/3.2/pg_ext.so
#7 0x00007ffff7c67002 in rb_nogvl () from
/home/kanis/.rvm/rubies/ruby-3.2.1/bin/../lib/libruby.so.3.2
#8 0x00007ffff1f261ac in gvl_PQputCopyData ()
from
/home//kanis/.rvm/gems/ruby-3.2.1/gems/pg-1.5.4-x86_64-linux/lib/3.2/pg_ext.so
#9 0x00007ffff1f2af90 in ?? () from
/home/kanis/.rvm/gems/ruby-3.2.1/gems/pg-1.5.4-x86_64-linux/lib/3.2/pg_ext.so
#10 0x00007ffff7c9ebff in vm_call_cfunc_with_frame ()
from /home//kanis/.rvm/rubies/ruby-3.2.1/bin/../lib/libruby.so.3.2
#11 0x00007ffff7caefa0 in vm_exec_core () from
/home/kanis/.rvm/rubies/ruby-3.2.1/bin/../lib/libruby.so.3.2
#12 0x00007ffff7cb4a23 in rb_vm_exec () from /home/kanis/.rvm/rub
[...more ruby stack...]
```

The problem occurs when pqPutMsgEnd() sends some data per pqSendSome(), if
there is more that 8192 Bytes in the output buffer. The data size is
truncated to the next lower 8KB boundary. This happens here:
https://github.com/postgres/postgres/blob/eeb0ebad79d9350305d9e111fbac76e20fa4b2fe/src/interfaces/libpq/fe-misc.c#L544
Due to the truncation it can happen that the boundary is before the boundary
of the previous call of pqSendSome(). This in turn is raised by
pg_GSS_write() which can not transfer any data in this case.

So the error message "GSSAPI caller failed to retransmit all data needing to
be retried" is not about the application calling libpq. In fact the 7MB file
always fits into libpq's internal buffer memory, so that no retransmits are
done on the application level or at the level of the ruby binding.

The issue is fixed if I change libpq at either pqPutMsgEnd() or at
pg_GSS_write().
So either avoid the truncation of send size to next 8KB boundary in
pqPutMsgEnd() like so:
```
if (conn->outCount >= 8192)
{
int toSend = conn->outCount; // - (conn->outCount % 8192);

if (pqSendSome(conn, toSend) < 0)
return EOF;
```

... or just returning 0 Bytes in pg_GSS_write() instead of raising an
error:
```
if (len < PqGSSSendConsumed)
{
return 0;
appendPQExpBufferStr(&conn->errorMessage,
"GSSAPI caller failed to retransmit all data needing to be
retried\n");
```

Unfortunately so far I'm unable to provide a repro script. Maybe the above
information is enough to understand the problem.

Regards,
Lars Kanis

Tom Lane

tgl@sss.pgh.pa.us

over 2 years ago

In reply to: PG Bug reporting form (#1)

Re: BUG #18210: libpq: PQputCopyData sometimes fails in non-blocking mode over GSSAPI encrypted connection

PG Bug reporting form <noreply@postgresql.org> writes:

The error "GSSAPI caller failed to retransmit all data needing to be
retried" is raised here:
https://github.com/postgres/postgres/blob/eeb0ebad79d9350305d9e111fbac76e20fa4b2fe/src/interfaces/libpq/fe-secure-gssapi.c#L110
It happens only in non-blocking mode over GSSAPI encrypted connections. It
isn't reliable and depends on the network timing. When sending a 7MB file in
alternating pieces of 27KB and 180 Byte per PQputCopyData() there is a 50%
chance to get the failure over the local network. It doesn't happen if TLS
is used instead.

Hmm.

The issue is fixed if I change libpq at either pqPutMsgEnd() or at
pg_GSS_write().

Both of those seem like pretty obviously wrong "fixes". However,
I'm not quite sure I see the problem clearly enough to propose
a fix I'd believe in.

Unfortunately so far I'm unable to provide a repro script. Maybe the above
information is enough to understand the problem.

A repro script would be really really helpful here.

regards, tom lane

Tom Lane

tgl@sss.pgh.pa.us

over 2 years ago

In reply to: Tom Lane (#2)

Re: BUG #18210: libpq: PQputCopyData sometimes fails in non-blocking mode over GSSAPI encrypted connection

I wrote:

PG Bug reporting form <noreply@postgresql.org> writes:

The error "GSSAPI caller failed to retransmit all data needing to be
retried" is raised here:
https://github.com/postgres/postgres/blob/eeb0ebad79d9350305d9e111fbac76e20fa4b2fe/src/interfaces/libpq/fe-secure-gssapi.c#L110
It happens only in non-blocking mode over GSSAPI encrypted connections. It
isn't reliable and depends on the network timing. When sending a 7MB file in
alternating pieces of 27KB and 180 Byte per PQputCopyData() there is a 50%
chance to get the failure over the local network. It doesn't happen if TLS
is used instead.

A repro script would be really really helpful here.

After consuming more caffeine, I was able to repro it with the
not-intended-for-commit hack in 0001 attached. (On my machine,
the test just hangs up upon failing, because the hacked-up
logic in copy.c doesn't cope very well with the failure. I don't
think that is copy.c's fault though, it's just an incomplete hack.)

I concur with the conclusion that it's really pqPutMsgEnd's fault.
By deciding not to send the last partial block that's in the
outBuffer, it runs the risk of not presenting some data that it did
present the last time, and that can trigger the "failed to retransmit
all data" error. This happens because GSS's gss_MaxPktSize is a bit
less than 16K (it's 16320 on my machine). So if we initially present
24K of data (3 blocks), and pg_GSS_write successfully encrypts and
sends one packet of data, then it will encrypt all the rest. But if
its second pqsecure_raw_write call fails with EINTR, it will return
with bytes_sent = 16320 (and PqGSSSendConsumed = 8256), causing us to
reduce the outBuffer contents to 8256 bytes plus whatever partial
block we didn't try to send. If we don't fill outBuffer to at least
16K before trying again, we'll try to send just 8192 bytes, and
kaboom. (This is why the alternating-long-and-short-lines business
is important.)

The quick hack in 0002 attached fixes it, but I can't say that
I like this solution: it's propagating a bit of ugliness that
ought to be localized in pg_GSS_write out to callers.

I wonder if we should drop the idea of returning a positive bytecount
after a partial write, and just return the pqsecure_raw_write result,
and not reset PqGSSSendConsumed until we write everything presented.
In edge cases maybe that would result in some buffer bloat, but it
doesn't seem worse than what happens when the very first
pqsecure_raw_write returns EINTR.

In any case, the backend needs a look to see whether it requires a
similar fix. We don't do nonblock mode there, but I don't think
that means we can never get EINTR.

regards, tom lane

Lars Kanis

lars@greiz-reinsdorf.de

over 2 years ago

In reply to: Tom Lane (#2)

Re: BUG #18210: libpq: PQputCopyData sometimes fails in non-blocking mode over GSSAPI encrypted connection

Am 22.11.23 um 16:11 schrieb Tom Lane:

A repro script would be really really helpful here.

After fiddling a bit longer I was able to write a reproducer in C. It
needs a Kerberos enabled environment and it reliably fails as described
when running over Internet (by VPN connection). Then the output is:

PQputCopyData text2 command failed: GSSAPI caller failed to
retransmit all data needing to be retried

The program is here:
https://gist.github.com/larskanis/5546a20365b0ca64e294a32b138f441e

I wasn't able to reproduce the error on a LAN connection with the above
program. Even adding a trigger on the copy-table to slow the server down
or adding PQflush/PQconsumeInput calls didn't work to trigger the GSSAPI
error. In contrast the original program sometimes fails even on a LAN
connection.

Regards,
Lars Kanis

Tom Lane

tgl@sss.pgh.pa.us

over 2 years ago

In reply to: Tom Lane (#3)

Re: BUG #18210: libpq: PQputCopyData sometimes fails in non-blocking mode over GSSAPI encrypted connection

I wrote:

I wonder if we should drop the idea of returning a positive bytecount
after a partial write, and just return the pqsecure_raw_write result,
and not reset PqGSSSendConsumed until we write everything presented.
In edge cases maybe that would result in some buffer bloat, but it
doesn't seem worse than what happens when the very first
pqsecure_raw_write returns EINTR.

Here's a patch that fixes it along those lines. I like this better,
I think, not least because it removes the assumption that "interesting"
pqsecure_raw_write failures will recur on the next try.

Still need to look at syncing the backend with this.

regards, tom lane

Stephen Frost

sfrost@snowman.net

over 2 years ago

In reply to: Tom Lane (#5)

Re: BUG #18210: libpq: PQputCopyData sometimes fails in non-blocking mode over GSSAPI encrypted connection

Greetings,

* Tom Lane (tgl@sss.pgh.pa.us) wrote:

I wrote:

I wonder if we should drop the idea of returning a positive bytecount
after a partial write, and just return the pqsecure_raw_write result,
and not reset PqGSSSendConsumed until we write everything presented.
In edge cases maybe that would result in some buffer bloat, but it
doesn't seem worse than what happens when the very first
pqsecure_raw_write returns EINTR.

Here's a patch that fixes it along those lines. I like this better,
I think, not least because it removes the assumption that "interesting"
pqsecure_raw_write failures will recur on the next try.

Thanks for looking into this. I agree with this approach and took a
look through the patch and it all makes sense to me.

Still need to look at syncing the backend with this.

I see you made similar changes to the backend and those also look good.

Thanks again,

Stephen

BUG #18210: libpq: PQputCopyData sometimes fails in non-blocking mode over GSSAPI encrypted connection

Attachments:

Attachments: