[PATCH] Fix hostaddr crash during non-blocking cancellation
Hi all,
A connection with only a hostaddr (no host) can't be cancelled via
PQcreateCancel(), because we'll crash in emitHostIdentityInfo(). The
problem is that the synthetic connhost entry we've created for
cancellation has an incorrect type field, which causes the following
code to make bad decisions if connhost[0].host is NULL:
emitHostIdentifyInfo(...)
{
...
if (conn->connhost[conn->whichhost].type == CHT_HOST_ADDRESS)
displayed_host = conn->connhost[conn->whichhost].hostaddr;
else
displayed_host = conn->connhost[conn->whichhost].host;
...
if (conn->connhost[conn->whichhost].type != CHT_HOST_ADDRESS &&
host_addr[0] &&
strcmp(displayed_host, host_addr) != 0) <- crashes here
I think the solution is just to copy over the correct type, as done in
the attached 0001.
Putting in a regression test requires us to once again answer the
question of "where do we test TCP-only features". I'd like to have a
PG_TEST_EXTRA entry for these, so 0002 adds a `tcp` group. That's
going to need more debate, and 002_tcp.pl is very quick-and-dirty, but
I also _really_ want to stop throwing tests away just because we don't
have a nice place to put them... [1]/messages/by-id/flat/CAOYmi+kx8eOmKj01dV4vSBeq9pvqR8dt6rGw+B_pBOE2_GOj+g@mail.gmail.com
So: if 0001 looks good, I propose that I backpatch it after beta1, but
hold onto 0002 until REL_18_STABLE is split off. Then we can figure
out the "test TCP" semantics for the full 19 cycle, and maybe
eventually backpatch tests once we're happy with how they work.
WDYT?
--Jacob
[1]: /messages/by-id/flat/CAOYmi+kx8eOmKj01dV4vSBeq9pvqR8dt6rGw+B_pBOE2_GOj+g@mail.gmail.com
Attachments:
v1-0001-Fix-connhost-type-during-non-blocking-cancellatio.patchapplication/octet-stream; name=v1-0001-Fix-connhost-type-during-non-blocking-cancellatio.patchDownload
From 78190176b20392fa2396731d4693ebe6286679c8 Mon Sep 17 00:00:00 2001
From: Jacob Champion <jacob.champion@enterprisedb.com>
Date: Mon, 5 May 2025 14:19:10 -0700
Subject: [PATCH v1 1/2] Fix connhost type during non-blocking cancellation
PQcancelCreate() creates a deep copy of the connection's pg_conn_host,
but it missed the type field, which ended up initialized to
CHT_HOST_NAME. The type communicates which pointers in the struct are
valid, so if a connection used hostaddr without a host name (which would
normally result in a type of CHT_HOST_ADDRESS), the client would later
segfault in emitHostIdentityInfo().
Backpatch to 17, where the new API was introduced.
Reviewed-by: TODO
Backpatch-through: 17
---
src/interfaces/libpq/fe-cancel.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/src/interfaces/libpq/fe-cancel.c b/src/interfaces/libpq/fe-cancel.c
index 25de2a337c9..9674d0a156b 100644
--- a/src/interfaces/libpq/fe-cancel.c
+++ b/src/interfaces/libpq/fe-cancel.c
@@ -137,6 +137,8 @@ PQcancelCreate(PGconn *conn)
goto oom_error;
originalHost = conn->connhost[conn->whichhost];
+ cancelConn->connhost[0].type = originalHost.type;
+
if (originalHost.host)
{
cancelConn->connhost[0].host = strdup(originalHost.host);
--
2.34.1
v1-0002-WIP-add-TCP-tests.patchapplication/octet-stream; name=v1-0002-WIP-add-TCP-tests.patchDownload
From e13d6f97cdd9b2806b4c8fe9db78766d7e7fe1e6 Mon Sep 17 00:00:00 2001
From: Jacob Champion <jacob.champion@enterprisedb.com>
Date: Mon, 5 May 2025 14:09:54 -0700
Subject: [PATCH v1 2/2] WIP: add TCP tests
Regression test for the previous patch. Needs discussion on-list.
---
.cirrus.tasks.yml | 2 +-
src/test/modules/libpq_pipeline/meson.build | 1 +
src/test/modules/libpq_pipeline/t/002_tcp.pl | 45 ++++++++++++++++++++
3 files changed, 47 insertions(+), 1 deletion(-)
create mode 100644 src/test/modules/libpq_pipeline/t/002_tcp.pl
diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml
index 92057006c93..dc7786fff42 100644
--- a/.cirrus.tasks.yml
+++ b/.cirrus.tasks.yml
@@ -29,7 +29,7 @@ env:
MTEST_ARGS: --print-errorlogs --no-rebuild -C build
PGCTLTIMEOUT: 120 # avoids spurious failures during parallel tests
TEMP_CONFIG: ${CIRRUS_WORKING_DIR}/src/tools/ci/pg_ci_base.conf
- PG_TEST_EXTRA: kerberos ldap ssl libpq_encryption load_balance oauth
+ PG_TEST_EXTRA: kerberos ldap ssl libpq_encryption load_balance oauth tcp
# What files to preserve in case tests fail
diff --git a/src/test/modules/libpq_pipeline/meson.build b/src/test/modules/libpq_pipeline/meson.build
index 3fd70a04a38..9edf6c0a22b 100644
--- a/src/test/modules/libpq_pipeline/meson.build
+++ b/src/test/modules/libpq_pipeline/meson.build
@@ -26,6 +26,7 @@ tests += {
'tap': {
'tests': [
't/001_libpq_pipeline.pl',
+ 't/002_tcp.pl',
],
'deps': [libpq_pipeline],
},
diff --git a/src/test/modules/libpq_pipeline/t/002_tcp.pl b/src/test/modules/libpq_pipeline/t/002_tcp.pl
new file mode 100644
index 00000000000..64902a1a9c2
--- /dev/null
+++ b/src/test/modules/libpq_pipeline/t/002_tcp.pl
@@ -0,0 +1,45 @@
+
+# Copyright (c) 2021-2025, PostgreSQL Global Development Group
+
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Force test nodes to begin in TCP mode.
+# Use an INIT block so it runs after the BEGIN block in Utils.pm.
+
+INIT { $PostgreSQL::Test::Utils::use_unix_sockets = 0; }
+
+use PostgreSQL::Test::Cluster;
+
+if (!$ENV{PG_TEST_EXTRA} || $ENV{PG_TEST_EXTRA} !~ /\btcp\b/)
+{
+ plan skip_all =>
+ 'Potentially unsafe test TCP not enabled in PG_TEST_EXTRA';
+}
+
+my $node = PostgreSQL::Test::Cluster->new('main');
+$node->init;
+$node->start;
+
+# Don't let PGHOST interfere with these tests.
+delete $ENV{PGHOST};
+
+my @cases = (
+ $node->connstr('postgres') . " max_protocol_version=latest",
+ $node->connstr('postgres') . " max_protocol_version=3.0",
+ "hostaddr=127.0.0.1 port=" . $node->port);
+
+foreach my $c (@cases)
+{
+ # Don't use $node->command_ok(); it overrides PGHOST too.
+ command_ok(
+ [ 'libpq_pipeline', 'cancel', $c ],
+ "libpq_pipeline cancel, connstr: " . $c);
+}
+
+$node->stop('fast');
+
+done_testing();
--
2.34.1
01 looks sensible to me.
I like 02 as well. Only quibble would be the name (tcp) as it doesn't
really describe a class of things to be tested like the other things in
PG_TEST_EXTRA. Something indicating a lack of socket? Just more verbose
somehow? "tcp_only" perhaps?
Cheers,
Greg
--
Crunchy Data - https://www.crunchydata.com
Enterprise Postgres Software Products & Tech Support
Jacob Champion <jacob.champion@enterprisedb.com> writes:
A connection with only a hostaddr (no host) can't be cancelled via
PQcreateCancel(), because we'll crash in emitHostIdentityInfo(). The
problem is that the synthetic connhost entry we've created for
cancellation has an incorrect type field, which causes the following
code to make bad decisions if connhost[0].host is NULL:
I hadn't noticed (or maybe I forgot) this thread, so when the
same problem was reported at [1]/messages/by-id/18974-575f02b2168b36b3@postgresql.org I just went ahead and pushed the
submitted patch, which is only cosmetically different from your 0001.
Apologies for treading on your toes.
As for the question of how to test this sort of thing, I'm not
too excited about the narrow-gauge test case your 0002 proposes.
What I did for manual testing in [1]/messages/by-id/18974-575f02b2168b36b3@postgresql.org was to hack the postgres_fdw
tests to connect using hostaddr instead of the default. I think
formalizing that sort of approach would yield much better coverage.
I don't have any specific ideas about how to do it, though.
Maybe get our tests to respond to an environment variable that
allows overriding the default choices of connection properties?
regards, tom lane
On Thu, Jul 3, 2025 at 11:54 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
I hadn't noticed (or maybe I forgot) this thread, so when the
same problem was reported at [1] I just went ahead and pushed the
submitted patch, which is only cosmetically different from your 0001.
Apologies for treading on your toes.
No worries, as long as it's fixed I'm happy!
(And many thanks to Greg for the review; sorry for not getting to it
fast enough.)
As for the question of how to test this sort of thing, I'm not
too excited about the narrow-gauge test case your 0002 proposes.
What I did for manual testing in [1] was to hack the postgres_fdw
tests to connect using hostaddr instead of the default. I think
formalizing that sort of approach would yield much better coverage.
I agree that overriding connection defaults probably gets us better
overall coverage -- I just think I got pushback in the past for adding
"multipliers" in that way. But I won't argue against test coverage as
long as we get it in the end. :D
That said, I am planning to get noisier about the lack of "TCP suite".
The number of tests we've discarded just because we don't have a
current place to put them keeps slowly growing, and my long-term
intent with 0002 was to actually add a new place for them. Whatever
formalization we choose, let's please keep a TCP-only cluster
somewhere instead of forcing people to try to find a least-bad suite
to slot new tests into.
Thanks!
--Jacob