Don't use the deprecated and insecure PQcancel in our frontend tools anymore

Started by Jelte Fennema-Nio5 months ago12 messageshackers
Jump to latest
#1Jelte Fennema-Nio
postgres@jeltef.nl

A bunch of frontend tools, including psql, still used PQcancel to send
cancel requests to the server. That function is insecure, because it
does not use encryption to send the cancel request. This starts using
the new cancellation APIs (introduced in 61461a300) for all these
frontend tools. These APIs use the same encryption settings as the
connection that's being cancelled. Since these APIs are not signal-safe
this required a refactor to not send the cancel request in a signal
handler anymore, but instead using a dedicated thread. Similar logic was
already used for Windows anyway, so this also has the benefit that it
makes the cancel logic more uniform across our supported platforms.

Because this is fixing a security issue, it would be nice if we could
backport it. I'm not sure that's realistic though, given the
size/complexity of the change. I'm curious what others think about that.
To be clear, I'm only really talking about backporting to PG17 and PG18
because those contain the new cancellation APIs in libpq. Backporting
to even older versions would also require backporting the new
cancellation APIs in libpq, which seems like a no-go.

A possible follow up improvement to pg_dump would be to use threads for
its parallel workers on UNIX as well. Then the Windows and Unix
implementations would get even more aligned. I started looking into
that, but that quickly became quite a big refactor, touching a lot of
code unrelated to cancellation. So it seems better to do that in a
separate follow-on patch if people are interested.

Attachments:

v1-0001-Move-Windows-pthread-compatibility-functions-to-s.patchtext/x-patch; charset=utf-8; name=v1-0001-Move-Windows-pthread-compatibility-functions-to-s.patchDownload+6-5
v1-0002-Don-t-use-the-deprecated-and-insecure-PQcancel-in.patchtext/x-patch; charset=utf-8; name=v1-0002-Don-t-use-the-deprecated-and-insecure-PQcancel-in.patchDownload+415-326
#2Jelte Fennema-Nio
postgres@jeltef.nl
In reply to: Jelte Fennema-Nio (#1)
Re: Don't use the deprecated and insecure PQcancel in our frontend tools anymore

On Sun Dec 14, 2025 at 3:40 PM CET, Jelte Fennema-Nio wrote:

A bunch of frontend tools, including psql, still used PQcancel to send
cancel requests to the server. That function is insecure, because it
does not use encryption to send the cancel request. This starts using
the new cancellation APIs (introduced in 61461a300) for all these
frontend tools.

Fixed conflict after Copyright update.

Attachments:

v2-0001-Move-Windows-pthread-compatibility-functions-to-s.patchtext/x-patch; charset=utf-8; name=v2-0001-Move-Windows-pthread-compatibility-functions-to-s.patchDownload+5-5
v2-0002-Don-t-use-the-deprecated-and-insecure-PQcancel-in.patchtext/x-patch; charset=utf-8; name=v2-0002-Don-t-use-the-deprecated-and-insecure-PQcancel-in.patchDownload+415-326
#3Jelte Fennema-Nio
postgres@jeltef.nl
In reply to: Jelte Fennema-Nio (#1)
Re: Don't use the deprecated and insecure PQcancel in our frontend tools anymore

On Sun Dec 14, 2025 at 3:40 PM CET, Jelte Fennema-Nio wrote:

A bunch of frontend tools, including psql, still used PQcancel to send
cancel requests to the server. That function is insecure, because it
does not use encryption to send the cancel request. This starts using
the new cancellation APIs (introduced in 61461a300) for all these
frontend tools.

Small update. Split up the fe_utils and pg_dump changes into separate
commits, to make patches easier to review. Also use non-blocking writes
to the self-pipe from the signal handler to avoid potential deadlocks
(extremely unlikely for such blocks to occur, but better safe than sorry).

Attachments:

v3-0001-Move-Windows-pthread-compatibility-functions-to-s.patchtext/x-patch; charset=utf-8; name=v3-0001-Move-Windows-pthread-compatibility-functions-to-s.patchDownload+5-5
v3-0002-Don-t-use-deprecated-and-insecure-PQcancel-psql-a.patchtext/x-patch; charset=utf-8; name=v3-0002-Don-t-use-deprecated-and-insecure-PQcancel-psql-a.patchDownload+202-120
v3-0003-pg_dump-Don-t-use-the-deprecated-and-insecure-PQc.patchtext/x-patch; charset=utf-8; name=v3-0003-pg_dump-Don-t-use-the-deprecated-and-insecure-PQc.patchDownload+221-204
#4Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Jelte Fennema-Nio (#3)
Re: Don't use the deprecated and insecure PQcancel in our frontend tools anymore

On 08/02/2026 21:05, Jelte Fennema-Nio wrote:

On Sun Dec 14, 2025 at 3:40 PM CET, Jelte Fennema-Nio wrote:

A bunch of frontend tools, including psql, still used PQcancel to send
cancel requests to the server. That function is insecure, because it
does not use encryption to send the cancel request. This starts using
the new cancellation APIs (introduced in 61461a300) for all these
frontend tools.

Small update. Split up the fe_utils and pg_dump changes into separate
commits, to make patches easier to review. Also use non-blocking writes
to the self-pipe from the signal handler to avoid potential deadlocks
(extremely unlikely for such blocks to occur, but better safe than sorry).

Had a brief look at this:

It took me a while to get the big picture of how this works. cancel.c
could use some high-level comments explaining how to use the facility;
it's a real mixed bag right now.

The SIGINT handler now does three things:
- Set CancelRequested global variable,
- call callback if set, and
- wake up the cancel thread to send the cancel message, if cancel
connection is set.
There's no high-level overview documentation or comments on how those
three mechanism work or interact. It took me a while to understand that
they are really separate, alternative ways to handle SIGINT, all mashed
into the same signal handler function. At first read, I thought they're
somehow part of the one same mechanism.

The cancelConn mechanism is a global variable, which means that it can
only be used with one connection in the process. That's OK with the
current callers, but seems short-sighted. What if we wanted to use it
for pgbench, for example, which uses multiple threads and connections?
Or if we changed pg_dump to use multiple threads, like you also
suggested as a possible follow-up.

The "self-pipe trick" usually refers to interrupting the main thread
from select(); this uses it to wake up the other, separate cancellation
thread. That's fine, but again it took me a while to understand that
that's what it does. Comments!

This is racy, if the cancellation thread doesn't immediately process the
wakeup. For example, because it's still busy processing a previous
wakeup, because there's a network hiccup or something. By the time the
cancellation thread runs, the main thread might already be running a
different query than it was when the user hit CTRL-C.

- Heikki

#5Jelte Fennema-Nio
postgres@jeltef.nl
In reply to: Heikki Linnakangas (#4)
Re: Don't use the deprecated and insecure PQcancel in our frontend tools anymore

On Thu Mar 5, 2026 at 7:30 PM CET, Heikki Linnakangas wrote:

It took me a while to get the big picture of how this works. cancel.c
could use some high-level comments explaining how to use the facility;
it's a real mixed bag right now.

Attached is a version with a bunch more comments. I agree this cancel
logic is hard to understand without them. It took me quite a while to
understand it myself. (I don't think the code got any harder to
understand with these changes though, the exact same complexity was
already there for Windows. But I agree more commends are good.)

The cancelConn mechanism is a global variable, which means that it can
only be used with one connection in the process. That's OK with the
current callers, but seems short-sighted. What if we wanted to use it
for pgbench, for example, which uses multiple threads and connections?
Or if we changed pg_dump to use multiple threads, like you also
suggested as a possible follow-up.

Allowing for multiple callers seems like scope-creep to me, and also
hard to do in a generic way. I'd say IF we want that in a generic way
I'd first want to see a version of that that solves the problem for
pgdench/pg_dump, before generalizing it to cancel.c.

This is racy, if the cancellation thread doesn't immediately process the
wakeup. For example, because it's still busy processing a previous
wakeup, because there's a network hiccup or something. By the time the
cancellation thread runs, the main thread might already be running a
different query than it was when the user hit CTRL-C.

I now noted this in one of the new comments. I don't think there's a way
around this race condition entirely. It's simply a limitation of our
cancel protocol (because it's impossible to specify which query on a
connection should be cancelled).

In theory we could reduce the window for the race, by having all
frontend tools use async connections and have the main thread wait for
either the self-pipe or a cancel. That way it would be more similar to
the previous signal code in behaviour. That's a much bigger lift though,
i.e. all PQexec and PQgetResult calls would need to be modified. My
proposed change doesn't require changing the callsites at all.

In interactive usage of psql it seems pretty unlikely that people will
hit this race condition. In non-interactive use, it doesn't matter
because you just want Ctrl-C to cancel the application (whichever query
it currently runs). So I'd say it's currently not worth the complexity
to do the less racy option.

Attachments:

v4-0001-Move-Windows-pthread-compatibility-functions-to-s.patchtext/x-patch; charset=utf-8; name=v4-0001-Move-Windows-pthread-compatibility-functions-to-s.patchDownload+5-5
v4-0002-Don-t-use-deprecated-and-insecure-PQcancel-psql-a.patchtext/x-patch; charset=utf-8; name=v4-0002-Don-t-use-deprecated-and-insecure-PQcancel-psql-a.patchDownload+252-125
v4-0003-pg_dump-Don-t-use-the-deprecated-and-insecure-PQc.patchtext/x-patch; charset=utf-8; name=v4-0003-pg_dump-Don-t-use-the-deprecated-and-insecure-PQc.patchDownload+222-204
#6Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Jelte Fennema-Nio (#5)
Re: Don't use the deprecated and insecure PQcancel in our frontend tools anymore

On 06/03/2026 04:12, Jelte Fennema-Nio wrote:

On Thu Mar 5, 2026 at 7:30 PM CET, Heikki Linnakangas wrote:

It took me a while to get the big picture of how this works. cancel.c
could use some high-level comments explaining how to use the facility;
it's a real mixed bag right now.

Attached is a version with a bunch more comments. I agree this cancel
logic is hard to understand without them. It took me quite a while to
understand it myself. (I don't think the code got any harder to
understand with these changes though, the exact same complexity was
already there for Windows. But I agree more commends are good.)

Thanks. I agree it was complicated before these patches.

This is racy, if the cancellation thread doesn't immediately process
the wakeup. For example, because it's still busy processing a previous
wakeup, because there's a network hiccup or something. By the time the
cancellation thread runs, the main thread might already be running a
different query than it was when the user hit CTRL-C.

I now noted this in one of the new comments. I don't think there's a way
around this race condition entirely. It's simply a limitation of our
cancel protocol (because it's impossible to specify which query on a
connection should be cancelled).

That's true, but I still wonder if this could make it much worse.

In theory we could reduce the window for the race, by having all
frontend tools use async connections and have the main thread wait for
either the self-pipe or a cancel. That way it would be more similar to
the previous signal code in behaviour. That's a much bigger lift though,
i.e. all PQexec and PQgetResult calls would need to be modified. My
proposed change doesn't require changing the callsites at all.

Yeah, it does have that advantage..

One simple thing we could is to remember the "generation" in the signal
handler, and store it in another global variable ("cancelledGeneration"
or such). In the cancel thread, check that the generation matches;
otherwise the thread is about to send a cancellation to a query that
already finished, and should not send it.

I worry how this behaves if establishing the cancel connection gets
stuck for a long time. Because of a network hiccup, for example. That's
also not a new problem though; it's perhaps even worse today, if the
signal handler gets stuck for a long time, trying to establish the
connection. Still, would be good to do some testing with a bad network.

- Heikki

#7Jelte Fennema-Nio
postgres@jeltef.nl
In reply to: Heikki Linnakangas (#6)
Re: Don't use the deprecated and insecure PQcancel in our frontend tools anymore

On Fri, 6 Mar 2026 at 20:51, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

In theory we could reduce the window for the race, by having all
frontend tools use async connections and have the main thread wait for
either the self-pipe or a cancel. That way it would be more similar to
the previous signal code in behaviour. That's a much bigger lift though,
i.e. all PQexec and PQgetResult calls would need to be modified. My
proposed change doesn't require changing the callsites at all.

Yeah, it does have that advantage..

I let Claude Code take a stab at a POC for doing the same thread
approach. So I could get a more accurate feeling of how big this lift
would be. It's a much bigger change, but the general design is
relatively straight forward. It's attached as the nocfbot patch (it's
not built on top of any of the other patches, it's a separate one).

One simple thing we could is to remember the "generation" in the signal
handler, and store it in another global variable ("cancelledGeneration"
or such). In the cancel thread, check that the generation matches;
otherwise the thread is about to send a cancellation to a query that
already finished, and should not send it.
In the cancel thread, check that the generation matches;

otherwise the thread is about to send a cancellation to a query that
already finished, and should not send it.

I took a look at this, and attached a fixup patch that does this. It
uses C11 atomics because locks cannot be taken in the signal handler
and the signal handler needs to read/write variables from/to two
different threads. I'm not sure if it pulls it's own weight though. It
seems a really unlikely scenario where the signal handler is fired
during one generation, but then the cancel thread wakes up during
another. I'm not sure if we should care about it. And I actually think
it could make us miss cancel a SIGINT for non-interactive use cases of
psql.

I worry how this behaves if establishing the cancel connection gets
stuck for a long time. Because of a network hiccup, for example. That's
also not a new problem though; it's perhaps even worse today, if the
signal handler gets stuck for a long time, trying to establish the
connection. Still, would be good to do some testing with a bad network.

To make sure we're talking about the same situation, let me summarize
it differently: Establishing the connection for the cancel is slow,
but the actual query connection is still fast.

I think this is an interesting case to consider (much more interesting
than the kind of race the additional generation check could protect
against). First of all because I think it can definitely happen.
Especially with SSL the cancel needs several round trips, while a
query on an already established connection only needs one direction
latency. I can definitely see how this could cause out-of-order
arrival even on stable high-latency networks. Especilaly if there's
also some unfortunate packet drops.

And as you suggested the failure modes are different:
- With master, psql will become unresponsive until the client gets a
response from the server (or tcp timeouts are hit)
- With this patchset, a later query will get cancelled.

I think for interactive psql usage both are annoying, but both are not
the end of the world. I think I would personally prefer the current
master behaviour. I'm not sure preserving it is worth all the
additional code changes though to make all the applications use
non-blocking APIs. In any case SSL for cancel keys is definitely worth
the patchset behaviour to me (even though it sounds slightly worse).

For non-interactive use (i.e. running scripts in psql or other
frontend tools like vacuumdb). I don't think this situation applies.
You want whatever query to be cancelled.

#8Jelte Fennema-Nio
postgres@jeltef.nl
In reply to: Jelte Fennema-Nio (#7)
Re: Don't use the deprecated and insecure PQcancel in our frontend tools anymore

On Sat Mar 7, 2026 at 1:01 AM CET, Jelte Fennema-Nio wrote:

I took a look at this, and attached a fixup patch that does this.

Now with the actual attachements...

Attachments:

nocfbot.v0-0001-POC-Use-non-blocking-apis-for-frontend-tools.patchtext/x-patch; charset=utf-8; name=nocfbot.v0-0001-POC-Use-non-blocking-apis-for-frontend-tools.patchDownload+953-207
v5-0001-Move-Windows-pthread-compatibility-functions-to-s.patchtext/x-patch; charset=utf-8; name=v5-0001-Move-Windows-pthread-compatibility-functions-to-s.patchDownload+5-5
v5-0002-Don-t-use-deprecated-and-insecure-PQcancel-psql-a.patchtext/x-patch; charset=utf-8; name=v5-0002-Don-t-use-deprecated-and-insecure-PQcancel-psql-a.patchDownload+252-125
v5-0003-pg_dump-Don-t-use-the-deprecated-and-insecure-PQc.patchtext/x-patch; charset=utf-8; name=v5-0003-pg_dump-Don-t-use-the-deprecated-and-insecure-PQc.patchDownload+222-204
v5-0004-fixup-Don-t-use-deprecated-and-insecure-PQcancel-.patchtext/x-patch; charset=utf-8; name=v5-0004-fixup-Don-t-use-deprecated-and-insecure-PQcancel-.patchDownload+37-10
v5-0005-fixup-Don-t-use-deprecated-and-insecure-PQcancel-.patchtext/x-patch; charset=utf-8; name=v5-0005-fixup-Don-t-use-deprecated-and-insecure-PQcancel-.patchDownload+9-2
#9Jelte Fennema-Nio
postgres@jeltef.nl
In reply to: Heikki Linnakangas (#6)
Re: Don't use the deprecated and insecure PQcancel in our frontend tools anymore

On Fri Mar 6, 2026 at 8:51 PM CET, Heikki Linnakangas wrote:

I worry how this behaves if establishing the cancel connection gets
stuck for a long time. Because of a network hiccup, for example. That's
also not a new problem though; it's perhaps even worse today, if the
signal handler gets stuck for a long time, trying to establish the
connection. Still, would be good to do some testing with a bad network.

After thinking on this again, I thought of a much easier solution to
this problem than the direction I was exploring in my previous response
to this:
We can have SetCancelConn() and ResetCancelConn() wait for any pending
cancel to complete before letting them replace/remove the cancelConn.

That way even in case of a bad network, we know that an already
in-flight cancel request will never cancel a query from a next
SetCancelConn() call. It does mean that you cannot submit a new query
before we've received a response to the in-flight cancel request (either
because the hiccup is reselved or because TCP timeouts report a
failure). That's the current behaviour too with running PQcancel in the
signal handler, and I also think that's the behaviour that makes the
most sense.

Attached is a patchset that does this.

To ensure that it worked correctly, I mimicked network issues by running
the following iptables command after already having connected with psql:

sudo iptables -A INPUT -p tcp --dport 5432 -m state --state NEW -j DROP

That command drops all new incoming connections to the server, but
allows already established connections to continue working. Which means
that any new cancel connections will not be able to connect.

You can allow the traffic again with:

sudo iptables -D INPUT -p tcp --dport 5432 -m state --state NEW -j DROP

To see what happens when the connection attempt never goes through I
used:
sudo sysctl net.ipv4.tcp_syn_retries=2
sudo sysctl net.ipv4.tcp_retries2=3

This is what happens then:

localhost jelte@postgres:5432-1484255=# select pg_sleep(5);
^CSending cancel request
Time: 5005.833 ms (00:05.006)
Could not send cancel request: connection to server at "localhost" (127.0.0.1), port 5432 failed: Connection timed out
Is the server running on that host and accepting TCP/IP connections?
localhost jelte@postgres:5432-1484255=#

The query succeeds after 5 seconds, but the prompt does not become
interactive until a little while after that when the cancel request
error is also shown.

Attachments:

v6-0001-Move-Windows-pthread-compatibility-functions-to-s.patchtext/x-patch; charset=utf-8; name=v6-0001-Move-Windows-pthread-compatibility-functions-to-s.patchDownload+5-5
v6-0002-Don-t-use-deprecated-and-insecure-PQcancel-psql-a.patchtext/x-patch; charset=utf-8; name=v6-0002-Don-t-use-deprecated-and-insecure-PQcancel-psql-a.patchDownload+241-122
v6-0003-pg_dump-Don-t-use-the-deprecated-and-insecure-PQc.patchtext/x-patch; charset=utf-8; name=v6-0003-pg_dump-Don-t-use-the-deprecated-and-insecure-PQc.patchDownload+225-203
#10Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Jelte Fennema-Nio (#9)
Re: Don't use the deprecated and insecure PQcancel in our frontend tools anymore

On 15/03/2026 17:09, Jelte Fennema-Nio wrote:

On Fri Mar 6, 2026 at 8:51 PM CET, Heikki Linnakangas wrote:

I worry how this behaves if establishing the cancel connection gets
stuck for a long time. Because of a network hiccup, for example.
That's also not a new problem though; it's perhaps even worse today,
if the signal handler gets stuck for a long time, trying to establish
the connection. Still, would be good to do some testing with a bad
network.

After thinking on this again, I thought of a much easier solution to
this problem than the direction I was exploring in my previous response
to this: We can have SetCancelConn() and ResetCancelConn() wait for any
pending
cancel to complete before letting them replace/remove the cancelConn.

That way even in case of a bad network, we know that an already
in-flight cancel request will never cancel a query from a next
SetCancelConn() call. It does mean that you cannot submit a new query
before we've received a response to the in-flight cancel request (either
because the hiccup is reselved or because TCP timeouts report a
failure). That's the current behaviour too with running PQcancel in the
signal handler, and I also think that's the behaviour that makes the
most sense.

+1. With a little extra effort, the cancellation can be made abortable
too, so that you don't need to wait for the TCP timeout. I.e when
ResetCancelConn() is called, the cancellation thread can immediately
call PQcancelReset().

One a different topic, is there any guarantee on which thread will
receive the SIGINT? It matters because psql's cancel callback sometimes
calls longjmp(), which assumes that the signal handler is executed in
the main thread.

- Heikki

#11Jelte Fennema-Nio
postgres@jeltef.nl
In reply to: Heikki Linnakangas (#10)
Re: Don't use the deprecated and insecure PQcancel in our frontend tools anymore

On Mon, 16 Mar 2026 at 10:57, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

+1. With a little extra effort, the cancellation can be made abortable
too, so that you don't need to wait for the TCP timeout. I.e when
ResetCancelConn() is called, the cancellation thread can immediately
call PQcancelReset().

I agree we could do that, but I don't think we should. Then we'd be
getting into the exact situation where psql doesn't wait for an already
in-flight cancel request to be processed by the server before sending
the next query. i.e. while this would be fine if there's a network
issue, it would be bad if the server is just slow to respond to the
cancel request (e.g. because there's a PgBouncer in the middle that
hasn't forwarded the request yet).

One a different topic, is there any guarantee on which thread will
receive the SIGINT? It matters because psql's cancel callback sometimes
calls longjmp(), which assumes that the signal handler is executed in
the main thread.

Good point, I had thought about whether this mattered, but hadn't
considered the callbacks. Attached is v7 that makes sure the signal is
always handled by the main thread by blocking SIGINT before creating the
cancel thread.

I manually tested that this works by sending signals to specific threads
using htop, and logging thread the id from the signal handler. Before
this change the thread id would be from different threads, after this
change it's always from the same one.

Attachments:

v7-0001-Move-Windows-pthread-compatibility-functions-to-s.patchtext/x-patch; charset=utf-8; name=v7-0001-Move-Windows-pthread-compatibility-functions-to-s.patchDownload+5-5
v7-0002-Don-t-use-deprecated-and-insecure-PQcancel-psql-a.patchtext/x-patch; charset=utf-8; name=v7-0002-Don-t-use-deprecated-and-insecure-PQcancel-psql-a.patchDownload+257-122
v7-0003-pg_dump-Don-t-use-the-deprecated-and-insecure-PQc.patchtext/x-patch; charset=utf-8; name=v7-0003-pg_dump-Don-t-use-the-deprecated-and-insecure-PQc.patchDownload+225-203
#12Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Jelte Fennema-Nio (#11)
Re: Don't use the deprecated and insecure PQcancel in our frontend tools anymore

On 17/03/2026 12:31, Jelte Fennema-Nio wrote:

On Mon, 16 Mar 2026 at 10:57, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

+1. With a little extra effort, the cancellation can be made abortable
too, so that you don't need to wait for the TCP timeout. I.e when
ResetCancelConn() is called, the cancellation thread can immediately
call PQcancelReset().

I agree we could do that, but I don't think we should. Then we'd be
getting into the exact situation where psql doesn't wait for an already
in-flight cancel request to be processed by the server before sending
the next query. i.e. while this would be fine if there's a network
issue, it would be bad if the server is just slow to respond to the
cancel request (e.g. because there's a PgBouncer in the middle that
hasn't forwarded the request yet).

True, it can get messy fast. Hmm, perhaps you could check
PQcancelStatus() and only cancel the cancellation if it hasn't sent the
CancelRequest packet yet.

One a different topic, is there any guarantee on which thread will
receive the SIGINT? It matters because psql's cancel callback sometimes
calls longjmp(), which assumes that the signal handler is executed in
the main thread.

Good point, I had thought about whether this mattered, but hadn't
considered the callbacks. Attached is v7 that makes sure the signal is
always handled by the main thread by blocking SIGINT before creating the
cancel thread.

The more I look into the cancel handling in psql and other client
programs, the more I want to gouge my eyes out. It was pretty messy
before this patch, too, and now also we have an extra thread to worry about.

For example, let's look at what happens at COPY IN, starting from
HandleCopyResult(). It sets SetCancelConn(pset.db), and then calls
handleCopyIn(). handleCopyIn() calls sigsetjmp(), enables the longjmp in
the signal handler with "sigint_interrupt_enabled = true". and calls
fgets().

So we rely on longjmp() to get out of the fgets(). Is that safe? I
suppose, it's worked all these years. It's scary though, I would never
do that in new code. And then you add threads to the mix... is it still
safe in the presence of threads? glibc uses internal locks on FILE
objects to make them thread-safe; if you longjmp() in the middle of
fgets(), is the lock released? I suppose it doesn't matter if only one
thread ever accesses it, but ugh.

Because the signal handler longjmp()'d out, it never calls PQcancel()
(or wakes up the thread, with the patch) in that case, even though the
cancel connection is currently armed with SetCancelConn(). Is that
intentional? If you then immediately SIGINT again, then the signal
handler will PQcancel(). I guess there's some logic to that; if the
signal arrives while you're blocked on the input, sending the CopyEnd
probably still works. Perhaps is should be documented that the callback
is called first.

The cancel handling in wait_on_slots() in parallel_slot.c is surprising
in a different way. It already uses async libpq calls and has a select()
loop, but it still relies on the signal handler to do the cancellation.
And it arbitrarily PQcancel()s only one of the connections it waits on.

The comments you added in your patches help a lot, explaining the three
different ways that cancellation can be handled, thanks for that.

psql has a global variable, 'cancel_pressed', which is set in the signal
handler callback. Is it redundant with 'CancelRequested' that's also set
in the signal handler?

Some ideas on how to make this less confusing:

- Change pg_dump to use threads on Unix too.

- Extract the cancel_pipe stuff into a helper function, and reuse it in
pg_dump and in wait_on_slots()

- Make the callback and SetCancelConn() mutually exclusive, so that when
you call SetCancelConn(), it *replaces* the callback if any was set. To
re-install it, call SetCancelCallback(). This would make them more
symmetrical, and would remove the confusion on what happens with the
active cancel connection if the callback longjmp()s. The
SetCancelCallback()/ResetCancelCallback() calls would replace the
sigint_interrupt_enabled variable.

- Instead of SetCancelConn() and ResetCancelConn(), have functions like
"StartCancelInBackground(conn)" and "WaitBackgroundCancel()". The cancel
callback would then call StartCancelInBackground()".

Not sure how much these really help, but maybe something to try out.

Then there's the idea of switching to async libpq calls and select()
everywhere...

Another more radical idea: Could we move this functionality to libpq
itself? Imagine that we had a function like PQrequestCancel(PGconn
*conn) that just set a flag on the PGconn object to indicate that
cancellation has been requested. When that flag is set, all blocking
calls like PQexec() on the original connection drive the cancellation
connection and sending the cancel request, instead of (or in addition
to) polling the main connection. The new PQrequestCancel() would be safe
to call from the signal handler, but then main thread would do all the
work. That'd be very easy to use when applicable, but there's one pretty
serious problem: it wouldn't work with async libpq calls and select()
without more API changes. The internal cancel connection that's opened
would have a different fd from the one returned by PQsocket(), so the
application wouldn't know to poll it.

- Heikki