Add non-blocking version of PQcancel

Started by Jelte Fennemaabout 4 years ago146 messages
Jump to latest
#1Jelte Fennema
Jelte.Fennema@microsoft.com

The existing PQcancel API is using blocking IO. This makes PQcancel
impossible to use in an event loop based codebase, without blocking the
event loop until the call returns.

This patch adds a new cancellation API to libpq which is called
PQcancelConnectionStart. This API can be used to send cancellations in a
non-blocking fashion. To do this it internally uses regular PGconn
connection establishment. This has as a downside that
PQcancelConnectionStart cannot be safely called from a signal handler.

Luckily, this should be fine for most usages of this API. Since most
code that's using an event loop handles signals in that event loop as
well (as opposed to calling functions from the signal handler directly).

There are also a few advantages of this approach:
1. No need to add and maintain a second non-blocking connection
establishment codepath.
2. Cancel connections benefit automatically from any improvements made
to the normal connection establishment codepath. Examples of things
that it currently gets for free currently are TLS support and
keepalive settings.

This patch also includes a test for this new API (and also the already
existing cancellation APIs). The test can be easily run like this:

cd src/test/modules/libpq_pipeline
make && ./libpq_pipeline cancel

NOTE: I have not tested this with GSS for the moment. My expectation is
that using this new API with a GSS connection will result in a
CONNECTION_BAD status when calling PQcancelStatus. The reason for this
is that GSS reads will also need to communicate back that an EOF was
found, just like I've done for TLS reads and unencrypted reads. Since in
case of a cancel connection an EOF is actually expected, and should not
be treated as an error.

Attachments:

0001-Add-non-blocking-version-of-PQcancel.patchapplication/octet-stream; name=0001-Add-non-blocking-version-of-PQcancel.patchDownload+348-10
#2Andres Freund
andres@anarazel.de
In reply to: Jelte Fennema (#1)
Re: Add non-blocking version of PQcancel

Hi,

On 2022-01-12 15:22:18 +0000, Jelte Fennema wrote:

This patch also includes a test for this new API (and also the already
existing cancellation APIs). The test can be easily run like this:

cd src/test/modules/libpq_pipeline
make && ./libpq_pipeline cancel

Right now tests fails to build on windows with:

[15:45:10.518] src/interfaces/libpq/libpqdll.def : fatal error LNK1121: duplicate ordinal number '189' [c:\cirrus\libpq.vcxproj]
on fails tests on other platforms. See
https://cirrus-ci.com/build/4791821363576832

NOTE: I have not tested this with GSS for the moment. My expectation is
that using this new API with a GSS connection will result in a
CONNECTION_BAD status when calling PQcancelStatus. The reason for this
is that GSS reads will also need to communicate back that an EOF was
found, just like I've done for TLS reads and unencrypted reads. Since in
case of a cancel connection an EOF is actually expected, and should not
be treated as an error.

The failures do not seem related to this.

Greetings,

Andres Freund

#3Jelte Fennema
Jelte.Fennema@microsoft.com
In reply to: Andres Freund (#2)
Re: Add non-blocking version of PQcancel

Attached is an updated patch which I believe fixes windows and the other test failures.
At least on my machine make check-world passes now when compiled with --enable-tap-tests

I also included a second patch which adds some basic documentation for the libpq tests.

Attachments:

0001-Add-non-blocking-version-of-PQcancel.patchapplication/octet-stream; name=0001-Add-non-blocking-version-of-PQcancel.patchDownload+347-9
0002-Add-documentation-for-libpq_pipeline-tests.patchapplication/octet-stream; name=0002-Add-documentation-for-libpq_pipeline-tests.patchDownload+20-1
#4Jacob Champion
jacob.champion@enterprisedb.com
In reply to: Jelte Fennema (#3)
Re: Add non-blocking version of PQcancel

On Thu, 2022-01-13 at 14:51 +0000, Jelte Fennema wrote:

Attached is an updated patch which I believe fixes windows and the other test failures.
At least on my machine make check-world passes now when compiled with --enable-tap-tests

I also included a second patch which adds some basic documentation for the libpq tests.

This is not a full review by any means, but here are my thoughts so
far:

NOTE: I have not tested this with GSS for the moment. My expectation is
that using this new API with a GSS connection will result in a
CONNECTION_BAD status when calling PQcancelStatus. The reason for this
is that GSS reads will also need to communicate back that an EOF was
found, just like I've done for TLS reads and unencrypted reads.

For what it's worth, I did a smoke test with a Kerberos environment via

./libpq_pipeline cancel '... gssencmode=require'

and the tests claim to pass.

2. Cancel connections benefit automatically from any improvements made
to the normal connection establishment codepath. Examples of things
that it currently gets for free currently are TLS support and
keepalive settings.

This seems like a big change compared to PQcancel(); one that's not
really hinted at elsewhere. Having the async version of an API open up
a completely different code path with new features is pretty surprising
to me.

And does the backend actually handle cancel requests via TLS (or GSS)?
It didn't look that way from a quick scan, but I may have missed
something.

@@ -1555,6 +1665,7 @@ print_test_list(void)
printf("singlerow\n");
printf("transaction\n");
printf("uniqviol\n");
+ printf("cancel\n");
}

This should probably go near the top; it looks like the existing list
is alphabetized.

The new cancel tests don't print any feedback. It'd be nice to get the
same sort of output as the other tests.

/* issue a cancel request */
extern int PQcancel(PGcancel *cancel, char *errbuf, int errbufsize);
+extern PGcancelConn * PQcancelConnectStart(PGconn *conn);
+extern PGcancelConn * PQcancelConnect(PGconn *conn);
+extern PostgresPollingStatusType PQcancelConnectPoll(PGcancelConn * cancelConn);
+extern ConnStatusType PQcancelStatus(const PGcancelConn * cancelConn);
+extern int PQcancelSocket(const PGcancelConn * cancelConn);
+extern char *PQcancelErrorMessage(const PGcancelConn * cancelConn);
+extern void PQcancelFinish(PGcancelConn * cancelConn);

That's a lot of new entry points, most of which don't do anything
except call their twin after a pointer cast. How painful would it be to
just use the existing APIs as-is, and error out when calling
unsupported functions if conn->cancelRequest is true?

--Jacob

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jacob Champion (#4)
Re: Add non-blocking version of PQcancel

Jacob Champion <pchampion@vmware.com> writes:

On Thu, 2022-01-13 at 14:51 +0000, Jelte Fennema wrote:

2. Cancel connections benefit automatically from any improvements made
to the normal connection establishment codepath. Examples of things
that it currently gets for free currently are TLS support and
keepalive settings.

This seems like a big change compared to PQcancel(); one that's not
really hinted at elsewhere. Having the async version of an API open up
a completely different code path with new features is pretty surprising
to me.

Well, the patch lacks any user-facing doco at all, so a-fortiori this
point is not covered. I trust the plan was to write docs later.

I kind of feel that this patch is going in the wrong direction.
I do see the need for a version of PQcancel that can encrypt the
transmitted cancel request (and yes, that should work on the backend
side; see recursion in ProcessStartupPacket). I have not seen
requests for a non-blocking version, and this doesn't surprise me.
I feel that the whole non-blocking aspect of libpq probably belongs
to another era when people didn't trust threads.

So what I'd do is make a version that just takes a PGconn, sends the
cancel request, and returns success or failure; never mind the
non-blocking aspect. One possible long-run advantage of this is that
it might be possible to "sync" the cancel request so that we know,
or at least can find out afterwards, exactly which query got
cancelled; something that's fundamentally impossible if the cancel
function works from a clone data structure that is disconnected
from the current connection state.

(Note that it probably makes sense to make a clone PGconn to pass
to fe-connect.c, internally to this function. I just don't want
to expose that to the app.)

regards, tom lane

#6Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#5)
Re: Add non-blocking version of PQcancel

Hi,

On 2022-03-24 17:41:53 -0400, Tom Lane wrote:

I kind of feel that this patch is going in the wrong direction.
I do see the need for a version of PQcancel that can encrypt the
transmitted cancel request (and yes, that should work on the backend
side; see recursion in ProcessStartupPacket). I have not seen
requests for a non-blocking version, and this doesn't surprise me.
I feel that the whole non-blocking aspect of libpq probably belongs
to another era when people didn't trust threads.

That's not a whole lot of fun if you think of cases like postgres_fdw (or
citus as in Jelte's case), which run inside the backend. Even with just a
single postgres_fdw, we don't really want to end up in an uninterruptible
PQcancel() that doesn't even react to pg_terminate_backend().

Even if using threads weren't an issue, I don't really buy the premise - most
networking code has moved *away* from using dedicated threads for each
connection. It just doesn't scale.

Leaving PQcancel aside, we use the non-blocking libpq stuff widely
ourselves. I think walreceiver, isolationtester, pgbench etc would be *much*
harder to get working equally well if there was just blocking calls. If
anything, we're getting to the point where purely blocking functionality
shouldn't be added anymore.

Greetings,

Andres Freund

#7Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#6)
Re: Add non-blocking version of PQcancel

On Thu, Mar 24, 2022 at 6:49 PM Andres Freund <andres@anarazel.de> wrote:

That's not a whole lot of fun if you think of cases like postgres_fdw (or
citus as in Jelte's case), which run inside the backend. Even with just a
single postgres_fdw, we don't really want to end up in an uninterruptible
PQcancel() that doesn't even react to pg_terminate_backend().

Even if using threads weren't an issue, I don't really buy the premise - most
networking code has moved *away* from using dedicated threads for each
connection. It just doesn't scale.

Leaving PQcancel aside, we use the non-blocking libpq stuff widely
ourselves. I think walreceiver, isolationtester, pgbench etc would be *much*
harder to get working equally well if there was just blocking calls. If
anything, we're getting to the point where purely blocking functionality
shouldn't be added anymore.

+1. I think having a non-blocking version of PQcancel() available is a
great idea, and I've wanted it myself. See commit
ae9bfc5d65123aaa0d1cca9988037489760bdeae.

That said, I don't think that this particular patch is going in the
right direction. I think Jacob's comment upthread is right on point:
"This seems like a big change compared to PQcancel(); one that's not
really hinted at elsewhere. Having the async version of an API open up
a completely different code path with new features is pretty
surprising to me." It seems to me that we want to end up with similar
code paths for PQcancel() and the non-blocking version of cancel. We
could get there in two ways. One way would be to implement the
non-blocking functionality in a manner that matches exactly what
PQcancel() does now. I imagine that the existing code from PQcancel()
would move, with some amount of change, into a new set of non-blocking
APIs. Perhaps PQcancel() would then be rewritten to use those new APIs
instead of hand-rolling the same logic. The other possible approach
would be to first change the blocking version of PQcancel() to use the
regular connection code instead of its own idiosyncratic logic, and
then as a second step, extend it with non-blocking interfaces that use
the regular non-blocking connection code. With either of these
approaches, we end up with the functionality working similarly in the
blocking and non-blocking code paths.

Leaving the question of approach aside, I think it's fairly clear that
this patch cannot be seriously considered for v15. One problem is the
lack of user-facing documentation, but there's a other stuff that just
doesn't look sufficiently well-considered. For example, it updates the
comment for pqsecure_read() to say "Returns -1 in case of failures,
except in the case of clean connection closure then it returns -2."
But that function calls any of three different implementation
functions depending on the situation and the patch only updates one of
them. And it updates that function to return -2 when the is
ECONNRESET, which seems to fly in the face of the comment's idea that
this is the "clean connection closure" case. I think it's probably a
bad sign that this function is tinkering with logic in this sort of
low-level function anyway. pqReadData() is a really general function
that manages to work with non-blocking I/O already, so why does
non-blocking query cancellation need to change its return values, or
whether or not it drops data in certain cases?

I'm also skeptical about the fact that we end up with a whole bunch of
new functions that are just wrappers around existing functions. That's
not a scalable approach. Every function that we have for a PGconn will
eventually need a variant that deals with a PGcancelConn. That seems
kind of pointless, especially considering that a PGcancelConn is
*exactly* a PGconn in disguise. If we decide to pursue the approach of
using the existing infrastructure for PGconn objects to handle query
cancellation, we ought to manipulate them using the same functions we
currently do, with some kind of mode or flag or switch or something
that you can use to turn a regular PGconn into something that cancels
a query. Maybe you create the PGconn and call
PQsprinkleMagicCancelDust() on it, and then you just proceed using the
existing functions, or something like that. Then, not only do the
existing functions not need query-cancel analogues, but any new
functions we add in the future don't either.

I'll set the target version for this patch to 16. I hope work continues.

Thanks,

--
Robert Haas
EDB: http://www.enterprisedb.com

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#7)
Re: Add non-blocking version of PQcancel

Robert Haas <robertmhaas@gmail.com> writes:

That said, I don't think that this particular patch is going in the
right direction. I think Jacob's comment upthread is right on point:
"This seems like a big change compared to PQcancel(); one that's not
really hinted at elsewhere. Having the async version of an API open up
a completely different code path with new features is pretty
surprising to me." It seems to me that we want to end up with similar
code paths for PQcancel() and the non-blocking version of cancel. We
could get there in two ways. One way would be to implement the
non-blocking functionality in a manner that matches exactly what
PQcancel() does now. I imagine that the existing code from PQcancel()
would move, with some amount of change, into a new set of non-blocking
APIs. Perhaps PQcancel() would then be rewritten to use those new APIs
instead of hand-rolling the same logic. The other possible approach
would be to first change the blocking version of PQcancel() to use the
regular connection code instead of its own idiosyncratic logic, and
then as a second step, extend it with non-blocking interfaces that use
the regular non-blocking connection code. With either of these
approaches, we end up with the functionality working similarly in the
blocking and non-blocking code paths.

I think you misunderstand where the real pain point is. The reason
that PQcancel's functionality is so limited has little to do with
blocking vs non-blocking, and everything to do with the fact that
it's designed to be safe to call from a SIGINT handler. That makes
it quite impractical to invoke OpenSSL, and probably our GSS code
as well. If we want support for all connection-time options then
we have to make a new function that does not promise signal safety.

I'm prepared to yield on the question of whether we should provide
a non-blocking version, though I still say that (a) an easier-to-call,
one-step blocking alternative would be good too, and (b) it should
not be designed around the assumption that there's a completely
independent state object being used to perform the cancel. Even in
the non-blocking case, callers should only deal with the original
PGconn.

Leaving the question of approach aside, I think it's fairly clear that
this patch cannot be seriously considered for v15.

Yeah, I don't think it's anywhere near fully baked yet. On the other
hand, we do have a couple of weeks left.

regards, tom lane

#9Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#8)
Re: Add non-blocking version of PQcancel

On Fri, Mar 25, 2022 at 2:47 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I think you misunderstand where the real pain point is. The reason
that PQcancel's functionality is so limited has little to do with
blocking vs non-blocking, and everything to do with the fact that
it's designed to be safe to call from a SIGINT handler. That makes
it quite impractical to invoke OpenSSL, and probably our GSS code
as well. If we want support for all connection-time options then
we have to make a new function that does not promise signal safety.

Well, that's a fair point, but it's somewhat orthogonal to the one I'm
making, which is that a non-blocking version of function X might be
expected to share code or at least functionality with X itself. Having
something that is named in a way that implies asynchrony without other
differences but which is actually different in other important ways is
no good.

I'm prepared to yield on the question of whether we should provide
a non-blocking version, though I still say that (a) an easier-to-call,
one-step blocking alternative would be good too, and (b) it should
not be designed around the assumption that there's a completely
independent state object being used to perform the cancel. Even in
the non-blocking case, callers should only deal with the original
PGconn.

Well, this sounds like you're arguing for the first of the two
approaches I thought would be acceptable, rather than the second.

Leaving the question of approach aside, I think it's fairly clear that
this patch cannot be seriously considered for v15.

Yeah, I don't think it's anywhere near fully baked yet. On the other
hand, we do have a couple of weeks left.

We do?

--
Robert Haas
EDB: http://www.enterprisedb.com

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#9)
Re: Add non-blocking version of PQcancel

Robert Haas <robertmhaas@gmail.com> writes:

Well, that's a fair point, but it's somewhat orthogonal to the one I'm
making, which is that a non-blocking version of function X might be
expected to share code or at least functionality with X itself. Having
something that is named in a way that implies asynchrony without other
differences but which is actually different in other important ways is
no good.

Yeah. We need to choose a name for these new function(s) that is
sufficiently different from "PQcancel" that people won't expect them
to behave exactly the same as that does. I lack any good ideas about
that, how about you?

Yeah, I don't think it's anywhere near fully baked yet. On the other
hand, we do have a couple of weeks left.

We do?

Um, you did read the psql-release discussion about setting the feature
freeze deadline, no?

regards, tom lane

#11Jelte Fennema
Jelte.Fennema@microsoft.com
In reply to: Tom Lane (#10)
Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

Thanks for all the feedback everyone. I'll try to send a new patch
later this week that includes user facing docs and a simplified API.
For now a few responses:

Yeah. We need to choose a name for these new function(s) that is
sufficiently different from "PQcancel" that people won't expect them
to behave exactly the same as that does. I lack any good ideas about
that, how about you?

So I guess the names I proposed were not great, since everyone seems to be falling over them.
But I'd like to make my intention clear with the current naming. After this patch there would be
four different APIs for starting a cancelation:
1. PQrequestCancel: deprecated+old, not signal-safe function for requesting query cancellation, only uses a specific set of connection options
2. PQcancel: Cancel queries in a signal safe way, to be signal-safe it only uses a limited set of connection options
3. PQcancelConnect: Cancel queries in a non-signal safe way that uses all connection options
4. PQcancelConnectStart: Cancel queries in a non-signal safe and non-blocking way that uses all connection options

So the idea was that you should not look at PQcancelConnectStart as the non-blocking
version of PQcancel, but as the non-blocking version of PQcancelConnect. I'll try to
think of some different names too, but IMHO these names could be acceptable
when their differences are addressed sufficiently in the documentation.

One other approach to naming that comes to mind now is repurposing PQrequestCancel:
1. PQrequestCancel: Cancel queries in a non-signal safe way that uses all connection options
2. PQrequestCancelStart: Cancel queries in a non-signal safe and non-blocking way that uses all connection options
3. PQcancel: Cancel queries in a signal safe way, to be signal-safe it only uses a limited set of connection options

I think it's probably a
bad sign that this function is tinkering with logic in this sort of
low-level function anyway. pqReadData() is a really general function
that manages to work with non-blocking I/O already, so why does
non-blocking query cancellation need to change its return values, or
whether or not it drops data in certain cases?

The reason for this low level change is that the cancellation part of the
Postgres protocol is following a different, much more simplistic design
than all the other parts. The client does not expect a response message back
from the server after sending the cancellation request. The expectation
is that the server signals completion by closing the connection, i.e. sending EOF.
For all other parts of the protocol, connection termination should be initiated
client side by sending a Terminate message. So the server closing (sending
EOF) is always unexpected and is thus currently considered an error by pqReadData.

But since this is not the case for the cancellation protocol, the result is
changed to -2 in case of EOF to make it possible to distinguish between
an EOF and an actual error.

And it updates that function to return -2 when the is
ECONNRESET, which seems to fly in the face of the comment's idea that
this is the "clean connection closure" case.

The diff sadly does not include the very relevant comment right above these
lines. Pasting the whole case statement here to clear up this confusion:

case SSL_ERROR_ZERO_RETURN:

/*
* Per OpenSSL documentation, this error code is only returned for
* a clean connection closure, so we should not report it as a
* server crash.
*/
appendPQExpBufferStr(&conn->errorMessage,
libpq_gettext("SSL connection has been closed unexpectedly\n"));
result_errno = ECONNRESET;
n = -2;
break;

For example, it updates the
comment for pqsecure_read() to say "Returns -1 in case of failures,
except in the case of clean connection closure then it returns -2."
But that function calls any of three different implementation
functions depending on the situation and the patch only updates one of
them.

That comment is indeed not describing what is happening correctly and I'll
try to make it clearer. The main reason for it being incorrect is coming from
the fact that receiving EOFs is handled in different places based on the
encryption method:

1. Unencrypted TCP: EOF is not returned as an error by pqsecure_read, but detected by pqReadData (see comments related to definitelyEOF)
2. OpenSSL: EOF is returned as an error by pqsecure_read (see copied case statement above)
3. GSS: When writing the patch I was not sure how EOF handling worked here, but given that the tests passed for Jacob on GSS, I'm guessing it works the same as unencrypted TCP.

#12Jelte Fennema
Jelte.Fennema@microsoft.com
In reply to: Jelte Fennema (#11)
Re: Add non-blocking version of PQcancel

I attached a new version of this patch. Which does three main things:
1. Change the PQrequestCancel implementation to use the regular
connection establishement code, to support all connection options
including encryption.
2. Add PQrequestCancelStart which is a thread-safe and non-blocking
version of this new PQrequestCancel implementation.
3. Add PQconnectComplete, which completes a connection started by
PQrequestCancelStart. This is useful if you want a thread-safe, but
blocking cancel (without having a need for signal safety).

This change un-deprecates PQrequestCancel, since now there's actually an
advantage to using it over PQcancel. It also includes user facing documentation
for all these functions.

As a API design change from the previous version, PQrequestCancelStart now
returns a regular PGconn for the cancel connection.

@Tom Lane regarding this:

Even in the non-blocking case, callers should only deal with the original PGconn.

This would by definition result in non-threadsafe code (afaict). So I refrained from doing this.
The blocking version doesn't expose a PGconn at all, but the non-blocking one now returns a new PGconn.

There's two more changes that I at least want to do before considering this patch mergable:
1. Go over all the functions that can be called with a PGconn, but should not be
called with a cancellation PGconn and error out or exit early.
2. Copy over the SockAddr from the original connection and always connect to
the same socket. I believe with the current code the cancellation could end up
at the wrong server if there are multiple hosts listed in the connection string.

And there's a third item that I would like to do as a bonus:
3. Actually use the non-blocking API for the postgres_fdw code to implement a
timeout. Which would allow this comment can be removed:
/*
* Issue cancel request. Unfortunately, there's no good way to limit the
* amount of time that we might block inside PQgetCancel().
*/

So a next version of this patch can be expected somewhere later this week.
But any feedback on the current version would be appreciated. Because
these 3 changes won't change the overall design much.

Jelte

Attachments:

0001-Add-documentation-for-libpq_pipeline-tests.patchapplication/octet-stream; name=0001-Add-documentation-for-libpq_pipeline-tests.patchDownload+20-1
0002-Add-non-blocking-version-of-PQcancel.patchapplication/octet-stream; name=0002-Add-non-blocking-version-of-PQcancel.patchDownload+627-127
#13Justin Pryzby
pryzby@telsasoft.com
In reply to: Jelte Fennema (#12)
Re: Add non-blocking version of PQcancel

Note that the patch is still variously failing in cirrus.
https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/37/3511

You may already know that it's possible to trigger the cirrus ci tasks using a
github branch. See src/tools/ci/README.

#14Jelte Fennema
Jelte.Fennema@microsoft.com
In reply to: Justin Pryzby (#13)
Re: Add non-blocking version of PQcancel

Attached is the latest version of this patch, which I think is now in a state
in which it could be merged. The changes are:

1. Don't do host and address discovery for cancel connections. It now
reuses raddr and whichhost from the original connection. This makes
sure the cancel always goes to the right server, even when DNS records
change or another server would be chosen now in case of connnection
strings containing multiple hosts.
2. Fix the windows CI failure. This is done by both using the threadsafe code
in the the dblink cancellation code, and also by not erroring a cancellation
connection on windows in case of any errors. This last one is to work around
the issue described in this thread:
/messages/by-id/90b34057-4176-7bb0-0dbb-9822a5f6425b@greiz-reinsdorf.de

I also went over most of the functions that take a PGconn, to see if they needed
extra checks to guard against being executed on cancel. So far all seemed fine,
either they should be okay to execute against a cancellation connection, or
they failed already anyway because a cancellation connection never reaches
the CONNECTION_OK state. So I didn't add any checks specifically for cancel
connections. I'll do this again next week with a fresh head, to see if I haven't
missed any cases.

I'll try to find some time early next week to implement non-blocking cancellation
usage in postgres_fdw, i.e. the bonus task I mentioned in my previous email. But
I don't think it's necessary to have that implemented before merging.

Attachments:

0002-Add-non-blocking-version-of-PQcancel.patchapplication/octet-stream; name=0002-Add-non-blocking-version-of-PQcancel.patchDownload+743-151
#15Jelte Fennema
Jelte.Fennema@microsoft.com
In reply to: Jelte Fennema (#14)
Re: Add non-blocking version of PQcancel

Hereby what I consider the final version of this patch. I don't have any
changes planned myself (except for ones that come up during review).
Things that changed since the previous iteration:
1. postgres_fdw now uses the non-blocking cancellation API (including test).
2. Added some extra sleeps to the cancellation test, to remove random failures on FreeBSD.

Attachments:

0002-Add-non-blocking-version-of-PQcancel.patchapplication/octet-stream; name=0002-Add-non-blocking-version-of-PQcancel.patchDownload+849-154
#16Justin Pryzby
pryzby@telsasoft.com
In reply to: Jelte Fennema (#15)
Re: Add non-blocking version of PQcancel

Resending with a problematic email removed from CC...

On Mon, Apr 04, 2022 at 03:21:54PM +0000, Jelte Fennema wrote:

2. Added some extra sleeps to the cancellation test, to remove random failures on FreeBSD.

Apparently there's still an occasional issue.
https://cirrus-ci.com/task/6613309985128448

result 232/352 (error): ERROR: duplicate key value violates unique constraint "ppln_uniqviol_pkey"
DETAIL: Key (id)=(116) already exists.

This shows that the issue is pretty rare:
https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/38/3511

--
Justin

#17Jelte Fennema
Jelte.Fennema@microsoft.com
In reply to: Andres Freund (#6)
Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

(resent because it was blocked from the mailing-list due to inclusion of a blocked email address in the To line)

From: Andres Freund <andres@anarazel.de>

On 2022-04-04 15:21:54 +0000, Jelte Fennema wrote:

2. Added some extra sleeps to the cancellation test, to remove random
failures on FreeBSD.

That's extremely extremely rarely the solution to address test reliability
issues. It'll fail when running test under valgrind etc.

Why do you need sleeps / can you find another way to make the test reliable?

The problem they are solving is racy behaviour between sending the query
and sending the cancellation. If the cancellation is handled before the query
is started, then the query doesn't get cancelled. To solve this problem I used
the sleeps to wait a bit before sending the cancelation request.

When I wrote this, I couldn't think of a better way to do it then with sleeps.
But I didn't like it either (and I still don't). These emails made me start to think
again, about other ways of solving the problem. I think I've found another
solution (see attached patch). The way I solve it now is by using another
connection to check the state of the first one.

Jelte

Attachments:

0001-Add-documentation-for-libpq_pipeline-tests.patchapplication/octet-stream; name=0001-Add-documentation-for-libpq_pipeline-tests.patchDownload+20-1
0002-Add-non-blocking-version-of-PQcancel.patchapplication/octet-stream; name=0002-Add-non-blocking-version-of-PQcancel.patchDownload+894-154
#18Justin Pryzby
pryzby@telsasoft.com
In reply to: Justin Pryzby (#16)
Re: Add non-blocking version of PQcancel

On Fri, Jun 24, 2022 at 07:36:16PM -0500, Justin Pryzby wrote:

Resending with a problematic email removed from CC...

On Mon, Apr 04, 2022 at 03:21:54PM +0000, Jelte Fennema wrote:

2. Added some extra sleeps to the cancellation test, to remove random failures on FreeBSD.

Apparently there's still an occasional issue.
https://cirrus-ci.com/task/6613309985128448

I think that failure is actually not related to this patch.

There are probably others, but I noticed because it also affected one of my
patches, which changes nothing relevant.
https://cirrus-ci.com/task/5904044051922944

#19Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Justin Pryzby (#18)
Re: Add non-blocking version of PQcancel

On 2022-Jun-27, Justin Pryzby wrote:

On Fri, Jun 24, 2022 at 07:36:16PM -0500, Justin Pryzby wrote:

Apparently there's still an occasional issue.
https://cirrus-ci.com/task/6613309985128448

I think that failure is actually not related to this patch.

Yeah, it's not -- Kyotaro diagnosed it as a problem in libpq's pipeline
mode. I hope to push his fix soon, but there are nearby problems that I
haven't been able to track down a good fix for. I'm looking into the
whole.

--
Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/

#20Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jelte Fennema (#17)
Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

Jelte Fennema <Jelte.Fennema@microsoft.com> writes:

[ non-blocking PQcancel ]

I pushed the 0001 patch (libpq_pipeline documentation) with a bit
of further wordsmithing.

As for 0002, I'm not sure that's anywhere near ready. I doubt it's
a great idea to un-deprecate PQrequestCancel with a major change
in its behavior. If there is anybody out there still using it,
they're not likely to appreciate that. Let's leave that alone and
pick some other name.

I'm also finding the entire design of PQrequestCancelStart etc to
be horribly confusing --- it's not *bad* necessarily, but the chosen
function names are seriously misleading. PQrequestCancelStart doesn't
actually "start" anything, so the apparent parallel with PQconnectStart
is just wrong. It's also fairly unclear what the state of a cancel
PQconn is after the request cycle is completed, and whether you can
re-use it (especially after a failed request), and whether you have
to dispose of it separately.

On the whole it feels like a mistake to have two separate kinds of
PGconn with fundamentally different behaviors and yet no distinction
in the API. I think I'd recommend having a separate struct type
(which might internally contain little more than a pointer to a
cloned PGconn), and provide only a limited set of operations on it.
Seems like create, start/continue cancel request, destroy, and
fetch error message ought to be enough. I don't see a reason why we
need to support all of libpq's inquiry operations on such objects ---
for instance, if you want to know which host is involved, you could
perfectly well query the parent PGconn. Nor do I want to run around
and add code to every single libpq entry point to make it reject cancel
PGconns if it can't support them, but we'd have to do so if there's
just one struct type.

I'm not seeing the use-case for PQconnectComplete. If you want
a non-blocking cancel request, why would you then use a blocking
operation to complete the request? Seems like it'd be better
to have just a monolithic cancel function for those who don't
need non-blocking.

This change:

--- a/src/interfaces/libpq/libpq-fe.h
+++ b/src/interfaces/libpq/libpq-fe.h
@@ -59,12 +59,15 @@ typedef enum
 {
 	CONNECTION_OK,
 	CONNECTION_BAD,
+	CONNECTION_CANCEL_FINISHED,
 	/* Non-blocking mode only below here */

is an absolute non-starter: it breaks ABI for every libpq client,
even ones that aren't using this facility. Why do we need a new
ConnStatusType value anyway? Seems like PostgresPollingStatusType
covers what we need: once you reach PGRES_POLLING_OK, the cancel
request is done.

The test case is still not very bulletproof on slow machines,
as it seems to be assuming that 30 seconds == forever. It
would be all right to use $PostgreSQL::Test::Utils::timeout_default,
but I'm not sure that that's easily retrievable by C code.
Maybe make the TAP test pass it in with another optional switch
to libpq_pipeline? Alternatively, we could teach libpq_pipeline
to do getenv("PG_TEST_TIMEOUT_DEFAULT") with a fallback to 180,
but that feels like it might be overly familiar with the innards
of Utils.pm.

regards, tom lane

#21Jelte Fennema
Jelte.Fennema@microsoft.com
In reply to: Tom Lane (#20)
#22Jacob Champion
jacob.champion@enterprisedb.com
In reply to: Jelte Fennema (#21)
#23Jelte Fennema
Jelte.Fennema@microsoft.com
In reply to: Jacob Champion (#22)
#24Daniel Gustafsson
daniel@yesql.se
In reply to: Jelte Fennema (#23)
#25Jelte Fennema
Jelte.Fennema@microsoft.com
In reply to: Daniel Gustafsson (#24)
#26Jelte Fennema-Nio
postgres@jeltef.nl
In reply to: Jelte Fennema (#25)
#27Jelte Fennema-Nio
postgres@jeltef.nl
In reply to: Jelte Fennema-Nio (#26)
#28Jelte Fennema-Nio
postgres@jeltef.nl
In reply to: Jelte Fennema-Nio (#27)
#29Bruce Momjian
bruce@momjian.us
In reply to: Jelte Fennema-Nio (#28)
#30Gregory Stark (as CFM)
stark.cfm@gmail.com
In reply to: Bruce Momjian (#29)
#31Jelte Fennema-Nio
postgres@jeltef.nl
In reply to: Gregory Stark (as CFM) (#30)
#32Gregory Stark (as CFM)
stark.cfm@gmail.com
In reply to: Jelte Fennema-Nio (#31)
#33Jelte Fennema-Nio
postgres@jeltef.nl
In reply to: Gregory Stark (as CFM) (#32)
#34Jelte Fennema-Nio
postgres@jeltef.nl
In reply to: Jelte Fennema-Nio (#33)
#35Gregory Stark (as CFM)
stark.cfm@gmail.com
In reply to: Jelte Fennema-Nio (#34)
#36Tom Lane
tgl@sss.pgh.pa.us
In reply to: Gregory Stark (as CFM) (#35)
#37Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#36)
#38Jelte Fennema-Nio
postgres@jeltef.nl
In reply to: Bruce Momjian (#37)
#39Jelte Fennema-Nio
postgres@jeltef.nl
In reply to: Jelte Fennema-Nio (#38)
#40Denis Laxalde
denis.laxalde@dalibo.com
In reply to: Jelte Fennema-Nio (#39)
#41Jelte Fennema-Nio
postgres@jeltef.nl
In reply to: Denis Laxalde (#40)
#42Denis Laxalde
denis.laxalde@dalibo.com
In reply to: Jelte Fennema-Nio (#41)
#43Jelte Fennema-Nio
postgres@jeltef.nl
In reply to: Denis Laxalde (#42)
#44Denis Laxalde
denis.laxalde@dalibo.com
In reply to: Jelte Fennema-Nio (#43)
#45Jelte Fennema-Nio
postgres@jeltef.nl
In reply to: Denis Laxalde (#44)
#46Denis Laxalde
denis.laxalde@dalibo.com
In reply to: Jelte Fennema-Nio (#45)
#47Jelte Fennema-Nio
postgres@jeltef.nl
In reply to: Denis Laxalde (#46)
#48Jelte Fennema-Nio
postgres@jeltef.nl
In reply to: Jelte Fennema-Nio (#47)
#49Jelte Fennema-Nio
postgres@jeltef.nl
In reply to: Jelte Fennema-Nio (#48)
#50Thomas Munro
thomas.munro@gmail.com
In reply to: Jelte Fennema-Nio (#49)
#51Jelte Fennema-Nio
postgres@jeltef.nl
In reply to: Thomas Munro (#50)
#52Jelte Fennema-Nio
postgres@jeltef.nl
In reply to: Jelte Fennema-Nio (#51)
#53vignesh C
vignesh21@gmail.com
In reply to: Jelte Fennema-Nio (#52)
#54Jelte Fennema-Nio
postgres@jeltef.nl
In reply to: vignesh C (#53)
#55Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Jelte Fennema-Nio (#54)
#56Jelte Fennema-Nio
postgres@jeltef.nl
In reply to: Alvaro Herrera (#55)
#57Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Jelte Fennema-Nio (#56)
#58Jelte Fennema-Nio
postgres@jeltef.nl
In reply to: Alvaro Herrera (#57)
#59vignesh C
vignesh21@gmail.com
In reply to: Jelte Fennema-Nio (#56)
#60Jelte Fennema-Nio
postgres@jeltef.nl
In reply to: vignesh C (#59)
#61Jelte Fennema-Nio
postgres@jeltef.nl
In reply to: Jelte Fennema-Nio (#60)
#62Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Jelte Fennema-Nio (#61)
#63Jelte Fennema-Nio
postgres@jeltef.nl
In reply to: Alvaro Herrera (#62)
#64Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Jelte Fennema-Nio (#63)
#65Jelte Fennema-Nio
postgres@jeltef.nl
In reply to: Alvaro Herrera (#64)
#66Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Alvaro Herrera (#64)
#67Jelte Fennema-Nio
postgres@jeltef.nl
In reply to: Alvaro Herrera (#66)
#68Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Jelte Fennema-Nio (#67)
#69Jelte Fennema-Nio
postgres@jeltef.nl
In reply to: Alvaro Herrera (#68)
#70Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Jelte Fennema-Nio (#69)
#71Jelte Fennema-Nio
postgres@jeltef.nl
In reply to: Alvaro Herrera (#70)
#72Denis Laxalde
denis.laxalde@dalibo.com
In reply to: Jelte Fennema-Nio (#71)
#73Jelte Fennema-Nio
postgres@jeltef.nl
In reply to: Denis Laxalde (#72)
#74Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Denis Laxalde (#72)
#75Jelte Fennema-Nio
postgres@jeltef.nl
In reply to: Alvaro Herrera (#74)
#76Jelte Fennema-Nio
postgres@jeltef.nl
In reply to: Alvaro Herrera (#74)
#77Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Jelte Fennema-Nio (#76)
#78Jelte Fennema-Nio
postgres@jeltef.nl
In reply to: Alvaro Herrera (#77)
#79Jelte Fennema-Nio
postgres@jeltef.nl
In reply to: Jelte Fennema-Nio (#78)
#80Jelte Fennema-Nio
postgres@jeltef.nl
In reply to: Alvaro Herrera (#77)
#81Jelte Fennema-Nio
postgres@jeltef.nl
In reply to: Jelte Fennema-Nio (#80)
#82Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Jelte Fennema-Nio (#80)
#83Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Jelte Fennema-Nio (#81)
#84Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Alvaro Herrera (#83)
#85Jelte Fennema-Nio
postgres@jeltef.nl
In reply to: Alvaro Herrera (#84)
#86Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jelte Fennema-Nio (#85)
#87Jelte Fennema-Nio
postgres@jeltef.nl
In reply to: Tom Lane (#86)
#88Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Jelte Fennema-Nio (#87)
#89Jacob Champion
jacob.champion@enterprisedb.com
In reply to: Alvaro Herrera (#88)
#90Jelte Fennema-Nio
postgres@jeltef.nl
In reply to: Jacob Champion (#89)
#91Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Jelte Fennema-Nio (#90)
#92Jelte Fennema-Nio
postgres@jeltef.nl
In reply to: Alvaro Herrera (#91)
#93Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Jelte Fennema-Nio (#92)
#94Noah Misch
noah@leadboat.com
In reply to: Alvaro Herrera (#93)
#95Jelte Fennema-Nio
postgres@jeltef.nl
In reply to: Noah Misch (#94)
#96Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Jelte Fennema-Nio (#95)
#97Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Alvaro Herrera (#96)
#98Jelte Fennema-Nio
postgres@jeltef.nl
In reply to: Alvaro Herrera (#97)
#99Jelte Fennema-Nio
postgres@jeltef.nl
In reply to: Alvaro Herrera (#96)
#100Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Jelte Fennema-Nio (#98)
#101Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Alvaro Herrera (#100)
#102Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Alvaro Herrera (#101)
#103Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Alvaro Herrera (#102)
#104Jelte Fennema-Nio
postgres@jeltef.nl
In reply to: Alvaro Herrera (#103)
#105Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Jelte Fennema-Nio (#104)
#106Jelte Fennema-Nio
postgres@jeltef.nl
In reply to: Alvaro Herrera (#105)
#107Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jelte Fennema-Nio (#106)
#108Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#107)
#109Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#108)
#110Jelte Fennema-Nio
postgres@jeltef.nl
In reply to: Tom Lane (#109)
#111Noah Misch
noah@leadboat.com
In reply to: Jelte Fennema-Nio (#110)
#112Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#100)
#113Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#112)
#114Jelte Fennema-Nio
postgres@jeltef.nl
In reply to: Alvaro Herrera (#113)
#115Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#101)
#116Alexander Lakhin
exclusion@gmail.com
In reply to: Noah Misch (#111)
#117Jelte Fennema-Nio
postgres@jeltef.nl
In reply to: Alexander Lakhin (#116)
#118Alexander Lakhin
exclusion@gmail.com
In reply to: Jelte Fennema-Nio (#117)
#119Jelte Fennema-Nio
postgres@jeltef.nl
In reply to: Alexander Lakhin (#118)
#120Noah Misch
noah@leadboat.com
In reply to: Alvaro Herrera (#82)
#121Jelte Fennema-Nio
postgres@jeltef.nl
In reply to: Noah Misch (#120)
#122Jelte Fennema-Nio
postgres@jeltef.nl
In reply to: Jelte Fennema-Nio (#121)
#123Alexander Lakhin
exclusion@gmail.com
In reply to: Jelte Fennema-Nio (#119)
#124Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Alexander Lakhin (#123)
#125Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Alvaro Herrera (#124)
#126Thomas Munro
thomas.munro@gmail.com
In reply to: Alvaro Herrera (#124)
#127Alexander Lakhin
exclusion@gmail.com
In reply to: Thomas Munro (#126)
#128Thomas Munro
thomas.munro@gmail.com
In reply to: Alexander Lakhin (#127)
#129Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Alvaro Herrera (#125)
#130Alexander Lakhin
exclusion@gmail.com
In reply to: Alvaro Herrera (#129)
#131Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alexander Lakhin (#130)
#132Jelte Fennema-Nio
postgres@jeltef.nl
In reply to: Tom Lane (#131)
#133Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jelte Fennema-Nio (#132)
#134Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#133)
#135Jelte Fennema-Nio
postgres@jeltef.nl
In reply to: Tom Lane (#133)
#136Alexander Lakhin
exclusion@gmail.com
In reply to: Tom Lane (#134)
#137Jelte Fennema-Nio
postgres@jeltef.nl
In reply to: Alexander Lakhin (#136)
#138Alexander Lakhin
exclusion@gmail.com
In reply to: Alexander Lakhin (#136)
#139Andres Freund
andres@anarazel.de
In reply to: Alexander Lakhin (#138)
#140Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#139)
#141Alexander Lakhin
exclusion@gmail.com
In reply to: Tom Lane (#140)
#142Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alexander Lakhin (#141)
#143Jelte Fennema-Nio
postgres@jeltef.nl
In reply to: Tom Lane (#142)
#144Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jelte Fennema-Nio (#143)
#145Jelte Fennema-Nio
postgres@jeltef.nl
In reply to: Tom Lane (#144)
#146Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jelte Fennema-Nio (#145)