Add non-blocking version of PQcancel

Started by Jelte Fennemaover 4 years ago146 messageshackers

Jelte.Fennema@microsoft.com

over 4 years ago

The existing PQcancel API is using blocking IO. This makes PQcancel
impossible to use in an event loop based codebase, without blocking the
event loop until the call returns.

This patch adds a new cancellation API to libpq which is called
PQcancelConnectionStart. This API can be used to send cancellations in a
non-blocking fashion. To do this it internally uses regular PGconn
connection establishment. This has as a downside that
PQcancelConnectionStart cannot be safely called from a signal handler.

Luckily, this should be fine for most usages of this API. Since most
code that's using an event loop handles signals in that event loop as
well (as opposed to calling functions from the signal handler directly).

There are also a few advantages of this approach:
1. No need to add and maintain a second non-blocking connection
establishment codepath.
2. Cancel connections benefit automatically from any improvements made
to the normal connection establishment codepath. Examples of things
that it currently gets for free currently are TLS support and
keepalive settings.

This patch also includes a test for this new API (and also the already
existing cancellation APIs). The test can be easily run like this:

cd src/test/modules/libpq_pipeline
make && ./libpq_pipeline cancel

NOTE: I have not tested this with GSS for the moment. My expectation is
that using this new API with a GSS connection will result in a
CONNECTION_BAD status when calling PQcancelStatus. The reason for this
is that GSS reads will also need to communicate back that an EOF was
found, just like I've done for TLS reads and unencrypted reads. Since in
case of a cancel connection an EOF is actually expected, and should not
be treated as an error.

Andres Freund

andres@anarazel.de

over 4 years ago

In reply to: Jelte Fennema (#1)

Re: Add non-blocking version of PQcancel

Hi,

On 2022-01-12 15:22:18 +0000, Jelte Fennema wrote:

This patch also includes a test for this new API (and also the already
existing cancellation APIs). The test can be easily run like this:

cd src/test/modules/libpq_pipeline
make && ./libpq_pipeline cancel

Right now tests fails to build on windows with:

[15:45:10.518] src/interfaces/libpq/libpqdll.def : fatal error LNK1121: duplicate ordinal number '189' [c:\cirrus\libpq.vcxproj]
on fails tests on other platforms. See
https://cirrus-ci.com/build/4791821363576832

NOTE: I have not tested this with GSS for the moment. My expectation is
that using this new API with a GSS connection will result in a
CONNECTION_BAD status when calling PQcancelStatus. The reason for this
is that GSS reads will also need to communicate back that an EOF was
found, just like I've done for TLS reads and unencrypted reads. Since in
case of a cancel connection an EOF is actually expected, and should not
be treated as an error.

The failures do not seem related to this.

Greetings,

Andres Freund

Jelte Fennema

Jelte.Fennema@microsoft.com

over 4 years ago

In reply to: Andres Freund (#2)

Re: Add non-blocking version of PQcancel

Attached is an updated patch which I believe fixes windows and the other test failures.
At least on my machine make check-world passes now when compiled with --enable-tap-tests

I also included a second patch which adds some basic documentation for the libpq tests.

Jacob Champion

jacob.champion@enterprisedb.com

about 4 years ago

In reply to: Jelte Fennema (#3)

Re: Add non-blocking version of PQcancel

On Thu, 2022-01-13 at 14:51 +0000, Jelte Fennema wrote:

Attached is an updated patch which I believe fixes windows and the other test failures.
At least on my machine make check-world passes now when compiled with --enable-tap-tests

I also included a second patch which adds some basic documentation for the libpq tests.

This is not a full review by any means, but here are my thoughts so
far:

NOTE: I have not tested this with GSS for the moment. My expectation is
that using this new API with a GSS connection will result in a
CONNECTION_BAD status when calling PQcancelStatus. The reason for this
is that GSS reads will also need to communicate back that an EOF was
found, just like I've done for TLS reads and unencrypted reads.

For what it's worth, I did a smoke test with a Kerberos environment via

./libpq_pipeline cancel '... gssencmode=require'

and the tests claim to pass.

2. Cancel connections benefit automatically from any improvements made
to the normal connection establishment codepath. Examples of things
that it currently gets for free currently are TLS support and
keepalive settings.

This seems like a big change compared to PQcancel(); one that's not
really hinted at elsewhere. Having the async version of an API open up
a completely different code path with new features is pretty surprising
to me.

And does the backend actually handle cancel requests via TLS (or GSS)?
It didn't look that way from a quick scan, but I may have missed
something.

@@ -1555,6 +1665,7 @@ print_test_list(void)
printf("singlerow\n");
printf("transaction\n");
printf("uniqviol\n");
+ printf("cancel\n");
}

This should probably go near the top; it looks like the existing list
is alphabetized.

The new cancel tests don't print any feedback. It'd be nice to get the
same sort of output as the other tests.

/* issue a cancel request */
extern int PQcancel(PGcancel *cancel, char *errbuf, int errbufsize);
+extern PGcancelConn * PQcancelConnectStart(PGconn *conn);
+extern PGcancelConn * PQcancelConnect(PGconn *conn);
+extern PostgresPollingStatusType PQcancelConnectPoll(PGcancelConn * cancelConn);
+extern ConnStatusType PQcancelStatus(const PGcancelConn * cancelConn);
+extern int PQcancelSocket(const PGcancelConn * cancelConn);
+extern char *PQcancelErrorMessage(const PGcancelConn * cancelConn);
+extern void PQcancelFinish(PGcancelConn * cancelConn);

That's a lot of new entry points, most of which don't do anything
except call their twin after a pointer cast. How painful would it be to
just use the existing APIs as-is, and error out when calling
unsupported functions if conn->cancelRequest is true?

--Jacob

Tom Lane

tgl@sss.pgh.pa.us

about 4 years ago

In reply to: Jacob Champion (#4)

Re: Add non-blocking version of PQcancel

Jacob Champion <pchampion@vmware.com> writes:

On Thu, 2022-01-13 at 14:51 +0000, Jelte Fennema wrote:

2. Cancel connections benefit automatically from any improvements made
to the normal connection establishment codepath. Examples of things
that it currently gets for free currently are TLS support and
keepalive settings.

This seems like a big change compared to PQcancel(); one that's not
really hinted at elsewhere. Having the async version of an API open up
a completely different code path with new features is pretty surprising
to me.

Well, the patch lacks any user-facing doco at all, so a-fortiori this
point is not covered. I trust the plan was to write docs later.

I kind of feel that this patch is going in the wrong direction.
I do see the need for a version of PQcancel that can encrypt the
transmitted cancel request (and yes, that should work on the backend
side; see recursion in ProcessStartupPacket). I have not seen
requests for a non-blocking version, and this doesn't surprise me.
I feel that the whole non-blocking aspect of libpq probably belongs
to another era when people didn't trust threads.

So what I'd do is make a version that just takes a PGconn, sends the
cancel request, and returns success or failure; never mind the
non-blocking aspect. One possible long-run advantage of this is that
it might be possible to "sync" the cancel request so that we know,
or at least can find out afterwards, exactly which query got
cancelled; something that's fundamentally impossible if the cancel
function works from a clone data structure that is disconnected
from the current connection state.

(Note that it probably makes sense to make a clone PGconn to pass
to fe-connect.c, internally to this function. I just don't want
to expose that to the app.)

regards, tom lane

Andres Freund

andres@anarazel.de

about 4 years ago

In reply to: Tom Lane (#5)

Re: Add non-blocking version of PQcancel

Hi,

On 2022-03-24 17:41:53 -0400, Tom Lane wrote:

I kind of feel that this patch is going in the wrong direction.
I do see the need for a version of PQcancel that can encrypt the
transmitted cancel request (and yes, that should work on the backend
side; see recursion in ProcessStartupPacket). I have not seen
requests for a non-blocking version, and this doesn't surprise me.
I feel that the whole non-blocking aspect of libpq probably belongs
to another era when people didn't trust threads.

That's not a whole lot of fun if you think of cases like postgres_fdw (or
citus as in Jelte's case), which run inside the backend. Even with just a
single postgres_fdw, we don't really want to end up in an uninterruptible
PQcancel() that doesn't even react to pg_terminate_backend().

Even if using threads weren't an issue, I don't really buy the premise - most
networking code has moved *away* from using dedicated threads for each
connection. It just doesn't scale.

Leaving PQcancel aside, we use the non-blocking libpq stuff widely
ourselves. I think walreceiver, isolationtester, pgbench etc would be *much*
harder to get working equally well if there was just blocking calls. If
anything, we're getting to the point where purely blocking functionality
shouldn't be added anymore.

Greetings,

Andres Freund

Robert Haas

robertmhaas@gmail.com

about 4 years ago

In reply to: Andres Freund (#6)

Re: Add non-blocking version of PQcancel

On Thu, Mar 24, 2022 at 6:49 PM Andres Freund <andres@anarazel.de> wrote:

That's not a whole lot of fun if you think of cases like postgres_fdw (or
citus as in Jelte's case), which run inside the backend. Even with just a
single postgres_fdw, we don't really want to end up in an uninterruptible
PQcancel() that doesn't even react to pg_terminate_backend().

Even if using threads weren't an issue, I don't really buy the premise - most
networking code has moved *away* from using dedicated threads for each
connection. It just doesn't scale.

Leaving PQcancel aside, we use the non-blocking libpq stuff widely
ourselves. I think walreceiver, isolationtester, pgbench etc would be *much*
harder to get working equally well if there was just blocking calls. If
anything, we're getting to the point where purely blocking functionality
shouldn't be added anymore.

+1. I think having a non-blocking version of PQcancel() available is a
great idea, and I've wanted it myself. See commit
ae9bfc5d65123aaa0d1cca9988037489760bdeae.

That said, I don't think that this particular patch is going in the
right direction. I think Jacob's comment upthread is right on point:
"This seems like a big change compared to PQcancel(); one that's not
really hinted at elsewhere. Having the async version of an API open up
a completely different code path with new features is pretty
surprising to me." It seems to me that we want to end up with similar
code paths for PQcancel() and the non-blocking version of cancel. We
could get there in two ways. One way would be to implement the
non-blocking functionality in a manner that matches exactly what
PQcancel() does now. I imagine that the existing code from PQcancel()
would move, with some amount of change, into a new set of non-blocking
APIs. Perhaps PQcancel() would then be rewritten to use those new APIs
instead of hand-rolling the same logic. The other possible approach
would be to first change the blocking version of PQcancel() to use the
regular connection code instead of its own idiosyncratic logic, and
then as a second step, extend it with non-blocking interfaces that use
the regular non-blocking connection code. With either of these
approaches, we end up with the functionality working similarly in the
blocking and non-blocking code paths.

Leaving the question of approach aside, I think it's fairly clear that
this patch cannot be seriously considered for v15. One problem is the
lack of user-facing documentation, but there's a other stuff that just
doesn't look sufficiently well-considered. For example, it updates the
comment for pqsecure_read() to say "Returns -1 in case of failures,
except in the case of clean connection closure then it returns -2."
But that function calls any of three different implementation
functions depending on the situation and the patch only updates one of
them. And it updates that function to return -2 when the is
ECONNRESET, which seems to fly in the face of the comment's idea that
this is the "clean connection closure" case. I think it's probably a
bad sign that this function is tinkering with logic in this sort of
low-level function anyway. pqReadData() is a really general function
that manages to work with non-blocking I/O already, so why does
non-blocking query cancellation need to change its return values, or
whether or not it drops data in certain cases?

I'm also skeptical about the fact that we end up with a whole bunch of
new functions that are just wrappers around existing functions. That's
not a scalable approach. Every function that we have for a PGconn will
eventually need a variant that deals with a PGcancelConn. That seems
kind of pointless, especially considering that a PGcancelConn is
*exactly* a PGconn in disguise. If we decide to pursue the approach of
using the existing infrastructure for PGconn objects to handle query
cancellation, we ought to manipulate them using the same functions we
currently do, with some kind of mode or flag or switch or something
that you can use to turn a regular PGconn into something that cancels
a query. Maybe you create the PGconn and call
PQsprinkleMagicCancelDust() on it, and then you just proceed using the
existing functions, or something like that. Then, not only do the
existing functions not need query-cancel analogues, but any new
functions we add in the future don't either.

I'll set the target version for this patch to 16. I hope work continues.

Thanks,

--
Robert Haas
EDB: http://www.enterprisedb.com

Tom Lane

tgl@sss.pgh.pa.us

about 4 years ago

In reply to: Robert Haas (#7)

Re: Add non-blocking version of PQcancel

Robert Haas <robertmhaas@gmail.com> writes:

That said, I don't think that this particular patch is going in the
right direction. I think Jacob's comment upthread is right on point:
"This seems like a big change compared to PQcancel(); one that's not
really hinted at elsewhere. Having the async version of an API open up
a completely different code path with new features is pretty
surprising to me." It seems to me that we want to end up with similar
code paths for PQcancel() and the non-blocking version of cancel. We
could get there in two ways. One way would be to implement the
non-blocking functionality in a manner that matches exactly what
PQcancel() does now. I imagine that the existing code from PQcancel()
would move, with some amount of change, into a new set of non-blocking
APIs. Perhaps PQcancel() would then be rewritten to use those new APIs
instead of hand-rolling the same logic. The other possible approach
would be to first change the blocking version of PQcancel() to use the
regular connection code instead of its own idiosyncratic logic, and
then as a second step, extend it with non-blocking interfaces that use
the regular non-blocking connection code. With either of these
approaches, we end up with the functionality working similarly in the
blocking and non-blocking code paths.

I think you misunderstand where the real pain point is. The reason
that PQcancel's functionality is so limited has little to do with
blocking vs non-blocking, and everything to do with the fact that
it's designed to be safe to call from a SIGINT handler. That makes
it quite impractical to invoke OpenSSL, and probably our GSS code
as well. If we want support for all connection-time options then
we have to make a new function that does not promise signal safety.

I'm prepared to yield on the question of whether we should provide
a non-blocking version, though I still say that (a) an easier-to-call,
one-step blocking alternative would be good too, and (b) it should
not be designed around the assumption that there's a completely
independent state object being used to perform the cancel. Even in
the non-blocking case, callers should only deal with the original
PGconn.

Leaving the question of approach aside, I think it's fairly clear that
this patch cannot be seriously considered for v15.

Yeah, I don't think it's anywhere near fully baked yet. On the other
hand, we do have a couple of weeks left.

regards, tom lane

Robert Haas

robertmhaas@gmail.com

about 4 years ago

In reply to: Tom Lane (#8)

Re: Add non-blocking version of PQcancel

On Fri, Mar 25, 2022 at 2:47 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I think you misunderstand where the real pain point is. The reason
that PQcancel's functionality is so limited has little to do with
blocking vs non-blocking, and everything to do with the fact that
it's designed to be safe to call from a SIGINT handler. That makes
it quite impractical to invoke OpenSSL, and probably our GSS code
as well. If we want support for all connection-time options then
we have to make a new function that does not promise signal safety.

Well, that's a fair point, but it's somewhat orthogonal to the one I'm
making, which is that a non-blocking version of function X might be
expected to share code or at least functionality with X itself. Having
something that is named in a way that implies asynchrony without other
differences but which is actually different in other important ways is
no good.

I'm prepared to yield on the question of whether we should provide
a non-blocking version, though I still say that (a) an easier-to-call,
one-step blocking alternative would be good too, and (b) it should
not be designed around the assumption that there's a completely
independent state object being used to perform the cancel. Even in
the non-blocking case, callers should only deal with the original
PGconn.

Well, this sounds like you're arguing for the first of the two
approaches I thought would be acceptable, rather than the second.

Leaving the question of approach aside, I think it's fairly clear that
this patch cannot be seriously considered for v15.

Yeah, I don't think it's anywhere near fully baked yet. On the other
hand, we do have a couple of weeks left.

We do?

--
Robert Haas
EDB: http://www.enterprisedb.com

#10

Tom Lane

tgl@sss.pgh.pa.us

about 4 years ago

In reply to: Robert Haas (#9)

Re: Add non-blocking version of PQcancel

Robert Haas <robertmhaas@gmail.com> writes:

Well, that's a fair point, but it's somewhat orthogonal to the one I'm
making, which is that a non-blocking version of function X might be
expected to share code or at least functionality with X itself. Having
something that is named in a way that implies asynchrony without other
differences but which is actually different in other important ways is
no good.

Yeah. We need to choose a name for these new function(s) that is
sufficiently different from "PQcancel" that people won't expect them
to behave exactly the same as that does. I lack any good ideas about
that, how about you?

Yeah, I don't think it's anywhere near fully baked yet. On the other
hand, we do have a couple of weeks left.

We do?

Um, you did read the psql-release discussion about setting the feature
freeze deadline, no?

regards, tom lane

#11

Jelte Fennema

Jelte.Fennema@microsoft.com

about 4 years ago

In reply to: Tom Lane (#10)

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

Thanks for all the feedback everyone. I'll try to send a new patch
later this week that includes user facing docs and a simplified API.
For now a few responses:

Yeah. We need to choose a name for these new function(s) that is
sufficiently different from "PQcancel" that people won't expect them
to behave exactly the same as that does. I lack any good ideas about
that, how about you?

So I guess the names I proposed were not great, since everyone seems to be falling over them.
But I'd like to make my intention clear with the current naming. After this patch there would be
four different APIs for starting a cancelation:
1. PQrequestCancel: deprecated+old, not signal-safe function for requesting query cancellation, only uses a specific set of connection options
2. PQcancel: Cancel queries in a signal safe way, to be signal-safe it only uses a limited set of connection options
3. PQcancelConnect: Cancel queries in a non-signal safe way that uses all connection options
4. PQcancelConnectStart: Cancel queries in a non-signal safe and non-blocking way that uses all connection options

So the idea was that you should not look at PQcancelConnectStart as the non-blocking
version of PQcancel, but as the non-blocking version of PQcancelConnect. I'll try to
think of some different names too, but IMHO these names could be acceptable
when their differences are addressed sufficiently in the documentation.

One other approach to naming that comes to mind now is repurposing PQrequestCancel:
1. PQrequestCancel: Cancel queries in a non-signal safe way that uses all connection options
2. PQrequestCancelStart: Cancel queries in a non-signal safe and non-blocking way that uses all connection options
3. PQcancel: Cancel queries in a signal safe way, to be signal-safe it only uses a limited set of connection options

I think it's probably a
bad sign that this function is tinkering with logic in this sort of
low-level function anyway. pqReadData() is a really general function
that manages to work with non-blocking I/O already, so why does
non-blocking query cancellation need to change its return values, or
whether or not it drops data in certain cases?

The reason for this low level change is that the cancellation part of the
Postgres protocol is following a different, much more simplistic design
than all the other parts. The client does not expect a response message back
from the server after sending the cancellation request. The expectation
is that the server signals completion by closing the connection, i.e. sending EOF.
For all other parts of the protocol, connection termination should be initiated
client side by sending a Terminate message. So the server closing (sending
EOF) is always unexpected and is thus currently considered an error by pqReadData.

But since this is not the case for the cancellation protocol, the result is
changed to -2 in case of EOF to make it possible to distinguish between
an EOF and an actual error.

And it updates that function to return -2 when the is
ECONNRESET, which seems to fly in the face of the comment's idea that
this is the "clean connection closure" case.

The diff sadly does not include the very relevant comment right above these
lines. Pasting the whole case statement here to clear up this confusion:

case SSL_ERROR_ZERO_RETURN:

/*
* Per OpenSSL documentation, this error code is only returned for
* a clean connection closure, so we should not report it as a
* server crash.
*/
appendPQExpBufferStr(&conn->errorMessage,
libpq_gettext("SSL connection has been closed unexpectedly\n"));
result_errno = ECONNRESET;
n = -2;
break;

For example, it updates the
comment for pqsecure_read() to say "Returns -1 in case of failures,
except in the case of clean connection closure then it returns -2."
But that function calls any of three different implementation
functions depending on the situation and the patch only updates one of
them.

That comment is indeed not describing what is happening correctly and I'll
try to make it clearer. The main reason for it being incorrect is coming from
the fact that receiving EOFs is handled in different places based on the
encryption method:

1. Unencrypted TCP: EOF is not returned as an error by pqsecure_read, but detected by pqReadData (see comments related to definitelyEOF)
2. OpenSSL: EOF is returned as an error by pqsecure_read (see copied case statement above)
3. GSS: When writing the patch I was not sure how EOF handling worked here, but given that the tests passed for Jacob on GSS, I'm guessing it works the same as unencrypted TCP.

#12

Jelte Fennema

Jelte.Fennema@microsoft.com

about 4 years ago

In reply to: Jelte Fennema (#11)

Re: Add non-blocking version of PQcancel

I attached a new version of this patch. Which does three main things:
1. Change the PQrequestCancel implementation to use the regular
connection establishement code, to support all connection options
including encryption.
2. Add PQrequestCancelStart which is a thread-safe and non-blocking
version of this new PQrequestCancel implementation.
3. Add PQconnectComplete, which completes a connection started by
PQrequestCancelStart. This is useful if you want a thread-safe, but
blocking cancel (without having a need for signal safety).

This change un-deprecates PQrequestCancel, since now there's actually an
advantage to using it over PQcancel. It also includes user facing documentation
for all these functions.

As a API design change from the previous version, PQrequestCancelStart now
returns a regular PGconn for the cancel connection.

@Tom Lane regarding this:

Even in the non-blocking case, callers should only deal with the original PGconn.

This would by definition result in non-threadsafe code (afaict). So I refrained from doing this.
The blocking version doesn't expose a PGconn at all, but the non-blocking one now returns a new PGconn.

There's two more changes that I at least want to do before considering this patch mergable:
1. Go over all the functions that can be called with a PGconn, but should not be
called with a cancellation PGconn and error out or exit early.
2. Copy over the SockAddr from the original connection and always connect to
the same socket. I believe with the current code the cancellation could end up
at the wrong server if there are multiple hosts listed in the connection string.

And there's a third item that I would like to do as a bonus:
3. Actually use the non-blocking API for the postgres_fdw code to implement a
timeout. Which would allow this comment can be removed:
/*
* Issue cancel request. Unfortunately, there's no good way to limit the
* amount of time that we might block inside PQgetCancel().
*/

So a next version of this patch can be expected somewhere later this week.
But any feedback on the current version would be appreciated. Because
these 3 changes won't change the overall design much.

Jelte

#13

Justin Pryzby

pryzby@telsasoft.com

about 4 years ago

In reply to: Jelte Fennema (#12)

Re: Add non-blocking version of PQcancel

Note that the patch is still variously failing in cirrus.
https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/37/3511

You may already know that it's possible to trigger the cirrus ci tasks using a
github branch. See src/tools/ci/README.

#14

Jelte Fennema

Jelte.Fennema@microsoft.com

about 4 years ago

In reply to: Justin Pryzby (#13)

Re: Add non-blocking version of PQcancel

Attached is the latest version of this patch, which I think is now in a state
in which it could be merged. The changes are:

1. Don't do host and address discovery for cancel connections. It now
reuses raddr and whichhost from the original connection. This makes
sure the cancel always goes to the right server, even when DNS records
change or another server would be chosen now in case of connnection
strings containing multiple hosts.
2. Fix the windows CI failure. This is done by both using the threadsafe code
in the the dblink cancellation code, and also by not erroring a cancellation
connection on windows in case of any errors. This last one is to work around
the issue described in this thread:
/messages/by-id/90b34057-4176-7bb0-0dbb-9822a5f6425b@greiz-reinsdorf.de

I also went over most of the functions that take a PGconn, to see if they needed
extra checks to guard against being executed on cancel. So far all seemed fine,
either they should be okay to execute against a cancellation connection, or
they failed already anyway because a cancellation connection never reaches
the CONNECTION_OK state. So I didn't add any checks specifically for cancel
connections. I'll do this again next week with a fresh head, to see if I haven't
missed any cases.

I'll try to find some time early next week to implement non-blocking cancellation
usage in postgres_fdw, i.e. the bonus task I mentioned in my previous email. But
I don't think it's necessary to have that implemented before merging.

#15

Jelte Fennema

Jelte.Fennema@microsoft.com

about 4 years ago

In reply to: Jelte Fennema (#14)

Re: Add non-blocking version of PQcancel

Hereby what I consider the final version of this patch. I don't have any
changes planned myself (except for ones that come up during review).
Things that changed since the previous iteration:
1. postgres_fdw now uses the non-blocking cancellation API (including test).
2. Added some extra sleeps to the cancellation test, to remove random failures on FreeBSD.

#16

Justin Pryzby

pryzby@telsasoft.com

almost 4 years ago

In reply to: Jelte Fennema (#15)

Re: Add non-blocking version of PQcancel

Resending with a problematic email removed from CC...

On Mon, Apr 04, 2022 at 03:21:54PM +0000, Jelte Fennema wrote:

2. Added some extra sleeps to the cancellation test, to remove random failures on FreeBSD.

Apparently there's still an occasional issue.
https://cirrus-ci.com/task/6613309985128448

result 232/352 (error): ERROR: duplicate key value violates unique constraint "ppln_uniqviol_pkey"
DETAIL: Key (id)=(116) already exists.

This shows that the issue is pretty rare:
https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/38/3511

--
Justin

#17

Jelte Fennema

Jelte.Fennema@microsoft.com

almost 4 years ago

In reply to: Andres Freund (#6)

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

(resent because it was blocked from the mailing-list due to inclusion of a blocked email address in the To line)

From: Andres Freund <andres@anarazel.de>

On 2022-04-04 15:21:54 +0000, Jelte Fennema wrote:

2. Added some extra sleeps to the cancellation test, to remove random
failures on FreeBSD.

That's extremely extremely rarely the solution to address test reliability
issues. It'll fail when running test under valgrind etc.

Why do you need sleeps / can you find another way to make the test reliable?

The problem they are solving is racy behaviour between sending the query
and sending the cancellation. If the cancellation is handled before the query
is started, then the query doesn't get cancelled. To solve this problem I used
the sleeps to wait a bit before sending the cancelation request.

When I wrote this, I couldn't think of a better way to do it then with sleeps.
But I didn't like it either (and I still don't). These emails made me start to think
again, about other ways of solving the problem. I think I've found another
solution (see attached patch). The way I solve it now is by using another
connection to check the state of the first one.

Jelte

Import Notes

Reply to msg id not found: 20220625013617.e4qvy6qlaukgvfpo@alap3.anarazel.de

#18

Justin Pryzby

pryzby@telsasoft.com

almost 4 years ago

In reply to: Justin Pryzby (#16)

Re: Add non-blocking version of PQcancel

On Fri, Jun 24, 2022 at 07:36:16PM -0500, Justin Pryzby wrote:

Resending with a problematic email removed from CC...

On Mon, Apr 04, 2022 at 03:21:54PM +0000, Jelte Fennema wrote:

2. Added some extra sleeps to the cancellation test, to remove random failures on FreeBSD.

Apparently there's still an occasional issue.
https://cirrus-ci.com/task/6613309985128448

I think that failure is actually not related to this patch.

There are probably others, but I noticed because it also affected one of my
patches, which changes nothing relevant.
https://cirrus-ci.com/task/5904044051922944

#19

Alvaro Herrera

alvherre@2ndquadrant.com

almost 4 years ago

In reply to: Justin Pryzby (#18)

Re: Add non-blocking version of PQcancel

On 2022-Jun-27, Justin Pryzby wrote:

On Fri, Jun 24, 2022 at 07:36:16PM -0500, Justin Pryzby wrote:

Apparently there's still an occasional issue.
https://cirrus-ci.com/task/6613309985128448

I think that failure is actually not related to this patch.

Yeah, it's not -- Kyotaro diagnosed it as a problem in libpq's pipeline
mode. I hope to push his fix soon, but there are nearby problems that I
haven't been able to track down a good fix for. I'm looking into the
whole.

--
Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/

#20

Tom Lane

tgl@sss.pgh.pa.us

over 3 years ago

In reply to: Jelte Fennema (#17)

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

Jelte Fennema <Jelte.Fennema@microsoft.com> writes:

[ non-blocking PQcancel ]

I pushed the 0001 patch (libpq_pipeline documentation) with a bit
of further wordsmithing.

As for 0002, I'm not sure that's anywhere near ready. I doubt it's
a great idea to un-deprecate PQrequestCancel with a major change
in its behavior. If there is anybody out there still using it,
they're not likely to appreciate that. Let's leave that alone and
pick some other name.

I'm also finding the entire design of PQrequestCancelStart etc to
be horribly confusing --- it's not *bad* necessarily, but the chosen
function names are seriously misleading. PQrequestCancelStart doesn't
actually "start" anything, so the apparent parallel with PQconnectStart
is just wrong. It's also fairly unclear what the state of a cancel
PQconn is after the request cycle is completed, and whether you can
re-use it (especially after a failed request), and whether you have
to dispose of it separately.

On the whole it feels like a mistake to have two separate kinds of
PGconn with fundamentally different behaviors and yet no distinction
in the API. I think I'd recommend having a separate struct type
(which might internally contain little more than a pointer to a
cloned PGconn), and provide only a limited set of operations on it.
Seems like create, start/continue cancel request, destroy, and
fetch error message ought to be enough. I don't see a reason why we
need to support all of libpq's inquiry operations on such objects ---
for instance, if you want to know which host is involved, you could
perfectly well query the parent PGconn. Nor do I want to run around
and add code to every single libpq entry point to make it reject cancel
PGconns if it can't support them, but we'd have to do so if there's
just one struct type.

I'm not seeing the use-case for PQconnectComplete. If you want
a non-blocking cancel request, why would you then use a blocking
operation to complete the request? Seems like it'd be better
to have just a monolithic cancel function for those who don't
need non-blocking.

This change:

--- a/src/interfaces/libpq/libpq-fe.h
+++ b/src/interfaces/libpq/libpq-fe.h
@@ -59,12 +59,15 @@ typedef enum
 {
 	CONNECTION_OK,
 	CONNECTION_BAD,
+	CONNECTION_CANCEL_FINISHED,
 	/* Non-blocking mode only below here */

is an absolute non-starter: it breaks ABI for every libpq client,
even ones that aren't using this facility. Why do we need a new
ConnStatusType value anyway? Seems like PostgresPollingStatusType
covers what we need: once you reach PGRES_POLLING_OK, the cancel
request is done.

The test case is still not very bulletproof on slow machines,
as it seems to be assuming that 30 seconds == forever. It
would be all right to use $PostgreSQL::Test::Utils::timeout_default,
but I'm not sure that that's easily retrievable by C code.
Maybe make the TAP test pass it in with another optional switch
to libpq_pipeline? Alternatively, we could teach libpq_pipeline
to do getenv("PG_TEST_TIMEOUT_DEFAULT") with a fallback to 180,
but that feels like it might be overly familiar with the innards
of Utils.pm.

regards, tom lane

#21

Jelte Fennema

Jelte.Fennema@microsoft.com

over 3 years ago

In reply to: Tom Lane (#20)

#22

Jacob Champion

jacob.champion@enterprisedb.com

over 3 years ago

In reply to: Jelte Fennema (#21)

#23