pgsql: psql: Add test for query canceling

Started by Peter Eisentrautalmost 5 years ago20 messagescomitters
Jump to latest
#1Peter Eisentraut
peter_e@gmx.net

psql: Add test for query canceling

Query canceling in psql was accidentally broken by
3a5130672296ed4e682403a77a9a3ad3d21cef75 (since reverted), so having
some test coverage for that seems useful.

Reviewed-by: Fabien COELHO <coelho@cri.ensmp.fr>
Discussion: /messages/by-id/18c78a01-4a34-9dd4-f78b-6860f1420c8e@enterprisedb.com

Branch
------
master

Details
-------
https://git.postgresql.org/pg/commitdiff/5b3f471ff23a2773e6c1ee1704377581c987107d

Modified Files
--------------
src/bin/psql/t/020_cancel.pl | 40 ++++++++++++++++++++++++++++++++++++++++
1 file changed, 40 insertions(+)

#2Daniel Gustafsson
daniel@yesql.se
In reply to: Peter Eisentraut (#1)
Re: pgsql: psql: Add test for query canceling

On 20 Aug 2021, at 11:44, Peter Eisentraut <peter@eisentraut.org> wrote:

psql: Add test for query canceling

Seems sidewinder didn’t quite like this:

==~_~===-=-===~_~== pgsql.build/src/bin/psql/tmp_check/log/020_cancel_main.log ==~_~===-=-===~_~==
2021-08-20 12:52:29.992 CEST [9820:1] LOG: starting PostgreSQL 15devel on x86_64-unknown-netbsd7.0, compiled by gcc (nb2 20150115) 4.8.4, 64-bit
2021-08-20 12:52:29.992 CEST [9820:2] LOG: listening on Unix socket "/tmp/V5NA0EWHUl/.s.PGSQL.62222"
2021-08-20 12:52:29.994 CEST [9023:1] LOG: database system was shut down at 2021-08-20 12:52:29 CEST
2021-08-20 12:52:29.995 CEST [9820:3] LOG: database system is ready to accept connections
2021-08-20 12:52:30.093 CEST [656:1] [unknown] LOG: connection received: host=[local]
2021-08-20 12:52:30.093 CEST [656:2] [unknown] LOG: connection authorized: user=pgbf database=postgres application_name=020_cancel.pl
2021-08-20 12:52:30.098 CEST [656:3] 020_cancel.pl LOG: statement: select pg_sleep(3);
2021-08-20 12:52:31.121 CEST [9820:4] LOG: received immediate shutdown request
2021-08-20 12:52:31.121 CEST [656:4] 020_cancel.pl LOG: could not send data to client: Broken pipe
2021-08-20 12:52:31.121 CEST [656:5] 020_cancel.pl STATEMENT: select pg_sleep(3);
2021-08-20 12:52:31.125 CEST [9820:5] LOG: database system is shut down

Nothing obvious sticks out, could it be timing related?

https://brekka.postgresql.org/cgi-bin/show_log.pl?nm=sidewinder&amp;dt=2021-08-20%2010%3A45%3A24

--
Daniel Gustafsson https://vmware.com/

#3Fabien COELHO
coelho@cri.ensmp.fr
In reply to: Daniel Gustafsson (#2)
Re: pgsql: psql: Add test for query canceling

psql: Add test for query canceling

Seems sidewinder didn’t quite like this:

Nothing obvious sticks out, could it be timing related?

Hmmm. Possibly, as the test is time dependent, hence the "SKIP:" section,
which should mean that the test should be ignore when it fails. It seems
that this part did not work. Maybe it should not state that it is
expecting 2 results?

use Test::More tests => 2;

Replace by:

use Test::More;

Or something is amiss with the "SKIP" section trick?

--
Fabien.

#4Daniel Gustafsson
daniel@yesql.se
In reply to: Fabien COELHO (#3)
Re: pgsql: psql: Add test for query canceling

On 20 Aug 2021, at 17:40, Fabien COELHO <coelho@cri.ensmp.fr> wrote:

psql: Add test for query canceling

Seems sidewinder didn’t quite like this:

Nothing obvious sticks out, could it be timing related?

Hmmm. Possibly, as the test is time dependent, hence the "SKIP:" section, which should mean that the test should be ignore when it fails. It seems that this part did not work. Maybe it should not state that it is expecting 2 results?

use Test::More tests => 2;

Replace by:

use Test::More;

Or something is amiss with the "SKIP" section trick?

SKIP blocks only skip tests if the condition is met, in the test as it stands
now it skips it on Windows since it requires a Unix shell:

+SKIP: {
+   skip "cancel test requires a Unix shell", 2 if $windows_os;

On all other platforms the test will execute and count towards the overall test
result of the suite.

If we want the test to run but not fail the entire test suite if it fails then
it should use a TODO block instead, but that’s intended for tests known to fail
and this doesn’t seem to fall in that category.

--
Daniel Gustafsson https://vmware.com/

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Daniel Gustafsson (#4)
Re: pgsql: psql: Add test for query canceling

Daniel Gustafsson <daniel@yesql.se> writes:

If we want the test to run but not fail the entire test suite if it fails then
it should use a TODO block instead, but that’s intended for tests known to fail
and this doesn’t seem to fall in that category.

That seems pretty useless. If we did break things in this area,
such a test would not help us notice.

The problem with the test seems blindingly obvious from here: it
is assuming first that psql will start fast enough to print its
PID within one second, and next that we'll be able to issue
the cancel (and have the backend react) in less than 2 seconds
more. This seems about guaranteed to fail on cache-clobber
animals, for example, but animals that are merely slow or overloaded
would have issues too.

I think you should drop the overly-cute bit with a SIGALRM handler,
and instead have a loop-with-delay around an attempt to read the
psql.pid file, after launching the psql run without an immediate
wait for termination. That gets rid of the first problem (though
you still want the loop to timeout eventually, it could wait up
to say 180 seconds, as we do elsewhere). Then the second problem
is easy to solve by making the pg_sleep delay twice as much.

regards, tom lane

#6Daniel Gustafsson
daniel@yesql.se
In reply to: Tom Lane (#5)
Re: pgsql: psql: Add test for query canceling

On 20 Aug 2021, at 20:47, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Daniel Gustafsson <daniel@yesql.se> writes:

If we want the test to run but not fail the entire test suite if it fails then
it should use a TODO block instead, but that’s intended for tests known to fail
and this doesn’t seem to fall in that category.

That seems pretty useless. If we did break things in this area,
such a test would not help us notice.

For sure. I wasn’t advocating it, merely indicating that the SKIP block isn’t
working the way attributed to upthread.

The problem with the test seems blindingly obvious from here: it
is assuming first that psql will start fast enough to print its
PID within one second, and next that we'll be able to issue
the cancel (and have the backend react) in less than 2 seconds
more. This seems about guaranteed to fail on cache-clobber
animals, for example, but animals that are merely slow or overloaded
would have issues too.

I think you should drop the overly-cute bit with a SIGALRM handler,
and instead have a loop-with-delay around an attempt to read the
psql.pid file, after launching the psql run without an immediate
wait for termination. That gets rid of the first problem (though
you still want the loop to timeout eventually, it could wait up
to say 180 seconds, as we do elsewhere). Then the second problem
is easy to solve by making the pg_sleep delay twice as much.

This could perhaps be done with a PostgresNode::interactive_psql session? I
used that in a similar, but far from the same, test setup in the online
checksums patchset.

--
Daniel Gustafsson https://vmware.com/

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Daniel Gustafsson (#6)
Re: pgsql: psql: Add test for query canceling

Daniel Gustafsson <daniel@yesql.se> writes:

On 20 Aug 2021, at 20:47, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I think you should drop the overly-cute bit with a SIGALRM handler,
and instead have a loop-with-delay around an attempt to read the
psql.pid file, after launching the psql run without an immediate
wait for termination.

This could perhaps be done with a PostgresNode::interactive_psql
session?

Yeah, we do have that infrastructure available in the
010_tab_completion.pl test.

BTW, after looking more closely at the test, I think there's an
independent problem with the SIGALRM business. If I read the perl
script correctly, it's arming a SIGALRM timer in the Perl script's
process that is supposed to expire after the psql child process
has been launched. However, per the NetBSD man pages:

* exec() causes signal handlers to be reset to default actions in the
child process. The default action for SIGALRM is to terminate the
process.

* Active timer events are *inherited* by the child process. This seems
not very wise, but execve(2) is pretty definite about it [1]https://man.netbsd.org/NetBSD-9.2-STABLE/execve.2.

This means that on NetBSD and similar systems, what actually happens
when the SIGALRM timer times out is that the psql process is killed,
likely before the Perl script is able to signal it, but in any case
not too darn much later. So I do not think that this test is testing
what it means to test at all: there is no reason to believe that any
query cancel request ever gets sent, even without the timing issue.
Since the Perl script is only testing that psql returns nonzero exit
status, and not bothering to inquire into just which status it was,
it cannot tell "query was canceled" apart from "psql died horribly".
And the latter is what's actually happening.

regards, tom lane

[1]: https://man.netbsd.org/NetBSD-9.2-STABLE/execve.2

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#7)
Re: pgsql: psql: Add test for query canceling

I see that c818c25f4 worked around sidewinder's issue with
this test, but wrasse is still unhappy. I was able to
duplicate the problem on wrasse's host, so I ran it under
truss to see if I could see what's happening.

What the truss log shows is that the Perl script successfully
reads psql.pid and issues kill(SIGINT) to the correct process
... but very shortly thereafter it issues kill(SIGTERM) to the
same process, so of course psql goes belly-up and doesn't
complete the test.

I suppose this is some undocumented interaction between IPC::Run
and the SIGALRM signal handler. I stand by the recommendation
that you need to not use a signal handler here.

regards, tom lane

#9Daniel Gustafsson
daniel@yesql.se
In reply to: Tom Lane (#7)
Re: pgsql: psql: Add test for query canceling

On 21 Aug 2021, at 00:40, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Daniel Gustafsson <daniel@yesql.se> writes:

On 20 Aug 2021, at 20:47, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I think you should drop the overly-cute bit with a SIGALRM handler,
and instead have a loop-with-delay around an attempt to read the
psql.pid file, after launching the psql run without an immediate
wait for termination.

This could perhaps be done with a PostgresNode::interactive_psql
session?

Yeah, we do have that infrastructure available in the
010_tab_completion.pl test.

In order to satisfy my own curiosity I decided to take a stab at this which
turned into the attached. I wasn’t able to reliably know that the pg_sleep
query had started except letting the timer expire, but that may very well be
the rust on my Perl skills showing.

Not sure if this moves the needle on this in either direction, but I figured I
should report back progress either way.

--
Daniel Gustafsson https://vmware.com/

Attachments:

psql_query_cancel_test.patchapplication/octet-stream; name=psql_query_cancel_test.patch; x-unix-mode=0644Download+36-31
#10Peter Eisentraut
peter_e@gmx.net
In reply to: Daniel Gustafsson (#9)
Re: pgsql: psql: Add test for query canceling

On 23.08.21 13:12, Daniel Gustafsson wrote:

On 21 Aug 2021, at 00:40, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Daniel Gustafsson <daniel@yesql.se> writes:

On 20 Aug 2021, at 20:47, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I think you should drop the overly-cute bit with a SIGALRM handler,
and instead have a loop-with-delay around an attempt to read the
psql.pid file, after launching the psql run without an immediate
wait for termination.

This could perhaps be done with a PostgresNode::interactive_psql
session?

Yeah, we do have that infrastructure available in the
010_tab_completion.pl test.

In order to satisfy my own curiosity I decided to take a stab at this which
turned into the attached. I wasn’t able to reliably know that the pg_sleep
query had started except letting the timer expire, but that may very well be
the rust on my Perl skills showing.

Thanks for looking into this. With your patch applied, tests 2, 3, and
4 fail. Do they pass for you?

#11Daniel Gustafsson
daniel@yesql.se
In reply to: Peter Eisentraut (#10)
Re: pgsql: psql: Add test for query canceling

On 24 Aug 2021, at 15:40, Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote:

On 23.08.21 13:12, Daniel Gustafsson wrote:

On 21 Aug 2021, at 00:40, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Daniel Gustafsson <daniel@yesql.se> writes:

On 20 Aug 2021, at 20:47, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I think you should drop the overly-cute bit with a SIGALRM handler,
and instead have a loop-with-delay around an attempt to read the
psql.pid file, after launching the psql run without an immediate
wait for termination.

This could perhaps be done with a PostgresNode::interactive_psql
session?

Yeah, we do have that infrastructure available in the
010_tab_completion.pl test.

In order to satisfy my own curiosity I decided to take a stab at this which
turned into the attached. I wasn’t able to reliably know that the pg_sleep
query had started except letting the timer expire, but that may very well be
the rust on my Perl skills showing.

Thanks for looking into this. With your patch applied, tests 2, 3, and 4 fail. Do they pass for you?

Thats odd, all tests pass for me. Did the logs give an indication as to what
was failing?

--
Daniel Gustafsson https://vmware.com/

#12Tom Lane
tgl@sss.pgh.pa.us
In reply to: Daniel Gustafsson (#11)
Re: pgsql: psql: Add test for query canceling

Daniel Gustafsson <daniel@yesql.se> writes:

On 24 Aug 2021, at 15:40, Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote:
Thanks for looking into this. With your patch applied, tests 2, 3, and 4 fail. Do they pass for you?

Thats odd, all tests pass for me. Did the logs give an indication as to what
was failing?

I also tried the patch. It passes for me on my Linux box, although
it takes noticeably longer than before which doesn't seem great.
On wrasse's host, it fails. The Perl log shows:

ok 1 - PPID timer expired
ok 2 - psql was signaled
not ok 3 - query was canceled

# Failed test 'query was canceled'
# at t/020_cancel.pl line 50.
# 'postgres=# SELECT pg_sleep(15);
# Cancel request sent
# '
# doesn't match '(?-xism:canceling statement due to user request)'
ok 4 - Sleep query timer expired

The postmaster log shows that the cancel was received and acted on:

2021-08-24 17:39:39.021 CEST [2178] LOG: database system is ready to accept connections
2021-08-24 17:39:39.270 CEST [2187] 020_cancel.pl LOG: statement: SELECT pg_sleep(15);
2021-08-24 17:39:42.407 CEST [2187] 020_cancel.pl ERROR: canceling statement due to user request
2021-08-24 17:39:42.407 CEST [2187] 020_cancel.pl STATEMENT: SELECT pg_sleep(15);
2021-08-24 17:39:42.426 CEST [2178] LOG: received fast shutdown request

so I'm not sure why the error message didn't make its way back to the
client. Lack of pumping maybe? It's notable that there's not a lot
of time elapsed between the cancel and the shutdown, according to the
postmaster log, which says that we didn't wait for a timeout there.

Anyway, I'm not terribly thrilled with the patch as-presented, because
what I was hoping for was a test free of timing assumptions. This seems
to be worse not better by that light.

regards, tom lane

#13Peter Eisentraut
peter_e@gmx.net
In reply to: Tom Lane (#5)
Re: pgsql: psql: Add test for query canceling

On 20.08.21 20:47, Tom Lane wrote:

I think you should drop the overly-cute bit with a SIGALRM handler,
and instead have a loop-with-delay around an attempt to read the
psql.pid file, after launching the psql run without an immediate
wait for termination. That gets rid of the first problem (though
you still want the loop to timeout eventually, it could wait up
to say 180 seconds, as we do elsewhere). Then the second problem
is easy to solve by making the pg_sleep delay twice as much.

Here is a proposal. It waits separately for the pid file to appear and
also checks for the sleep query to be registered by the backend, so it
doesn't have any more dependencies on things happening "fast enough".
And it's also faster in the normal case now. Thoughts?

Attachments:

0001-Make-psql-cancel-test-more-timing-robust.patchtext/plain; charset=UTF-8; name=0001-Make-psql-cancel-test-more-timing-robust.patch; x-mac-creator=0; x-mac-type=0Download+23-9
#14Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Eisentraut (#13)
Re: pgsql: psql: Add test for query canceling

Peter Eisentraut <peter.eisentraut@enterprisedb.com> writes:

Here is a proposal. It waits separately for the pid file to appear and
also checks for the sleep query to be registered by the backend, so it
doesn't have any more dependencies on things happening "fast enough".
And it's also faster in the normal case now. Thoughts?

That looks pretty solid to me, and I can confirm that it passes
on wrasse's host. The only nit I can find to pick is that this:

+ usleep(10_000) until -s "$tempdir/psql.pid" or ($count++ > 180 * 100 and die 'pid file did not appear');

basically assumes that psql.pid will be written atomically.
It'd be marginally safer to wait till psql.pid can be seen to
contain a newline. I don't think that would be too hard to do,
if you put the slurp_file call inside the wait loop and inspect
its result.

regards, tom lane

#15Daniel Gustafsson
daniel@yesql.se
In reply to: Peter Eisentraut (#13)
Re: pgsql: psql: Add test for query canceling

On 24 Aug 2021, at 18:43, Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote:

On 20.08.21 20:47, Tom Lane wrote:

I think you should drop the overly-cute bit with a SIGALRM handler,
and instead have a loop-with-delay around an attempt to read the
psql.pid file, after launching the psql run without an immediate
wait for termination. That gets rid of the first problem (though
you still want the loop to timeout eventually, it could wait up
to say 180 seconds, as we do elsewhere). Then the second problem
is easy to solve by making the pg_sleep delay twice as much.

Here is a proposal. It waits separately for the pid file to appear and also checks for the sleep query to be registered by the backend, so it doesn't have any more dependencies on things happening "fast enough". And it's also faster in the normal case now. Thoughts?

From reading the patch, that looks like a really good combination of ideas for
this test. +1.

--
Daniel Gustafsson https://vmware.com/

#16Fabien COELHO
coelho@cri.ensmp.fr
In reply to: Tom Lane (#14)
Re: pgsql: psql: Add test for query canceling

That looks pretty solid to me, and I can confirm that it passes
on wrasse's host. The only nit I can find to pick is that this:

+ usleep(10_000) until -s "$tempdir/psql.pid" or ($count++ > 180 * 100 and die 'pid file did not appear');

basically assumes that psql.pid will be written atomically.

It'd be marginally safer to wait till psql.pid can be seen to
contain a newline. I don't think that would be too hard to do,
if you put the slurp_file call inside the wait loop and inspect
its result.

I finally came around to have a look at the patch.

For the issue raised above, I can see that the file could be created but
not yet written, but as the patch waits for the file to be non zero, ISTM
that the probability of a torn write of a single integer by echo is so
remote that we could let it as that?

Attached a version with the slurping in the loop anyway.

--
Fabien.

Attachments:

0001-Make-psql-cancel-test-more-timing-robust-2.patchtext/x-diff; name=0001-Make-psql-cancel-test-more-timing-robust-2.patchDownload+26-9
#17Daniel Gustafsson
daniel@yesql.se
In reply to: Fabien COELHO (#16)
Re: pgsql: psql: Add test for query canceling

On 25 Aug 2021, at 07:37, Fabien COELHO <coelho@cri.ensmp.fr> wrote:

That looks pretty solid to me, and I can confirm that it passes
on wrasse's host. The only nit I can find to pick is that this:

+ usleep(10_000) until -s "$tempdir/psql.pid" or ($count++ > 180 * 100 and die 'pid file did not appear');

basically assumes that psql.pid will be written atomically.

It'd be marginally safer to wait till psql.pid can be seen to
contain a newline. I don't think that would be too hard to do,
if you put the slurp_file call inside the wait loop and inspect
its result.

I finally came around to have a look at the patch.

For the issue raised above, I can see that the file could be created but not yet written, but as the patch waits for the file to be non zero, ISTM that the probability of a torn write of a single integer by echo is so remote that we could let it as that?

Attached a version with the slurping in the loop anyway.

It might be that the risk of a torn write is close to negligible, but that just
means that it’s bound to happen, and pretty soon too =) Checking with slurping
in the loop seems like a good idea to me.

--
Daniel Gustafsson https://vmware.com/

#18Peter Eisentraut
peter_e@gmx.net
In reply to: Fabien COELHO (#16)
Re: pgsql: psql: Add test for query canceling

On 25.08.21 07:37, Fabien COELHO wrote:

I finally came around to have a look at the patch.

For the issue raised above, I can see that the file could be created but
not yet written, but as the patch waits for the file to be non zero,
ISTM that the probability of a torn write of a single integer by echo is
so remote that we could let it as that?

Attached a version with the slurping in the loop anyway.

It looks like this fixed it. Thanks all.

#19Michael Paquier
michael@paquier.xyz
In reply to: Peter Eisentraut (#18)
Re: pgsql: psql: Add test for query canceling

On Thu, Aug 26, 2021 at 08:39:31AM +0200, Peter Eisentraut wrote:

It looks like this fixed it. Thanks all.

conchuela has complained today on this test, as of HEAD:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=conchuela&amp;dt=2021-08-30%2001%3A00%3A12
--
Michael

#20Daniel Gustafsson
daniel@yesql.se
In reply to: Michael Paquier (#19)
Re: pgsql: psql: Add test for query canceling

On 30 Aug 2021, at 06:27, Michael Paquier <michael@paquier.xyz> wrote:

On Thu, Aug 26, 2021 at 08:39:31AM +0200, Peter Eisentraut wrote:

It looks like this fixed it. Thanks all.

conchuela has complained today on this test, as of HEAD:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=conchuela&amp;dt=2021-08-30%2001%3A00%3A12

# ''
# doesn't match '(?^:canceling statement due to user request)'

This smells like a timing issue again, which is unexpected given how it's coded.

--
Daniel Gustafsson https://vmware.com/