interrupted tap tests leave postgres instances around
Hi,
When tap tests are interrupted (e.g. with ctrl-c), we don't cancel running
postgres instances etc. That doesn't strike me as a good thing.
In contrast, the postgres instances started by pg_regress do terminate. I
assume this is because pg_regress starts postgres directly, whereas tap tests
largely start postgres via pg_ctl. pg_ctl will, as it should, start postgres
without a controlling terminal. Thus a ctrl-c won't be delivered to it.
ISTM we should at least install a SIGINT/TERM handler in Cluster.pm that does
the stuff we already do in END.
That still leaves us with some other potential processes around that won't
immediately exec, but it'd be much better already.
Greetings,
Andres Freund
On Thu, Sep 29, 2022 at 09:07:34PM -0700, Andres Freund wrote:
ISTM we should at least install a SIGINT/TERM handler in Cluster.pm that does
the stuff we already do in END.
Hmm, indeed. And here I thought that END was actually taking care of
that on an interrupt..
--
Michael
On 2022-Sep-30, Michael Paquier wrote:
On Thu, Sep 29, 2022 at 09:07:34PM -0700, Andres Freund wrote:
ISTM we should at least install a SIGINT/TERM handler in Cluster.pm that does
the stuff we already do in END.Hmm, indeed. And here I thought that END was actually taking care of
that on an interrupt..
Me too. But the perlmod manpage says
An "END" code block is executed as late as possible, that is, after perl has
finished running the program and just before the interpreter is being exited,
even if it is exiting as a result of a die() function. (But not if it's
morphing into another program via "exec", or being blown out of the water by a
signal--you have to trap that yourself (if you can).)
So clearly we need to fix it. I thought it should be as simple as the
attached, since exit() calls END. (Would it be better to die() instead
of exit()?)
But on testing, some nodes linger after being sent a shutdown signal.
I'm not clear why this is -- I think it's due to the fact that we send
the signal just as the node is starting up, which means the signal
doesn't reach the process. (I added the 0002 patch --not for commit--
to see which Clusters were being shut down and in the trace file I can
clearly see that the nodes that linger were definitely subject to
->teardown_node).
Another funny thing: C-C'ing one run, I got this lingering process:
alvherre 800868 98.2 0.0 12144 5052 pts/9 R 11:03 0:26 /pgsql/install/master/bin/psql -X -c BASE_BACKUP (CHECKPOINT 'fast', MAX_RATE 32); -c SELECT pg_backup_stop() -d port=54380 host=/tmp/O_2PPNj9Fg dbname='postgres' replication=database
This is probably a bug in psql. Backtrace is:
#0 PQclear (res=<optimized out>) at /pgsql/source/master/src/interfaces/libpq/fe-exec.c:748
#1 PQclear (res=res@entry=0x55ad308c6190) at /pgsql/source/master/src/interfaces/libpq/fe-exec.c:718
#2 0x000055ad2f303323 in ClearOrSaveResult (result=0x55ad308c6190) at /pgsql/source/master/src/bin/psql/common.c:472
#3 ClearOrSaveAllResults () at /pgsql/source/master/src/bin/psql/common.c:488
#4 ExecQueryAndProcessResults (query=query@entry=0x55ad308bc7a0 "BASE_BACKUP (CHECKPOINT 'fast', MAX_RATE 32);",
elapsed_msec=elapsed_msec@entry=0x7fff9c9941d8, svpt_gone_p=svpt_gone_p@entry=0x7fff9c9941d7, is_watch=is_watch@entry=false,
opt=opt@entry=0x0, printQueryFout=printQueryFout@entry=0x0) at /pgsql/source/master/src/bin/psql/common.c:1608
#5 0x000055ad2f301b9d in SendQuery (query=0x55ad308bc7a0 "BASE_BACKUP (CHECKPOINT 'fast', MAX_RATE 32);")
at /pgsql/source/master/src/bin/psql/common.c:1172
#6 0x000055ad2f2f7bd9 in main (argc=<optimized out>, argv=<optimized out>) at /pgsql/source/master/src/bin/psql/startup.c:384
--
Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/
"How amazing is that? I call it a night and come back to find that a bug has
been identified and patched while I sleep." (Robert Davidson)
http://archives.postgresql.org/pgsql-sql/2006-03/msg00378.php
Attachments:
0001-tear-down-nodes-on-signal.patchtext/x-diff; charset=us-asciiDownload
From 5b01504da83e0a593a4adc24f77b22e7bfda9a0b Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: Fri, 30 Sep 2022 11:00:57 +0200
Subject: [PATCH 1/2] tear down nodes on signal
---
src/test/perl/PostgreSQL/Test/Cluster.pm | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)
diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 4fef9c12e6..4a64cb749b 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -1539,7 +1539,6 @@ sub can_bind
# were created in) when the test script exits.
END
{
-
# take care not to change the script's exit value
my $exit_code = $?;
@@ -2922,6 +2921,14 @@ sub corrupt_page_checksum
return;
}
+#
+# Signal handlers
+#
+$SIG{TERM} = $SIG{INT} = sub
+{
+ exit 1;
+};
+
=pod
=back
--
2.30.2
0002-trace-cluster-termination.patchtext/x-diff; charset=us-asciiDownload
From 38d4ef4e81a5def4307edca2e2dc207a7559f53d Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: Fri, 30 Sep 2022 11:06:45 +0200
Subject: [PATCH 2/2] trace cluster termination
---
src/test/perl/PostgreSQL/Test/Cluster.pm | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 4a64cb749b..ae80d06b7d 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -1251,6 +1251,7 @@ sub new
my $testname = basename($0);
$testname =~ s/\.[^.]+$//;
my $node = {
+ _testname => $testname,
_port => $port,
_host => $host,
_basedir =>
@@ -1542,6 +1543,15 @@ END
# take care not to change the script's exit value
my $exit_code = $?;
+ {
+ open my $TRACE, ">>", "/tmp/perl.trace";
+ foreach my $node (@all_nodes)
+ {
+ print $TRACE "shutting down $node->{_testname} / $node->{_name}\n";
+ }
+ close $TRACE;
+ }
+
foreach my $node (@all_nodes)
{
$node->teardown_node;
--
2.30.2
Hi,
On 2022-09-30 11:17:00 +0200, Alvaro Herrera wrote:
But on testing, some nodes linger after being sent a shutdown signal.
I'm not clear why this is -- I think it's due to the fact that we send
the signal just as the node is starting up, which means the signal
doesn't reach the process.
I suspect it's when a test gets interrupt while pg_ctl is starting the
backend. The start() routine only does _update_pid() after pg_ctl finished,
and terminate()->stop() returns before doing anything if pid isn't defined.
Perhaps the END{} routine should call $node->_update_pid(-1); if $exit_code !=
0 and _pid is undefined?
That does seem to reduce the incidence of "leftover" postgres
instances. 001_start_stop.pl leaves some behind, but that makes sense, because
it's bypassing the whole node management. But I still occasionally see some
remaining processes if I crank up test concurrency.
Ah! At least part of the problem is that sub stop() does BAIL_OUT, and of
course it can fail as part of the shutdown.
But there's still some that survive, where your perl.trace doesn't contain the
node getting shut down...
Greetings,
Andres Freund
On 30.09.22 06:07, Andres Freund wrote:
When tap tests are interrupted (e.g. with ctrl-c), we don't cancel running
postgres instances etc. That doesn't strike me as a good thing.In contrast, the postgres instances started by pg_regress do terminate. I
assume this is because pg_regress starts postgres directly, whereas tap tests
largely start postgres via pg_ctl. pg_ctl will, as it should, start postgres
without a controlling terminal. Thus a ctrl-c won't be delivered to it.
I ran into the problem recently that pg_upgrade starts the servers with
pg_ctl, and thus without terminal, and so you can't get any password
prompts for SSL keys, for example. Taking out the setsid() call in
pg_ctl.c fixed that. I suspect this is ultimately the same problem.
We could make TAP tests and pg_upgrade not use pg_ctl and start
postmaster directly. I'm not sure how much work that would be, but
seeing that pg_regress does it, it doesn't seem unreasonable.
Alternatively, perhaps we could make a mode for pg_ctl that it doesn't
call setsid(). This could be activated by an environment variable.
That might address all these problems, too.
Hi,
On 2022-10-04 10:24:19 +0200, Peter Eisentraut wrote:
On 30.09.22 06:07, Andres Freund wrote:
When tap tests are interrupted (e.g. with ctrl-c), we don't cancel running
postgres instances etc. That doesn't strike me as a good thing.In contrast, the postgres instances started by pg_regress do terminate. I
assume this is because pg_regress starts postgres directly, whereas tap tests
largely start postgres via pg_ctl. pg_ctl will, as it should, start postgres
without a controlling terminal. Thus a ctrl-c won't be delivered to it.I ran into the problem recently that pg_upgrade starts the servers with
pg_ctl, and thus without terminal, and so you can't get any password prompts
for SSL keys, for example.
For this specific case I wonder if pg_upgrade should disable ssl... That would
require fixing pg_upgrade to use a unix socket on windows, but that'd be a
good idea anyway.
Taking out the setsid() call in pg_ctl.c fixed that. I suspect this is
ultimately the same problem.
We could make TAP tests and pg_upgrade not use pg_ctl and start postmaster
directly. I'm not sure how much work that would be, but seeing that
pg_regress does it, it doesn't seem unreasonable.
It's not trivial, particularly from perl. Check all the stuff pg_regress and
pg_ctl do around windows accounts and tokens.
Alternatively, perhaps we could make a mode for pg_ctl that it doesn't call
setsid(). This could be activated by an environment variable. That might
address all these problems, too.
It looks like that won't help. Because pg_ctl exits after forking postgres,
postgres parent isn't the shell anymore...
Greetings,
Andres Freund
On 2022-Oct-01, Andres Freund wrote:
Perhaps the END{} routine should call $node->_update_pid(-1); if $exit_code !=
0 and _pid is undefined?
Yeah, that sounds reasonable.
That does seem to reduce the incidence of "leftover" postgres
instances. 001_start_stop.pl leaves some behind, but that makes sense, because
it's bypassing the whole node management. But I still occasionally see some
remaining processes if I crank up test concurrency.Ah! At least part of the problem is that sub stop() does BAIL_OUT, and of
course it can fail as part of the shutdown.
I made teardown_node pass down fail_ok=>1 to avoid this problem, so we
no longer BAIL_OUT in that case.
But there's still some that survive, where your perl.trace doesn't contain the
node getting shut down...
Yeah, something's still unexplained. I'll get this pushed soon, which
already reduces the number of leftover instances a good amount.
--
Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
Attachments:
v2-0001-Better-handle-interrupting-TAP-tests.patchtext/x-diff; charset=us-asciiDownload
From b43b3071abfd7bb83f7ec5942095aa031c3ed038 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: Tue, 18 Oct 2022 12:11:46 +0200
Subject: [PATCH v2] Better handle interrupting TAP tests
Set up a signal handler for INT/TERM so that we run our END block if we
get them. In END, if the exit status indicates a problem, call
_update_pid(-1) to improve chances of the stop working in case start()
hasn't returned yet.
Also, change END's teardown_node() so that it passes fail_ok=>1, so that
if a node fails to stop, we still stop the other nodes in the same test.
---
src/test/perl/PostgreSQL/Test/Cluster.pm | 21 ++++++++++++++++++---
1 file changed, 18 insertions(+), 3 deletions(-)
diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index d7c13318b0..17856d69c8 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -1545,7 +1545,13 @@ END
foreach my $node (@all_nodes)
{
- $node->teardown_node;
+ # During unclean termination (which could be a signal or some
+ # other failure), we're not sure that the status of our nodes
+ # has been correctly set up already, so try and detect their
+ # status so that we can shut them down properly.
+ $node->_update_pid(-1) if $exit_code != 0;
+
+ $node->teardown_node(fail_ok => 1);
# skip clean if we are requested to retain the basedir
next if defined $ENV{'PG_TEST_NOCLEAN'};
@@ -1564,13 +1570,15 @@ END
Do an immediate stop of the node
+Any optional extra parameter is passed to ->stop.
+
=cut
sub teardown_node
{
- my $self = shift;
+ my ($self, %params) = @_;
- $self->stop('immediate');
+ $self->stop('immediate', %params);
return;
}
@@ -2922,6 +2930,13 @@ sub corrupt_page_checksum
return;
}
+#
+# Signal handlers
+#
+$SIG{TERM} = $SIG{INT} = sub {
+ die "death by signal";
+};
+
=pod
=back
--
2.30.2
Pushed this. It should improve things significantly.
--
Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/
"Puedes vivir sólo una vez, pero si lo haces bien, una vez es suficiente"