The real reason why TAP testing isn't ready for prime time

Started by Tom Laneover 10 years ago22 messages
#1Tom Lane
tgl@sss.pgh.pa.us

Buildfarm member hamster has failed a pretty significant fraction of
its recent runs in the BinInstallCheck step:
http://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=hamster&br=HEAD

Since other critters aren't equally distressed, it seems likely that
this is just an out-of-disk-space type of problem. But maybe it's
trying to tell us that there's a genuine platform-specific bug there.
In any case, I challenge anyone to figure out what's happening from
the information available from the buildfarm logs.

I don't know whether this is just that the buildfarm script isn't
collecting data that it should be. But my experiences with the
TAP test scripts haven't been very positive. When they fail, it
takes a lot of digging to find out why. Basically, that entire
mechanism sucks as far as debuggability is concerned.

I think there is a good argument for turning this off in the buildfarm
until there is a better way of identifying and solving problems. It is
not helping us that hamster is red half the time for undiscoverable
reasons. That just conditions people to ignore it, and it may well be
masking real problems that the machine could be finding if it weren't
failing at this step.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#2Michael Paquier
michael.paquier@gmail.com
In reply to: Tom Lane (#1)
Re: The real reason why TAP testing isn't ready for prime time

On Mon, Jun 15, 2015 at 3:37 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Buildfarm member hamster has failed a pretty significant fraction of
its recent runs in the BinInstallCheck step:
http://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=hamster&amp;br=HEAD

Since other critters aren't equally distressed, it seems likely that
this is just an out-of-disk-space type of problem. But maybe it's
trying to tell us that there's a genuine platform-specific bug there.
In any case, I challenge anyone to figure out what's happening from
the information available from the buildfarm logs.

I don't know whether this is just that the buildfarm script isn't
collecting data that it should be. But my experiences with the
TAP test scripts haven't been very positive. When they fail, it
takes a lot of digging to find out why. Basically, that entire
mechanism sucks as far as debuggability is concerned.

Indeed. I think that one step in the good direction would be to
replace all the calls to system and system_or_bail with a wrapper
routine that calls IPC::Run able to catch the logs and store those
logs in each test's base path. The same applies to pg_rewind tests.

I think there is a good argument for turning this off in the buildfarm
until there is a better way of identifying and solving problems. It is
not helping us that hamster is red half the time for undiscoverable
reasons. That just conditions people to ignore it, and it may well be
masking real problems that the machine could be finding if it weren't
failing at this step.

hamster is legendary slow and has a slow disc, hence it improves
chances of catching race conditions, and it is the only slow buildfarm
machine enabling the TAP tests (by comparison dangomushi has never
failed with the TAP tests) hence I would prefer thinking that the
problem is not specific to ArchLinux ARM. In this case the failure
seems to be related to the timing test servers stop and start even if
-w switch is used with pg_ctl, particularly that PGPORT is set to the
same value for all servers... Still, for the time being I don't mind
disabling them and just did so now. I will try to investigate further
on the machine itself.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Michael Paquier
michael.paquier@gmail.com
In reply to: Michael Paquier (#2)
1 attachment(s)
Re: The real reason why TAP testing isn't ready for prime time

On Mon, Jun 15, 2015 at 8:26 AM, Michael Paquier wrote:

hamster is legendary slow and has a slow disc, hence it improves
chances of catching race conditions, and it is the only slow buildfarm
machine enabling the TAP tests (by comparison dangomushi has never
failed with the TAP tests) hence I would prefer thinking that the
problem is not specific to ArchLinux ARM. In this case the failure
seems to be related to the timing test servers stop and start even if
-w switch is used with pg_ctl, particularly that PGPORT is set to the
same value for all servers... Still, for the time being I don't mind
disabling them and just did so now. I will try to investigate further
on the machine itself.

So, I have been doing some progress on this thing.. And wrote the
patch attached that improves the way TAP tests log their output using
IPC::Run::run. This simply creates a log file as regress_log/test_name
for each test run, capturing all the command outputs. For example on
current HEAD, some commands of pg_ctl are simply silenced or have
their output redirected to /dev/null. With this patch, all their
output is redirected to the log file. This is as well compatible with
Windows, and logging of pg_rewind tests is unified with the new
structure of TestLib.pm. I am going to apply that to hamster and see
what is causing the error reported.

I think that it would be useful as well to improve the buildfarm
output. Thoughts?
Regards
--
Michael

Attachments:

20160618_tap_improve_logs.patchapplication/x-patch; name=20160618_tap_improve_logs.patchDownload
From b74108a2f3680930df2f9749743cb797d8dbb024 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@otacoo.com>
Date: Thu, 18 Jun 2015 15:44:42 +0900
Subject: [PATCH] Improve log capture of TAP tests

All tests have their logs stored as regress_log/$TEST_NAME, with content
captured from the many commands run during the tests.
---
 src/Makefile.global.in                         |  1 +
 src/bin/pg_basebackup/.gitignore               |  1 +
 src/bin/pg_basebackup/Makefile                 |  2 +-
 src/bin/pg_basebackup/t/010_pg_basebackup.pl   |  4 +-
 src/bin/pg_controldata/.gitignore              |  1 +
 src/bin/pg_controldata/Makefile                |  2 +-
 src/bin/pg_controldata/t/001_pg_controldata.pl |  2 +-
 src/bin/pg_ctl/.gitignore                      |  1 +
 src/bin/pg_ctl/Makefile                        |  2 +-
 src/bin/pg_ctl/t/001_start_stop.pl             |  2 +-
 src/bin/pg_ctl/t/002_status.pl                 |  6 +--
 src/bin/pg_rewind/RewindTest.pm                | 58 ++++++++++++--------------
 src/bin/scripts/.gitignore                     |  1 +
 src/bin/scripts/Makefile                       |  2 +-
 src/test/perl/TestLib.pm                       | 46 ++++++++++++++------
 src/test/ssl/.gitignore                        |  1 +
 src/test/ssl/Makefile                          |  3 ++
 src/test/ssl/ServerSetup.pm                    |  8 ++--
 src/test/ssl/t/001_ssltests.pl                 |  2 +-
 19 files changed, 85 insertions(+), 60 deletions(-)

diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index c583b44..a21aa19 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -336,6 +336,7 @@ cd $(srcdir) && TESTDIR='$(CURDIR)' PATH="$(bindir):$$PATH" PGPORT='6$(DEF_PGPOR
 endef
 
 define prove_check
+rm -r $(srcdir)/regress_log
 cd $(srcdir) && TESTDIR='$(CURDIR)' $(with_temp_install) PGPORT='6$(DEF_PGPORT)' top_builddir='$(CURDIR)/$(top_builddir)' $(PROVE) $(PG_PROVE_FLAGS) $(PROVE_FLAGS) t/*.pl
 endef
 
diff --git a/src/bin/pg_basebackup/.gitignore b/src/bin/pg_basebackup/.gitignore
index 36a2f12..746908b 100644
--- a/src/bin/pg_basebackup/.gitignore
+++ b/src/bin/pg_basebackup/.gitignore
@@ -2,4 +2,5 @@
 /pg_receivexlog
 /pg_recvlogical
 
+/regress_log/
 /tmp_check/
diff --git a/src/bin/pg_basebackup/Makefile b/src/bin/pg_basebackup/Makefile
index 0d8421a..66771dc 100644
--- a/src/bin/pg_basebackup/Makefile
+++ b/src/bin/pg_basebackup/Makefile
@@ -48,7 +48,7 @@ clean distclean maintainer-clean:
 	rm -f pg_basebackup$(X) pg_receivexlog$(X) pg_recvlogical$(X) \
 		pg_basebackup.o pg_receivexlog.o pg_recvlogical.o \
 		$(OBJS)
-	rm -rf tmp_check
+	rm -rf tmp_check regress_log
 
 check:
 	$(prove_check)
diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index 3476ea6..ecff372 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -22,7 +22,7 @@ print HBA "local replication all trust\n";
 print HBA "host replication all 127.0.0.1/32 trust\n";
 print HBA "host replication all ::1/128 trust\n";
 close HBA;
-system_or_bail 'pg_ctl', '-s', '-D', "$tempdir/pgdata", 'reload';
+run_or_bail(['pg_ctl', '-s', '-D', "$tempdir/pgdata", 'reload']);
 
 command_fails(
 	[ 'pg_basebackup', '-D', "$tempdir/backup" ],
@@ -51,7 +51,7 @@ ok(-f "$tempdir/tarbackup/base.tar", 'backup tar was created');
 
 my $superlongname = "superlongname_" . ("x" x 100);
 
-system_or_bail 'touch', "$tempdir/pgdata/$superlongname";
+run_or_bail(['touch', "$tempdir/pgdata/$superlongname"]);
 command_fails([ 'pg_basebackup', '-D', "$tempdir/tarbackup_l1", '-Ft' ],
 	'pg_basebackup tar with long name fails');
 unlink "$tempdir/pgdata/$superlongname";
diff --git a/src/bin/pg_controldata/.gitignore b/src/bin/pg_controldata/.gitignore
index 051d71d..04d0c68 100644
--- a/src/bin/pg_controldata/.gitignore
+++ b/src/bin/pg_controldata/.gitignore
@@ -1,2 +1,3 @@
 /pg_controldata
 /tmp_check/
+/regress_log/
diff --git a/src/bin/pg_controldata/Makefile b/src/bin/pg_controldata/Makefile
index fd7399b..3853dd1 100644
--- a/src/bin/pg_controldata/Makefile
+++ b/src/bin/pg_controldata/Makefile
@@ -33,7 +33,7 @@ uninstall:
 
 clean distclean maintainer-clean:
 	rm -f pg_controldata$(X) $(OBJS)
-	rm -rf tmp_check
+	rm -rf tmp_check regress_log
 
 check:
 	$(prove_check)
diff --git a/src/bin/pg_controldata/t/001_pg_controldata.pl b/src/bin/pg_controldata/t/001_pg_controldata.pl
index a4180e7..0d1407e 100644
--- a/src/bin/pg_controldata/t/001_pg_controldata.pl
+++ b/src/bin/pg_controldata/t/001_pg_controldata.pl
@@ -11,6 +11,6 @@ program_options_handling_ok('pg_controldata');
 command_fails(['pg_controldata'], 'pg_controldata without arguments fails');
 command_fails([ 'pg_controldata', 'nonexistent' ],
 	'pg_controldata with nonexistent directory fails');
-system_or_bail "initdb -D '$tempdir'/data -A trust >/dev/null";
+run_or_bail(['initdb', '-D', "$tempdir/data", '-A', 'trust']);
 command_like([ 'pg_controldata', "$tempdir/data" ],
 	qr/checkpoint/, 'pg_controldata produces output');
diff --git a/src/bin/pg_ctl/.gitignore b/src/bin/pg_ctl/.gitignore
index 73ab4ed..3efa15d 100644
--- a/src/bin/pg_ctl/.gitignore
+++ b/src/bin/pg_ctl/.gitignore
@@ -1,2 +1,3 @@
 /pg_ctl
 /tmp_check/
+/regress_log/
diff --git a/src/bin/pg_ctl/Makefile b/src/bin/pg_ctl/Makefile
index 37eb482..5bce54f 100644
--- a/src/bin/pg_ctl/Makefile
+++ b/src/bin/pg_ctl/Makefile
@@ -36,7 +36,7 @@ uninstall:
 
 clean distclean maintainer-clean:
 	rm -f pg_ctl$(X) $(OBJS)
-	rm -rf tmp_check
+	rm -rf tmp_check regress_log
 
 check:
 	$(prove_check)
diff --git a/src/bin/pg_ctl/t/001_start_stop.pl b/src/bin/pg_ctl/t/001_start_stop.pl
index 6c9ec5c..511ec48 100644
--- a/src/bin/pg_ctl/t/001_start_stop.pl
+++ b/src/bin/pg_ctl/t/001_start_stop.pl
@@ -36,4 +36,4 @@ command_ok([ 'pg_ctl', 'restart', '-D', "$tempdir/data", '-w', '-m', 'fast' ],
 command_ok([ 'pg_ctl', 'restart', '-D', "$tempdir/data", '-w', '-m', 'fast' ],
 	'pg_ctl restart with server running');
 
-system_or_bail 'pg_ctl', '-s', 'stop', '-D', "$tempdir/data", '-m', 'fast';
+run_or_bail(['pg_ctl', '-s', 'stop', '-D', "$tempdir/data", '-m', 'fast']);
diff --git a/src/bin/pg_ctl/t/002_status.pl b/src/bin/pg_ctl/t/002_status.pl
index 0558854..bfc2147 100644
--- a/src/bin/pg_ctl/t/002_status.pl
+++ b/src/bin/pg_ctl/t/002_status.pl
@@ -18,9 +18,9 @@ close CONF;
 command_exit_is([ 'pg_ctl', 'status', '-D', "$tempdir/data" ],
 	3, 'pg_ctl status with server not running');
 
-system_or_bail 'pg_ctl', '-s', '-l', "$tempdir/logfile", '-D',
-  "$tempdir/data", '-w', 'start';
+run_or_bail(['pg_ctl', '-s', '-l', "$tempdir/logfile", '-D',
+  "$tempdir/data", '-w', 'start']);
 command_exit_is([ 'pg_ctl', 'status', '-D', "$tempdir/data" ],
 	0, 'pg_ctl status with server running');
 
-system_or_bail 'pg_ctl', '-s', 'stop', '-D', "$tempdir/data", '-m', 'fast';
+run_or_bail(['pg_ctl', '-s', 'stop', '-D', "$tempdir/data", '-m', 'fast']);
diff --git a/src/bin/pg_rewind/RewindTest.pm b/src/bin/pg_rewind/RewindTest.pm
index 5219ec9..c8dff91 100644
--- a/src/bin/pg_rewind/RewindTest.pm
+++ b/src/bin/pg_rewind/RewindTest.pm
@@ -79,7 +79,6 @@ mkdir "regress_log";
 my $port_master  = $ENV{PGPORT};
 my $port_standby = $port_master + 1;
 
-my $log_path;
 my $tempdir_short;
 
 my $connstr_master  = "port=$port_master";
@@ -91,14 +90,16 @@ sub master_psql
 {
 	my $cmd = shift;
 
-	system_or_bail("psql -q --no-psqlrc -d $connstr_master -c \"$cmd\"");
+	run_or_bail(['psql', '-q', '--no-psqlrc', '-d', $connstr_master,
+				 '-c', "$cmd"]);
 }
 
 sub standby_psql
 {
 	my $cmd = shift;
 
-	system_or_bail("psql -q --no-psqlrc -d $connstr_standby -c \"$cmd\"");
+	run_or_bail(['psql', '-q', '--no-psqlrc', '-d', $connstr_standby,
+				 '-c', "$cmd"]);
 }
 
 # Run a query against the master, and check that the output matches what's
@@ -175,10 +176,6 @@ sub init_rewind_test
 {
 	my $testname  = shift;
 	my $test_mode = shift;
-
-	$log_path = "regress_log/pg_rewind_log_${testname}_${test_mode}";
-
-	remove_tree $log_path;
 }
 
 sub setup_cluster
@@ -209,8 +206,10 @@ max_connections = 10
 local replication all trust
 ));
 
-	system_or_bail(
-"pg_ctl -w -D $test_master_datadir -o \"-k $tempdir_short --listen-addresses='' -p $port_master\" start >>$log_path 2>&1"
+	run_or_bail(
+		['pg_ctl' , '-w', '-D' , $test_master_datadir,
+		 "-o", "-k $tempdir_short --listen-addresses='' -p $port_master",
+		 'start']
 	);
 
 	#### Now run the test-specific parts to initialize the master before setting
@@ -225,8 +224,9 @@ sub create_standby
 	remove_tree $test_standby_datadir;
 
 	# Base backup is taken with xlog files included
-	system_or_bail(
-"pg_basebackup -D $test_standby_datadir -p $port_master -x >>$log_path 2>&1");
+	run_or_bail(
+		['pg_basebackup', '-D', $test_standby_datadir, '-p',
+		 $port_master, '-x']);
 	append_to_file(
 		"$test_standby_datadir/recovery.conf", qq(
 primary_conninfo='$connstr_master application_name=rewind_standby'
@@ -235,8 +235,9 @@ recovery_target_timeline='latest'
 ));
 
 	# Start standby
-	system_or_bail(
-"pg_ctl -w -D $test_standby_datadir -o \"-k $tempdir_short --listen-addresses='' -p $port_standby\" start >>$log_path 2>&1"
+	run_or_bail(
+['pg_ctl', '-w', '-D', $test_standby_datadir , '-o',
+"-k $tempdir_short --listen-addresses='' -p $port_standby", 'start']
 	);
 
 	# Wait until the standby has caught up with the primary, by polling
@@ -255,8 +256,8 @@ sub promote_standby
 	# Now promote slave and insert some new data on master, this will put
 	# the master out-of-sync with the standby. Wait until the standby is
 	# out of recovery mode, and is ready to accept read-write connections.
-	system_or_bail(
-		"pg_ctl -w -D $test_standby_datadir promote >>$log_path 2>&1");
+	run_or_bail(
+		['pg_ctl', '-w', '-D', $test_standby_datadir, 'promote']);
 	poll_query_until("SELECT NOT pg_is_in_recovery()", $connstr_standby)
 	  or die "Timed out while waiting for promotion of standby";
 
@@ -274,8 +275,8 @@ sub run_pg_rewind
 	my $test_mode = shift;
 
 	# Stop the master and be ready to perform the rewind
-	system_or_bail(
-		"pg_ctl -w -D $test_master_datadir stop -m fast >>$log_path 2>&1");
+	run_or_bail(
+		['pg_ctl', '-w', '-D', $test_master_datadir, 'stop', '-m', 'fast']);
 
 	# At this point, the rewind processing is ready to run.
 	# We now have a very simple scenario with a few diverged WAL record.
@@ -294,31 +295,25 @@ sub run_pg_rewind
 
 		# Do rewind using a local pgdata as source
 		# Stop the master and be ready to perform the rewind
-		system_or_bail(
-			"pg_ctl -w -D $test_standby_datadir stop -m fast >>$log_path 2>&1"
+		run_or_bail(
+			['pg_ctl', '-w', '-D', $test_standby_datadir, 'stop', '-m', 'fast']
 		);
-		my $result = run(
+		my $result = run_log(
 			[   'pg_rewind',
 				"--debug",
 				"--source-pgdata=$test_standby_datadir",
-				"--target-pgdata=$test_master_datadir" ],
-			'>>',
-			$log_path,
-			'2>&1');
+				"--target-pgdata=$test_master_datadir" ]);
 		ok($result, 'pg_rewind local');
 	}
 	elsif ($test_mode eq "remote")
 	{
 
 		# Do rewind using a remote connection as source
-		my $result = run(
+		my $result = run_log(
 			[   'pg_rewind',
 				"--source-server",
 				"port=$port_standby dbname=postgres",
-				"--target-pgdata=$test_master_datadir" ],
-			'>>',
-			$log_path,
-			'2>&1');
+				"--target-pgdata=$test_master_datadir" ]);
 		ok($result, 'pg_rewind remote');
 	}
 	else
@@ -342,8 +337,9 @@ recovery_target_timeline='latest'
 ));
 
 	# Restart the master to check that rewind went correctly
-	system_or_bail(
-"pg_ctl -w -D $test_master_datadir -o \"-k $tempdir_short --listen-addresses='' -p $port_master\" start >>$log_path 2>&1"
+	run_or_bail(
+		['pg_ctl', '-w', '-D', $test_master_datadir, '-o',
+		 "-k $tempdir_short --listen-addresses='' -p $port_master", 'start']
 	);
 
 	#### Now run the test-specific parts to check the result
diff --git a/src/bin/scripts/.gitignore b/src/bin/scripts/.gitignore
index 1056b28..b1647a7 100644
--- a/src/bin/scripts/.gitignore
+++ b/src/bin/scripts/.gitignore
@@ -16,3 +16,4 @@
 /print.c
 
 /tmp_check/
+/regress_log/
\ No newline at end of file
diff --git a/src/bin/scripts/Makefile b/src/bin/scripts/Makefile
index c831716..c699470 100644
--- a/src/bin/scripts/Makefile
+++ b/src/bin/scripts/Makefile
@@ -67,7 +67,7 @@ clean distclean maintainer-clean:
 	rm -f $(addsuffix $(X), $(PROGRAMS)) $(addsuffix .o, $(PROGRAMS))
 	rm -f common.o dumputils.o kwlookup.o keywords.o print.o mbprint.o $(WIN32RES)
 	rm -f dumputils.c print.c mbprint.c kwlookup.c keywords.c
-	rm -rf tmp_check
+	rm -rf tmp_check regress_log
 
 check:
 	$(prove_check)
diff --git a/src/test/perl/TestLib.pm b/src/test/perl/TestLib.pm
index ef42366..c00e4df 100644
--- a/src/test/perl/TestLib.pm
+++ b/src/test/perl/TestLib.pm
@@ -12,6 +12,8 @@ our @EXPORT = qw(
   restart_test_server
   psql
   system_or_bail
+  run_or_bail
+  run_log
 
   command_ok
   command_fails
@@ -24,11 +26,16 @@ our @EXPORT = qw(
 );
 
 use Cwd;
+use File::Basename;
 use File::Spec;
 use File::Temp ();
 use IPC::Run qw(run start);
 use Test::More;
 
+mkdir 'regress_log';
+my $test_logfile = basename($0);
+$test_logfile =~ s/\.[^.]+$//;
+$test_logfile = 'regress_log/regress_log_' . $test_logfile;
 
 # Set to untranslated messages, to be able to compare program output
 # with expected strings.
@@ -77,9 +84,9 @@ sub tempdir_short
 sub standard_initdb
 {
 	my $pgdata = shift;
-	system_or_bail("initdb -D '$pgdata' -A trust -N >/dev/null");
-	system_or_bail("$ENV{top_builddir}/src/test/regress/pg_regress",
-		'--config-auth', $pgdata);
+	run_or_bail(['initdb', '-D', "$pgdata", '-A' , 'trust', '-N']);
+	run_or_bail(["$ENV{top_builddir}/src/test/regress/pg_regress",
+		'--config-auth', $pgdata]);
 }
 
 my ($test_server_datadir, $test_server_logfile);
@@ -92,14 +99,14 @@ sub start_test_server
 	my $tempdir_short = tempdir_short;
 
 	standard_initdb "$tempdir/pgdata";
-	$ret = system 'pg_ctl', '-D', "$tempdir/pgdata", '-s', '-w', '-l',
+	$ret = run_log(['pg_ctl', '-D', "$tempdir/pgdata", '-w', '-l',
 	  "$tempdir/logfile", '-o',
-"--fsync=off -k $tempdir_short --listen-addresses='' --log-statement=all",
-	  'start';
+"--fsync=off -k \"$tempdir_short\" --listen-addresses='' --log-statement=all",
+					'start']);
 
-	if ($ret != 0)
+	if (!$ret)
 	{
-		system('cat', "$tempdir/logfile");
+		run_log(['cat', "$tempdir/logfile"]);
 		BAIL_OUT("pg_ctl failed");
 	}
 
@@ -110,16 +117,16 @@ sub start_test_server
 
 sub restart_test_server
 {
-	system 'pg_ctl', '-s', '-D', $test_server_datadir, '-w', '-l',
-	  $test_server_logfile, 'restart';
+	run_log(['pg_ctl', '-D', $test_server_datadir, '-w', '-l',
+	  $test_server_logfile, 'restart']);
 }
 
 END
 {
 	if ($test_server_datadir)
 	{
-		system 'pg_ctl', '-D', $test_server_datadir, '-s', '-w', '-m',
-		  'immediate', 'stop';
+		run_log(['pg_ctl', '-D', $test_server_datadir, '-w', '-m',
+		  'immediate', 'stop']);
 	}
 }
 
@@ -134,6 +141,21 @@ sub system_or_bail
 	system(@_) == 0 or BAIL_OUT("system @_ failed: $?");
 }
 
+sub run_log
+{
+	my $result = run (@_, '>>', $test_logfile, '2>&1');
+	return $result;
+}
+
+sub run_or_bail
+{
+	my $result = run (@_, '>>', $test_logfile, '2>&1');
+	if (!$result)
+	{
+		BAIL_OUT("run $_[0] failed: $?");
+	}
+}
+
 
 #
 # Test functions
diff --git a/src/test/ssl/.gitignore b/src/test/ssl/.gitignore
index a325c2f..d7b2b23 100644
--- a/src/test/ssl/.gitignore
+++ b/src/test/ssl/.gitignore
@@ -1,3 +1,4 @@
 # Generated by regression tests
 /client-log
 /tmp_check/
+/regress_log/
diff --git a/src/test/ssl/Makefile b/src/test/ssl/Makefile
index d9fd29a..785500e 100644
--- a/src/test/ssl/Makefile
+++ b/src/test/ssl/Makefile
@@ -122,5 +122,8 @@ ssl/root+client.crl: ssl/root.crl ssl/client.crl
 sslfiles-clean:
 	rm -f $(SSLFILES) ssl/client_ca.srl ssl/server_ca.srl ssl/client_ca-certindex* ssl/server_ca-certindex* ssl/root_ca-certindex* ssl/root_ca.srl ssl/temp_ca.crt ssl/temp_ca_signed.crt
 
+clean:
+	rm -rf regress_log
+
 check:
 	$(prove_check)
diff --git a/src/test/ssl/ServerSetup.pm b/src/test/ssl/ServerSetup.pm
index 4ce4a69..16a6ad9 100644
--- a/src/test/ssl/ServerSetup.pm
+++ b/src/test/ssl/ServerSetup.pm
@@ -68,7 +68,7 @@ sub configure_test_server_for_ssl
 # Copy all server certificates and keys, and client root cert, to the data dir
 	copy_files("ssl/server-*.crt", "$tempdir/pgdata");
 	copy_files("ssl/server-*.key", "$tempdir/pgdata");
-	system_or_bail "chmod 0600 '$tempdir'/pgdata/server-*.key";
+	system_or_bail("chmod 0600 $tempdir/pgdata/server-*.key");
 	copy_files("ssl/root+client_ca.crt", "$tempdir/pgdata");
 	copy_files("ssl/root+client.crl",    "$tempdir/pgdata");
 
@@ -111,8 +111,6 @@ sub switch_server_cert
    # restart_test_server() because that overrides listen_addresses to only all
    # Unix domain socket connections.
 
-	system_or_bail 'pg_ctl', 'stop',  '-s', '-D', "$tempdir/pgdata", '-w';
-	system_or_bail 'pg_ctl', 'start', '-s', '-D', "$tempdir/pgdata", '-w',
-	  '-l',
-	  "$tempdir/logfile";
+	run_or_bail(['pg_ctl', 'stop',  '-s', '-D', "$tempdir/pgdata", '-w']);
+	run_or_bail(['pg_ctl', 'start', '-s', '-D', "$tempdir/pgdata", '-w']);
 }
diff --git a/src/test/ssl/t/001_ssltests.pl b/src/test/ssl/t/001_ssltests.pl
index 926b529..6e84a27 100644
--- a/src/test/ssl/t/001_ssltests.pl
+++ b/src/test/ssl/t/001_ssltests.pl
@@ -78,7 +78,7 @@ sub test_connect_fails
 # The client's private key must not be world-readable. Git doesn't track
 # permissions (except for the executable bit), so they might be wrong after
 # a checkout.
-system_or_bail "chmod 0600 ssl/client.key";
+system_or_bail("chmod 0600 ssl/client.key");
 
 #### Part 0. Set up the server.
 
-- 
2.4.3

#4Michael Paquier
michael.paquier@gmail.com
In reply to: Michael Paquier (#3)
3 attachment(s)
Re: The real reason why TAP testing isn't ready for prime time

On Thu, Jun 18, 2015 at 3:52 PM, Michael Paquier wrote:

I think that it would be useful as well to improve the buildfarm
output. Thoughts?

And after running the tests more or less 6~7 times in a row on a PI, I
have been able to trigger the problem and I think that I have found
its origin. First, the error has been triggered by the tests of
pg_rewind:
t/002_databases.pl ...
1..4
Bailout called. Further testing stopped: run pg_ctl failed: 256
Bail out! run pg_ctl failed: 256
FAILED--Further testing stopped: run pg_ctl failed: 256
Makefile:51: recipe for target 'check' failed
make[1]: *** [check] Error 255

And by looking at the logs obtained thanks to the previous patch I
could see the following (log attached for tests 1 and 2):
$ tail -n5 regress_log/regress_log_002_databases
waiting for server to start........ stopped waiting
pg_ctl: could not start server
Examine the log output.
LOG: received immediate shutdown request
LOG: received immediate shutdown request

pg_ctl should be able to start the server and should not fail here.
This is confirmed by the fact that first test has not stopped the
servers. On a clean run, the immediate shutdown request is received
and done:
waiting for server to shut down....LOG: received immediate shutdown request
LOG: unexpected EOF on standby connection
done

But in the case of the failure this does not happen:
LOG: received immediate shutdown request
LOG: unexpected EOF on standby connection
LOG: received immediate shutdown request
See the "done" is not here.

Now if we look at RewindTest.pm, there is the following code:
if ($test_master_datadir)
{
system
"pg_ctl -D $test_master_datadir -s -m immediate stop
2> /dev/null";
}
if ($test_standby_datadir)
{
system
"pg_ctl -D $test_standby_datadir -s -m immediate
stop 2> /dev/null";
}
And I think that the problem is triggered because we are missing a -w
switch here, meaning that we do not wait until the confirmation that
the server has stopped, and visibly if stop is slow enough the next
server to use cannot start because the port is already taken by the
server currently stopping.

Note as well that the last command of pg_ctl stop in
pg_ctl/t/002_status.pl does not use -w, so we have the same problem
there.

Attached is a patch fixing those problems and improving the log
facility as it really helped me out with those issues. The simplest
fix would be to include the -w switch missing in the tests of
pg_rewind and pg_ctl though.

It would be good to get that fixed, then I would be able to re-enable
the TAP tests of hamster. I have run the tests a dozen of times again
with this patch, and I could not trigger the failure anymore.
Regards,
--
Michael

Attachments:

regress_log_001_basicapplication/octet-stream; name=regress_log_001_basicDownload
regress_log_002_databasesapplication/octet-stream; name=regress_log_002_databasesDownload
20150619_tap_improve_logs_v2.patchapplication/x-patch; name=20150619_tap_improve_logs_v2.patchDownload
From 50ea935f3873df7c206e6ff3890a3d2219698098 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@otacoo.com>
Date: Fri, 19 Jun 2015 14:12:09 +0900
Subject: [PATCH] Improve log capture of TAP tests and fix race conditions

All tests have their logs stored as regress_log/$TEST_NAME, with content
captured from the many commands run during the tests.

Some commands of pg_ctl was lacking a -w switch, meaning that in rather
slow environments this could lead to a server not able to start if the
last server stopped did not yet acknowledge the stop request.
---
 src/Makefile.global.in                         |  1 +
 src/bin/pg_basebackup/.gitignore               |  1 +
 src/bin/pg_basebackup/Makefile                 |  2 +-
 src/bin/pg_basebackup/t/010_pg_basebackup.pl   |  4 +-
 src/bin/pg_controldata/.gitignore              |  1 +
 src/bin/pg_controldata/Makefile                |  2 +-
 src/bin/pg_controldata/t/001_pg_controldata.pl |  2 +-
 src/bin/pg_ctl/.gitignore                      |  1 +
 src/bin/pg_ctl/Makefile                        |  2 +-
 src/bin/pg_ctl/t/001_start_stop.pl             |  2 +-
 src/bin/pg_ctl/t/002_status.pl                 |  6 +--
 src/bin/pg_rewind/RewindTest.pm                | 68 +++++++++++++-------------
 src/bin/scripts/.gitignore                     |  1 +
 src/bin/scripts/Makefile                       |  2 +-
 src/test/perl/TestLib.pm                       | 48 +++++++++++++-----
 src/test/ssl/.gitignore                        |  1 +
 src/test/ssl/Makefile                          |  3 ++
 src/test/ssl/ServerSetup.pm                    |  6 +--
 18 files changed, 90 insertions(+), 63 deletions(-)

diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index c583b44..4b1de8c 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -336,6 +336,7 @@ cd $(srcdir) && TESTDIR='$(CURDIR)' PATH="$(bindir):$$PATH" PGPORT='6$(DEF_PGPOR
 endef
 
 define prove_check
+rm -rf $(srcdir)/regress_log
 cd $(srcdir) && TESTDIR='$(CURDIR)' $(with_temp_install) PGPORT='6$(DEF_PGPORT)' top_builddir='$(CURDIR)/$(top_builddir)' $(PROVE) $(PG_PROVE_FLAGS) $(PROVE_FLAGS) t/*.pl
 endef
 
diff --git a/src/bin/pg_basebackup/.gitignore b/src/bin/pg_basebackup/.gitignore
index 36a2f12..746908b 100644
--- a/src/bin/pg_basebackup/.gitignore
+++ b/src/bin/pg_basebackup/.gitignore
@@ -2,4 +2,5 @@
 /pg_receivexlog
 /pg_recvlogical
 
+/regress_log/
 /tmp_check/
diff --git a/src/bin/pg_basebackup/Makefile b/src/bin/pg_basebackup/Makefile
index 0d8421a..66771dc 100644
--- a/src/bin/pg_basebackup/Makefile
+++ b/src/bin/pg_basebackup/Makefile
@@ -48,7 +48,7 @@ clean distclean maintainer-clean:
 	rm -f pg_basebackup$(X) pg_receivexlog$(X) pg_recvlogical$(X) \
 		pg_basebackup.o pg_receivexlog.o pg_recvlogical.o \
 		$(OBJS)
-	rm -rf tmp_check
+	rm -rf tmp_check regress_log
 
 check:
 	$(prove_check)
diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index 3476ea6..ecff372 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -22,7 +22,7 @@ print HBA "local replication all trust\n";
 print HBA "host replication all 127.0.0.1/32 trust\n";
 print HBA "host replication all ::1/128 trust\n";
 close HBA;
-system_or_bail 'pg_ctl', '-s', '-D', "$tempdir/pgdata", 'reload';
+run_or_bail(['pg_ctl', '-s', '-D', "$tempdir/pgdata", 'reload']);
 
 command_fails(
 	[ 'pg_basebackup', '-D', "$tempdir/backup" ],
@@ -51,7 +51,7 @@ ok(-f "$tempdir/tarbackup/base.tar", 'backup tar was created');
 
 my $superlongname = "superlongname_" . ("x" x 100);
 
-system_or_bail 'touch', "$tempdir/pgdata/$superlongname";
+run_or_bail(['touch', "$tempdir/pgdata/$superlongname"]);
 command_fails([ 'pg_basebackup', '-D', "$tempdir/tarbackup_l1", '-Ft' ],
 	'pg_basebackup tar with long name fails');
 unlink "$tempdir/pgdata/$superlongname";
diff --git a/src/bin/pg_controldata/.gitignore b/src/bin/pg_controldata/.gitignore
index 051d71d..04d0c68 100644
--- a/src/bin/pg_controldata/.gitignore
+++ b/src/bin/pg_controldata/.gitignore
@@ -1,2 +1,3 @@
 /pg_controldata
 /tmp_check/
+/regress_log/
diff --git a/src/bin/pg_controldata/Makefile b/src/bin/pg_controldata/Makefile
index fd7399b..3853dd1 100644
--- a/src/bin/pg_controldata/Makefile
+++ b/src/bin/pg_controldata/Makefile
@@ -33,7 +33,7 @@ uninstall:
 
 clean distclean maintainer-clean:
 	rm -f pg_controldata$(X) $(OBJS)
-	rm -rf tmp_check
+	rm -rf tmp_check regress_log
 
 check:
 	$(prove_check)
diff --git a/src/bin/pg_controldata/t/001_pg_controldata.pl b/src/bin/pg_controldata/t/001_pg_controldata.pl
index a4180e7..0d1407e 100644
--- a/src/bin/pg_controldata/t/001_pg_controldata.pl
+++ b/src/bin/pg_controldata/t/001_pg_controldata.pl
@@ -11,6 +11,6 @@ program_options_handling_ok('pg_controldata');
 command_fails(['pg_controldata'], 'pg_controldata without arguments fails');
 command_fails([ 'pg_controldata', 'nonexistent' ],
 	'pg_controldata with nonexistent directory fails');
-system_or_bail "initdb -D '$tempdir'/data -A trust >/dev/null";
+run_or_bail(['initdb', '-D', "$tempdir/data", '-A', 'trust']);
 command_like([ 'pg_controldata', "$tempdir/data" ],
 	qr/checkpoint/, 'pg_controldata produces output');
diff --git a/src/bin/pg_ctl/.gitignore b/src/bin/pg_ctl/.gitignore
index 73ab4ed..3efa15d 100644
--- a/src/bin/pg_ctl/.gitignore
+++ b/src/bin/pg_ctl/.gitignore
@@ -1,2 +1,3 @@
 /pg_ctl
 /tmp_check/
+/regress_log/
diff --git a/src/bin/pg_ctl/Makefile b/src/bin/pg_ctl/Makefile
index 37eb482..5bce54f 100644
--- a/src/bin/pg_ctl/Makefile
+++ b/src/bin/pg_ctl/Makefile
@@ -36,7 +36,7 @@ uninstall:
 
 clean distclean maintainer-clean:
 	rm -f pg_ctl$(X) $(OBJS)
-	rm -rf tmp_check
+	rm -rf tmp_check regress_log
 
 check:
 	$(prove_check)
diff --git a/src/bin/pg_ctl/t/001_start_stop.pl b/src/bin/pg_ctl/t/001_start_stop.pl
index 6c9ec5c..132e1a4 100644
--- a/src/bin/pg_ctl/t/001_start_stop.pl
+++ b/src/bin/pg_ctl/t/001_start_stop.pl
@@ -36,4 +36,4 @@ command_ok([ 'pg_ctl', 'restart', '-D', "$tempdir/data", '-w', '-m', 'fast' ],
 command_ok([ 'pg_ctl', 'restart', '-D', "$tempdir/data", '-w', '-m', 'fast' ],
 	'pg_ctl restart with server running');
 
-system_or_bail 'pg_ctl', '-s', 'stop', '-D', "$tempdir/data", '-m', 'fast';
+run_or_bail(['pg_ctl', '-w', 'stop', '-D', "$tempdir/data", '-m', 'fast']);
diff --git a/src/bin/pg_ctl/t/002_status.pl b/src/bin/pg_ctl/t/002_status.pl
index 0558854..e72dba9 100644
--- a/src/bin/pg_ctl/t/002_status.pl
+++ b/src/bin/pg_ctl/t/002_status.pl
@@ -18,9 +18,9 @@ close CONF;
 command_exit_is([ 'pg_ctl', 'status', '-D', "$tempdir/data" ],
 	3, 'pg_ctl status with server not running');
 
-system_or_bail 'pg_ctl', '-s', '-l', "$tempdir/logfile", '-D',
-  "$tempdir/data", '-w', 'start';
+run_or_bail(['pg_ctl', '-l', "$tempdir/logfile", '-D',
+  "$tempdir/data", '-w', 'start']);
 command_exit_is([ 'pg_ctl', 'status', '-D', "$tempdir/data" ],
 	0, 'pg_ctl status with server running');
 
-system_or_bail 'pg_ctl', '-s', 'stop', '-D', "$tempdir/data", '-m', 'fast';
+run_or_bail(['pg_ctl', '-w', 'stop', '-D', "$tempdir/data", '-m', 'fast']);
diff --git a/src/bin/pg_rewind/RewindTest.pm b/src/bin/pg_rewind/RewindTest.pm
index 5219ec9..e8bafc2 100644
--- a/src/bin/pg_rewind/RewindTest.pm
+++ b/src/bin/pg_rewind/RewindTest.pm
@@ -79,7 +79,6 @@ mkdir "regress_log";
 my $port_master  = $ENV{PGPORT};
 my $port_standby = $port_master + 1;
 
-my $log_path;
 my $tempdir_short;
 
 my $connstr_master  = "port=$port_master";
@@ -91,14 +90,16 @@ sub master_psql
 {
 	my $cmd = shift;
 
-	system_or_bail("psql -q --no-psqlrc -d $connstr_master -c \"$cmd\"");
+	run_or_bail(['psql', '-q', '--no-psqlrc', '-d', $connstr_master,
+				 '-c', "$cmd"]);
 }
 
 sub standby_psql
 {
 	my $cmd = shift;
 
-	system_or_bail("psql -q --no-psqlrc -d $connstr_standby -c \"$cmd\"");
+	run_or_bail(['psql', '-q', '--no-psqlrc', '-d', $connstr_standby,
+				 '-c', "$cmd"]);
 }
 
 # Run a query against the master, and check that the output matches what's
@@ -175,10 +176,6 @@ sub init_rewind_test
 {
 	my $testname  = shift;
 	my $test_mode = shift;
-
-	$log_path = "regress_log/pg_rewind_log_${testname}_${test_mode}";
-
-	remove_tree $log_path;
 }
 
 sub setup_cluster
@@ -209,8 +206,10 @@ max_connections = 10
 local replication all trust
 ));
 
-	system_or_bail(
-"pg_ctl -w -D $test_master_datadir -o \"-k $tempdir_short --listen-addresses='' -p $port_master\" start >>$log_path 2>&1"
+	run_or_bail(
+		['pg_ctl' , '-w', '-D' , $test_master_datadir,
+		 "-o", "-k $tempdir_short --listen-addresses='' -p $port_master",
+		 'start']
 	);
 
 	#### Now run the test-specific parts to initialize the master before setting
@@ -225,8 +224,9 @@ sub create_standby
 	remove_tree $test_standby_datadir;
 
 	# Base backup is taken with xlog files included
-	system_or_bail(
-"pg_basebackup -D $test_standby_datadir -p $port_master -x >>$log_path 2>&1");
+	run_or_bail(
+		['pg_basebackup', '-D', $test_standby_datadir, '-p',
+		 $port_master, '-x']);
 	append_to_file(
 		"$test_standby_datadir/recovery.conf", qq(
 primary_conninfo='$connstr_master application_name=rewind_standby'
@@ -235,8 +235,9 @@ recovery_target_timeline='latest'
 ));
 
 	# Start standby
-	system_or_bail(
-"pg_ctl -w -D $test_standby_datadir -o \"-k $tempdir_short --listen-addresses='' -p $port_standby\" start >>$log_path 2>&1"
+	run_or_bail(
+['pg_ctl', '-w', '-D', $test_standby_datadir , '-o',
+"-k $tempdir_short --listen-addresses='' -p $port_standby", 'start']
 	);
 
 	# Wait until the standby has caught up with the primary, by polling
@@ -255,8 +256,8 @@ sub promote_standby
 	# Now promote slave and insert some new data on master, this will put
 	# the master out-of-sync with the standby. Wait until the standby is
 	# out of recovery mode, and is ready to accept read-write connections.
-	system_or_bail(
-		"pg_ctl -w -D $test_standby_datadir promote >>$log_path 2>&1");
+	run_or_bail(
+		['pg_ctl', '-w', '-D', $test_standby_datadir, 'promote']);
 	poll_query_until("SELECT NOT pg_is_in_recovery()", $connstr_standby)
 	  or die "Timed out while waiting for promotion of standby";
 
@@ -274,8 +275,8 @@ sub run_pg_rewind
 	my $test_mode = shift;
 
 	# Stop the master and be ready to perform the rewind
-	system_or_bail(
-		"pg_ctl -w -D $test_master_datadir stop -m fast >>$log_path 2>&1");
+	run_or_bail(
+		['pg_ctl', '-w', '-D', $test_master_datadir, 'stop', '-m', 'fast']);
 
 	# At this point, the rewind processing is ready to run.
 	# We now have a very simple scenario with a few diverged WAL record.
@@ -294,31 +295,25 @@ sub run_pg_rewind
 
 		# Do rewind using a local pgdata as source
 		# Stop the master and be ready to perform the rewind
-		system_or_bail(
-			"pg_ctl -w -D $test_standby_datadir stop -m fast >>$log_path 2>&1"
+		run_or_bail(
+			['pg_ctl', '-w', '-D', $test_standby_datadir, 'stop', '-m', 'fast']
 		);
-		my $result = run(
+		my $result = run_log(
 			[   'pg_rewind',
 				"--debug",
 				"--source-pgdata=$test_standby_datadir",
-				"--target-pgdata=$test_master_datadir" ],
-			'>>',
-			$log_path,
-			'2>&1');
+				"--target-pgdata=$test_master_datadir" ]);
 		ok($result, 'pg_rewind local');
 	}
 	elsif ($test_mode eq "remote")
 	{
 
 		# Do rewind using a remote connection as source
-		my $result = run(
+		my $result = run_log(
 			[   'pg_rewind',
 				"--source-server",
 				"port=$port_standby dbname=postgres",
-				"--target-pgdata=$test_master_datadir" ],
-			'>>',
-			$log_path,
-			'2>&1');
+				"--target-pgdata=$test_master_datadir" ]);
 		ok($result, 'pg_rewind remote');
 	}
 	else
@@ -342,8 +337,9 @@ recovery_target_timeline='latest'
 ));
 
 	# Restart the master to check that rewind went correctly
-	system_or_bail(
-"pg_ctl -w -D $test_master_datadir -o \"-k $tempdir_short --listen-addresses='' -p $port_master\" start >>$log_path 2>&1"
+	run_or_bail(
+		['pg_ctl', '-w', '-D', $test_master_datadir, '-o',
+		 "-k $tempdir_short --listen-addresses='' -p $port_master", 'start']
 	);
 
 	#### Now run the test-specific parts to check the result
@@ -354,13 +350,15 @@ sub clean_rewind_test
 {
 	if ($test_master_datadir)
 	{
-		system
-		  "pg_ctl -D $test_master_datadir -s -m immediate stop 2> /dev/null";
+		run_log(
+		  ['pg_ctl', '-w', '-D', $test_master_datadir, '-m', 'immediate',
+		   'stop']);
 	}
 	if ($test_standby_datadir)
 	{
-		system
-		  "pg_ctl -D $test_standby_datadir -s -m immediate stop 2> /dev/null";
+		run_log(
+		  ['pg_ctl', '-w', '-D', $test_standby_datadir, '-m', 'immediate',
+		   'stop']);
 	}
 }
 
diff --git a/src/bin/scripts/.gitignore b/src/bin/scripts/.gitignore
index 1056b28..b1647a7 100644
--- a/src/bin/scripts/.gitignore
+++ b/src/bin/scripts/.gitignore
@@ -16,3 +16,4 @@
 /print.c
 
 /tmp_check/
+/regress_log/
\ No newline at end of file
diff --git a/src/bin/scripts/Makefile b/src/bin/scripts/Makefile
index c831716..c699470 100644
--- a/src/bin/scripts/Makefile
+++ b/src/bin/scripts/Makefile
@@ -67,7 +67,7 @@ clean distclean maintainer-clean:
 	rm -f $(addsuffix $(X), $(PROGRAMS)) $(addsuffix .o, $(PROGRAMS))
 	rm -f common.o dumputils.o kwlookup.o keywords.o print.o mbprint.o $(WIN32RES)
 	rm -f dumputils.c print.c mbprint.c kwlookup.c keywords.c
-	rm -rf tmp_check
+	rm -rf tmp_check regress_log
 
 check:
 	$(prove_check)
diff --git a/src/test/perl/TestLib.pm b/src/test/perl/TestLib.pm
index ef42366..a9350f9 100644
--- a/src/test/perl/TestLib.pm
+++ b/src/test/perl/TestLib.pm
@@ -12,6 +12,8 @@ our @EXPORT = qw(
   restart_test_server
   psql
   system_or_bail
+  run_or_bail
+  run_log
 
   command_ok
   command_fails
@@ -24,11 +26,16 @@ our @EXPORT = qw(
 );
 
 use Cwd;
+use File::Basename;
 use File::Spec;
 use File::Temp ();
 use IPC::Run qw(run start);
 use Test::More;
 
+mkdir 'regress_log';
+my $test_logfile = basename($0);
+$test_logfile =~ s/\.[^.]+$//;
+$test_logfile = 'regress_log/regress_log_' . $test_logfile;
 
 # Set to untranslated messages, to be able to compare program output
 # with expected strings.
@@ -77,9 +84,9 @@ sub tempdir_short
 sub standard_initdb
 {
 	my $pgdata = shift;
-	system_or_bail("initdb -D '$pgdata' -A trust -N >/dev/null");
-	system_or_bail("$ENV{top_builddir}/src/test/regress/pg_regress",
-		'--config-auth', $pgdata);
+	run_or_bail(['initdb', '-D', "$pgdata", '-A' , 'trust', '-N']);
+	run_or_bail(["$ENV{top_builddir}/src/test/regress/pg_regress",
+		'--config-auth', $pgdata]);
 }
 
 my ($test_server_datadir, $test_server_logfile);
@@ -92,14 +99,14 @@ sub start_test_server
 	my $tempdir_short = tempdir_short;
 
 	standard_initdb "$tempdir/pgdata";
-	$ret = system 'pg_ctl', '-D', "$tempdir/pgdata", '-s', '-w', '-l',
+	$ret = run_log(['pg_ctl', '-D', "$tempdir/pgdata", '-w', '-l',
 	  "$tempdir/logfile", '-o',
-"--fsync=off -k $tempdir_short --listen-addresses='' --log-statement=all",
-	  'start';
+"--fsync=off -k \"$tempdir_short\" --listen-addresses='' --log-statement=all",
+					'start']);
 
-	if ($ret != 0)
+	if (!$ret)
 	{
-		system('cat', "$tempdir/logfile");
+		run_log(['cat', "$tempdir/logfile"]);
 		BAIL_OUT("pg_ctl failed");
 	}
 
@@ -110,16 +117,16 @@ sub start_test_server
 
 sub restart_test_server
 {
-	system 'pg_ctl', '-s', '-D', $test_server_datadir, '-w', '-l',
-	  $test_server_logfile, 'restart';
+	run_log(['pg_ctl', '-D', $test_server_datadir, '-w', '-l',
+	  $test_server_logfile, 'restart']);
 }
 
 END
 {
 	if ($test_server_datadir)
 	{
-		system 'pg_ctl', '-D', $test_server_datadir, '-s', '-w', '-m',
-		  'immediate', 'stop';
+		run_log(['pg_ctl', '-D', $test_server_datadir, '-w', '-m',
+		  'immediate', 'stop']);
 	}
 }
 
@@ -131,7 +138,22 @@ sub psql
 
 sub system_or_bail
 {
-	system(@_) == 0 or BAIL_OUT("system @_ failed: $?");
+	system(@_) == 0 or BAIL_OUT("system $_[0] failed: $?");
+}
+
+sub run_log
+{
+	my $result = run (@_, '>>', $test_logfile, '2>&1');
+	return $result;
+}
+
+sub run_or_bail
+{
+	my $result = run (@_, '>>', $test_logfile, '2>&1');
+	if (!$result)
+	{
+		BAIL_OUT("run $_[0] failed: $?");
+	}
 }
 
 
diff --git a/src/test/ssl/.gitignore b/src/test/ssl/.gitignore
index a325c2f..d7b2b23 100644
--- a/src/test/ssl/.gitignore
+++ b/src/test/ssl/.gitignore
@@ -1,3 +1,4 @@
 # Generated by regression tests
 /client-log
 /tmp_check/
+/regress_log/
diff --git a/src/test/ssl/Makefile b/src/test/ssl/Makefile
index d9fd29a..785500e 100644
--- a/src/test/ssl/Makefile
+++ b/src/test/ssl/Makefile
@@ -122,5 +122,8 @@ ssl/root+client.crl: ssl/root.crl ssl/client.crl
 sslfiles-clean:
 	rm -f $(SSLFILES) ssl/client_ca.srl ssl/server_ca.srl ssl/client_ca-certindex* ssl/server_ca-certindex* ssl/root_ca-certindex* ssl/root_ca.srl ssl/temp_ca.crt ssl/temp_ca_signed.crt
 
+clean:
+	rm -rf regress_log
+
 check:
 	$(prove_check)
diff --git a/src/test/ssl/ServerSetup.pm b/src/test/ssl/ServerSetup.pm
index 4ce4a69..6b186a4 100644
--- a/src/test/ssl/ServerSetup.pm
+++ b/src/test/ssl/ServerSetup.pm
@@ -111,8 +111,6 @@ sub switch_server_cert
    # restart_test_server() because that overrides listen_addresses to only all
    # Unix domain socket connections.
 
-	system_or_bail 'pg_ctl', 'stop',  '-s', '-D', "$tempdir/pgdata", '-w';
-	system_or_bail 'pg_ctl', 'start', '-s', '-D', "$tempdir/pgdata", '-w',
-	  '-l',
-	  "$tempdir/logfile";
+	run_or_bail(['pg_ctl', 'stop',  '-s', '-D', "$tempdir/pgdata", '-w']);
+	run_or_bail(['pg_ctl', 'start', '-s', '-D', "$tempdir/pgdata", '-w']);
 }
-- 
2.4.4

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Paquier (#4)
Re: The real reason why TAP testing isn't ready for prime time

Michael Paquier <michael.paquier@gmail.com> writes:

Attached is a patch fixing those problems and improving the log
facility as it really helped me out with those issues. The simplest
fix would be to include the -w switch missing in the tests of
pg_rewind and pg_ctl though.

I agree with adding the -w switch and will go do that. As far as the
rest of this patch goes, it seems like it could be made less invasive
if the logs got dumped into a subdirectory of tmp_check rather than
adding another top-level directory that has to be cleaned?

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Paquier (#4)
Re: The real reason why TAP testing isn't ready for prime time

Michael Paquier <michael.paquier@gmail.com> writes:

Now if we look at RewindTest.pm, there is the following code:
if ($test_master_datadir)
{
system
"pg_ctl -D $test_master_datadir -s -m immediate stop
2> /dev/null";
}
if ($test_standby_datadir)
{
system
"pg_ctl -D $test_standby_datadir -s -m immediate
stop 2> /dev/null";
}
And I think that the problem is triggered because we are missing a -w
switch here, meaning that we do not wait until the confirmation that
the server has stopped, and visibly if stop is slow enough the next
server to use cannot start because the port is already taken by the
server currently stopping.

After I woke up a bit more, I remembered that -w is already the default
for "pg_ctl stop", so your diagnosis here is incorrect.

I suspect that the real problem is the arbitrary decision to use -m
immediate. The postmaster would ordinarily wait for its children to
die, but on a slow machine we could perhaps reach the end of that
5-second timeout, whereupon the postmaster would SIGKILL its children
*and exit immediately*. I'm not sure how instantaneous SIGKILL is,
but it seems possible that we could end up trying to start the new
postmaster before all the children of the old one are dead. If the
shmem interlock is working properly that ought to fail.

I wonder whether it's such a good idea for the postmaster to give
up waiting before all children are gone (postmaster.c:1722 in HEAD).

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#6)
Re: The real reason why TAP testing isn't ready for prime time

On Fri, Jun 19, 2015 at 11:07 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I wonder whether it's such a good idea for the postmaster to give
up waiting before all children are gone (postmaster.c:1722 in HEAD).

I doubt it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#7)
Re: The real reason why TAP testing isn't ready for prime time

On 2015-06-19 11:16:18 -0400, Robert Haas wrote:

On Fri, Jun 19, 2015 at 11:07 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I wonder whether it's such a good idea for the postmaster to give
up waiting before all children are gone (postmaster.c:1722 in HEAD).

I doubt it.

Seconded. It's pretty bad to possibly not be able to start again after
stopping it. I don't see the benefit in not waiting - sure, the
poastmaster exits faster, but postgres hasn't shut down at that point...

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#8)
Re: The real reason why TAP testing isn't ready for prime time

Andres Freund <andres@anarazel.de> writes:

On 2015-06-19 11:16:18 -0400, Robert Haas wrote:

On Fri, Jun 19, 2015 at 11:07 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I wonder whether it's such a good idea for the postmaster to give
up waiting before all children are gone (postmaster.c:1722 in HEAD).

I doubt it.

Seconded. It's pretty bad to possibly not be able to start again after
stopping it. I don't see the benefit in not waiting - sure, the
poastmaster exits faster, but postgres hasn't shut down at that point...

Yeah. I'll see about fixing this. Hard to be sure if it will fix
hamster's issue though.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#9)
Re: The real reason why TAP testing isn't ready for prime time

Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

On 2015-06-19 11:16:18 -0400, Robert Haas wrote:

On Fri, Jun 19, 2015 at 11:07 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I wonder whether it's such a good idea for the postmaster to give
up waiting before all children are gone (postmaster.c:1722 in HEAD).

I doubt it.

Seconded. It's pretty bad to possibly not be able to start again after
stopping it. I don't see the benefit in not waiting - sure, the
poastmaster exits faster, but postgres hasn't shut down at that point...

Yeah. I'll see about fixing this. Hard to be sure if it will fix
hamster's issue though.

We discussed this when that patch got in (82233ce7ea42d6b). The reason
for not waiting, it was argued, is that the most likely reason for those
processes not to have already gone away by the time we send SIGKILL was
that they are stuck somewhere in the kernel, and so we might not be able
to actually get them to go away with the SIGKILL. As I recall, that was
the actual problem that MauMau was trying to get fixed.

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Andres Freund
andres@anarazel.de
In reply to: Alvaro Herrera (#10)
Re: The real reason why TAP testing isn't ready for prime time

On 2015-06-19 13:56:21 -0300, Alvaro Herrera wrote:

We discussed this when that patch got in (82233ce7ea42d6b). The reason
for not waiting, it was argued, is that the most likely reason for those
processes not to have already gone away by the time we send SIGKILL was
that they are stuck somewhere in the kernel, and so we might not be able
to actually get them to go away with the SIGKILL. As I recall, that was
the actual problem that MauMau was trying to get fixed.

How does exiting before they're dead help? They're still going to be
attached to shared memeory and thus prevent a restart. I don't think
hiding the problem by exiting the postmaster helps at all.

I don't think it's our job to protect against processes stuck in the
kernel. If that happens something seriously has gone wrong, and we
shouldn't just continue without making that visible.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Alvaro Herrera (#10)
Re: The real reason why TAP testing isn't ready for prime time

Alvaro Herrera wrote:

We discussed this when that patch got in (82233ce7ea42d6b). The reason
for not waiting, it was argued, is that the most likely reason for those
processes not to have already gone away by the time we send SIGKILL was
that they are stuck somewhere in the kernel, and so we might not be able
to actually get them to go away with the SIGKILL. As I recall, that was
the actual problem that MauMau was trying to get fixed.

Hm, I reviewed the discussion and actually I'm wrong: I came up with
the unkillable process argument on my own. MauMau's problem was a
deadlocked process because of trying to malloc() on its way out.
I still think that unkillable processes are an issue, but perhaps it's
sane to have postmaster wait for some time instead of just going away
immediately.

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#11)
Re: The real reason why TAP testing isn't ready for prime time

Andres Freund <andres@anarazel.de> writes:

On 2015-06-19 13:56:21 -0300, Alvaro Herrera wrote:

We discussed this when that patch got in (82233ce7ea42d6b). The reason
for not waiting, it was argued, is that the most likely reason for those
processes not to have already gone away by the time we send SIGKILL was
that they are stuck somewhere in the kernel, and so we might not be able
to actually get them to go away with the SIGKILL. As I recall, that was
the actual problem that MauMau was trying to get fixed.

How does exiting before they're dead help? They're still going to be
attached to shared memeory and thus prevent a restart. I don't think
hiding the problem by exiting the postmaster helps at all.

My thought exactly.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#13)
Re: The real reason why TAP testing isn't ready for prime time

Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

On 2015-06-19 13:56:21 -0300, Alvaro Herrera wrote:

We discussed this when that patch got in (82233ce7ea42d6b). The reason
for not waiting, it was argued, is that the most likely reason for those
processes not to have already gone away by the time we send SIGKILL was
that they are stuck somewhere in the kernel, and so we might not be able
to actually get them to go away with the SIGKILL. As I recall, that was
the actual problem that MauMau was trying to get fixed.

How does exiting before they're dead help? They're still going to be
attached to shared memeory and thus prevent a restart. I don't think
hiding the problem by exiting the postmaster helps at all.

My thought exactly.

I guess you have a point there.

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15Michael Paquier
michael.paquier@gmail.com
In reply to: Tom Lane (#6)
Re: The real reason why TAP testing isn't ready for prime time

On Sat, Jun 20, 2015 at 12:07 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Michael Paquier <michael.paquier@gmail.com> writes:

Now if we look at RewindTest.pm, there is the following code:
if ($test_master_datadir)
{
system
"pg_ctl -D $test_master_datadir -s -m immediate stop
2> /dev/null";
}
if ($test_standby_datadir)
{
system
"pg_ctl -D $test_standby_datadir -s -m immediate
stop 2> /dev/null";
}
And I think that the problem is triggered because we are missing a -w
switch here, meaning that we do not wait until the confirmation that
the server has stopped, and visibly if stop is slow enough the next
server to use cannot start because the port is already taken by the
server currently stopping.

After I woke up a bit more, I remembered that -w is already the default
for "pg_ctl stop", so your diagnosis here is incorrect.

Ah right. I forgot that. Perhaps I got just lucky in my runs.

I suspect that the real problem is the arbitrary decision to use -m
immediate. The postmaster would ordinarily wait for its children to
die, but on a slow machine we could perhaps reach the end of that
5-second timeout, whereupon the postmaster would SIGKILL its children
*and exit immediately*. I'm not sure how instantaneous SIGKILL is,
but it seems possible that we could end up trying to start the new
postmaster before all the children of the old one are dead. If the
shmem interlock is working properly that ought to fail.

I wonder whether it's such a good idea for the postmaster to give
up waiting before all children are gone (postmaster.c:1722 in HEAD).

I don't think so as well.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Michael Paquier
michael.paquier@gmail.com
In reply to: Tom Lane (#9)
Re: The real reason why TAP testing isn't ready for prime time

On Sat, Jun 20, 2015 at 12:44 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Andres Freund <andres@anarazel.de> writes:

On 2015-06-19 11:16:18 -0400, Robert Haas wrote:

On Fri, Jun 19, 2015 at 11:07 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I wonder whether it's such a good idea for the postmaster to give
up waiting before all children are gone (postmaster.c:1722 in HEAD).

I doubt it.

Seconded. It's pretty bad to possibly not be able to start again after
stopping it. I don't see the benefit in not waiting - sure, the
poastmaster exits faster, but postgres hasn't shut down at that point...

Yeah. I'll see about fixing this. Hard to be sure if it will fix
hamster's issue though.

I have re-enabled the TAP tests in hamster. What happens in the next
couple of days will be interesting to see.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17Michael Paquier
michael.paquier@gmail.com
In reply to: Tom Lane (#5)
Re: The real reason why TAP testing isn't ready for prime time

On Fri, Jun 19, 2015 at 11:45 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Michael Paquier <michael.paquier@gmail.com> writes:

Attached is a patch fixing those problems and improving the log
facility as it really helped me out with those issues. The simplest
fix would be to include the -w switch missing in the tests of
pg_rewind and pg_ctl though.

I agree with adding the -w switch and will go do that. As far as the
rest of this patch goes, it seems like it could be made less invasive
if the logs got dumped into a subdirectory of tmp_check rather than
adding another top-level directory that has to be cleaned?

tmp_check remains unused since dcae5fac (missing something perhaps?),
but I guess we could use it again to save the logs in it, and have
pg_rewind do the same.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18Michael Paquier
michael.paquier@gmail.com
In reply to: Michael Paquier (#17)
1 attachment(s)
Re: The real reason why TAP testing isn't ready for prime time

On Sat, Jun 20, 2015 at 7:06 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:

As far as the
rest of this patch goes, it seems like it could be made less invasive
if the logs got dumped into a subdirectory of tmp_check rather than
adding another top-level directory that has to be cleaned?

tmp_check remains unused since dcae5fac (missing something perhaps?),
but I guess we could use it again to save the logs in it, and have
pg_rewind do the same.

OK, so here is an updated patch. All the logs are saved in
tmp_check/log, and each log file is named with the name of the file
that launched the test to facilitate its identification. I noticed at
the same time that some commands were called in silent mode. I removed
that and now what they output is written as well in the log files.
Regards,
--
Michael

Attachments:

20150620_tap_improve_logs_v3.patchapplication/x-patch; name=20150620_tap_improve_logs_v3.patchDownload
From 59fa3c2cd99bfbff970c6054ea279bb26a0a3413 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@otacoo.com>
Date: Sat, 20 Jun 2015 17:44:39 +0900
Subject: [PATCH] Improve log capture of TAP tests and fix race conditions

All tests have their logs stored as regress_log/$TEST_NAME, with content
captured from the many commands run during the tests. Commands that were
previously called in silent mode are made verbose and have their output
written in the newly-created log files.
---
 src/Makefile.global.in                         |  1 +
 src/bin/pg_basebackup/t/010_pg_basebackup.pl   |  4 +-
 src/bin/pg_controldata/t/001_pg_controldata.pl |  2 +-
 src/bin/pg_ctl/t/001_start_stop.pl             |  2 +-
 src/bin/pg_ctl/t/002_status.pl                 |  6 +--
 src/bin/pg_rewind/.gitignore                   |  1 -
 src/bin/pg_rewind/Makefile                     |  2 +-
 src/bin/pg_rewind/RewindTest.pm                | 74 ++++++++++++--------------
 src/bin/pg_rewind/t/001_basic.pl               |  1 -
 src/bin/pg_rewind/t/002_databases.pl           |  1 -
 src/bin/pg_rewind/t/003_extrafiles.pl          |  1 -
 src/test/perl/TestLib.pm                       | 52 +++++++++++++-----
 src/test/ssl/ServerSetup.pm                    |  6 +--
 13 files changed, 83 insertions(+), 70 deletions(-)

diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index c583b44..171d1e4 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -336,6 +336,7 @@ cd $(srcdir) && TESTDIR='$(CURDIR)' PATH="$(bindir):$$PATH" PGPORT='6$(DEF_PGPOR
 endef
 
 define prove_check
+rm -rf $(srcdir)/tmp_check/log
 cd $(srcdir) && TESTDIR='$(CURDIR)' $(with_temp_install) PGPORT='6$(DEF_PGPORT)' top_builddir='$(CURDIR)/$(top_builddir)' $(PROVE) $(PG_PROVE_FLAGS) $(PROVE_FLAGS) t/*.pl
 endef
 
diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index 3476ea6..b87ef29 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -22,7 +22,7 @@ print HBA "local replication all trust\n";
 print HBA "host replication all 127.0.0.1/32 trust\n";
 print HBA "host replication all ::1/128 trust\n";
 close HBA;
-system_or_bail 'pg_ctl', '-s', '-D', "$tempdir/pgdata", 'reload';
+run_or_bail(['pg_ctl', '-D', "$tempdir/pgdata", 'reload']);
 
 command_fails(
 	[ 'pg_basebackup', '-D', "$tempdir/backup" ],
@@ -51,7 +51,7 @@ ok(-f "$tempdir/tarbackup/base.tar", 'backup tar was created');
 
 my $superlongname = "superlongname_" . ("x" x 100);
 
-system_or_bail 'touch', "$tempdir/pgdata/$superlongname";
+run_or_bail(['touch', "$tempdir/pgdata/$superlongname"]);
 command_fails([ 'pg_basebackup', '-D', "$tempdir/tarbackup_l1", '-Ft' ],
 	'pg_basebackup tar with long name fails');
 unlink "$tempdir/pgdata/$superlongname";
diff --git a/src/bin/pg_controldata/t/001_pg_controldata.pl b/src/bin/pg_controldata/t/001_pg_controldata.pl
index a4180e7..0d1407e 100644
--- a/src/bin/pg_controldata/t/001_pg_controldata.pl
+++ b/src/bin/pg_controldata/t/001_pg_controldata.pl
@@ -11,6 +11,6 @@ program_options_handling_ok('pg_controldata');
 command_fails(['pg_controldata'], 'pg_controldata without arguments fails');
 command_fails([ 'pg_controldata', 'nonexistent' ],
 	'pg_controldata with nonexistent directory fails');
-system_or_bail "initdb -D '$tempdir'/data -A trust >/dev/null";
+run_or_bail(['initdb', '-D', "$tempdir/data", '-A', 'trust']);
 command_like([ 'pg_controldata', "$tempdir/data" ],
 	qr/checkpoint/, 'pg_controldata produces output');
diff --git a/src/bin/pg_ctl/t/001_start_stop.pl b/src/bin/pg_ctl/t/001_start_stop.pl
index 6c9ec5c..57563d7 100644
--- a/src/bin/pg_ctl/t/001_start_stop.pl
+++ b/src/bin/pg_ctl/t/001_start_stop.pl
@@ -36,4 +36,4 @@ command_ok([ 'pg_ctl', 'restart', '-D', "$tempdir/data", '-w', '-m', 'fast' ],
 command_ok([ 'pg_ctl', 'restart', '-D', "$tempdir/data", '-w', '-m', 'fast' ],
 	'pg_ctl restart with server running');
 
-system_or_bail 'pg_ctl', '-s', 'stop', '-D', "$tempdir/data", '-m', 'fast';
+run_or_bail(['pg_ctl', 'stop', '-D', "$tempdir/data", '-m', 'fast']);
diff --git a/src/bin/pg_ctl/t/002_status.pl b/src/bin/pg_ctl/t/002_status.pl
index 0558854..d1192f6 100644
--- a/src/bin/pg_ctl/t/002_status.pl
+++ b/src/bin/pg_ctl/t/002_status.pl
@@ -18,9 +18,9 @@ close CONF;
 command_exit_is([ 'pg_ctl', 'status', '-D', "$tempdir/data" ],
 	3, 'pg_ctl status with server not running');
 
-system_or_bail 'pg_ctl', '-s', '-l', "$tempdir/logfile", '-D',
-  "$tempdir/data", '-w', 'start';
+run_or_bail(['pg_ctl', '-l', "$tempdir/logfile", '-D',
+  "$tempdir/data", '-w', 'start']);
 command_exit_is([ 'pg_ctl', 'status', '-D', "$tempdir/data" ],
 	0, 'pg_ctl status with server running');
 
-system_or_bail 'pg_ctl', '-s', 'stop', '-D', "$tempdir/data", '-m', 'fast';
+run_or_bail(['pg_ctl', 'stop', '-D', "$tempdir/data", '-m', 'fast']);
diff --git a/src/bin/pg_rewind/.gitignore b/src/bin/pg_rewind/.gitignore
index 9ade7ef..79ddca3 100644
--- a/src/bin/pg_rewind/.gitignore
+++ b/src/bin/pg_rewind/.gitignore
@@ -4,4 +4,3 @@
 
 # Generated by test suite
 /tmp_check/
-/regress_log/
diff --git a/src/bin/pg_rewind/Makefile b/src/bin/pg_rewind/Makefile
index 7d60715..92b5d20 100644
--- a/src/bin/pg_rewind/Makefile
+++ b/src/bin/pg_rewind/Makefile
@@ -45,7 +45,7 @@ uninstall:
 
 clean distclean maintainer-clean:
 	rm -f pg_rewind$(X) $(OBJS) xlogreader.c
-	rm -rf tmp_check regress_log
+	rm -rf tmp_check
 
 check:
 	$(prove_check)
diff --git a/src/bin/pg_rewind/RewindTest.pm b/src/bin/pg_rewind/RewindTest.pm
index 5219ec9..814bffa 100644
--- a/src/bin/pg_rewind/RewindTest.pm
+++ b/src/bin/pg_rewind/RewindTest.pm
@@ -79,7 +79,6 @@ mkdir "regress_log";
 my $port_master  = $ENV{PGPORT};
 my $port_standby = $port_master + 1;
 
-my $log_path;
 my $tempdir_short;
 
 my $connstr_master  = "port=$port_master";
@@ -91,14 +90,16 @@ sub master_psql
 {
 	my $cmd = shift;
 
-	system_or_bail("psql -q --no-psqlrc -d $connstr_master -c \"$cmd\"");
+	run_or_bail(['psql', '-q', '--no-psqlrc', '-d', $connstr_master,
+				 '-c', "$cmd"]);
 }
 
 sub standby_psql
 {
 	my $cmd = shift;
 
-	system_or_bail("psql -q --no-psqlrc -d $connstr_standby -c \"$cmd\"");
+	run_or_bail(['psql', '-q', '--no-psqlrc', '-d', $connstr_standby,
+				 '-c', "$cmd"]);
 }
 
 # Run a query against the master, and check that the output matches what's
@@ -171,16 +172,6 @@ sub append_to_file
 	close $fh;
 }
 
-sub init_rewind_test
-{
-	my $testname  = shift;
-	my $test_mode = shift;
-
-	$log_path = "regress_log/pg_rewind_log_${testname}_${test_mode}";
-
-	remove_tree $log_path;
-}
-
 sub setup_cluster
 {
 	$tempdir_short = tempdir_short;
@@ -209,8 +200,10 @@ max_connections = 10
 local replication all trust
 ));
 
-	system_or_bail(
-"pg_ctl -w -D $test_master_datadir -o \"-k $tempdir_short --listen-addresses='' -p $port_master\" start >>$log_path 2>&1"
+	run_or_bail(
+		['pg_ctl' , '-w', '-D' , $test_master_datadir,
+		 "-o", "-k $tempdir_short --listen-addresses='' -p $port_master",
+		 'start']
 	);
 
 	#### Now run the test-specific parts to initialize the master before setting
@@ -225,8 +218,9 @@ sub create_standby
 	remove_tree $test_standby_datadir;
 
 	# Base backup is taken with xlog files included
-	system_or_bail(
-"pg_basebackup -D $test_standby_datadir -p $port_master -x >>$log_path 2>&1");
+	run_or_bail(
+		['pg_basebackup', '-D', $test_standby_datadir, '-p',
+		 $port_master, '-x']);
 	append_to_file(
 		"$test_standby_datadir/recovery.conf", qq(
 primary_conninfo='$connstr_master application_name=rewind_standby'
@@ -235,8 +229,9 @@ recovery_target_timeline='latest'
 ));
 
 	# Start standby
-	system_or_bail(
-"pg_ctl -w -D $test_standby_datadir -o \"-k $tempdir_short --listen-addresses='' -p $port_standby\" start >>$log_path 2>&1"
+	run_or_bail(
+		['pg_ctl', '-w', '-D', $test_standby_datadir , '-o',
+		 "-k $tempdir_short --listen-addresses='' -p $port_standby", 'start']
 	);
 
 	# Wait until the standby has caught up with the primary, by polling
@@ -255,8 +250,8 @@ sub promote_standby
 	# Now promote slave and insert some new data on master, this will put
 	# the master out-of-sync with the standby. Wait until the standby is
 	# out of recovery mode, and is ready to accept read-write connections.
-	system_or_bail(
-		"pg_ctl -w -D $test_standby_datadir promote >>$log_path 2>&1");
+	run_or_bail(
+		['pg_ctl', '-w', '-D', $test_standby_datadir, 'promote']);
 	poll_query_until("SELECT NOT pg_is_in_recovery()", $connstr_standby)
 	  or die "Timed out while waiting for promotion of standby";
 
@@ -274,8 +269,8 @@ sub run_pg_rewind
 	my $test_mode = shift;
 
 	# Stop the master and be ready to perform the rewind
-	system_or_bail(
-		"pg_ctl -w -D $test_master_datadir stop -m fast >>$log_path 2>&1");
+	run_or_bail(
+		['pg_ctl', '-D', $test_master_datadir, 'stop', '-m', 'fast']);
 
 	# At this point, the rewind processing is ready to run.
 	# We now have a very simple scenario with a few diverged WAL record.
@@ -294,31 +289,25 @@ sub run_pg_rewind
 
 		# Do rewind using a local pgdata as source
 		# Stop the master and be ready to perform the rewind
-		system_or_bail(
-			"pg_ctl -w -D $test_standby_datadir stop -m fast >>$log_path 2>&1"
+		run_or_bail(
+			['pg_ctl', '-D', $test_standby_datadir, 'stop', '-m', 'fast']
 		);
-		my $result = run(
+		my $result = run_log(
 			[   'pg_rewind',
 				"--debug",
 				"--source-pgdata=$test_standby_datadir",
-				"--target-pgdata=$test_master_datadir" ],
-			'>>',
-			$log_path,
-			'2>&1');
+				"--target-pgdata=$test_master_datadir" ]);
 		ok($result, 'pg_rewind local');
 	}
 	elsif ($test_mode eq "remote")
 	{
 
 		# Do rewind using a remote connection as source
-		my $result = run(
+		my $result = run_log(
 			[   'pg_rewind',
 				"--source-server",
 				"port=$port_standby dbname=postgres",
-				"--target-pgdata=$test_master_datadir" ],
-			'>>',
-			$log_path,
-			'2>&1');
+				"--target-pgdata=$test_master_datadir" ]);
 		ok($result, 'pg_rewind remote');
 	}
 	else
@@ -342,8 +331,9 @@ recovery_target_timeline='latest'
 ));
 
 	# Restart the master to check that rewind went correctly
-	system_or_bail(
-"pg_ctl -w -D $test_master_datadir -o \"-k $tempdir_short --listen-addresses='' -p $port_master\" start >>$log_path 2>&1"
+	run_or_bail(
+		['pg_ctl', '-w', '-D', $test_master_datadir, '-o',
+		 "-k $tempdir_short --listen-addresses='' -p $port_master", 'start']
 	);
 
 	#### Now run the test-specific parts to check the result
@@ -354,13 +344,15 @@ sub clean_rewind_test
 {
 	if ($test_master_datadir)
 	{
-		system
-		  "pg_ctl -D $test_master_datadir -s -m immediate stop 2> /dev/null";
+		run_log(
+		  ['pg_ctl', '-D', $test_master_datadir, '-m', 'immediate',
+		   'stop']);
 	}
 	if ($test_standby_datadir)
 	{
-		system
-		  "pg_ctl -D $test_standby_datadir -s -m immediate stop 2> /dev/null";
+		run_log(
+		  ['pg_ctl', '-D', $test_standby_datadir, '-m', 'immediate',
+		   'stop']);
 	}
 }
 
diff --git a/src/bin/pg_rewind/t/001_basic.pl b/src/bin/pg_rewind/t/001_basic.pl
index f60368b..be7d887 100644
--- a/src/bin/pg_rewind/t/001_basic.pl
+++ b/src/bin/pg_rewind/t/001_basic.pl
@@ -9,7 +9,6 @@ sub run_test
 {
 	my $test_mode = shift;
 
-	RewindTest::init_rewind_test('basic', $test_mode);
 	RewindTest::setup_cluster();
 
 	# Create a test table and insert a row in master.
diff --git a/src/bin/pg_rewind/t/002_databases.pl b/src/bin/pg_rewind/t/002_databases.pl
index 7564fa9..b0b007a 100644
--- a/src/bin/pg_rewind/t/002_databases.pl
+++ b/src/bin/pg_rewind/t/002_databases.pl
@@ -9,7 +9,6 @@ sub run_test
 {
 	my $test_mode = shift;
 
-	RewindTest::init_rewind_test('databases', $test_mode);
 	RewindTest::setup_cluster();
 
 	# Create a database in master.
diff --git a/src/bin/pg_rewind/t/003_extrafiles.pl b/src/bin/pg_rewind/t/003_extrafiles.pl
index 9a95268..0cd0ac4 100644
--- a/src/bin/pg_rewind/t/003_extrafiles.pl
+++ b/src/bin/pg_rewind/t/003_extrafiles.pl
@@ -14,7 +14,6 @@ sub run_test
 {
 	my $test_mode = shift;
 
-	RewindTest::init_rewind_test('extrafiles', $test_mode);
 	RewindTest::setup_cluster();
 
 	my $test_master_datadir = $RewindTest::test_master_datadir;
diff --git a/src/test/perl/TestLib.pm b/src/test/perl/TestLib.pm
index ef42366..98fefce 100644
--- a/src/test/perl/TestLib.pm
+++ b/src/test/perl/TestLib.pm
@@ -12,6 +12,8 @@ our @EXPORT = qw(
   restart_test_server
   psql
   system_or_bail
+  run_or_bail
+  run_log
 
   command_ok
   command_fails
@@ -24,11 +26,20 @@ our @EXPORT = qw(
 );
 
 use Cwd;
+use File::Basename;
 use File::Spec;
 use File::Temp ();
 use IPC::Run qw(run start);
 use Test::More;
 
+# Log file location. For each test, the log file name uses the name of the
+# file launching this module, without its file prefix for readability.
+my $log_path = 'tmp_check/log';
+mkdir 'tmp_check';
+mkdir $log_path;
+my $test_logfile = basename($0);
+$test_logfile =~ s/\.[^.]+$//;
+$test_logfile = "$log_path/regress_log_$test_logfile";
 
 # Set to untranslated messages, to be able to compare program output
 # with expected strings.
@@ -77,9 +88,9 @@ sub tempdir_short
 sub standard_initdb
 {
 	my $pgdata = shift;
-	system_or_bail("initdb -D '$pgdata' -A trust -N >/dev/null");
-	system_or_bail("$ENV{top_builddir}/src/test/regress/pg_regress",
-		'--config-auth', $pgdata);
+	run_or_bail(['initdb', '-D', "$pgdata", '-A' , 'trust', '-N']);
+	run_or_bail(["$ENV{top_builddir}/src/test/regress/pg_regress",
+		'--config-auth', $pgdata]);
 }
 
 my ($test_server_datadir, $test_server_logfile);
@@ -92,14 +103,14 @@ sub start_test_server
 	my $tempdir_short = tempdir_short;
 
 	standard_initdb "$tempdir/pgdata";
-	$ret = system 'pg_ctl', '-D', "$tempdir/pgdata", '-s', '-w', '-l',
+	$ret = run_log(['pg_ctl', '-D', "$tempdir/pgdata", '-w', '-l',
 	  "$tempdir/logfile", '-o',
-"--fsync=off -k $tempdir_short --listen-addresses='' --log-statement=all",
-	  'start';
+"--fsync=off -k \"$tempdir_short\" --listen-addresses='' --log-statement=all",
+					'start']);
 
-	if ($ret != 0)
+	if (!$ret)
 	{
-		system('cat', "$tempdir/logfile");
+		run_log(['cat', "$tempdir/logfile"]);
 		BAIL_OUT("pg_ctl failed");
 	}
 
@@ -110,16 +121,16 @@ sub start_test_server
 
 sub restart_test_server
 {
-	system 'pg_ctl', '-s', '-D', $test_server_datadir, '-w', '-l',
-	  $test_server_logfile, 'restart';
+	run_log(['pg_ctl', '-D', $test_server_datadir, '-w', '-l',
+	  $test_server_logfile, 'restart']);
 }
 
 END
 {
 	if ($test_server_datadir)
 	{
-		system 'pg_ctl', '-D', $test_server_datadir, '-s', '-w', '-m',
-		  'immediate', 'stop';
+		run_log(['pg_ctl', '-D', $test_server_datadir, '-m',
+		  'immediate', 'stop']);
 	}
 }
 
@@ -131,7 +142,22 @@ sub psql
 
 sub system_or_bail
 {
-	system(@_) == 0 or BAIL_OUT("system @_ failed: $?");
+	system(@_) == 0 or BAIL_OUT("system $_[0] failed: $?");
+}
+
+sub run_log
+{
+	my $result = run (@_, '>>', $test_logfile, '2>&1');
+	return $result;
+}
+
+sub run_or_bail
+{
+	my $result = run (@_, '>>', $test_logfile, '2>&1');
+	if (!$result)
+	{
+		BAIL_OUT("run $_[0] failed: $?");
+	}
 }
 
 
diff --git a/src/test/ssl/ServerSetup.pm b/src/test/ssl/ServerSetup.pm
index bbff99a..2293d1b 100644
--- a/src/test/ssl/ServerSetup.pm
+++ b/src/test/ssl/ServerSetup.pm
@@ -125,8 +125,6 @@ sub switch_server_cert
    # restart_test_server() because that overrides listen_addresses to only all
    # Unix domain socket connections.
 
-	system_or_bail 'pg_ctl', 'stop',  '-s', '-D', "$tempdir/pgdata", '-w';
-	system_or_bail 'pg_ctl', 'start', '-s', '-D', "$tempdir/pgdata", '-w',
-	  '-l',
-	  "$tempdir/logfile";
+	run_or_bail(['pg_ctl', 'stop', '-D', "$tempdir/pgdata"]);
+	run_or_bail(['pg_ctl', 'start', '-D', "$tempdir/pgdata", '-w']);
 }
-- 
2.4.4

#19Michael Paquier
michael.paquier@gmail.com
In reply to: Michael Paquier (#16)
Re: The real reason why TAP testing isn't ready for prime time

On Sat, Jun 20, 2015 at 6:53 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Sat, Jun 20, 2015 at 12:44 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Andres Freund <andres@anarazel.de> writes:

On 2015-06-19 11:16:18 -0400, Robert Haas wrote:

On Fri, Jun 19, 2015 at 11:07 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I wonder whether it's such a good idea for the postmaster to give
up waiting before all children are gone (postmaster.c:1722 in HEAD).

I doubt it.

Seconded. It's pretty bad to possibly not be able to start again after
stopping it. I don't see the benefit in not waiting - sure, the
poastmaster exits faster, but postgres hasn't shut down at that point...

Yeah. I'll see about fixing this. Hard to be sure if it will fix
hamster's issue though.

I have re-enabled the TAP tests in hamster. What happens in the next
couple of days will be interesting to see.he perl

And, we get a failure:
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hamster&amp;dt=2015-06-20%2017%3A59%3A01
I am not sure why buildfarm runs makes it more easily reproducible,
one of the reasons may be the perl scripts run underground.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20Michael Paquier
michael.paquier@gmail.com
In reply to: Michael Paquier (#19)
Re: The real reason why TAP testing isn't ready for prime time

On Sun, Jun 21, 2015 at 7:06 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:

And, we get a failure:
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hamster&amp;dt=2015-06-20%2017%3A59%3A01
I am not sure why buildfarm runs makes it more easily reproducible,
one of the reasons may be the perl scripts run underground.

hamster has not complained for a couple of weeks now, and the issue
was reproducible every 4~6 days:
http://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=hamster&amp;br=HEAD
Hence let's consider the issue as resolved.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Paquier (#20)
Re: The real reason why TAP testing isn't ready for prime time

Michael Paquier <michael.paquier@gmail.com> writes:

hamster has not complained for a couple of weeks now, and the issue
was reproducible every 4~6 days:
http://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=hamster&amp;br=HEAD
Hence let's consider the issue as resolved.

Nah, I'm afraid not. We are definitely still seeing a lot of unexplained
pg_ctl failures; as a recent example,

http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=axolotl&amp;dt=2015-07-29%2018%3A00%3A19

Now maybe that's not the same thing as hamster's issue, but there's still
something funny going on. That's why I've been bitching about the lack of
debug information. I hope that we can make progress on understanding this
once Andrew pushes out the new buildfarm release (especially since some of
his machines are particularly prone to this type of failure, so I trust
he'll update those critters promptly).

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#22Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#21)
Re: The real reason why TAP testing isn't ready for prime time

On 07/29/2015 11:17 PM, Tom Lane wrote:

Michael Paquier <michael.paquier@gmail.com> writes:

hamster has not complained for a couple of weeks now, and the issue
was reproducible every 4~6 days:
http://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=hamster&amp;br=HEAD
Hence let's consider the issue as resolved.

Nah, I'm afraid not. We are definitely still seeing a lot of unexplained
pg_ctl failures; as a recent example,

http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=axolotl&amp;dt=2015-07-29%2018%3A00%3A19

Now maybe that's not the same thing as hamster's issue, but there's still
something funny going on. That's why I've been bitching about the lack of
debug information. I hope that we can make progress on understanding this
once Andrew pushes out the new buildfarm release (especially since some of
his machines are particularly prone to this type of failure, so I trust
he'll update those critters promptly).

I want to deal with running the tests on MSVC first. Meanwhile I'll drop
the latest run_build.pl on axolotl.

Anyone who wants to be on the bleeding edge can grab it from
<https://raw.githubusercontent.com/PGBuildFarm/client-code/master/run_build.pl&gt;

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers