Timeout control within tests

Started by Noah Mischalmost 4 years ago6 messages
#1Noah Misch
noah@leadboat.com
2 attachment(s)

On Fri, Feb 05, 2021 at 03:55:20PM -0500, Tom Lane wrote:

We have, almost invariably, regretted it when we tried to use short
timeouts in test cases.

More generally, sometimes people want to do things like run a test
under valgrind. So it's not just "underpowered machines" that may
need a generous timeout. Even if we did reduce the default, I'd
want a way (probably via an environment variable, cf PGCTLTIMEOUT)
to kick it back up.

I have a few use cases for that:

1. My buildfarm members have more and more competition for CPU and I/O.
Examples where I suspect animal slowness caused a 180s timeout:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=topminnow&dt=2021-04-11%2003%3A11%3A39
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tern&dt=2021-07-26%2004%3A38%3A29

2. When I'm developing a change and I locally break a test in a way that leads
to a timeout, I like to be able to lower that timeout.

3. I want more tests to use the right timeout from the start. Low-timeout
tests appear at least a few times per year:
d03eeab Mon May 31 00:29:58 2021 -0700 Raise a timeout to 180s, in test 010_logical_decoding_timelines.pl.
388b959 Sat Feb 27 07:02:56 2021 -0800 Raise a timeout to 180s, in contrib/test_decoding.
08dde1b Tue Dec 22 11:10:12 2020 -0500 Increase timeout in 021_row_visibility.pl.
8961355 Sat Apr 25 18:45:27 2020 -0700 Raise a timeout to 180s, in test 003_recovery_targets.pl.
1db439a Mon Dec 10 20:15:42 2018 -0800 Raise some timeouts to 180s, in test code.

I propose to have environment variable PG_TEST_TIMEOUT_DEFAULT control the
timeout used in the places that currently hard-code 180s. TAP tests should
retrieve the value via $PostgreSQL::Test::Utils::timeout_default. pg_regress
tests should retrieve it via \getenv. I would like to back-patch the TAP
part, for cause (1). (The pg_regress part hasn't been a buildfarm problem,
and \getenv is new in v15.) Patches attached. I considered and excluded
other changes, for now:

a. I considered consolidating this with PGISOLATIONTIMEOUT (default 300). One
could remove the older variable entirely or make isolationtester use the
first-available of [PGISOLATIONTIMEOUT, 2 * PG_TEST_TIMEOUT_DEFAULT, 360].
Does anyone have an opinion on what, if anything, to do there?

b. I briefly made stats.sql accept PG_TEST_TIMEOUT_DEFAULT to override its
hard-coded 30s timeout. However, a higher timeout won't help when a UDP
buffer fills. If the test were structured to observe evidence of a vacant
UDP buffer before proceeding with the test stat messages, a higher timeout
could make more sense. I added a comment.

c. One could remove timeout-duration function arguments (e.g. from
pg_recvlogical_upto) and just have the function consult timeout_default.
This felt like highly-optional refactoring.

Attachments:

tap-timeout-default-var-v1.patchtext/plain; charset=us-asciiDownload
Author:     Noah Misch <noah@leadboat.com>
Commit:     Noah Misch <noah@leadboat.com>

    Introduce PG_TEST_TIMEOUT_DEFAULT for TAP suite can't-happen timeouts.
    
    Slow hosts may avoid load-induced, spurious failures by setting
    environment variable PG_TEST_TIMEOUT_DEFAULT to some number of seconds
    greater than 180.  Developers may see faster failures by setting that
    environment variable to some lesser number of seconds.  In tests, write
    $PostgreSQL::Test::Utils::timeout_default wherever the convention has
    been to write 180.  This change raises the default for some briefer
    timeouts.  Back-patch to v10 (all supported versions).
    
    Reviewed by FIXME.
    
    Discussion: https://postgr.es/m/FIXME

diff --git a/contrib/amcheck/t/002_cic.pl b/contrib/amcheck/t/002_cic.pl
index d604def..b8e4ac7 100644
--- a/contrib/amcheck/t/002_cic.pl
+++ b/contrib/amcheck/t/002_cic.pl
@@ -18,7 +18,8 @@ my ($node, $result);
 #
 $node = PostgreSQL::Test::Cluster->new('CIC_test');
 $node->init;
-$node->append_conf('postgresql.conf', 'lock_timeout = 180000');
+$node->append_conf('postgresql.conf',
+	'lock_timeout = ' . (1000 * $PostgreSQL::Test::Utils::timeout_default));
 $node->start;
 $node->safe_psql('postgres', q(CREATE EXTENSION amcheck));
 $node->safe_psql('postgres', q(CREATE TABLE tbl(i int)));
diff --git a/contrib/amcheck/t/003_cic_2pc.pl b/contrib/amcheck/t/003_cic_2pc.pl
index f668ed3..e66ccd9 100644
--- a/contrib/amcheck/t/003_cic_2pc.pl
+++ b/contrib/amcheck/t/003_cic_2pc.pl
@@ -22,7 +22,8 @@ my ($node, $result);
 $node = PostgreSQL::Test::Cluster->new('CIC_2PC_test');
 $node->init;
 $node->append_conf('postgresql.conf', 'max_prepared_transactions = 10');
-$node->append_conf('postgresql.conf', 'lock_timeout = 180000');
+$node->append_conf('postgresql.conf',
+	'lock_timeout = ' . (1000 * $PostgreSQL::Test::Utils::timeout_default));
 $node->start;
 $node->safe_psql('postgres', q(CREATE EXTENSION amcheck));
 $node->safe_psql('postgres', q(CREATE TABLE tbl(i int)));
@@ -38,7 +39,7 @@ $node->safe_psql('postgres', q(CREATE TABLE tbl(i int)));
 
 my $main_in    = '';
 my $main_out   = '';
-my $main_timer = IPC::Run::timeout(180);
+my $main_timer = IPC::Run::timeout($PostgreSQL::Test::Utils::timeout_default);
 
 my $main_h =
   $node->background_psql('postgres', \$main_in, \$main_out,
@@ -52,7 +53,7 @@ pump $main_h until $main_out =~ /syncpoint1/ || $main_timer->is_expired;
 
 my $cic_in    = '';
 my $cic_out   = '';
-my $cic_timer = IPC::Run::timeout(180);
+my $cic_timer = IPC::Run::timeout($PostgreSQL::Test::Utils::timeout_default);
 my $cic_h =
   $node->background_psql('postgres', \$cic_in, \$cic_out,
 	$cic_timer, on_error_stop => 1);
@@ -113,9 +114,10 @@ PREPARE TRANSACTION 'persists_forever';
 ));
 $node->restart;
 
-my $reindex_in    = '';
-my $reindex_out   = '';
-my $reindex_timer = IPC::Run::timeout(180);
+my $reindex_in  = '';
+my $reindex_out = '';
+my $reindex_timer =
+  IPC::Run::timeout($PostgreSQL::Test::Utils::timeout_default);
 my $reindex_h =
   $node->background_psql('postgres', \$reindex_in, \$reindex_out,
 	$reindex_timer, on_error_stop => 1);
diff --git a/src/bin/pg_ctl/t/004_logrotate.pl b/src/bin/pg_ctl/t/004_logrotate.pl
index d290452..d73ce03 100644
--- a/src/bin/pg_ctl/t/004_logrotate.pl
+++ b/src/bin/pg_ctl/t/004_logrotate.pl
@@ -39,7 +39,7 @@ sub check_log_pattern
 	my $node     = shift;
 	my $lfname   = fetch_file_name($logfiles, $format);
 
-	my $max_attempts = 180 * 10;
+	my $max_attempts = 10 * $PostgreSQL::Test::Utils::timeout_default;
 
 	my $logcontents;
 	for (my $attempts = 0; $attempts < $max_attempts; $attempts++)
@@ -78,7 +78,7 @@ $node->start();
 $node->psql('postgres', 'SELECT 1/0');
 
 # might need to retry if logging collector process is slow...
-my $max_attempts = 180 * 10;
+my $max_attempts = 10 * $PostgreSQL::Test::Utils::timeout_default;
 
 my $current_logfiles;
 for (my $attempts = 0; $attempts < $max_attempts; $attempts++)
diff --git a/src/bin/pg_dump/t/002_pg_dump.pl b/src/bin/pg_dump/t/002_pg_dump.pl
index dd065c7..6906d91 100644
--- a/src/bin/pg_dump/t/002_pg_dump.pl
+++ b/src/bin/pg_dump/t/002_pg_dump.pl
@@ -295,7 +295,8 @@ my %pgdump_runs = (
 			'--no-sync',
 			"--file=$tempdir/only_dump_test_table.sql",
 			'--table=dump_test.test_table',
-			'--lock-wait-timeout=1000000',
+			'--lock-wait-timeout='
+			  . (1000 * $PostgreSQL::Test::Utils::timeout_default),
 			'postgres',
 		],
 	},
diff --git a/src/bin/psql/t/010_tab_completion.pl b/src/bin/psql/t/010_tab_completion.pl
index 005961f..a549106 100644
--- a/src/bin/psql/t/010_tab_completion.pl
+++ b/src/bin/psql/t/010_tab_completion.pl
@@ -94,7 +94,7 @@ close $FH;
 my $in  = '';
 my $out = '';
 
-my $timer = timer(5);
+my $timer = timer($PostgreSQL::Test::Utils::timeout_default);
 
 my $h = $node->interactive_psql('postgres', \$in, \$out, $timer);
 
@@ -111,7 +111,7 @@ sub check_completion
 	# reset output collector
 	$out = "";
 	# restart per-command timer
-	$timer->start(5);
+	$timer->start($PostgreSQL::Test::Utils::timeout_default);
 	# send the data to be sent
 	$in .= $send;
 	# wait ...
@@ -442,7 +442,7 @@ check_completion("blarg \t\t", qr//, "check completion failure path");
 clear_query();
 
 # send psql an explicit \q to shut it down, else pty won't close properly
-$timer->start(5);
+$timer->start($PostgreSQL::Test::Utils::timeout_default);
 $in .= "\\q\n";
 finish $h or die "psql returned $?";
 $timer->reset;
diff --git a/src/bin/psql/t/020_cancel.pl b/src/bin/psql/t/020_cancel.pl
index 3224f8e..d57d342 100644
--- a/src/bin/psql/t/020_cancel.pl
+++ b/src/bin/psql/t/020_cancel.pl
@@ -46,12 +46,13 @@ SKIP: {
 	my $psql_pid;
 	until (-s "$tempdir/psql.pid" and ($psql_pid = PostgreSQL::Test::Utils::slurp_file("$tempdir/psql.pid")) =~ /^\d+\n/s)
 	{
-		($count++ < 180 * 100) or die "pid file did not appear";
+		($count++ < 100 * $PostgreSQL::Test::Utils::timeout_default)
+		  or die "pid file did not appear";
 		usleep(10_000)
 	}
 
 	# Send sleep command and wait until the server has registered it
-	$stdin = "select pg_sleep(180);\n";
+	$stdin = "select pg_sleep($PostgreSQL::Test::Utils::timeout_default);\n";
 	pump $h while length $stdin;
 	$node->poll_query_until('postgres', q{SELECT (SELECT count(*) FROM pg_stat_activity WHERE query ~ '^select pg_sleep') > 0;})
 	  or die "timed out";
diff --git a/src/bin/scripts/t/080_pg_isready.pl b/src/bin/scripts/t/080_pg_isready.pl
index e8436dc..c45ca66 100644
--- a/src/bin/scripts/t/080_pg_isready.pl
+++ b/src/bin/scripts/t/080_pg_isready.pl
@@ -18,8 +18,8 @@ my $node = PostgreSQL::Test::Cluster->new('main');
 $node->init;
 $node->start;
 
-# use a long timeout for the benefit of very slow buildfarm machines
-$node->command_ok([qw(pg_isready --timeout=60)],
+$node->command_ok(
+	[ 'pg_isready', "--timeout=$PostgreSQL::Test::Utils::timeout_default" ],
 	'succeeds with server running');
 
 done_testing();
diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index ed70eff..c6fe4ca 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -36,7 +36,8 @@ PostgreSQL::Test::Cluster - class representing PostgreSQL server instance
   my ($stdout, $stderr, $timed_out);
   my $cmdret = $node->psql('postgres', 'SELECT pg_sleep(600)',
 	  stdout => \$stdout, stderr => \$stderr,
-	  timeout => 180, timed_out => \$timed_out,
+	  timeout => $PostgreSQL::Test::Utils::timeout_default,
+	  timed_out => \$timed_out,
 	  extra_params => ['--single-transaction'],
 	  on_error_die => 1)
   print "Sleep timed out" if $timed_out;
@@ -1725,7 +1726,8 @@ e.g.
 	my ($stdout, $stderr, $timed_out);
 	my $cmdret = $node->psql('postgres', 'SELECT pg_sleep(600)',
 		stdout => \$stdout, stderr => \$stderr,
-		timeout => 180, timed_out => \$timed_out,
+		timeout => $PostgreSQL::Test::Utils::timeout_default,
+		timed_out => \$timed_out,
 		extra_params => ['--single-transaction'])
 
 will set $cmdret to undef and $timed_out to a true value.
@@ -1905,7 +1907,8 @@ scalar reference.  This allows the caller to act on other parts of the system
 while idling this backend.
 
 The specified timer object is attached to the harness, as well.  It's caller's
-responsibility to select the timeout length, and to restart the timer after
+responsibility to set the timeout length (usually
+$PostgreSQL::Test::Utils::timeout_default), and to restart the timer after
 each command if the timeout is per-command.
 
 psql is invoked in tuples-only unaligned mode with reading of B<.psqlrc>
@@ -1993,9 +1996,10 @@ The process's stdin is sourced from the $stdin scalar reference,
 and its stdout and stderr go to the $stdout scalar reference.
 ptys are used so that psql thinks it's being called interactively.
 
-The specified timer object is attached to the harness, as well.
-It's caller's responsibility to select the timeout length, and to
-restart the timer after each command if the timeout is per-command.
+The specified timer object is attached to the harness, as well.  It's caller's
+responsibility to set the timeout length (usually
+$PostgreSQL::Test::Utils::timeout_default), and to restart the timer after
+each command if the timeout is per-command.
 
 psql is invoked in tuples-only unaligned mode with reading of B<.psqlrc>
 disabled.  That may be overridden by passing extra psql parameters.
@@ -2311,7 +2315,7 @@ sub connect_fails
 Run B<$query> repeatedly, until it returns the B<$expected> result
 ('t', or SQL boolean true, by default).
 Continues polling if B<psql> returns an error result.
-Times out after 180 seconds.
+Times out after $PostgreSQL::Test::Utils::timeout_default seconds.
 Returns 1 if successful, 0 if timed out.
 
 =cut
@@ -2329,7 +2333,7 @@ sub poll_query_until
 		'-d',                             $self->connstr($dbname)
 	];
 	my ($stdout, $stderr);
-	my $max_attempts = 180 * 10;
+	my $max_attempts = 10 * $PostgreSQL::Test::Utils::timeout_default;
 	my $attempts     = 0;
 
 	while ($attempts < $max_attempts)
@@ -2353,8 +2357,8 @@ sub poll_query_until
 		$attempts++;
 	}
 
-	# The query result didn't change in 180 seconds. Give up. Print the
-	# output from the last attempt, hopefully that's useful for debugging.
+	# Give up. Print the output from the last attempt, hopefully that's useful
+	# for debugging.
 	diag qq(poll_query_until timed out executing this query:
 $query
 expecting this output:
@@ -2667,7 +2671,7 @@ sub wait_for_slot_catchup
 
 Waits for the contents of the server log file, starting at the given offset, to
 match the supplied regular expression.  Checks the entire log if no offset is
-given.  Times out after 180 seconds.
+given.  Times out after $PostgreSQL::Test::Utils::timeout_default seconds.
 
 If successful, returns the length of the entire log file, in bytes.
 
@@ -2678,7 +2682,7 @@ sub wait_for_log
 	my ($self, $regexp, $offset) = @_;
 	$offset = 0 unless defined $offset;
 
-	my $max_attempts = 180 * 10;
+	my $max_attempts = 10 * $PostgreSQL::Test::Utils::timeout_default;
 	my $attempts     = 0;
 
 	while ($attempts < $max_attempts)
@@ -2693,7 +2697,6 @@ sub wait_for_log
 		$attempts++;
 	}
 
-	# The logs didn't match within 180 seconds. Give up.
 	croak "timed out waiting for match: $regexp";
 }
 
diff --git a/src/test/perl/PostgreSQL/Test/Utils.pm b/src/test/perl/PostgreSQL/Test/Utils.pm
index 57fcb24..e8bf2e0 100644
--- a/src/test/perl/PostgreSQL/Test/Utils.pm
+++ b/src/test/perl/PostgreSQL/Test/Utils.pm
@@ -91,8 +91,8 @@ our @EXPORT = qw(
   $use_unix_sockets
 );
 
-our ($windows_os, $is_msys2, $use_unix_sockets, $tmp_check, $log_path,
-	$test_logfile);
+our ($windows_os, $is_msys2, $use_unix_sockets, $timeout_default,
+	$tmp_check, $log_path, $test_logfile);
 
 BEGIN
 {
@@ -157,6 +157,10 @@ BEGIN
 	# supported, but it can be overridden if desired.
 	$use_unix_sockets =
 	  (!$windows_os || defined $ENV{PG_TEST_USE_UNIX_SOCKETS});
+
+	$timeout_default = $ENV{PG_TEST_TIMEOUT_DEFAULT};
+	$timeout_default = 180
+	  if not defined $timeout_default or $timeout_default eq '';
 }
 
 =pod
diff --git a/src/test/perl/README b/src/test/perl/README
index 0511c55..4b160cc 100644
--- a/src/test/perl/README
+++ b/src/test/perl/README
@@ -23,6 +23,12 @@ tmp_check/log/ to get more info.  Files named 'regress_log_XXX' are log
 output from the perl test scripts themselves, and should be examined first.
 Other files are postmaster logs, and may be helpful as additional data.
 
+The tests default to a timeout of 180 seconds for many individual operations.
+Slow hosts may avoid load-induced, spurious failures by setting environment
+variable PG_TEST_TIMEOUT_DEFAULT to some number of seconds greater than 180.
+Developers may see faster failures by setting that environment variable to
+some lesser number of seconds.
+
 Data directories will also be left behind for analysis when a test fails;
 they are named according to the test filename.  But if the environment
 variable PG_TEST_NOCLEAN is set, data directories will be retained
diff --git a/src/test/recovery/t/003_recovery_targets.pl b/src/test/recovery/t/003_recovery_targets.pl
index 25dd5ee..e8e1a42 100644
--- a/src/test/recovery/t/003_recovery_targets.pl
+++ b/src/test/recovery/t/003_recovery_targets.pl
@@ -172,8 +172,8 @@ run_log(
 		$node_standby->logfile, 'start'
 	]);
 
-# wait up to 180s for postgres to terminate
-foreach my $i (0 .. 1800)
+# wait for postgres to terminate
+foreach my $i (0 .. 10 * $PostgreSQL::Test::Utils::timeout_default)
 {
 	last if !-f $node_standby->data_dir . '/postmaster.pid';
 	usleep(100_000);
diff --git a/src/test/recovery/t/006_logical_decoding.pl b/src/test/recovery/t/006_logical_decoding.pl
index fa6bd45..9cec279 100644
--- a/src/test/recovery/t/006_logical_decoding.pl
+++ b/src/test/recovery/t/006_logical_decoding.pl
@@ -107,7 +107,8 @@ $node_primary->safe_psql('postgres',
 );
 
 my $stdout_recv = $node_primary->pg_recvlogical_upto(
-	'postgres', 'test_slot', $endpos, 180,
+	'postgres', 'test_slot', $endpos,
+	$PostgreSQL::Test::Utils::timeout_default,
 	'include-xids'     => '0',
 	'skip-empty-xacts' => '1');
 chomp($stdout_recv);
@@ -119,7 +120,8 @@ $node_primary->poll_query_until('postgres',
 ) or die "slot never became inactive";
 
 $stdout_recv = $node_primary->pg_recvlogical_upto(
-	'postgres', 'test_slot', $endpos, 180,
+	'postgres', 'test_slot', $endpos,
+	$PostgreSQL::Test::Utils::timeout_default,
 	'include-xids'     => '0',
 	'skip-empty-xacts' => '1');
 chomp($stdout_recv);
diff --git a/src/test/recovery/t/010_logical_decoding_timelines.pl b/src/test/recovery/t/010_logical_decoding_timelines.pl
index 6e8b0b1..01ff31e 100644
--- a/src/test/recovery/t/010_logical_decoding_timelines.pl
+++ b/src/test/recovery/t/010_logical_decoding_timelines.pl
@@ -157,7 +157,7 @@ like(
 ($ret, $stdout, $stderr) = $node_replica->psql(
 	'postgres',
 	"SELECT data FROM pg_logical_slot_peek_changes('before_basebackup', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');",
-	timeout => 180);
+	timeout => $PostgreSQL::Test::Utils::timeout_default);
 is($ret, 0, 'replay from slot before_basebackup succeeds');
 
 my $final_expected_output_bb = q(BEGIN
@@ -186,7 +186,7 @@ my $endpos = $node_replica->safe_psql('postgres',
 
 $stdout = $node_replica->pg_recvlogical_upto(
 	'postgres', 'before_basebackup',
-	$endpos,    180,
+	$endpos,    $PostgreSQL::Test::Utils::timeout_default,
 	'include-xids'     => '0',
 	'skip-empty-xacts' => '1');
 
diff --git a/src/test/recovery/t/013_crash_restart.pl b/src/test/recovery/t/013_crash_restart.pl
index 3b740eb..a0bce59 100644
--- a/src/test/recovery/t/013_crash_restart.pl
+++ b/src/test/recovery/t/013_crash_restart.pl
@@ -18,11 +18,7 @@ use PostgreSQL::Test::Utils;
 use Test::More;
 use Config;
 
-# To avoid hanging while expecting some specific input from a psql
-# instance being driven by us, add a timeout high enough that it
-# should never trigger even on very slow machines, unless something
-# is really wrong.
-my $psql_timeout = IPC::Run::timer(60);
+my $psql_timeout = IPC::Run::timer($PostgreSQL::Test::Utils::timeout_default);
 
 my $node = PostgreSQL::Test::Cluster->new('primary');
 $node->init(allows_streaming => 1);
diff --git a/src/test/recovery/t/017_shm.pl b/src/test/recovery/t/017_shm.pl
index 678a252..f636efc 100644
--- a/src/test/recovery/t/017_shm.pl
+++ b/src/test/recovery/t/017_shm.pl
@@ -132,7 +132,7 @@ my $slow_client = IPC::Run::start(
 	\$stdout,
 	'2>',
 	\$stderr,
-	IPC::Run::timeout(900));    # five times the poll_query_until timeout
+	IPC::Run::timeout(5 * $PostgreSQL::Test::Utils::timeout_default));
 ok( $gnat->poll_query_until(
 		'postgres',
 		"SELECT 1 FROM pg_stat_activity WHERE query = '$slow_query'", '1'),
@@ -143,10 +143,11 @@ $gnat->kill9;
 unlink($gnat->data_dir . '/postmaster.pid');
 $gnat->rotate_logfile;    # on Windows, can't open old log for writing
 log_ipcs();
-# Reject ordinary startup.  Retry for the same reasons poll_start() does.
+# Reject ordinary startup.  Retry for the same reasons poll_start() does,
+# every 0.1s for at least $PostgreSQL::Test::Utils::timeout_default seconds.
 my $pre_existing_msg = qr/pre-existing shared memory block/;
 {
-	my $max_attempts = 180 * 10;    # Retry every 0.1s for at least 180s.
+	my $max_attempts = 10 * $PostgreSQL::Test::Utils::timeout_default;
 	my $attempts     = 0;
 	while ($attempts < $max_attempts)
 	{
@@ -193,7 +194,7 @@ sub poll_start
 {
 	my ($node) = @_;
 
-	my $max_attempts = 180 * 10;
+	my $max_attempts = 10 * $PostgreSQL::Test::Utils::timeout_default;
 	my $attempts     = 0;
 
 	while ($attempts < $max_attempts)
@@ -209,8 +210,8 @@ sub poll_start
 		$attempts++;
 	}
 
-	# No success within 180 seconds.  Try one last time without fail_ok, which
-	# will BAIL_OUT unless it succeeds.
+	# Try one last time without fail_ok, which will BAIL_OUT unless it
+	# succeeds.
 	$node->start && return 1;
 	return 0;
 }
diff --git a/src/test/recovery/t/019_replslot_limit.pl b/src/test/recovery/t/019_replslot_limit.pl
index 4257bd4..0da7b83 100644
--- a/src/test/recovery/t/019_replslot_limit.pl
+++ b/src/test/recovery/t/019_replslot_limit.pl
@@ -291,7 +291,7 @@ my @result =
 		 SELECT pg_switch_wal();
 		 CHECKPOINT;
 		 SELECT 'finished';",
-		timeout => '60'));
+		timeout => $PostgreSQL::Test::Utils::timeout_default));
 is($result[1], 'finished', 'check if checkpoint command is not blocked');
 
 $node_primary2->stop;
@@ -343,7 +343,7 @@ $logstart = get_log_size($node_primary3);
 kill 'STOP', $senderpid, $receiverpid;
 advance_wal($node_primary3, 2);
 
-my $max_attempts = 180;
+my $max_attempts = $PostgreSQL::Test::Utils::timeout_default;
 while ($max_attempts-- >= 0)
 {
 	if (find_in_log(
@@ -366,7 +366,7 @@ $node_primary3->poll_query_until('postgres',
 	"lost")
   or die "timed out waiting for slot to be lost";
 
-$max_attempts = 180;
+$max_attempts = $PostgreSQL::Test::Utils::timeout_default;
 while ($max_attempts-- >= 0)
 {
 	if (find_in_log(
diff --git a/src/test/recovery/t/021_row_visibility.pl b/src/test/recovery/t/021_row_visibility.pl
index e274351..8d97db5 100644
--- a/src/test/recovery/t/021_row_visibility.pl
+++ b/src/test/recovery/t/021_row_visibility.pl
@@ -32,11 +32,8 @@ $node_standby->init_from_backup($node_primary, $backup_name,
 $node_standby->append_conf('postgresql.conf', 'max_prepared_transactions=10');
 $node_standby->start;
 
-# To avoid hanging while expecting some specific input from a psql
-# instance being driven by us, add a timeout high enough that it
-# should never trigger even on very slow machines, unless something
-# is really wrong.
-my $psql_timeout = IPC::Run::timer(300);
+my $psql_timeout =
+  IPC::Run::timer(2 * $PostgreSQL::Test::Utils::timeout_default);
 
 # One psql to primary and standby each, for all queries. That allows
 # to check uncommitted changes being replicated and such.
diff --git a/src/test/recovery/t/022_crash_temp_files.pl b/src/test/recovery/t/022_crash_temp_files.pl
index 6ab3092..d295061 100644
--- a/src/test/recovery/t/022_crash_temp_files.pl
+++ b/src/test/recovery/t/022_crash_temp_files.pl
@@ -15,11 +15,7 @@ if ($Config{osname} eq 'MSWin32')
 	exit;
 }
 
-# To avoid hanging while expecting some specific input from a psql
-# instance being driven by us, add a timeout high enough that it
-# should never trigger even on very slow machines, unless something
-# is really wrong.
-my $psql_timeout = IPC::Run::timer(60);
+my $psql_timeout = IPC::Run::timer($PostgreSQL::Test::Utils::timeout_default);
 
 my $node = PostgreSQL::Test::Cluster->new('node_crash');
 $node->init();
diff --git a/src/test/recovery/t/024_archive_recovery.pl b/src/test/recovery/t/024_archive_recovery.pl
index c10bb5b..ce347e0 100644
--- a/src/test/recovery/t/024_archive_recovery.pl
+++ b/src/test/recovery/t/024_archive_recovery.pl
@@ -81,8 +81,8 @@ sub test_recovery_wal_level_minimal
 			$recovery_node->logfile,  'start'
 		]);
 
-	# Wait up to 180s for postgres to terminate
-	foreach my $i (0 .. 1800)
+	# wait for postgres to terminate
+	foreach my $i (0 .. 10 * $PostgreSQL::Test::Utils::timeout_default)
 	{
 		last if !-f $recovery_node->data_dir . '/postmaster.pid';
 		usleep(100_000);
diff --git a/src/test/subscription/t/015_stream.pl b/src/test/subscription/t/015_stream.pl
index 9f221fc..6561b18 100644
--- a/src/test/subscription/t/015_stream.pl
+++ b/src/test/subscription/t/015_stream.pl
@@ -58,7 +58,7 @@ is($result, qq(2|2|2), 'check initial data was copied to subscriber');
 my $in  = '';
 my $out = '';
 
-my $timer = IPC::Run::timeout(180);
+my $timer = IPC::Run::timeout($PostgreSQL::Test::Utils::timeout_default);
 
 my $h = $node_publisher->background_psql('postgres', \$in, \$out, $timer,
 	on_error_stop => 0);
regress-timeout-default-var-v1.patchtext/plain; charset=us-asciiDownload
Author:     Noah Misch <noah@leadboat.com>
Commit:     Noah Misch <noah@leadboat.com>

    Use PG_TEST_TIMEOUT_DEFAULT for pg_regress suite can't-happen timeouts.
    
    Currently, only contrib/test_decoding has this property.  Use \getenv to
    load the timeout value.
    
    Reviewed by FIXME.
    
    Discussion: https://postgr.es/m/FIXME

diff --git a/contrib/test_decoding/expected/twophase.out b/contrib/test_decoding/expected/twophase.out
index e5e0f96..e89dc74 100644
--- a/contrib/test_decoding/expected/twophase.out
+++ b/contrib/test_decoding/expected/twophase.out
@@ -137,7 +137,10 @@ WHERE locktype = 'relation'
 (3 rows)
 
 -- The above CLUSTER command shouldn't cause a timeout on 2pc decoding.
-SET statement_timeout = '180s';
+\set env_timeout ''
+\getenv env_timeout PG_TEST_TIMEOUT_DEFAULT
+SELECT COALESCE(NULLIF(:'env_timeout', ''), '180') || 's' AS timeout \gset
+SET statement_timeout = :'timeout';
 SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
                                    data                                    
 ---------------------------------------------------------------------------
diff --git a/contrib/test_decoding/sql/twophase.sql b/contrib/test_decoding/sql/twophase.sql
index 05f18e8..aff5114 100644
--- a/contrib/test_decoding/sql/twophase.sql
+++ b/contrib/test_decoding/sql/twophase.sql
@@ -69,7 +69,10 @@ FROM pg_locks
 WHERE locktype = 'relation'
   AND relation = 'test_prepared1'::regclass;
 -- The above CLUSTER command shouldn't cause a timeout on 2pc decoding.
-SET statement_timeout = '180s';
+\set env_timeout ''
+\getenv env_timeout PG_TEST_TIMEOUT_DEFAULT
+SELECT COALESCE(NULLIF(:'env_timeout', ''), '180') || 's' AS timeout \gset
+SET statement_timeout = :'timeout';
 SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
 RESET statement_timeout;
 COMMIT PREPARED 'test_prepared_lock';
diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index 3e9ab09..b7416c8 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -34,7 +34,10 @@ declare
   updated3 bool;
   updated4 bool;
 begin
-  -- we don't want to wait forever; loop will exit after 30 seconds
+  -- We don't want to wait forever.  No timeout suffices if the OS drops our
+  -- stats traffic because an earlier test file left a full UDP buffer.
+  -- Hence, don't use PG_TEST_TIMEOUT_DEFAULT, which may be large for
+  -- can't-happen timeouts.  Exit after 30 seconds.
   for i in 1 .. 300 loop
 
     -- With parallel query, the seqscan and indexscan on tenk2 might be done
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index 82e6f24..dbc2dd2 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -33,7 +33,10 @@ declare
   updated3 bool;
   updated4 bool;
 begin
-  -- we don't want to wait forever; loop will exit after 30 seconds
+  -- We don't want to wait forever.  No timeout suffices if the OS drops our
+  -- stats traffic because an earlier test file left a full UDP buffer.
+  -- Hence, don't use PG_TEST_TIMEOUT_DEFAULT, which may be large for
+  -- can't-happen timeouts.  Exit after 30 seconds.
   for i in 1 .. 300 loop
 
     -- With parallel query, the seqscan and indexscan on tenk2 might be done
#2Andres Freund
andres@anarazel.de
In reply to: Noah Misch (#1)
Re: Timeout control within tests

Hi,

On 2022-02-17 21:28:42 -0800, Noah Misch wrote:

I propose to have environment variable PG_TEST_TIMEOUT_DEFAULT control the
timeout used in the places that currently hard-code 180s.

Meson's test runner has the concept of a "timeout multiplier" for ways of
running tests. Meson's stuff is about entire tests (i.e. one tap test), so
doesn't apply here, but I wonder if we shouldn't do something similar? That
way we could adjust different timeouts with one setting, instead of many
different fobs to adjust?

Greetings,

Andres Freund

#3Noah Misch
noah@leadboat.com
In reply to: Andres Freund (#2)
Re: Timeout control within tests

On Thu, Feb 17, 2022 at 09:48:25PM -0800, Andres Freund wrote:

On 2022-02-17 21:28:42 -0800, Noah Misch wrote:

I propose to have environment variable PG_TEST_TIMEOUT_DEFAULT control the
timeout used in the places that currently hard-code 180s.

Meson's test runner has the concept of a "timeout multiplier" for ways of
running tests. Meson's stuff is about entire tests (i.e. one tap test), so
doesn't apply here, but I wonder if we shouldn't do something similar?

Hmmm. It is good if the user can express an intent that continues to make
sense if we change the default timeout. For the buildfarm use case, a
multiplier is moderately better on that axis (PG_TEST_TIMEOUT_MULTIPLIER=100
beats PG_TEST_TIMEOUT_DEFAULT=18000). For the hacker use case, an absolute
value is substantially better on that axis (PG_TEST_TIMEOUT_DEFAULT=3 beats
PG_TEST_TIMEOUT_MULTIPLIER=.016666).

That
way we could adjust different timeouts with one setting, instead of many
different fobs to adjust?

I expect multiplier vs. absolute value doesn't change the expected number of
settings. If this change proceeds, we'd have three: PG_TEST_TIMEOUT_DEFAULT,
PGCTLTIMEOUT, and PGISOLATIONTIMEOUT. PGCTLTIMEOUT is separate for conceptual
reasons, and PGISOLATIONTIMEOUT is separate for historical reasons. There's
little use case for setting them to unequal values. If Meson can pass down
the overall timeout in effect for the test file, we could compute all three
variables from the passed-down value. Orthogonal to Meson, as I mentioned, we
could eliminate PGISOLATIONTIMEOUT.

timeouts.spec used to have substantial timeouts that had to elapse for the
test to pass. (Commit 741d7f1 ended that era.) A multiplier would have been
a good fit for that use case. If a similar test came back, we'd likely want
two multipliers, a low one for elapsing timeouts and a high one for
non-elapsing timeouts. A multiplier of 10-100 is reasonable for non-elapsing
timeouts, with the exact value being irrelevant on the buildfarm. Setting an
elapsing timeout higher than necessary causes measurable waste.

One could argue for offering both a multiplier variable and an absolute-value
variable. If there's just one variable, I think the absolute-value variable
is more compelling, due to the aforementioned hacker use case. What do you
think?

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Noah Misch (#3)
Re: Timeout control within tests

Noah Misch <noah@leadboat.com> writes:

On Thu, Feb 17, 2022 at 09:48:25PM -0800, Andres Freund wrote:

Meson's test runner has the concept of a "timeout multiplier" for ways of
running tests. Meson's stuff is about entire tests (i.e. one tap test), so
doesn't apply here, but I wonder if we shouldn't do something similar?

Hmmm. It is good if the user can express an intent that continues to make
sense if we change the default timeout. For the buildfarm use case, a
multiplier is moderately better on that axis (PG_TEST_TIMEOUT_MULTIPLIER=100
beats PG_TEST_TIMEOUT_DEFAULT=18000). For the hacker use case, an absolute
value is substantially better on that axis (PG_TEST_TIMEOUT_DEFAULT=3 beats
PG_TEST_TIMEOUT_MULTIPLIER=.016666).

FWIW, I'm fairly sure that PGISOLATIONTIMEOUT=300 was selected after
finding that smaller values didn't work reliably in the buildfarm.
Now maybe 741d7f1 fixed that, but I wouldn't count on it. So while I
approve of the idea to remove PGISOLATIONTIMEOUT in favor of using this
centralized setting, I think that we might need to have a multiplier
there, or else we'll end up with PG_TEST_TIMEOUT_DEFAULT set to 300
across the board. Perhaps the latter is fine, but a multiplier seems a
bit more flexible.

On the other hand, I also support your point that an absolute setting
is easier to think about / adjust for special uses. So maybe we should
just KISS and use a single absolute setting until we find a hard reason
why that doesn't work well.

regards, tom lane

#5Noah Misch
noah@leadboat.com
In reply to: Tom Lane (#4)
Re: Timeout control within tests

On Fri, Feb 18, 2022 at 10:26:52AM -0500, Tom Lane wrote:

Noah Misch <noah@leadboat.com> writes:

On Thu, Feb 17, 2022 at 09:48:25PM -0800, Andres Freund wrote:

Meson's test runner has the concept of a "timeout multiplier" for ways of
running tests. Meson's stuff is about entire tests (i.e. one tap test), so
doesn't apply here, but I wonder if we shouldn't do something similar?

Hmmm. It is good if the user can express an intent that continues to make
sense if we change the default timeout. For the buildfarm use case, a
multiplier is moderately better on that axis (PG_TEST_TIMEOUT_MULTIPLIER=100
beats PG_TEST_TIMEOUT_DEFAULT=18000). For the hacker use case, an absolute
value is substantially better on that axis (PG_TEST_TIMEOUT_DEFAULT=3 beats
PG_TEST_TIMEOUT_MULTIPLIER=.016666).

FWIW, I'm fairly sure that PGISOLATIONTIMEOUT=300 was selected after
finding that smaller values didn't work reliably in the buildfarm.
Now maybe 741d7f1 fixed that, but I wouldn't count on it. So while I
approve of the idea to remove PGISOLATIONTIMEOUT in favor of using this
centralized setting, I think that we might need to have a multiplier
there, or else we'll end up with PG_TEST_TIMEOUT_DEFAULT set to 300
across the board. Perhaps the latter is fine, but a multiplier seems a
bit more flexible.

The PGISOLATIONTIMEOUT replacement was 2*timeout_default, so isolation suites
would get 2*180s=360s. (I don't want to lower any default timeouts, but I
don't mind raising them.) In a sense, PG_TEST_TIMEOUT_DEFAULT is a multiplier
with as many sites as possible multiplying it by 1. The patch has multiples
at two code sites.

Show quoted text

On the other hand, I also support your point that an absolute setting
is easier to think about / adjust for special uses. So maybe we should
just KISS and use a single absolute setting until we find a hard reason
why that doesn't work well.

#6Noah Misch
noah@leadboat.com
In reply to: Noah Misch (#5)
1 attachment(s)
Re: Timeout control within tests

(I pushed the main patch as f2698ea, on 2022-03-04.)

On Fri, Feb 18, 2022 at 06:41:36PM -0800, Noah Misch wrote:

On Fri, Feb 18, 2022 at 10:26:52AM -0500, Tom Lane wrote:

Noah Misch <noah@leadboat.com> writes:

On Thu, Feb 17, 2022 at 09:48:25PM -0800, Andres Freund wrote:

Meson's test runner has the concept of a "timeout multiplier" for ways of
running tests. Meson's stuff is about entire tests (i.e. one tap test), so
doesn't apply here, but I wonder if we shouldn't do something similar?

Hmmm. It is good if the user can express an intent that continues to make
sense if we change the default timeout. For the buildfarm use case, a
multiplier is moderately better on that axis (PG_TEST_TIMEOUT_MULTIPLIER=100
beats PG_TEST_TIMEOUT_DEFAULT=18000). For the hacker use case, an absolute
value is substantially better on that axis (PG_TEST_TIMEOUT_DEFAULT=3 beats
PG_TEST_TIMEOUT_MULTIPLIER=.016666).

FWIW, I'm fairly sure that PGISOLATIONTIMEOUT=300 was selected after
finding that smaller values didn't work reliably in the buildfarm.
Now maybe 741d7f1 fixed that, but I wouldn't count on it. So while I
approve of the idea to remove PGISOLATIONTIMEOUT in favor of using this
centralized setting, I think that we might need to have a multiplier
there, or else we'll end up with PG_TEST_TIMEOUT_DEFAULT set to 300
across the board. Perhaps the latter is fine, but a multiplier seems a
bit more flexible.

The PGISOLATIONTIMEOUT replacement was 2*timeout_default, so isolation suites
would get 2*180s=360s. (I don't want to lower any default timeouts, but I
don't mind raising them.) In a sense, PG_TEST_TIMEOUT_DEFAULT is a multiplier
with as many sites as possible multiplying it by 1. The patch has multiples
at two code sites.

Here's the PGISOLATIONTIMEOUT replacement patch. I waffled on whether to
back-patch. Since it affects only isolation suite testing, only on systems
too slow for the default timeout, it's not a major decision. I currently plan
not to back-patch, since slow systems that would have wanted a back-patch can
just set both variables.

Attachments:

PGISOLATIONTIMEOUT-decom-v1.patchtext/plain; charset=us-asciiDownload
Author:     Noah Misch <noah@leadboat.com>
Commit:     Noah Misch <noah@leadboat.com>

    Replace PGISOLATIONTIMEOUT with 2 * PG_TEST_TIMEOUT_DEFAULT.
    
    Now that the more-generic variable exists, use it.
    
    Reviewed by FIXME.
    
    Discussion: https://postgr.es/m/20220219024136.GA3670392@rfd.leadboat.com

diff --git a/src/test/isolation/README b/src/test/isolation/README
index 8457a56..5818ca5 100644
--- a/src/test/isolation/README
+++ b/src/test/isolation/README
@@ -47,10 +47,10 @@ pg_isolation_regress is a tool similar to pg_regress, but instead of using
 psql to execute a test, it uses isolationtester.  It accepts all the same
 command-line arguments as pg_regress.
 
-By default, isolationtester will wait at most 300 seconds (5 minutes)
+By default, isolationtester will wait at most 360 seconds (6 minutes)
 for any one test step to complete.  If you need to adjust this, set
-the environment variable PGISOLATIONTIMEOUT to the desired timeout
-in seconds.
+the environment variable PG_TEST_TIMEOUT_DEFAULT to half the desired
+timeout in seconds.
 
 
 Test specification
@@ -138,10 +138,11 @@ Each step may contain commands that block until further action has been taken
 deadlock).  A test that uses this ability must manually specify valid
 permutations, i.e. those that would not expect a blocked session to execute a
 command.  If a test fails to follow that rule, isolationtester will cancel it
-after PGISOLATIONTIMEOUT seconds.  If the cancel doesn't work, isolationtester
-will exit uncleanly after a total of twice PGISOLATIONTIMEOUT.  Testing
-invalid permutations should be avoided because they can make the isolation
-tests take a very long time to run, and they serve no useful testing purpose.
+after 2 * PG_TEST_TIMEOUT_DEFAULT seconds.  If the cancel doesn't work,
+isolationtester will exit uncleanly after a total of 4 *
+PG_TEST_TIMEOUT_DEFAULT.  Testing invalid permutations should be avoided
+because they can make the isolation tests take a very long time to run, and
+they serve no useful testing purpose.
 
 Note that isolationtester recognizes that a command has blocked by looking
 to see if it is shown as waiting in the pg_locks view; therefore, only
diff --git a/src/test/isolation/isolationtester.c b/src/test/isolation/isolationtester.c
index 12179f2..095db8f 100644
--- a/src/test/isolation/isolationtester.c
+++ b/src/test/isolation/isolationtester.c
@@ -46,7 +46,7 @@ static int	nconns = 0;
 static bool any_new_notice = false;
 
 /* Maximum time to wait before giving up on a step (in usec) */
-static int64 max_step_wait = 300 * USECS_PER_SEC;
+static int64 max_step_wait = 360 * USECS_PER_SEC;
 
 
 static void check_testspec(TestSpec *testspec);
@@ -128,12 +128,12 @@ main(int argc, char **argv)
 		conninfo = "dbname = postgres";
 
 	/*
-	 * If PGISOLATIONTIMEOUT is set in the environment, adopt its value (given
-	 * in seconds) as the max time to wait for any one step to complete.
+	 * If PG_TEST_TIMEOUT_DEFAULT is set, adopt its value (given in seconds)
+	 * as half the max time to wait for any one step to complete.
 	 */
-	env_wait = getenv("PGISOLATIONTIMEOUT");
+	env_wait = getenv("PG_TEST_TIMEOUT_DEFAULT");
 	if (env_wait != NULL)
-		max_step_wait = ((int64) atoi(env_wait)) * USECS_PER_SEC;
+		max_step_wait = 2 * ((int64) atoi(env_wait)) * USECS_PER_SEC;
 
 	/* Read the test spec from stdin */
 	spec_yyparse();