Too-short timeouts in test code

Started by Noah Mischabout 7 years ago4 messages

noah@leadboat.com

about 7 years ago

1 attachment(s)

We've had more buildfarm failures due to hard-coded, short timeouts:

http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=chipmunk&dt=2018-10-13%2021%3A06%3A58
10s timeout while running pg_recvlogical
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hornet&dt=2018-12-03%2005%3A52%3A12
30s timeout while running pg_recvlogical
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mandrill&dt=2018-11-30%2014%3A31%3A18
60s timeout in isolationtester try_complete_step()

The 180s timeout in poll_query_until has been trouble-free since 2a0f89c
introduced it two years ago. I plan to raise the timeouts in question to
180s, as attached.

Attachments:

test-timeout-180s-v1.patchtext/plain; charset=us-asciiDownload

diff --git a/src/test/isolation/README b/src/test/isolation/README
index bea278a..780b6dc 100644
--- a/src/test/isolation/README
+++ b/src/test/isolation/README
@@ -108,8 +108,8 @@ Each step may contain commands that block until further action has been taken
 deadlock).  A test that uses this ability must manually specify valid
 permutations, i.e. those that would not expect a blocked session to execute a
 command.  If a test fails to follow that rule, isolationtester will cancel it
-after 60 seconds.  If the cancel doesn't work, isolationtester will exit
-uncleanly after a total of 75 seconds of wait time.  Testing invalid
+after 180 seconds.  If the cancel doesn't work, isolationtester will exit
+uncleanly after a total of 200 seconds of wait time.  Testing invalid
 permutations should be avoided because they can make the isolation tests take
 a very long time to run, and they serve no useful testing purpose.
 
diff --git a/src/test/isolation/isolationtester.c b/src/test/isolation/isolationtester.c
index 7df67da..9134b05 100644
--- a/src/test/isolation/isolationtester.c
+++ b/src/test/isolation/isolationtester.c
@@ -783,15 +783,15 @@ try_complete_step(Step *step, int flags)
 			td += (int64) current_time.tv_usec - (int64) start_time.tv_usec;
 
 			/*
-			 * After 60 seconds, try to cancel the query.
+			 * After 180 seconds, try to cancel the query.
 			 *
 			 * If the user tries to test an invalid permutation, we don't want
 			 * to hang forever, especially when this is running in the
-			 * buildfarm.  So try to cancel it after a minute.  This will
-			 * presumably lead to this permutation failing, but remaining
-			 * permutations and tests should still be OK.
+			 * buildfarm.  This will presumably lead to this permutation
+			 * failing, but remaining permutations and tests should still be
+			 * OK.
 			 */
-			if (td > 60 * USECS_PER_SEC && !canceled)
+			if (td > 180 * USECS_PER_SEC && !canceled)
 			{
 				PGcancel   *cancel = PQgetCancel(conn);
 
@@ -808,15 +808,15 @@ try_complete_step(Step *step, int flags)
 			}
 
 			/*
-			 * After 75 seconds, just give up and die.
+			 * After 200 seconds, just give up and die.
 			 *
 			 * Since cleanup steps won't be run in this case, this may cause
 			 * later tests to fail.  That stinks, but it's better than waiting
 			 * forever for the server to respond to the cancel.
 			 */
-			if (td > 75 * USECS_PER_SEC)
+			if (td > 200 * USECS_PER_SEC)
 			{
-				fprintf(stderr, "step %s timed out after 75 seconds\n",
+				fprintf(stderr, "step %s timed out after 200 seconds\n",
 						step->name);
 				exit_nicely();
 			}
diff --git a/src/test/recovery/t/006_logical_decoding.pl b/src/test/recovery/t/006_logical_decoding.pl
index 884b0ae..c23cc4d 100644
--- a/src/test/recovery/t/006_logical_decoding.pl
+++ b/src/test/recovery/t/006_logical_decoding.pl
@@ -72,7 +72,7 @@ my $endpos = $node_master->safe_psql('postgres',
 print "waiting to replay $endpos\n";
 
 my $stdout_recv = $node_master->pg_recvlogical_upto(
-	'postgres', 'test_slot', $endpos, 10,
+	'postgres', 'test_slot', $endpos, 180,
 	'include-xids'     => '0',
 	'skip-empty-xacts' => '1');
 chomp($stdout_recv);
@@ -84,7 +84,7 @@ $node_master->poll_query_until('postgres',
 ) or die "slot never became inactive";
 
 $stdout_recv = $node_master->pg_recvlogical_upto(
-	'postgres', 'test_slot', $endpos, 10,
+	'postgres', 'test_slot', $endpos, 180,
 	'include-xids'     => '0',
 	'skip-empty-xacts' => '1');
 chomp($stdout_recv);
diff --git a/src/test/recovery/t/010_logical_decoding_timelines.pl b/src/test/recovery/t/010_logical_decoding_timelines.pl
index 4fbd386..e582b20 100644
--- a/src/test/recovery/t/010_logical_decoding_timelines.pl
+++ b/src/test/recovery/t/010_logical_decoding_timelines.pl
@@ -183,7 +183,7 @@ my $endpos = $node_replica->safe_psql('postgres',
 
 $stdout = $node_replica->pg_recvlogical_upto(
 	'postgres', 'before_basebackup',
-	$endpos,    30,
+	$endpos,    180,
 	'include-xids'     => '0',
 	'skip-empty-xacts' => '1');

Michael Paquier

michael@paquier.xyz

about 7 years ago

In reply to: Noah Misch (#1)

Re: Too-short timeouts in test code

On Sat, Dec 08, 2018 at 04:16:01PM -0800, Noah Misch wrote:

The 180s timeout in poll_query_until has been trouble-free since 2a0f89c
introduced it two years ago. I plan to raise the timeouts in question to
180s, as attached.

That's annoying, thanks for tracking those down.

@@ -72,7 +72,7 @@ my $endpos = $node_master->safe_psql('postgres',
print "waiting to replay $endpos\n";

my $stdout_recv = $node_master->pg_recvlogical_upto(
- 'postgres', 'test_slot', $endpos, 10,
+ 'postgres', 'test_slot', $endpos, 180,
'include-xids' => '0',
'skip-empty-xacts' => '1');

Instead of allowing callers of pg_recvlogical_upto() to define the
timeout they want, perhaps it would be better to hardcode the timeout
limit within the routine?
--
Michael

Noah Misch

noah@leadboat.com

about 7 years ago

In reply to: Michael Paquier (#2)

Re: Too-short timeouts in test code

On Sun, Dec 09, 2018 at 11:14:12AM +0900, Michael Paquier wrote:

On Sat, Dec 08, 2018 at 04:16:01PM -0800, Noah Misch wrote:

@@ -72,7 +72,7 @@ my $endpos = $node_master->safe_psql('postgres',
print "waiting to replay $endpos\n";

my $stdout_recv = $node_master->pg_recvlogical_upto(
- 'postgres', 'test_slot', $endpos, 10,
+ 'postgres', 'test_slot', $endpos, 180,
'include-xids' => '0',
'skip-empty-xacts' => '1');

Instead of allowing callers of pg_recvlogical_upto() to define the
timeout they want, perhaps it would be better to hardcode the timeout
limit within the routine?

pg_recvlogical_upto() has the ability to make timeout non-fatal, by calling it
in an array context. No in-tree test uses that. Out-of-tree code using that
feature would likely benefit from the ability to set timeout duration. Hence,
I'm inclined not remove the timeout duration parameter.

Michael Paquier

michael@paquier.xyz

about 7 years ago

In reply to: Noah Misch (#3)

Re: Too-short timeouts in test code

On Sat, Dec 08, 2018 at 06:24:35PM -0800, Noah Misch wrote:

pg_recvlogical_upto() has the ability to make timeout non-fatal, by calling it
in an array context. No in-tree test uses that. Out-of-tree code using that
feature would likely benefit from the ability to set timeout duration. Hence,
I'm inclined not remove the timeout duration parameter.

No objections with this line of thoughts. The rest of the patch looks
fine to me.
--
Michael