Increasing timeout of poll_query_until for TAP tests

Started by Michael Paquierover 9 years ago16 messages

michael.paquier@gmail.com

over 9 years ago

1 attachment(s)

Hi all,

Lately hamster is failing every 4/5 days on the recovery regression
tests in 003 covering the recovery targets, with that:
# Postmaster PID for node "standby_2" is 20510
#
Timed out while waiting for standby to catch up at
t/003_recovery_targets.pl line 36.

Which means that poll_for_query timed out for the standby to catch up..

Here is an example of test that failed:
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hamster&dt=2016-07-24%2016%3A00%3A07

This buildfarm machine is legendary known for its slowness, and I
don't see a better answer to that than increasing the max timeout of
poll_query_until to put that back to green.
Thoughts?
--
Michael

Attachments:

fix-recovery-tap-failures.patchinvalid/octet-stream; name=fix-recovery-tap-failures.patchDownload

diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 5767919..e629373 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -1192,7 +1192,7 @@ sub poll_query_until
 {
 	my ($self, $dbname, $query) = @_;
 
-	my $max_attempts = 90;
+	my $max_attempts = 180;
 	my $attempts     = 0;
 	my ($stdout, $stderr);

Alvaro Herrera

alvherre@2ndquadrant.com

over 9 years ago

In reply to: Michael Paquier (#1)

Re: Increasing timeout of poll_query_until for TAP tests

Michael Paquier wrote:

Lately hamster is failing every 4/5 days on the recovery regression
tests in 003 covering the recovery targets, with that:
# Postmaster PID for node "standby_2" is 20510
#
Timed out while waiting for standby to catch up at
t/003_recovery_targets.pl line 36.

Which means that poll_for_query timed out for the standby to catch up..

Here is an example of test that failed:
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hamster&dt=2016-07-24%2016%3A00%3A07

Yeah, thanks, pushed. However this doesn't explain all the failures we see:

1) In
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hamster&dt=2016-07-14%2016%3A00%3A06
we see the pg_basebackup test failing. I suppose that failure is also
because of slowness, though of course this patch won't fix it.

2) In
http://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=hamster&dt=2016-06-29%2016%3A00%3A06&stg=recovery-check
we see a completely different failure:

error running SQL: 'psql:<stdin>:1: ERROR: relation "tab_int" does not exist
LINE 1: SELECT count(*) FROM tab_int
^'
while running 'psql -XAtq -d port=52824 host=/tmp/or2xHglniM dbname=postgres -f - -v ON_ERROR_STOP=1' at /home/buildfarm/data/buildroot/HEAD/pgsql.build/src/test/recovery/../../../src/test/perl/PostgresNode.pm line 1166.

Do we have an explanation for this one?

--
ï¿½lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Michael Paquier

michael.paquier@gmail.com

over 9 years ago

In reply to: Alvaro Herrera (#2)

Re: Increasing timeout of poll_query_until for TAP tests

On Mon, Jul 25, 2016 at 2:38 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

Michael Paquier wrote:
Yeah, thanks, pushed. However this doesn't explain all the failures we see:

I missed those ones, thanks for the reminder.

1) In
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hamster&dt=2016-07-14%2016%3A00%3A06
we see the pg_basebackup test failing. I suppose that failure is also
because of slowness, though of course this patch won't fix it.

That's from 010_pg_basebackup... And I am not sure what's behind that.
Could it be possible to add --verbose to the commands of
pg_basebackup? We may be able to catch the problem if it shows up
again.

2) In
http://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=hamster&dt=2016-06-29%2016%3A00%3A06&stg=recovery-check
we see a completely different failure:

error running SQL: 'psql:<stdin>:1: ERROR: relation "tab_int" does not exist
LINE 1: SELECT count(*) FROM tab_int
^'
while running 'psql -XAtq -d port=52824 host=/tmp/or2xHglniM dbname=postgres -f - -v ON_ERROR_STOP=1' at /home/buildfarm/data/buildroot/HEAD/pgsql.build/src/test/recovery/../../../src/test/perl/PostgresNode.pm line 1166.

Do we have an explanation for this one?

Ah, yes, and that's a stupid mistake. We had better use
replay_location instead of write_location. There is a risk that
records have not been replayed yet even if they have been written on
the standby, so it is possible that the query looking at tab_int may
not see this relation.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Michael Paquier

michael.paquier@gmail.com

over 9 years ago

In reply to: Michael Paquier (#3)

1 attachment(s)

Re: Increasing timeout of poll_query_until for TAP tests

On Mon, Jul 25, 2016 at 2:52 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

Ah, yes, and that's a stupid mistake. We had better use
replay_location instead of write_location. There is a risk that
records have not been replayed yet even if they have been written on
the standby, so it is possible that the query looking at tab_int may
not see this relation.

Or in short, the attached fixes 2) and will help providing input for 1)..
--
Michael

Attachments:

tap-fixes.patchapplication/x-patch; name=tap-fixes.patchDownload

diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index 6c33936..4b301d0 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -47,7 +47,7 @@ print CONF "wal_level = replica\n";
 close CONF;
 $node->restart;
 
-$node->command_ok([ 'pg_basebackup', '-D', "$tempdir/backup" ],
+$node->command_ok([ 'pg_basebackup', '--verbose', '-D', "$tempdir/backup" ],
 	'pg_basebackup runs');
 ok(-f "$tempdir/backup/PG_VERSION", 'backup was created');
 
@@ -57,13 +57,14 @@ is_deeply(
 	'no WAL files copied');
 
 $node->command_ok(
-	[   'pg_basebackup', '-D', "$tempdir/backup2", '--xlogdir',
+	[   'pg_basebackup', '--verbose', '-D', "$tempdir/backup2", '--xlogdir',
 		"$tempdir/xlog2" ],
 	'separate xlog directory');
 ok(-f "$tempdir/backup2/PG_VERSION", 'backup was created');
 ok(-d "$tempdir/xlog2/",             'xlog directory was created');
 
-$node->command_ok([ 'pg_basebackup', '-D', "$tempdir/tarbackup", '-Ft' ],
+$node->command_ok(
+	[ 'pg_basebackup', '--verbose', '-D', "$tempdir/tarbackup", '-Ft' ],
 	'tar format');
 ok(-f "$tempdir/tarbackup/base.tar", 'backup tar was created');
 
@@ -116,7 +117,8 @@ SKIP:
 		"CREATE TABLESPACE tblspc1 LOCATION '$shorter_tempdir/tblspc1';");
 	$node->safe_psql('postgres',
 		"CREATE TABLE test1 (a int) TABLESPACE tblspc1;");
-	$node->command_ok([ 'pg_basebackup', '-D', "$tempdir/tarbackup2", '-Ft' ],
+	$node->command_ok(
+		[ 'pg_basebackup', '--verbose', '-D', "$tempdir/tarbackup2", '-Ft' ],
 		'tar format with tablespaces');
 	ok(-f "$tempdir/tarbackup2/base.tar", 'backup tar was created');
 	my @tblspc_tars = glob "$tempdir/tarbackup2/[0-9]*.tar";
@@ -127,7 +129,7 @@ SKIP:
 		'plain format with tablespaces fails without tablespace mapping');
 
 	$node->command_ok(
-		[   'pg_basebackup', '-D', "$tempdir/backup1", '-Fp',
+		[   'pg_basebackup', '--verbose', '-D', "$tempdir/backup1", '-Fp',
 			"-T$shorter_tempdir/tblspc1=$tempdir/tbackup/tblspc1" ],
 		'plain format with tablespaces succeeds with tablespace mapping');
 	ok(-d "$tempdir/tbackup/tblspc1", 'tablespace was relocated');
@@ -146,7 +148,7 @@ SKIP:
 	$node->safe_psql('postgres',
 		"CREATE TABLESPACE tblspc2 LOCATION '$shorter_tempdir/tbl=spc2';");
 	$node->command_ok(
-		[   'pg_basebackup', '-D', "$tempdir/backup3", '-Fp',
+		[   'pg_basebackup', '--verbose', '-D', "$tempdir/backup3", '-Fp',
 			"-T$shorter_tempdir/tbl\\=spc2=$tempdir/tbackup/tbl\\=spc2" ],
 		'mapping tablespace with = sign in path');
 	ok(-d "$tempdir/tbackup/tbl=spc2",
@@ -157,12 +159,13 @@ SKIP:
 	$node->safe_psql('postgres',
 		"CREATE TABLESPACE tblspc3 LOCATION '$tempdir/$superlongname';");
 	$node->command_ok(
-		[ 'pg_basebackup', '-D', "$tempdir/tarbackup_l3", '-Ft' ],
+		[ 'pg_basebackup', '--verbose', '-D', "$tempdir/tarbackup_l3", '-Ft' ],
 		'pg_basebackup tar with long symlink target');
 	$node->safe_psql('postgres', "DROP TABLESPACE tblspc3;");
 }
 
-$node->command_ok([ 'pg_basebackup', '-D', "$tempdir/backupR", '-R' ],
+$node->command_ok(
+	[ 'pg_basebackup', '--verbose', '-D', "$tempdir/backupR", '-R' ],
 	'pg_basebackup -R runs');
 ok(-f "$tempdir/backupR/recovery.conf", 'recovery.conf was created');
 my $recovery_conf = slurp_file "$tempdir/backupR/recovery.conf";
@@ -180,18 +183,19 @@ like(
 	'recovery.conf sets primary_conninfo');
 
 $node->command_ok(
-	[ 'pg_basebackup', '-D', "$tempdir/backupxf", '-X', 'fetch' ],
+	[ 'pg_basebackup', '--verbose', '-D', "$tempdir/backupxf", '-X', 'fetch' ],
 	'pg_basebackup -X fetch runs');
 ok(grep(/^[0-9A-F]{24}$/, slurp_dir("$tempdir/backupxf/pg_xlog")),
 	'WAL files copied');
 $node->command_ok(
-	[ 'pg_basebackup', '-D', "$tempdir/backupxs", '-X', 'stream' ],
+	[ 'pg_basebackup', '--verbose', '-D', "$tempdir/backupxs", '-X',
+	  'stream' ],
 	'pg_basebackup -X stream runs');
 ok(grep(/^[0-9A-F]{24}$/, slurp_dir("$tempdir/backupxf/pg_xlog")),
 	'WAL files copied');
 
 $node->command_fails(
-	[ 'pg_basebackup', '-D', "$tempdir/fail", '-S', 'slot1' ],
+	[ 'pg_basebackup', '--verbose', '-D', "$tempdir/fail", '-S', 'slot1' ],
 	'pg_basebackup with replication slot fails without -X stream');
 $node->command_fails(
 	[   'pg_basebackup',             '-D',
@@ -207,8 +211,8 @@ my $lsn = $node->safe_psql('postgres',
 );
 is($lsn, '', 'restart LSN of new slot is null');
 $node->command_ok(
-	[   'pg_basebackup', '-D', "$tempdir/backupxs_sl", '-X',
-		'stream',        '-S', 'slot1' ],
+	[   'pg_basebackup', '--verbose', '-D', "$tempdir/backupxs_sl", '-X',
+		'stream',        '-S',        'slot1' ],
 	'pg_basebackup -X stream with replication slot runs');
 $lsn = $node->safe_psql('postgres',
 	q{SELECT restart_lsn FROM pg_replication_slots WHERE slot_name = 'slot1'}
@@ -216,8 +220,8 @@ $lsn = $node->safe_psql('postgres',
 like($lsn, qr!^0/[0-9A-Z]{7,8}$!, 'restart LSN of slot has advanced');
 
 $node->command_ok(
-	[   'pg_basebackup', '-D', "$tempdir/backupxs_sl_R", '-X',
-		'stream',        '-S', 'slot1',                  '-R' ],
+	[   'pg_basebackup', '--verbose', '-D',    "$tempdir/backupxs_sl_R", '-X',
+		'stream',        '-S',        'slot1', '-R' ],
 	'pg_basebackup with replication slot and -R runs');
 like(
 	slurp_file("$tempdir/backupxs_sl_R/recovery.conf"),
diff --git a/src/test/recovery/t/001_stream_rep.pl b/src/test/recovery/t/001_stream_rep.pl
index 7b42f21..e707d94 100644
--- a/src/test/recovery/t/001_stream_rep.pl
+++ b/src/test/recovery/t/001_stream_rep.pl
@@ -38,11 +38,11 @@ $node_master->safe_psql('postgres',
 my $applname_1 = $node_standby_1->name;
 my $applname_2 = $node_standby_2->name;
 my $caughtup_query =
-"SELECT pg_current_xlog_location() <= write_location FROM pg_stat_replication WHERE application_name = '$applname_1';";
+"SELECT pg_current_xlog_location() <= replay_location FROM pg_stat_replication WHERE application_name = '$applname_1';";
 $node_master->poll_query_until('postgres', $caughtup_query)
   or die "Timed out while waiting for standby 1 to catch up";
 $caughtup_query =
-"SELECT pg_last_xlog_replay_location() <= write_location FROM pg_stat_replication WHERE application_name = '$applname_2';";
+"SELECT pg_last_xlog_replay_location() <= replay_location FROM pg_stat_replication WHERE application_name = '$applname_2';";
 $node_standby_1->poll_query_until('postgres', $caughtup_query)
   or die "Timed out while waiting for standby 2 to catch up";

Michael Paquier

michael.paquier@gmail.com

over 9 years ago

In reply to: Michael Paquier (#4)

Re: Increasing timeout of poll_query_until for TAP tests

On Mon, Jul 25, 2016 at 10:05 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Mon, Jul 25, 2016 at 2:52 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

Ah, yes, and that's a stupid mistake. We had better use
replay_location instead of write_location. There is a risk that
records have not been replayed yet even if they have been written on
the standby, so it is possible that the query looking at tab_int may
not see this relation.

Or in short, the attached fixes 2) and will help providing input for 1)..

Increasing the timeout had zero effect for test 003:
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hamster&dt=2016-07-26%2016%3A00%3A07
So we're facing something else. I'll dig into that deeper using
manually hamster.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Michael Paquier

michael.paquier@gmail.com

over 9 years ago

In reply to: Michael Paquier (#5)

1 attachment(s)

Re: Increasing timeout of poll_query_until for TAP tests

On Wed, Jul 27, 2016 at 10:00 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Mon, Jul 25, 2016 at 10:05 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Mon, Jul 25, 2016 at 2:52 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

Ah, yes, and that's a stupid mistake. We had better use
replay_location instead of write_location. There is a risk that
records have not been replayed yet even if they have been written on
the standby, so it is possible that the query looking at tab_int may
not see this relation.

Or in short, the attached fixes 2) and will help providing input for 1)..

Increasing the timeout had zero effect for test 003:
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hamster&dt=2016-07-26%2016%3A00%3A07
So we're facing something else. I'll dig into that deeper using
manually hamster.

And so I have been finally able to reproduce this one, and this is a
timing issue in the test.

First see the failure itself:
LOG: recovery stopping after commit of transaction 548, time
2016-08-01 21:14:37.647104+09
LOG: recovery has paused
HINT: Execute pg_xlog_replay_resume() to continue.
LOG: statement: SELECT '0/30122D0'::pg_lsn <= pg_last_xlog_replay_location()
[keeps waiting for this LSN to be replayed]

But by looking at the WAL records of this failed test I could see the following
rmgr: Transaction len (rec/tot): 8/ 34, tx: 547, lsn:
0/03012248, prev 0/03012208, desc: COMMIT 2016-08-01 21:14:37.514805
JST
rmgr: Transaction len (rec/tot): 8/ 34, tx: 548, lsn:
0/03012270, prev 0/03012248, desc: COMMIT 2016-08-01 21:14:37.647104
JST
rmgr: Standby len (rec/tot): 24/ 50, tx: 0, lsn:
0/03012298, prev 0/03012270, desc: RUNNING_XACTS nextXid 549
latestCompletedXid 548 oldestRunningXid 549
rmgr: Heap len (rec/tot): 3/ 59, tx: 549, lsn:
0/030122D0, prev 0/03012298, desc: INSERT off 193, blkref #0: rel
1663/12404/16384 blk 8
It is not a surprise if this keeps waiting forever. As the recovery is
paused before it cannot replay the wanted LSN.

Here using pg_xlog_replay_resume() is not the correct solution because
this would cause the node to finish recovery before we want it to, and
so is recovery_target_action = 'promote'. If we look at the test, it
is doing the following when getting the TXID that is used as recovery
target:
$node_master->safe_psql('postgres',
"INSERT INTO tab_int VALUES (generate_series(1001,2000))");
my $recovery_txid =
$node_master->safe_psql('postgres', "SELECT txid_current()");
my $lsn2 =
$node_master->safe_psql('postgres', "SELECT pg_current_xlog_location();");
What I think we had better do is reverse the calls
pg_current_xlog_location() and txid_current() so as we are sure that
the LSN we track for replay is lower than the real target LSN. The
same problem exists when defining the timestamp target.

The patch attached does that, and it fixes as well the issue with test
001 regarding write_location that should not be used. With this patch
I have let the test 003 run for two hours in a row and it did not
complain. Previously I was able to see a failure at the 3rd~4th
attempts, within 30 minutes.

There is still an issue with pg_basebackup when testing stream mode
and replication slots. I am digging into this one now..
--
Michael

Attachments:

recovery-test-fixes.patchapplication/x-download; name=recovery-test-fixes.patchDownload

diff --git a/src/test/recovery/t/001_stream_rep.pl b/src/test/recovery/t/001_stream_rep.pl
index 7b42f21..e707d94 100644
--- a/src/test/recovery/t/001_stream_rep.pl
+++ b/src/test/recovery/t/001_stream_rep.pl
@@ -38,11 +38,11 @@ $node_master->safe_psql('postgres',
 my $applname_1 = $node_standby_1->name;
 my $applname_2 = $node_standby_2->name;
 my $caughtup_query =
-"SELECT pg_current_xlog_location() <= write_location FROM pg_stat_replication WHERE application_name = '$applname_1';";
+"SELECT pg_current_xlog_location() <= replay_location FROM pg_stat_replication WHERE application_name = '$applname_1';";
 $node_master->poll_query_until('postgres', $caughtup_query)
   or die "Timed out while waiting for standby 1 to catch up";
 $caughtup_query =
-"SELECT pg_last_xlog_replay_location() <= write_location FROM pg_stat_replication WHERE application_name = '$applname_2';";
+"SELECT pg_last_xlog_replay_location() <= replay_location FROM pg_stat_replication WHERE application_name = '$applname_2';";
 $node_standby_1->poll_query_until('postgres', $caughtup_query)
   or die "Timed out while waiting for standby 2 to catch up";
 
diff --git a/src/test/recovery/t/003_recovery_targets.pl b/src/test/recovery/t/003_recovery_targets.pl
index 20b878e..6fd8656 100644
--- a/src/test/recovery/t/003_recovery_targets.pl
+++ b/src/test/recovery/t/003_recovery_targets.pl
@@ -66,17 +66,17 @@ $node_master->backup('my_backup');
 # target TXID.
 $node_master->safe_psql('postgres',
 	"INSERT INTO tab_int VALUES (generate_series(1001,2000))");
-my $recovery_txid =
-  $node_master->safe_psql('postgres', "SELECT txid_current()");
 my $lsn2 =
   $node_master->safe_psql('postgres', "SELECT pg_current_xlog_location();");
+my $recovery_txid =
+  $node_master->safe_psql('postgres', "SELECT txid_current()");
 
 # More data, with recovery target timestamp
 $node_master->safe_psql('postgres',
 	"INSERT INTO tab_int VALUES (generate_series(2001,3000))");
-my $recovery_time = $node_master->safe_psql('postgres', "SELECT now()");
 my $lsn3 =
   $node_master->safe_psql('postgres', "SELECT pg_current_xlog_location();");
+my $recovery_time = $node_master->safe_psql('postgres', "SELECT now()");
 
 # Even more data, this time with a recovery target name
 $node_master->safe_psql('postgres',

Michael Paquier

michael.paquier@gmail.com

over 9 years ago

In reply to: Michael Paquier (#6)

Re: Increasing timeout of poll_query_until for TAP tests

On Tue, Aug 2, 2016 at 10:28 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:

There is still an issue with pg_basebackup when testing stream mode
and replication slots. I am digging into this one now..

After 5 hours running this test in a row and 30 attempts torturing
hamster with a script running make check in an infinite loop with set
-e I am giving up on that for the time being... I have added the patch
to make the tests more stable to next CF so as it is not forgotten:
https://commitfest.postgresql.org/10/693/
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Alvaro Herrera

alvherre@2ndquadrant.com

over 9 years ago

In reply to: Michael Paquier (#7)

Re: Increasing timeout of poll_query_until for TAP tests

Michael Paquier wrote:

On Tue, Aug 2, 2016 at 10:28 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:

There is still an issue with pg_basebackup when testing stream mode
and replication slots. I am digging into this one now..

After 5 hours running this test in a row and 30 attempts torturing
hamster with a script running make check in an infinite loop with set
-e I am giving up on that for the time being... I have added the patch
to make the tests more stable to next CF so as it is not forgotten:
https://commitfest.postgresql.org/10/693/

Great, thanks for the effort, will push.

--
ï¿½lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Alvaro Herrera

alvherre@2ndquadrant.com

over 9 years ago

In reply to: Michael Paquier (#6)

1 attachment(s)

Re: Increasing timeout of poll_query_until for TAP tests

Michael Paquier wrote:

Here using pg_xlog_replay_resume() is not the correct solution because
this would cause the node to finish recovery before we want it to, and
so is recovery_target_action = 'promote'. If we look at the test, it
is doing the following when getting the TXID that is used as recovery
target:
$node_master->safe_psql('postgres',
"INSERT INTO tab_int VALUES (generate_series(1001,2000))");
my $recovery_txid =
$node_master->safe_psql('postgres', "SELECT txid_current()");
my $lsn2 =
$node_master->safe_psql('postgres', "SELECT pg_current_xlog_location();");
What I think we had better do is reverse the calls
pg_current_xlog_location() and txid_current() so as we are sure that
the LSN we track for replay is lower than the real target LSN. The
same problem exists when defining the timestamp target.

The patch attached does that,

Why not capture both items in a single select, such as in the attached
patch?

--
ï¿½lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

recovery-test-fixes-2.patchtext/x-diff; charset=us-asciiDownload

diff --git a/src/test/recovery/t/003_recovery_targets.pl b/src/test/recovery/t/003_recovery_targets.pl
index 20b878e..3864e60 100644
--- a/src/test/recovery/t/003_recovery_targets.pl
+++ b/src/test/recovery/t/003_recovery_targets.pl
@@ -66,17 +66,16 @@ $node_master->backup('my_backup');
 # target TXID.
 $node_master->safe_psql('postgres',
 	"INSERT INTO tab_int VALUES (generate_series(1001,2000))");
-my $recovery_txid =
-  $node_master->safe_psql('postgres', "SELECT txid_current()");
-my $lsn2 =
-  $node_master->safe_psql('postgres', "SELECT pg_current_xlog_location();");
+my $ret =
+  $node_master->safe_psql('postgres', "SELECT pg_current_xlog_location(), txid_current();");
+my ($lsn2, $recovery_txid) = split /\|/, $ret;
 
 # More data, with recovery target timestamp
 $node_master->safe_psql('postgres',
 	"INSERT INTO tab_int VALUES (generate_series(2001,3000))");
-my $recovery_time = $node_master->safe_psql('postgres', "SELECT now()");
-my $lsn3 =
-  $node_master->safe_psql('postgres', "SELECT pg_current_xlog_location();");
+$ret =
+  $node_master->safe_psql('postgres', "SELECT pg_current_xlog_location(), now();");
+my ($lsn3, $recovery_time) = split /\|/, $ret;
 
 # Even more data, this time with a recovery target name
 $node_master->safe_psql('postgres',

#10

Michael Paquier

michael.paquier@gmail.com

over 9 years ago

In reply to: Alvaro Herrera (#9)

Re: Increasing timeout of poll_query_until for TAP tests

On Wed, Aug 3, 2016 at 7:21 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

Michael Paquier wrote:

Here using pg_xlog_replay_resume() is not the correct solution because
this would cause the node to finish recovery before we want it to, and
so is recovery_target_action = 'promote'. If we look at the test, it
is doing the following when getting the TXID that is used as recovery
target:
$node_master->safe_psql('postgres',
"INSERT INTO tab_int VALUES (generate_series(1001,2000))");
my $recovery_txid =
$node_master->safe_psql('postgres', "SELECT txid_current()");
my $lsn2 =
$node_master->safe_psql('postgres', "SELECT pg_current_xlog_location();");
What I think we had better do is reverse the calls
pg_current_xlog_location() and txid_current() so as we are sure that
the LSN we track for replay is lower than the real target LSN. The
same problem exists when defining the timestamp target.

The patch attached does that,

Why not capture both items in a single select, such as in the attached
patch?

Let me test this....
[... A while after ...]
This looks to work properly. 12 runs in a row have passed.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11

Alvaro Herrera

alvherre@2ndquadrant.com

over 9 years ago

In reply to: Michael Paquier (#10)

Re: Increasing timeout of poll_query_until for TAP tests

Michael Paquier wrote:

On Wed, Aug 3, 2016 at 7:21 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

Why not capture both items in a single select, such as in the attached
patch?

Let me test this....
[... A while after ...]
This looks to work properly. 12 runs in a row have passed.

Okay, applied that way.

BTW, one-line long queries look awful in that perl code. I don't
propose to change anything now, but I propose that long queries are
split using here-docs in new code,

$node->safe_psql(<<EQ);
SELECT foo
FROM bar
EQ

--
ï¿½lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12

Michael Paquier

michael.paquier@gmail.com

over 9 years ago

In reply to: Alvaro Herrera (#11)

Re: Increasing timeout of poll_query_until for TAP tests

On Thu, Aug 4, 2016 at 2:34 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

Michael Paquier wrote:

On Wed, Aug 3, 2016 at 7:21 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

Why not capture both items in a single select, such as in the attached
patch?

Let me test this....
[... A while after ...]
This looks to work properly. 12 runs in a row have passed.

Okay, applied that way.

BTW, one-line long queries look awful in that perl code. I don't
propose to change anything now, but I propose that long queries are
split using here-docs in new code,

$node->safe_psql(<<EQ);
SELECT foo
FROM bar
EQ

Yep, that would be a good idea. I didn't know this grammar existed. Or
use qq() directly.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13

Michael Paquier

michael.paquier@gmail.com

over 9 years ago

In reply to: Michael Paquier (#12)

Re: Increasing timeout of poll_query_until for TAP tests

On Thu, Aug 4, 2016 at 6:56 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Thu, Aug 4, 2016 at 2:34 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

Michael Paquier wrote:

On Wed, Aug 3, 2016 at 7:21 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

Why not capture both items in a single select, such as in the attached
patch?

Let me test this....
[... A while after ...]
This looks to work properly. 12 runs in a row have passed.

Okay, applied that way.

BTW, one-line long queries look awful in that perl code. I don't
propose to change anything now, but I propose that long queries are
split using here-docs in new code,

$node->safe_psql(<<EQ);
SELECT foo
FROM bar
EQ

Yep, that would be a good idea. I didn't know this grammar existed. Or
use qq() directly.

hamster has not failed even once for two weeks now, so I think that
we're good. Let's see if the problem with pg_basebackup pops up once
again.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14

Noah Misch

noah@leadboat.com

about 8 years ago

In reply to: Alvaro Herrera (#9)

Re: Increasing timeout of poll_query_until for TAP tests

On Tue, Aug 02, 2016 at 06:21:03PM -0400, Alvaro Herrera wrote:

Michael Paquier wrote:

$node_master->safe_psql('postgres',
"INSERT INTO tab_int VALUES (generate_series(1001,2000))");
my $recovery_txid =
$node_master->safe_psql('postgres', "SELECT txid_current()");
my $lsn2 =
$node_master->safe_psql('postgres', "SELECT pg_current_xlog_location();");
What I think we had better do is reverse the calls
pg_current_xlog_location() and txid_current() so as we are sure that
the LSN we track for replay is lower than the real target LSN. The
same problem exists when defining the timestamp target.

The patch attached does that,

Why not capture both items in a single select, such as in the attached
patch?

-my $recovery_time = $node_master->safe_psql('postgres', "SELECT now()");
-my $lsn3 =
-  $node_master->safe_psql('postgres', "SELECT pg_current_xlog_location();");
+$ret =
+  $node_master->safe_psql('postgres', "SELECT pg_current_xlog_location(), now();");
+my ($lsn3, $recovery_time) = split /\|/, $ret;

Since now() is transaction_timestamp(), $recovery_time precedes or equals
$lsn3, and this didn't close the race. Using clock_timestamp() here would
work, as does using separate transactions like recovery-test-fixes.patch did.
I'll shortly push a fix for this and a similar ordering problem in the
standby_5 test, which first appeared subsequent to this thread.

#15

Michael Paquier

michael.paquier@gmail.com

about 8 years ago

In reply to: Noah Misch (#14)

Re: Increasing timeout of poll_query_until for TAP tests

On Sun, Dec 31, 2017 at 09:52:27PM -0800, Noah Misch wrote:

Since now() is transaction_timestamp(), $recovery_time precedes or equals
$lsn3, and this didn't close the race. Using clock_timestamp() here would
work, as does using separate transactions like recovery-test-fixes.patch did.
I'll shortly push a fix for this and a similar ordering problem in the
standby_5 test, which first appeared subsequent to this thread.

As recovery_target_inclusive is true by default, my conclusion on the
matter, which was something that my tests on hamster, the now-dead
buildfarm animal seemed to confirm, is that just getting a timestamp at
least the value of the LSN from the same transaction was enough to fix
all the failures. And hamster was really slow. I can follow why
logically your patch makes sense, so I agree that this is sane. Have you
spotted failures from the buildfarm? This is hard to spot from any
people which just have access to what the buildfarm UI offers, so any
reference to failures on this thread would be welcome.
--
Michael

#16

Noah Misch

noah@leadboat.com

about 8 years ago

In reply to: Michael Paquier (#15)

Re: Increasing timeout of poll_query_until for TAP tests

On Mon, Jan 01, 2018 at 07:55:37PM +0900, Michael Paquier wrote:

On Sun, Dec 31, 2017 at 09:52:27PM -0800, Noah Misch wrote:

Since now() is transaction_timestamp(), $recovery_time precedes or equals
$lsn3, and this didn't close the race. Using clock_timestamp() here would
work, as does using separate transactions like recovery-test-fixes.patch did.
I'll shortly push a fix for this and a similar ordering problem in the
standby_5 test, which first appeared subsequent to this thread.

As recovery_target_inclusive is true by default, my conclusion on the
matter, which was something that my tests on hamster, the now-dead
buildfarm animal seemed to confirm, is that just getting a timestamp at
least the value of the LSN from the same transaction was enough to fix
all the failures. And hamster was really slow. I can follow why
logically your patch makes sense, so I agree that this is sane. Have you
spotted failures from the buildfarm?

No, but I checked only the last 90 days. Earlier master (e.g. git checkout
6078770^) with the following patch reproduces the failures on every run:

--- a/src/test/recovery/t/003_recovery_targets.pl
+++ b/src/test/recovery/t/003_recovery_targets.pl
@@ -71,8 +71,8 @@ my ($lsn2, $recovery_txid) = split /\|/, $ret;
 $node_master->safe_psql('postgres',
 	"INSERT INTO tab_int VALUES (generate_series(2001,3000))");
 $ret =
-  $node_master->safe_psql('postgres', "SELECT pg_current_wal_lsn(), now();");
-my ($lsn3, $recovery_time) = split /\|/, $ret;
+  $node_master->safe_psql('postgres', "SELECT pg_sleep(80), pg_current_wal_lsn(), now();");
+my ($delay_for_autovacuum, $lsn3, $recovery_time) = split /\|/, $ret;

 # Even more data, this time with a recovery target name
 $node_master->safe_psql('postgres',
@@ -88,6 +88,7 @@ $node_master->safe_psql('postgres',
 	"INSERT INTO tab_int VALUES (generate_series(4001,5000))");
 my $recovery_lsn =
   $node_master->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+$node_master->safe_psql('postgres', 'VACUUM');  # write some WAL
 my $lsn5 =
   $node_master->safe_psql('postgres', "SELECT pg_current_wal_lsn();");