TAP test to cover "EndOfLogTLI != replayTLI" case

Started by Amul Sulabout 4 years ago14 messages
#1Amul Sul
sulamul@gmail.com
1 attachment(s)

Hi,

Attached patch covers a case where TLI in the filename for a
record being read is different from where it belongs to. In other
words, it covers following case noted in StartupXLOG():

/*
* EndOfLogTLI is the TLI in the filename of the XLOG segment containing
* the end-of-log. It could be different from the timeline that EndOfLog
* nominally belongs to, if there was a timeline switch in that segment,
* and we were reading the old WAL from a segment belonging to a higher
* timeline.
*/
EndOfLogTLI = xlogreader->seg.ws_tli;

Test tried to set recovery target just before the end-of-recovery WAL
record. Unfortunately, the test couldn't directly verify EndOfLogTLI
!= replayTLI case, I am not sure how to do that -- any suggestions
will be greatly appreciated. Perhaps, having this test is good to
improve xlog.c test coverage. Also, if you see in other angle test
covers a case where recovery_target_lsn and
recovery_target_inclusive=off which doesn't exist as of now and that
is the reason I have added this test to 003_recovery_targets.pl file.

Thoughts? Suggestions?

--
Regards,
Amul Sul
EDB: http://www.enterprisedb.com

Attachments:

v1-0001-TAP-test-for-EndOfLogTLI.patchapplication/x-patch; name=v1-0001-TAP-test-for-EndOfLogTLI.patchDownload
From 4e2bbb37d4874c910494ba221c7be7e02a29c8c1 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Mon, 22 Nov 2021 09:16:51 -0500
Subject: [PATCH v1] TAP test for EndOfLogTLI

---
 src/test/recovery/t/003_recovery_targets.pl | 55 ++++++++++++++++++++-
 1 file changed, 54 insertions(+), 1 deletion(-)

diff --git a/src/test/recovery/t/003_recovery_targets.pl b/src/test/recovery/t/003_recovery_targets.pl
index 0d0636b85c0..b0d59a181e4 100644
--- a/src/test/recovery/t/003_recovery_targets.pl
+++ b/src/test/recovery/t/003_recovery_targets.pl
@@ -6,7 +6,7 @@ use strict;
 use warnings;
 use PostgreSQL::Test::Cluster;
 use PostgreSQL::Test::Utils;
-use Test::More tests => 9;
+use Test::More tests => 10;
 use Time::HiRes qw(usleep);
 
 # Create and test a standby from given backup, with a certain recovery target.
@@ -182,3 +182,56 @@ $logfile = slurp_file($node_standby->logfile());
 ok( $logfile =~
 	  qr/FATAL: .* recovery ended before configured recovery target was reached/,
 	'recovery end before target reached is a fatal error');
+
+# Test to cover a case where that we are looking for WAL record that ought to be
+# in for e.g 000000010000000000000001 we don't find it; instead we find
+# 000000020000000000000003 because of various reasons such as there was a
+# timeline switch in that segment, and we were reading the old WAL from a
+# segment belonging to a higher timeline or our recovery target timeline is 2,
+# or something that has 2 in its history.
+
+# Insert few more data to primary
+$node_primary->safe_psql('postgres',
+	"INSERT INTO tab_int VALUES (generate_series(6001,7000))");
+my $lsn6 = $node_primary->safe_psql('postgres',
+	"SELECT pg_current_wal_lsn()");
+
+# Setup new standby and enable WAL archiving to archive WAL files at the same
+# location as the primary.
+my $archive_cmd = $node_primary->safe_psql('postgres',
+	"SELECT current_setting('archive_command')");
+$node_standby = PostgreSQL::Test::Cluster->new('standby_9');
+$node_standby->init_from_backup(
+	$node_primary, 'my_backup',
+	has_streaming => 1);
+$node_standby->append_conf(
+        'postgresql.conf', qq(
+archive_mode = on
+archive_command = '$archive_cmd'
+));
+$node_standby->start;
+# Wait until necessary replay has been done on standby
+$node_primary->wait_for_catchup($node_standby, 'replay',
+	$node_primary->lsn('write'));
+$node_standby->promote;
+$node_standby->safe_psql('postgres',
+	"INSERT INTO tab_int VALUES (generate_series(7001,8000))");
+# Force archiving of WAL file
+$node_standby->safe_psql('postgres', "SELECT pg_switch_wal()");
+$node_standby->stop;
+
+# Another standby whose recovery target lsn will be in the WAL file has
+# a different TLI than the target LSN belongs to.
+$node_standby = PostgreSQL::Test::Cluster->new('standby_10');
+$node_standby->init_from_backup(
+	$node_primary, 'my_backup',
+	has_restoring => 1);
+$node_standby->append_conf(
+        'postgresql.conf', qq(
+recovery_target_lsn = '$lsn6'
+recovery_target_inclusive = off
+));
+$node_standby->start;
+my $result = $node_standby->safe_psql('postgres',
+	"SELECT count(*) FROM tab_int");
+is($result, '7000', "check standby content before timeline switch $lsn6");
-- 
2.18.0

#2Andres Freund
andres@anarazel.de
In reply to: Amul Sul (#1)
Re: TAP test to cover "EndOfLogTLI != replayTLI" case

Hi,

On 2021-11-23 11:43:21 +0530, Amul Sul wrote:

Attached patch covers a case where TLI in the filename for a
record being read is different from where it belongs to. In other
words, it covers following case noted in StartupXLOG():

Thoughts? Suggestions?

It seems the test isn't quite reliable. It occasionally fails on freebsd,
macos, linux and always on windows (starting with the new CI stuff, before the
test wasn't run).

See https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/36/3427

Greetings,

Andres Freund

#3Amul Sul
sulamul@gmail.com
In reply to: Andres Freund (#2)
1 attachment(s)
Re: TAP test to cover "EndOfLogTLI != replayTLI" case

On Mon, Jan 10, 2022 at 8:25 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2021-11-23 11:43:21 +0530, Amul Sul wrote:

Attached patch covers a case where TLI in the filename for a
record being read is different from where it belongs to. In other
words, it covers following case noted in StartupXLOG():

Thoughts? Suggestions?

It seems the test isn't quite reliable. It occasionally fails on freebsd,
macos, linux and always on windows (starting with the new CI stuff, before the
test wasn't run).

See https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/36/3427

Thanks for the note, I can see the same test is failing on my centos
vm too with the latest master head(376ce3e404b). The failing reason is
the "recovery_target_inclusive = off" setting which is unnecessary for
this test, the attached patch removing the same.

Regards,
Amul

Attachments:

v2-0001-TAP-test-for-EndOfLogTLI.patchapplication/x-patch; name=v2-0001-TAP-test-for-EndOfLogTLI.patchDownload
From 88ae9ea5a33c1ecc5b5493dae9c016ef19fbf88f Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Sun, 9 Jan 2022 23:10:07 -0500
Subject: [PATCH v2] TAP test for EndOfLogTLI

---
 src/test/recovery/t/003_recovery_targets.pl | 52 ++++++++++++++++++++-
 1 file changed, 51 insertions(+), 1 deletion(-)

diff --git a/src/test/recovery/t/003_recovery_targets.pl b/src/test/recovery/t/003_recovery_targets.pl
index 24da78c0bcd..cf72b5d9343 100644
--- a/src/test/recovery/t/003_recovery_targets.pl
+++ b/src/test/recovery/t/003_recovery_targets.pl
@@ -6,7 +6,7 @@ use strict;
 use warnings;
 use PostgreSQL::Test::Cluster;
 use PostgreSQL::Test::Utils;
-use Test::More tests => 9;
+use Test::More tests => 10;
 use Time::HiRes qw(usleep);
 
 # Create and test a standby from given backup, with a certain recovery target.
@@ -182,3 +182,53 @@ $logfile = slurp_file($node_standby->logfile());
 ok( $logfile =~
 	  qr/FATAL: .* recovery ended before configured recovery target was reached/,
 	'recovery end before target reached is a fatal error');
+
+# Test to cover a case where that we are looking for WAL record that ought to be
+# in for e.g 000000010000000000000001 we don't find it; instead we find
+# 000000020000000000000003 because of various reasons such as there was a
+# timeline switch in that segment, and we were reading the old WAL from a
+# segment belonging to a higher timeline or our recovery target timeline is 2,
+# or something that has 2 in its history.
+
+# Insert few more data to primary
+$node_primary->safe_psql('postgres',
+	"INSERT INTO tab_int VALUES (generate_series(6001,7000))");
+my $lsn6 = $node_primary->safe_psql('postgres',
+	"SELECT pg_current_wal_lsn()");
+
+# Setup new standby and enable WAL archiving to archive WAL files at the same
+# location as the primary.
+my $archive_cmd = $node_primary->safe_psql('postgres',
+	"SELECT current_setting('archive_command')");
+$node_standby = PostgreSQL::Test::Cluster->new('standby_9');
+$node_standby->init_from_backup(
+	$node_primary, 'my_backup',
+	has_streaming => 1);
+$node_standby->append_conf(
+        'postgresql.conf', qq(
+archive_mode = on
+archive_command = '$archive_cmd'
+));
+$node_standby->start;
+# Wait until necessary replay has been done on standby
+$node_primary->wait_for_catchup($node_standby, 'replay',
+	$node_primary->lsn('write'));
+$node_standby->promote;
+$node_standby->safe_psql('postgres',
+	"INSERT INTO tab_int VALUES (generate_series(7001,8000))");
+# Force archiving of WAL file
+$node_standby->safe_psql('postgres', "SELECT pg_switch_wal()");
+$node_standby->stop;
+
+# Another standby whose recovery target lsn will be in the WAL file has
+# a different TLI than the target LSN belongs to.
+$node_standby = PostgreSQL::Test::Cluster->new('standby_10');
+$node_standby->init_from_backup(
+	$node_primary, 'my_backup',
+	has_restoring => 1);
+$node_standby->append_conf(
+        'postgresql.conf', qq(recovery_target_lsn = '$lsn6'));
+$node_standby->start;
+my $result = $node_standby->safe_psql('postgres',
+	"SELECT count(*) FROM tab_int");
+is($result, '7000', "check standby content before timeline switch $lsn6");
-- 
2.18.0

#4Julien Rouhaud
rjuju123@gmail.com
In reply to: Amul Sul (#3)
Re: TAP test to cover "EndOfLogTLI != replayTLI" case

Hi,

On Mon, Jan 10, 2022 at 09:46:23AM +0530, Amul Sul wrote:

Thanks for the note, I can see the same test is failing on my centos
vm too with the latest master head(376ce3e404b). The failing reason is
the "recovery_target_inclusive = off" setting which is unnecessary for
this test, the attached patch removing the same.

This version still fails on windows according to the cfbot:

https://cirrus-ci.com/task/5882621321281536

[18:14:02.639] c:\cirrus>call perl src/tools/msvc/vcregress.pl recoverycheck
[18:14:56.114]
[18:14:56.122] # Failed test 'check standby content before timeline switch 0/500FB30'
[18:14:56.122] # at t/003_recovery_targets.pl line 234.
[18:14:56.122] # got: '6000'
[18:14:56.122] # expected: '7000'

I'm switching the cf entry to Waiting on Author.

#5Amul Sul
sulamul@gmail.com
In reply to: Julien Rouhaud (#4)
1 attachment(s)
Re: TAP test to cover "EndOfLogTLI != replayTLI" case

On Sat, Jan 15, 2022 at 11:35 AM Julien Rouhaud <rjuju123@gmail.com> wrote:

Hi,

On Mon, Jan 10, 2022 at 09:46:23AM +0530, Amul Sul wrote:

Thanks for the note, I can see the same test is failing on my centos
vm too with the latest master head(376ce3e404b). The failing reason is
the "recovery_target_inclusive = off" setting which is unnecessary for
this test, the attached patch removing the same.

This version still fails on windows according to the cfbot:

https://cirrus-ci.com/task/5882621321281536

[18:14:02.639] c:\cirrus>call perl src/tools/msvc/vcregress.pl recoverycheck
[18:14:56.114]
[18:14:56.122] # Failed test 'check standby content before timeline switch 0/500FB30'
[18:14:56.122] # at t/003_recovery_targets.pl line 234.
[18:14:56.122] # got: '6000'
[18:14:56.122] # expected: '7000'

I'm switching the cf entry to Waiting on Author.

Thanks for the note.

I am not sure what really went wrong but I think the 'standby_9'
server shutdown too early before it gets a chance to archive a
required WAL file. The attached patch tries to shutdown that server
after the required WAL are archived, unfortunately, I couldn't test
that on the window, let see how cfbot reacts to this version.

Regards,
Amul

Attachments:

v3-0001-TAP-test-for-EndOfLogTLI.patchapplication/x-patch; name=v3-0001-TAP-test-for-EndOfLogTLI.patchDownload
From 08d19c0ef2f464e8bf722ce13457acf3f9be47e8 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Mon, 17 Jan 2022 06:25:30 -0500
Subject: [PATCH v3] TAP test for EndOfLogTLI

---
 src/test/recovery/t/003_recovery_targets.pl | 57 ++++++++++++++++++++-
 1 file changed, 56 insertions(+), 1 deletion(-)

diff --git a/src/test/recovery/t/003_recovery_targets.pl b/src/test/recovery/t/003_recovery_targets.pl
index 24da78c0bcd..928799b9490 100644
--- a/src/test/recovery/t/003_recovery_targets.pl
+++ b/src/test/recovery/t/003_recovery_targets.pl
@@ -6,7 +6,7 @@ use strict;
 use warnings;
 use PostgreSQL::Test::Cluster;
 use PostgreSQL::Test::Utils;
-use Test::More tests => 9;
+use Test::More tests => 10;
 use Time::HiRes qw(usleep);
 
 # Create and test a standby from given backup, with a certain recovery target.
@@ -182,3 +182,58 @@ $logfile = slurp_file($node_standby->logfile());
 ok( $logfile =~
 	  qr/FATAL: .* recovery ended before configured recovery target was reached/,
 	'recovery end before target reached is a fatal error');
+
+# Test to cover a case where that we are looking for WAL record that ought to be
+# in for e.g 000000010000000000000001 we don't find it; instead we find
+# 000000020000000000000003 because of various reasons such as there was a
+# timeline switch in that segment, and we were reading the old WAL from a
+# segment belonging to a higher timeline or our recovery target timeline is 2,
+# or something that has 2 in its history.
+
+# Insert few more data to primary
+$node_primary->safe_psql('postgres',
+	"INSERT INTO tab_int VALUES (generate_series(6001,7000))");
+my $lsn6 = $node_primary->safe_psql('postgres',
+	"SELECT pg_current_wal_lsn()");
+
+# Setup new standby and enable WAL archiving to archive WAL files at the same
+# location as the primary.
+my $archive_cmd = $node_primary->safe_psql('postgres',
+	"SELECT current_setting('archive_command')");
+$node_standby = PostgreSQL::Test::Cluster->new('standby_9');
+$node_standby->init_from_backup(
+	$node_primary, 'my_backup',
+	has_streaming => 1);
+$node_standby->append_conf(
+        'postgresql.conf', qq(
+archive_mode = on
+archive_command = '$archive_cmd'
+));
+$node_standby->start;
+# Wait until necessary replay has been done on standby
+$node_primary->wait_for_catchup($node_standby, 'replay',
+	$node_primary->lsn('write'));
+$node_standby->promote;
+$node_standby->safe_psql('postgres',
+	"INSERT INTO tab_int VALUES (generate_series(7001,8000))");
+# Force archiving of WAL file
+my $last_wal = $node_standby->safe_psql('postgres',
+	"SELECT pg_walfile_name(pg_switch_wal())");
+# Wait until this WAL file archive
+my $check_archive = "SELECT last_archived_wal >= '$last_wal' FROM pg_stat_archiver";
+$node_standby->poll_query_until('postgres', $check_archive)
+	or die "Timed out while waiting for $last_wal file archive";
+$node_standby->stop;
+
+# Another standby whose recovery target lsn will be in the WAL file has
+# a different TLI than the target LSN belongs to.
+$node_standby = PostgreSQL::Test::Cluster->new('standby_10');
+$node_standby->init_from_backup(
+	$node_primary, 'my_backup',
+	has_restoring => 1);
+$node_standby->append_conf(
+        'postgresql.conf', qq(recovery_target_lsn = '$lsn6'));
+$node_standby->start;
+my $result = $node_standby->safe_psql('postgres',
+	"SELECT count(*) FROM tab_int");
+is($result, '7000', "check standby content before timeline switch $lsn6");
-- 
2.18.0

#6Julien Rouhaud
rjuju123@gmail.com
In reply to: Amul Sul (#5)
Re: TAP test to cover "EndOfLogTLI != replayTLI" case

Hi,

On Mon, Jan 17, 2022 at 05:07:48PM +0530, Amul Sul wrote:

I am not sure what really went wrong but I think the 'standby_9'
server shutdown too early before it gets a chance to archive a
required WAL file. The attached patch tries to shutdown that server
after the required WAL are archived, unfortunately, I couldn't test
that on the window, let see how cfbot reacts to this version.

Thanks for the updated patch! Note that thanks to Andres and Thomas work, you
can now easily rely on the exact same CI than the cfbot on your own github
repository, if you need to debug something on a platform you don't have access
to. You can check the documentation at [1]https://github.com/postgres/postgres/blob/master/src/tools/ci/README for more detail on how to setup the
CI.

[1]: https://github.com/postgres/postgres/blob/master/src/tools/ci/README

#7Julien Rouhaud
rjuju123@gmail.com
In reply to: Julien Rouhaud (#6)
Re: TAP test to cover "EndOfLogTLI != replayTLI" case

Hi,

On Mon, Jan 17, 2022 at 09:33:57PM +0800, Julien Rouhaud wrote:

Thanks for the updated patch! Note that thanks to Andres and Thomas work, you
can now easily rely on the exact same CI than the cfbot on your own github
repository, if you need to debug something on a platform you don't have access
to. You can check the documentation at [1] for more detail on how to setup the
CI.

The cfbot reports that the patch still fails on Windows but also failed on
macos with the same error: https://cirrus-ci.com/task/5655777858813952:

[14:20:43.950] # Failed test 'check standby content before timeline switch 0/500FAF8'
[14:20:43.950] # at t/003_recovery_targets.pl line 239.
[14:20:43.950] # got: '6000'
[14:20:43.950] # expected: '7000'
[14:20:43.950] # Looks like you failed 1 test of 10.

I'm switching the CF entry to Waiting on Author.

#8Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Julien Rouhaud (#7)
Re: TAP test to cover "EndOfLogTLI != replayTLI" case

At Tue, 18 Jan 2022 12:25:15 +0800, Julien Rouhaud <rjuju123@gmail.com> wrote in

Hi,

On Mon, Jan 17, 2022 at 09:33:57PM +0800, Julien Rouhaud wrote:

Thanks for the updated patch! Note that thanks to Andres and Thomas work, you
can now easily rely on the exact same CI than the cfbot on your own github
repository, if you need to debug something on a platform you don't have access
to. You can check the documentation at [1] for more detail on how to setup the
CI.

The cfbot reports that the patch still fails on Windows but also failed on
macos with the same error: https://cirrus-ci.com/task/5655777858813952:

[14:20:43.950] # Failed test 'check standby content before timeline switch 0/500FAF8'
[14:20:43.950] # at t/003_recovery_targets.pl line 239.
[14:20:43.950] # got: '6000'
[14:20:43.950] # expected: '7000'
[14:20:43.950] # Looks like you failed 1 test of 10.

I'm switching the CF entry to Waiting on Author.

The most significant advantages of the local CI setup are

- CI immediately responds to push.

- You can dirty the code with additional logging aid as much as you
like to see closely what is going on. It makes us hesitant to do
the same on this ML:p

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#9Amul Sul
sulamul@gmail.com
In reply to: Kyotaro Horiguchi (#8)
2 attachment(s)
Re: TAP test to cover "EndOfLogTLI != replayTLI" case

On Tue, Jan 18, 2022 at 10:31 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

At Tue, 18 Jan 2022 12:25:15 +0800, Julien Rouhaud <rjuju123@gmail.com> wrote in

Hi,

On Mon, Jan 17, 2022 at 09:33:57PM +0800, Julien Rouhaud wrote:

Thanks for the updated patch! Note that thanks to Andres and Thomas work, you
can now easily rely on the exact same CI than the cfbot on your own github
repository, if you need to debug something on a platform you don't have access
to. You can check the documentation at [1] for more detail on how to setup the
CI.

The cfbot reports that the patch still fails on Windows but also failed on
macos with the same error: https://cirrus-ci.com/task/5655777858813952:

[14:20:43.950] # Failed test 'check standby content before timeline switch 0/500FAF8'
[14:20:43.950] # at t/003_recovery_targets.pl line 239.
[14:20:43.950] # got: '6000'
[14:20:43.950] # expected: '7000'
[14:20:43.950] # Looks like you failed 1 test of 10.

I'm switching the CF entry to Waiting on Author.

The most significant advantages of the local CI setup are

- CI immediately responds to push.

- You can dirty the code with additional logging aid as much as you
like to see closely what is going on. It makes us hesitant to do
the same on this ML:p

Indeed :)

I found the cause for the test failing on window -- is due to the
custom archive command setting which wasn't setting the correct window
archive directory path.

There is no option to choose a custom wal archive and restore patch in
the TAP test. The attach 0001 patch proposes the same, which enables
you to choose a custom WAL archive and restore directory. The only
concern I have is that $node->info() will print the wrong archive
directory path in that case, do we need to fix that? We might need to
store archive_dir like base_dir, I wasn't sure about that. Thoughts?
Comments?

Regards,
Amul

Attachments:

v4-0001-Allow-TAP-tests-to-choose-custom-WAL-archive-and-.patchapplication/x-patch; name=v4-0001-Allow-TAP-tests-to-choose-custom-WAL-archive-and-.patchDownload
From 4bc968d3508e9f608fd572f40f39e2cb374d53f4 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Wed, 19 Jan 2022 23:22:02 -0500
Subject: [PATCH v4 1/3] Allow TAP tests to choose custom WAL archive and
 restore path.

---
 src/test/perl/PostgreSQL/Test/Cluster.pm | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index b7d4c24553..6a562db2ce 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -1033,10 +1033,13 @@ primary_conninfo='$root_connstr'
 # Internal routine to enable archive recovery command on a standby node
 sub enable_restoring
 {
-	my ($self, $root_node, $standby) = @_;
-	my $path = PostgreSQL::Test::Utils::perl2host($root_node->archive_dir);
+	my ($self, $root_node, $standby, $path) = @_;
 	my $name = $self->name;
 
+	# Default archive directory will be used if not specified.
+	$path = $root_node->archive_dir if (!defined($path));
+	$path = PostgreSQL::Test::Utils::perl2host($path);
+
 	print "### Enabling WAL restore for node \"$name\"\n";
 
 	# On Windows, the path specified in the restore command needs to use
@@ -1101,10 +1104,13 @@ sub set_standby_mode
 # Internal routine to enable archiving
 sub enable_archiving
 {
-	my ($self) = @_;
-	my $path   = PostgreSQL::Test::Utils::perl2host($self->archive_dir);
+	my ($self, $path) = @_;
 	my $name   = $self->name;
 
+	# Default archive directory will be used if not specified.
+	$path = $self->archive_dir if (!defined($path));
+	$path = PostgreSQL::Test::Utils::perl2host($path);
+
 	print "### Enabling WAL archiving for node \"$name\"\n";
 
 	# On Windows, the path specified in the restore command needs to use
-- 
2.18.0

v4-0002-TAP-test-for-EndOfLogTLI.patchapplication/x-patch; name=v4-0002-TAP-test-for-EndOfLogTLI.patchDownload
From 6c730ee10b9e64d90982aa555283c9094a5fdb87 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Wed, 19 Jan 2022 23:25:07 -0500
Subject: [PATCH v4 2/3] TAP test for EndOfLogTLI

---
 src/test/recovery/t/003_recovery_targets.pl | 47 ++++++++++++++++++++-
 1 file changed, 46 insertions(+), 1 deletion(-)

diff --git a/src/test/recovery/t/003_recovery_targets.pl b/src/test/recovery/t/003_recovery_targets.pl
index 24da78c0bc..b3f7076540 100644
--- a/src/test/recovery/t/003_recovery_targets.pl
+++ b/src/test/recovery/t/003_recovery_targets.pl
@@ -6,7 +6,7 @@ use strict;
 use warnings;
 use PostgreSQL::Test::Cluster;
 use PostgreSQL::Test::Utils;
-use Test::More tests => 9;
+use Test::More tests => 10;
 use Time::HiRes qw(usleep);
 
 # Create and test a standby from given backup, with a certain recovery target.
@@ -182,3 +182,48 @@ $logfile = slurp_file($node_standby->logfile());
 ok( $logfile =~
 	  qr/FATAL: .* recovery ended before configured recovery target was reached/,
 	'recovery end before target reached is a fatal error');
+
+# Test to cover a case where that we are looking for WAL record that ought to be
+# in for e.g 000000010000000000000005 we don't find it; instead we find
+# 000000020000000000000005 because of various reasons such as there was a
+# timeline switch in that segment, and we were reading the old WAL from a
+# segment belonging to a higher timeline or our recovery target timeline is 2,
+# or something that has 2 in its history.
+
+# Insert few more data to primary
+$node_primary->safe_psql('postgres', "SELECT pg_switch_wal()");
+$node_primary->safe_psql('postgres',
+	"INSERT INTO tab_int VALUES (generate_series(6001,7000))");
+my $lsn6 = $node_primary->safe_psql('postgres',
+	"SELECT pg_current_wal_lsn()");
+
+# Setup new standby and enable WAL archiving to archive WAL files at the same
+# location as the primary.
+my $archive_dir = $node_primary->archive_dir;
+$node_standby = PostgreSQL::Test::Cluster->new('standby_9');
+$node_standby->init_from_backup(
+	$node_primary, 'my_backup',
+	has_streaming => 1);
+$node_standby->enable_archiving($archive_dir);
+$node_standby->start;
+# Wait until necessary replay has been done on standby
+$node_primary->wait_for_catchup($node_standby);
+$node_standby->promote;
+$node_standby->safe_psql('postgres',
+	"INSERT INTO tab_int VALUES (generate_series(7001,8000))");
+# Force archiving of WAL file
+my $last_wal = $node_standby->safe_psql('postgres',
+	"SELECT pg_walfile_name(pg_switch_wal())");
+# Wait until this WAL file archive
+my $check_archive = "SELECT last_archived_wal >= '$last_wal' FROM pg_stat_archiver";
+$node_standby->poll_query_until('postgres', $check_archive)
+	or die "Timed out while waiting for $last_wal file archive";
+$node_standby->stop;
+
+# Another standby whose recovery target lsn will be in the WAL file has
+# a different TLI than the target LSN belongs to.
+@recovery_params = (
+	"recovery_target_lsn = '$lsn6'",
+	"recovery_target_inclusive = off");
+test_recovery_standby("target LSN belong to different TLI",
+	'standby_10', $node_primary, \@recovery_params, "7000", $lsn6);
-- 
2.18.0

#10Michael Paquier
michael@paquier.xyz
In reply to: Amul Sul (#9)
Re: TAP test to cover "EndOfLogTLI != replayTLI" case

On Thu, Jan 20, 2022 at 12:13:08PM +0530, Amul Sul wrote:

I found the cause for the test failing on window -- is due to the
custom archive command setting which wasn't setting the correct window
archive directory path.

After reading this patch and this thread, I have noticed that you are
testing the same thing as Heikki here, patch 0001:
/messages/by-id/52bc9ccd-8591-431b-0086-15d9acf25a3f@iki.fi

The patch sent on the other thread has a better description and shape,
so perhaps we'd better drop what is posted here in favor of the other
version. Thoughts?
--
Michael

#11Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Michael Paquier (#10)
Re: TAP test to cover "EndOfLogTLI != replayTLI" case

At Thu, 27 Jan 2022 15:31:37 +0900, Michael Paquier <michael@paquier.xyz> wrote in

On Thu, Jan 20, 2022 at 12:13:08PM +0530, Amul Sul wrote:

I found the cause for the test failing on window -- is due to the
custom archive command setting which wasn't setting the correct window
archive directory path.

After reading this patch and this thread, I have noticed that you are
testing the same thing as Heikki here, patch 0001:
/messages/by-id/52bc9ccd-8591-431b-0086-15d9acf25a3f@iki.fi

The patch sent on the other thread has a better description and shape,
so perhaps we'd better drop what is posted here in favor of the other
version. Thoughts?

pg_switch_wal() is more preferable than filling-in a large number of
records as the means to advance segments because it is stable against
changes of wal_segment_size.

As you said, using has_restoring on initializing node_ptr is simpler
than explicitly setting archive_dir from that of another node.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#12Amul Sul
sulamul@gmail.com
In reply to: Michael Paquier (#10)
Re: TAP test to cover "EndOfLogTLI != replayTLI" case

On Thu, Jan 27, 2022 at 12:01 PM Michael Paquier <michael@paquier.xyz> wrote:

On Thu, Jan 20, 2022 at 12:13:08PM +0530, Amul Sul wrote:

I found the cause for the test failing on window -- is due to the
custom archive command setting which wasn't setting the correct window
archive directory path.

After reading this patch and this thread, I have noticed that you are
testing the same thing as Heikki here, patch 0001:
/messages/by-id/52bc9ccd-8591-431b-0086-15d9acf25a3f@iki.fi

The patch sent on the other thread has a better description and shape,
so perhaps we'd better drop what is posted here in favor of the other
version. Thoughts?

Yes, I do agree with you. Thanks for the comparison.

Regards,
Amul

#13Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Amul Sul (#12)
Re: TAP test to cover "EndOfLogTLI != replayTLI" case

On 28/01/2022 12:10, Amul Sul wrote:

On Thu, Jan 27, 2022 at 12:01 PM Michael Paquier <michael@paquier.xyz> wrote:

After reading this patch and this thread, I have noticed that you are
testing the same thing as Heikki here, patch 0001:
/messages/by-id/52bc9ccd-8591-431b-0086-15d9acf25a3f@iki.fi

The patch sent on the other thread has a better description and shape,
so perhaps we'd better drop what is posted here in favor of the other
version. Thoughts?

Yes, I do agree with you. Thanks for the comparison.

Pushed the test from that other thread now. Thanks!

- Heikki

#14Amul Sul
sulamul@gmail.com
In reply to: Heikki Linnakangas (#13)
Re: TAP test to cover "EndOfLogTLI != replayTLI" case

On Mon, Feb 14, 2022 at 3:07 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 28/01/2022 12:10, Amul Sul wrote:

On Thu, Jan 27, 2022 at 12:01 PM Michael Paquier <michael@paquier.xyz> wrote:

After reading this patch and this thread, I have noticed that you are
testing the same thing as Heikki here, patch 0001:
/messages/by-id/52bc9ccd-8591-431b-0086-15d9acf25a3f@iki.fi

The patch sent on the other thread has a better description and shape,
so perhaps we'd better drop what is posted here in favor of the other
version. Thoughts?

Yes, I do agree with you. Thanks for the comparison.

Pushed the test from that other thread now. Thanks!

Thank you !

Regards,
Amul