Adding a TAP test checking data consistency on standby with minRecoveryPoint

Started by Michael Paquierabout 7 years ago19 messages

michael@paquier.xyz

about 7 years ago

1 attachment(s)

Hi all,

While working on a fix for c186ba13 which fixes the way minRecoveryPoint
is updated for other processes than the startup process, I was
struggling about being able to get that into a reproducible test case.

I have been thinking what Andrew Gierth has mentioned yesterday, and
roughly designed a test case mentioned here which is able to see the
problem:
/messages/by-id/20181107044915.GF1677@paquier.xyz

I have also been trying to shape that into a TAP test which can be added
into the in-core recovery test suite, and it happens that the part which
scans if all the pages of a relation are not newer than what
minRecoveryPoint is set to in the control file can be easily calculated
by using pageinspect and pg_control_recovery() with a simple SQL query.
So, digging into it, I have been able to get a reproducible TAP test
case which is in the patch attached.

On HEAD, if you revert c186ba13 and then trigger the test the
inconsistency shows up immediately. Keeping the fix makes the test
pass.

This test suite will make sure that we don't break again how
minRecoveryPoint is handled across multiple processes, so I think that
it would be a good addition for HEAD and the future.

Thoughts?
--
Michael

Attachments:

recovery-min-lsn-tap.patchtext/x-diff; charset=us-asciiDownload

diff --git a/src/test/recovery/Makefile b/src/test/recovery/Makefile
index daf79a0b1f..0364a927c3 100644
--- a/src/test/recovery/Makefile
+++ b/src/test/recovery/Makefile
@@ -9,7 +9,7 @@
 #
 #-------------------------------------------------------------------------
 
-EXTRA_INSTALL=contrib/test_decoding
+EXTRA_INSTALL=contrib/test_decoding contrib/pageinspect
 
 subdir = src/test/recovery
 top_builddir = ../../..
diff --git a/src/test/recovery/t/016_min_consistency.pl b/src/test/recovery/t/016_min_consistency.pl
new file mode 100644
index 0000000000..b6d06b2ad3
--- /dev/null
+++ b/src/test/recovery/t/016_min_consistency.pl
@@ -0,0 +1,92 @@
+# Test for checking consistency of on-disk pages for a cluster with
+# the minimum recovery LSN, ensuring that the updates happen across
+# all processes.  In this test, the startup process and the checkpoint,
+# triggering the non-startup code paths, are both checked.
+
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 1;
+
+# Initialize primary node
+my $primary = get_new_node('primary');
+$primary->init(allows_streaming => 1);
+
+# Set shared_buffers to a very low value to enforce discard and flush
+# of PostgreSQL buffers on standby, enforcing other processes than the
+# startup process to update the minimum recovery LSN in the control
+# file.
+$primary->append_conf("postgresql.conf", <<EOF);
+shared_buffers = 128kB
+EOF
+
+# Start the primary
+$primary->start;
+
+# setup/start a standby
+$primary->backup('bkp');
+my $standby = get_new_node('standby');
+$standby->init_from_backup($primary, 'bkp', has_streaming => 1);
+$standby->start;
+
+# Dummy table and extension creation for the upcoming tests.
+$primary->safe_psql('postgres',
+	'create extension pageinspect');
+$primary->safe_psql('postgres',
+	'create table test1 (a int) with (fillfactor = 10)');
+$primary->safe_psql('postgres', 'insert into test1 select generate_series(1, 10000)');
+
+# Take a checkpoint and enforce post-checkpoint full page writes
+# which makes the startup process replay those pages, updating
+# minRecoveryPoint.
+$primary->safe_psql('postgres', 'checkpoint');
+$primary->safe_psql('postgres', 'update test1 set a = a + 1');
+
+# Fill in the standby's shared buffers with the data filled in
+# previously.
+$standby->safe_psql('postgres', 'select count(*) from test1');
+
+# Update the table again, this does not generate full page writes so
+# the standby will replay records associated with it, but the startup
+# process will not flush those pages.
+$primary->safe_psql('postgres', 'update test1 set a = a + 1');
+
+# Extract from the relation the last block created, this will be
+# used at the end of the test for sanity checks.
+my $last_block;
+$primary->psql('postgres',
+	"select pg_relation_size('test1')::int / setting::int - 1 from pg_settings where name = 'block_size';",
+	stdout => \$last_block);
+
+# Wait for last record to have been replayed on the standby.
+$primary->wait_for_catchup($standby, 'replay',
+						   $primary->lsn('insert'));
+
+# Issue a restart point on the standby now, which makes the checkpointer
+# update minRecoveryPoint.
+$standby->safe_psql('postgres', 'checkpoint');
+
+# Now shut down the primary violently so as the standby does not
+# receive the shutdown checkpoint, making sure that the startup
+# process does not flush any pages on its side.  The standby is
+# cleanly stopped, which makes the checkpointer update minRecoveryPoint
+# with the restart point created at shutdown.
+$primary->stop('immediate');
+$standby->stop('fast');
+
+# Now restart the standby and check the state of the instance. All
+# the pages of the relation previously created should not have a
+# LSN newer than what minRecoveryPoint has.
+$standby->start;
+
+# Check that the last page of the table, which is the last one which
+# has been flushed by the previous checkpoint on the standby, does not
+# have a LSN newer than minRecoveryPoint.
+my $psql_out;
+$standby->psql(
+	'postgres',
+	"SELECT lsn <= min_recovery_end_lsn from page_header(get_raw_page('test1', $last_block)), pg_control_recovery()",
+	stdout => \$psql_out);
+is($psql_out, 't',
+   "Check that table data is consistent with minRecoveryPoint");

Andrew Gierth

andrew@tao11.riddles.org.uk

about 7 years ago

In reply to: Michael Paquier (#1)

Re: Adding a TAP test checking data consistency on standby with minRecoveryPoint

"Michael" == Michael Paquier <michael@paquier.xyz> writes:

Michael> I have also been trying to shape that into a TAP test which
Michael> can be added into the in-core recovery test suite, and it
Michael> happens that the part which scans if all the pages of a
Michael> relation are not newer than what minRecoveryPoint is set to in
Michael> the control file can be easily calculated by using pageinspect
Michael> and pg_control_recovery() with a simple SQL query.

How? It's OK (and normal) for in-core pages to have newer LSNs than
minRecoveryPoint, it's only on-disk pages that must not be more recent
than that. And pageinspect will show the in-core page...

--
Andrew (irc:RhodiumToad)

Michael Paquier

michael@paquier.xyz

about 7 years ago

In reply to: Andrew Gierth (#2)

Re: Adding a TAP test checking data consistency on standby with minRecoveryPoint

On Thu, Nov 08, 2018 at 05:04:35AM +0000, Andrew Gierth wrote:

How? It's OK (and normal) for in-core pages to have newer LSNs than
minRecoveryPoint, it's only on-disk pages that must not be more recent
than that. And pageinspect will show the in-core page...

If the standby is stopped cleanly once, what's in shared buffers gets
flushed to disk by the checkpointer once at shutdown, forcing
minRecoveryPoint to be updated by the checkpointer, and that update was
not happening beforethe fix. And once the standby is started again,
what's on disk gets reloaded, showing the inconsistency.
--
Michael

Andrew Gierth

andrew@tao11.riddles.org.uk

about 7 years ago

In reply to: Michael Paquier (#3)

Re: Adding a TAP test checking data consistency on standby with minRecoveryPoint

"Michael" == Michael Paquier <michael@paquier.xyz> writes:

How? It's OK (and normal) for in-core pages to have newer LSNs than
minRecoveryPoint, it's only on-disk pages that must not be more
recent than that. And pageinspect will show the in-core page...

Michael> If the standby is stopped cleanly once, what's in shared
Michael> buffers gets flushed to disk by the checkpointer once at
Michael> shutdown, forcing minRecoveryPoint to be updated by the
Michael> checkpointer, and that update was not happening beforethe fix.
Michael> And once the standby is started again, what's on disk gets
Michael> reloaded, showing the inconsistency.

Ah, I missed that you were shutting down.

I think it would be nice to have a test that didn't rely on that, too.

--
Andrew (irc:RhodiumToad)

Michael Paquier

michael@paquier.xyz

about 7 years ago

In reply to: Andrew Gierth (#4)

Re: Adding a TAP test checking data consistency on standby with minRecoveryPoint

On Thu, Nov 08, 2018 at 06:00:29AM +0000, Andrew Gierth wrote:

I think it would be nice to have a test that didn't rely on that, too.

Yes, I don't disagree with you and I thought about it. Fetching the
value from the control file is easy, doing the comparison between two
LSNs is also simple by doing it directly with pg_lsn in the database
(and I don't want to add math logics about LSNs as a TAP API). Now I am
less sure about how portable it is possible to make the read of 8 bytes
on the page header for the last page of a relation portable across many
architectures in perl.

And I am not sure that we actually need this addition that as the
standby spawned in the TAP test will not have any clients scanning the
pages and generating WAL, except autovacuum which would be better
switched off in the test.
--
Michael

Michael Paquier

michael@paquier.xyz

about 7 years ago

In reply to: Michael Paquier (#5)

Re: Adding a TAP test checking data consistency on standby with minRecoveryPoint

On Thu, Nov 08, 2018 at 04:00:26PM +0900, Michael Paquier wrote:

Yes, I don't disagree with you and I thought about it. Fetching the
value from the control file is easy, doing the comparison between two
LSNs is also simple by doing it directly with pg_lsn in the database
(and I don't want to add math logics about LSNs as a TAP API). Now I am
less sure about how portable it is possible to make the read of 8 bytes
on the page header for the last page of a relation portable across many
architectures in perl.

Has a perl monk a specific idea here? It seems to me that the amount of
specific facility which would be needed overweights the current
simplicity of the test.
--
Michael

Andrew Gierth

andrew@tao11.riddles.org.uk

about 7 years ago

In reply to: Michael Paquier (#6)

Re: Adding a TAP test checking data consistency on standby with minRecoveryPoint

"Michael" == Michael Paquier <michael@paquier.xyz> writes:

On Thu, Nov 08, 2018 at 04:00:26PM +0900, Michael Paquier wrote:

Yes, I don't disagree with you and I thought about it. Fetching the
value from the control file is easy, doing the comparison between
two LSNs is also simple by doing it directly with pg_lsn in the
database (and I don't want to add math logics about LSNs as a TAP
API). Now I am less sure about how portable it is possible to make
the read of 8 bytes on the page header for the last page of a
relation portable across many architectures in perl.

Michael> Has a perl monk a specific idea here? It seems to me that the
Michael> amount of specific facility which would be needed overweights
Michael> the current simplicity of the test.

Seems simple enough - the LSN on the page is actually two 4-byte values
with the most significant one first, regardless of platform byte order
(though each individual word is in native order), so

my ($hi,$lo) = unpack("LL", $buf);

should suffice. ("L" is always 32 bits regardless of platform, and it
has the platform's endianness.)

Looking only at the last page seems questionable. Something like this
should work to return the largest LSN of any page in the specified list
of files:

# find_largest_lsn(blocksize,filenames...)
sub find_largest_lsn
{
my $blocksize = shift;
my @filenames = ref($_[0]) eq 'ARRAY' ? @{$_[0]} : @_;
my ($max_hi,$max_lo) = (0,0);
for my $filename (@filenames)
{
open(my $fh, "<:raw", $filename)
or die "failed to open $filename: $!";
my ($buf,$len);
while ($len = read($fh, $buf, $blocksize))
{
$len == $blocksize
or die "read only $len of $blocksize bytes from $filename";
my ($hi,$lo) = unpack("LL", $buf);
($max_hi,$max_lo) = ($hi,$lo)
if ($hi > $max_hi or ($hi == $max_hi and $lo > $max_lo));
}
defined($len) or die "read error on $filename: $!";
close($fh);
}
return sprintf("%X/%X", $max_hi, $max_lo);
}

--
Andrew (irc:RhodiumToad)

Michael Paquier

michael@paquier.xyz

about 7 years ago

In reply to: Andrew Gierth (#7)

1 attachment(s)

Re: Adding a TAP test checking data consistency on standby with minRecoveryPoint

On Fri, Nov 09, 2018 at 12:42:27PM +0000, Andrew Gierth wrote:

Seems simple enough - the LSN on the page is actually two 4-byte values
with the most significant one first, regardless of platform byte order
(though each individual word is in native order), so

my ($hi,$lo) = unpack("LL", $buf);

should suffice. ("L" is always 32 bits regardless of platform, and it
has the platform's endianness.)

Thanks for the review, Andrew. And I completely forgot that this is at
the beginning of the page.

Looking only at the last page seems questionable.

I have switched also the online check so as it also looks at the full
range of blocks instead of only the last one.

Something like this should work to return the largest LSN of any page
in the specified list of files:

# find_largest_lsn(blocksize,filenames...)
sub find_largest_lsn

Thanks! I am stealing that stuff, and I have added an offline check by
comparing the value of minRecoveryPoint in pg_controldata. Again, if
you revert c186ba13 and run the tests, both online and offline failures
are showing up.

What do you think?
--
Michael

Attachments:

recovery-min-lsn-tap-v2.patchtext/x-diff; charset=us-asciiDownload

diff --git a/src/test/perl/TestLib.pm b/src/test/perl/TestLib.pm
index b9cb51b9d3..ce59401cef 100644
--- a/src/test/perl/TestLib.pm
+++ b/src/test/perl/TestLib.pm
@@ -36,6 +36,7 @@ our @EXPORT = qw(
   system_or_bail
   system_log
   run_log
+  run_command
 
   command_ok
   command_fails
@@ -203,6 +204,16 @@ sub run_log
 	return IPC::Run::run(@_);
 }
 
+sub run_command
+{
+	my ($cmd) = @_;
+	my ($stdout, $stderr);
+	my $result = IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+	chomp($stdout);
+	chomp($stderr);
+	return ($stdout, $stderr);
+}
+
 # Generate a string made of the given range of ASCII characters
 sub generate_ascii_string
 {
diff --git a/src/test/recovery/Makefile b/src/test/recovery/Makefile
index daf79a0b1f..0364a927c3 100644
--- a/src/test/recovery/Makefile
+++ b/src/test/recovery/Makefile
@@ -9,7 +9,7 @@
 #
 #-------------------------------------------------------------------------
 
-EXTRA_INSTALL=contrib/test_decoding
+EXTRA_INSTALL=contrib/test_decoding contrib/pageinspect
 
 subdir = src/test/recovery
 top_builddir = ../../..
diff --git a/src/test/recovery/t/016_min_consistency.pl b/src/test/recovery/t/016_min_consistency.pl
new file mode 100644
index 0000000000..8909226bf5
--- /dev/null
+++ b/src/test/recovery/t/016_min_consistency.pl
@@ -0,0 +1,174 @@
+# Test for checking consistency of on-disk pages for a cluster with
+# the minimum recovery LSN, ensuring that the updates happen across
+# all processes.  In this test, the updates from the startup process
+# and the checkpointer (which triggers non-startup code paths) are
+# both checked.
+
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+# Find the largest LSN in the set of pages part of the given relation
+# file.  This is used for offline checks of page consistency.  The LSN
+# is historically stored as a set of two numbers of 4 byte-length
+# located at the beginning of each page.
+sub find_largest_lsn
+{
+	my $blocksize = int(shift);
+	my $filename = shift;
+	my ($max_hi,$max_lo) = (0,0);
+	open(my $fh, "<:raw", $filename)
+	  or die "failed to open $filename: $!";
+	my ($buf,$len);
+	while ($len = read($fh, $buf, $blocksize))
+	{
+		$len == $blocksize
+		  or die "read only $len of $blocksize bytes from $filename";
+		my ($hi,$lo) = unpack("LL", $buf);
+
+		if ($hi > $max_hi or ($hi == $max_hi and $lo > $max_lo))
+		{
+			($max_hi,$max_lo) = ($hi,$lo);
+		}
+	}
+	defined($len) or die "read error on $filename: $!";
+	close($fh);
+
+	return sprintf("%X/%X", $max_hi, $max_lo);
+}
+
+# Initialize primary node
+my $primary = get_new_node('primary');
+$primary->init(allows_streaming => 1);
+
+# Set shared_buffers to a very low value to enforce discard and flush
+# of PostgreSQL buffers on standby, enforcing other processes than the
+# startup process to update the minimum recovery LSN in the control
+# file.  Autovacuum is disabled so as there is no risk of having other
+# processes than the checkpointer doing page flushes.
+$primary->append_conf("postgresql.conf", <<EOF);
+shared_buffers = 128kB
+autovacuum = off
+EOF
+
+# Start the primary
+$primary->start;
+
+# setup/start a standby
+$primary->backup('bkp');
+my $standby = get_new_node('standby');
+$standby->init_from_backup($primary, 'bkp', has_streaming => 1);
+$standby->start;
+
+# Object creations for the upcoming tests:
+# - Base table whose data consistency is checked.
+# - pageinspect to look at the page-level contents.
+# - Function wrapper on top of pageinspect to scan a range of pages and
+#   get the maximum LSN present.
+$primary->safe_psql('postgres', "
+create extension pageinspect;
+-- Function wrapper on top of pageinspect which fetches the largest LSN
+-- present in the given page range.
+create or replace function max_lsn_range(relname text,
+  start_blk int,
+  end_blk int)
+returns pg_lsn as \$\$
+declare
+  max_lsn pg_lsn = '0/0'::pg_lsn;
+  cur_lsn pg_lsn;
+begin
+  for i in start_blk..end_blk loop
+    execute 'select lsn from page_header(get_raw_page(''' || relname || ''',' || i || '));' into cur_lsn;
+    if max_lsn < cur_lsn then
+      max_lsn = cur_lsn;
+    end if;
+  end loop;
+  return max_lsn;
+end;
+\$\$ language plpgsql;
+create table test1 (a int) with (fillfactor = 10);
+insert into test1 select generate_series(1, 10000);");
+
+# Take a checkpoint and enforce post-checkpoint full page writes
+# which makes the startup process replay those pages, updating
+# minRecoveryPoint.
+$primary->safe_psql('postgres', 'checkpoint');
+$primary->safe_psql('postgres', 'update test1 set a = a + 1');
+
+# Fill in the standby's shared buffers with the data filled in
+# previously.
+$standby->safe_psql('postgres', 'select count(*) from test1');
+
+# Update the table again, this does not generate full page writes so
+# the standby will replay records associated with it, but the startup
+# process will not flush those pages.
+$primary->safe_psql('postgres', 'update test1 set a = a + 1');
+
+# Extract from the relation the last block created and its relation
+# file, this will be used at the end of the test for sanity checks.
+my $blocksize = $primary->safe_psql('postgres',
+	"select setting::int from pg_settings where name = 'block_size';");
+my $last_block = $primary->safe_psql('postgres',
+	"select pg_relation_size('test1')::int / $blocksize - 1");
+my $relfilenode = $primary->safe_psql('postgres',
+	"select pg_relation_filepath('test1'::regclass);");
+
+# Wait for last record to have been replayed on the standby.
+$primary->wait_for_catchup($standby, 'replay',
+						   $primary->lsn('insert'));
+
+# Issue a restart point on the standby now, which makes the checkpointer
+# update minRecoveryPoint.
+$standby->safe_psql('postgres', 'checkpoint');
+
+# Now shut down the primary violently so as the standby does not
+# receive the shutdown checkpoint, making sure that the startup
+# process does not flush any pages on its side.  The standby is
+# cleanly stopped, which makes the checkpointer update minRecoveryPoint
+# with the restart point created at shutdown.
+$primary->stop('immediate');
+$standby->stop('fast');
+
+# Check the data consistency of the instance while offline.  This is
+# done by directly scanning the on-disk relation blocks and what
+# pg_controldata lets know.
+my $standby_data = $standby->data_dir;
+my $offline_max_lsn = find_largest_lsn($blocksize,
+									   "$standby_data/$relfilenode");
+
+# Fetch minRecoveryPoint from the control file itself
+my ($stdout, $stderr) = run_command(['pg_controldata', $standby_data]);
+my @control_data = split("\n", $stdout);
+my $offline_recovery_lsn = undef;
+foreach (@control_data)
+{
+	if ($_ =~ /^Minimum recovery ending location:\s*(.*)$/mg)
+	{
+		$offline_recovery_lsn = $1;
+		last;
+	}
+}
+die "No minRecoveryPoint in control file found\n"
+	unless defined($offline_recovery_lsn);
+
+# minRecoveryPoint should never be
+ok($offline_recovery_lsn ge $offline_max_lsn,
+   "Check offline that table data is consistent with minRecoveryPoint");
+
+# Now restart the standby and check the state of the instance while
+# online.  Again, all the pages of the relation previously created
+# should not have a LSN newer than what minRecoveryPoint has.
+$standby->start;
+
+# Check that the last page of the table, which is the last one which
+# has been flushed by the previous checkpoint on the standby, does not
+# have a LSN newer than minRecoveryPoint.
+my $psql_out;
+$standby->psql(
+	'postgres',
+	"SELECT max_lsn_range('test1', 0, $last_block) <= min_recovery_end_lsn from pg_control_recovery()",
+	stdout => \$psql_out);
+is($psql_out, 't',
+   "Check online that table data is consistent with minRecoveryPoint");

Michael Paquier

michael@paquier.xyz

almost 7 years ago

In reply to: Michael Paquier (#8)

Re: Adding a TAP test checking data consistency on standby with minRecoveryPoint

On Mon, Nov 12, 2018 at 01:08:38PM +0900, Michael Paquier wrote:

Thanks! I am stealing that stuff, and I have added an offline check by
comparing the value of minRecoveryPoint in pg_controldata. Again, if
you revert c186ba13 and run the tests, both online and offline failures
are showing up.

What do you think?

I think that this test has some value, and it did not get any reviews,
so I am moving it to the next CF.
--
Michael

#10

Georgios Kokolatos

gkokolatos@pm.me

almost 7 years ago

In reply to: Michael Paquier (#9)

Re: Adding a TAP test checking data consistency on standby with minRecoveryPoint

Hi,

I applied the patch on current master and run the tests, but I am afraid that the newly introduced test failed on installcheck-world:

```t/016_min_consistency.pl ............. # Looks like your test exited with 29 before it could output anything.
t/016_min_consistency.pl ............. Dubious, test returned 29 (wstat 7424, 0x1d00)
Failed 2/2 subtests

Test Summary Report
-------------------
t/016_min_consistency.pl (Wstat: 7424 Tests: 0 Failed: 0)
Non-zero exit status: 29
Parse errors: Bad plan. You planned 2 tests but ran 0.
Files=16, Tests=143, 65 wallclock secs ( 0.04 usr 0.04 sys + 6.53 cusr 5.08 csys = 11.69 CPU)
Result: FAIL```

To be honest, I have not checked closely on the failure, still it is the only test failing which by itself should be worthwhile mentioning.

#11

Michael Paquier

michael@paquier.xyz

almost 7 years ago

In reply to: Georgios Kokolatos (#10)

Re: Adding a TAP test checking data consistency on standby with minRecoveryPoint

On Mon, Mar 11, 2019 at 02:35:49PM +0000, Georgios Kokolatos wrote:

To be honest, I have not checked closely on the failure, still it is
the only test failing which by itself should be worthwhile
mentioning.

This works for me, as well as a plain installcheck and check from the
subpath. Are you sure that contrib/pageinspect is installed in the
set of binaries your tests are using? What do the logs of the server
tell you? These are located in src/test/recovery/tmp_check/. Please
note that when running installcheck we assume that all the needed
components are installed.
--
Michael

#12

Arthur Zakirov

a.zakirov@postgrespro.ru

almost 7 years ago

In reply to: Michael Paquier (#11)

Re: Adding a TAP test checking data consistency on standby with minRecoveryPoint

On 12.03.2019 08:19, Michael Paquier wrote:

On Mon, Mar 11, 2019 at 02:35:49PM +0000, Georgios Kokolatos wrote:

To be honest, I have not checked closely on the failure, still it is
the only test failing which by itself should be worthwhile
mentioning.

This works for me, as well as a plain installcheck and check from the
subpath. Are you sure that contrib/pageinspect is installed in the
set of binaries your tests are using? What do the logs of the server
tell you? These are located in src/test/recovery/tmp_check/. Please
note that when running installcheck we assume that all the needed
components are installed.

On my environment installcheck passes as well as check. Also I ran tests
with reverted c186ba13, 016_min_consistency.pl doesn't pass as expected.

--
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

#13

Georgios Kokolatos

gkokolatos@pm.me

almost 7 years ago

In reply to: Michael Paquier (#11)

Re: Adding a TAP test checking data consistency on standby with minRecoveryPoint

Hi,

you are right in saying that my comment didn't offer much of a constructive explanation.
Apologies for that.

To the issue at hand. Tests were run in the same manner as in all other cases and the test
in question was the only one to fail in the whole tree.

By looking a bit closer to the error, the culprit seems to be the missing installation of
`pageinspect`, as it was correctly pointed out. I did notice before sending the first message in
the thread that `contrib/pageinspect` was added to the `EXTRA_INSTALL` variable and I
*assumed* that it should be properly installed to the system. However that was not the case.

Without being a recursive Makefile guru, I noticed one has to explicitly
$(call recurse,checkprep, $(recurse_alldirs_targets))
in `src/test/Makefile` in order for the call to `checkprep` to be made for running
make check or installcheck or any other of the variants.

Having said all the above, and after spending more time reviewing the whole patch
I do think it is worthwhile committing, as long as it is not tripping any users up with unexpected
test errors due to incomplete installations.

I hope this comment helps a bit more.

#14

Michael Paquier

michael@paquier.xyz

almost 7 years ago

In reply to: Arthur Zakirov (#12)

Re: Adding a TAP test checking data consistency on standby with minRecoveryPoint

On Tue, Mar 12, 2019 at 01:16:14PM +0300, Arthur Zakirov wrote:

On my environment installcheck passes as well as check. Also I ran tests
with reverted c186ba13, 016_min_consistency.pl doesn't pass as expected.

Thanks all for the feedback! I have done an extra pass on the new,
tweaking some sentences in the comments, and committed it.
--
Michael

#15

Peter Geoghegan

pg@bowt.ie

almost 7 years ago

In reply to: Michael Paquier (#14)

Re: Adding a TAP test checking data consistency on standby with minRecoveryPoint

On Wed, Mar 13, 2019 at 12:22 AM Michael Paquier <michael@paquier.xyz> wrote:

On Tue, Mar 12, 2019 at 01:16:14PM +0300, Arthur Zakirov wrote:

On my environment installcheck passes as well as check. Also I ran tests
with reverted c186ba13, 016_min_consistency.pl doesn't pass as expected.

Seems like there might be a problem either caused by or detected by
016_min_consistency.pl on piculet:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=piculet&dt=2019-03-23%2022%3A28%3A59

--
Peter Geoghegan

#16

Michael Paquier

michael@paquier.xyz

almost 7 years ago

In reply to: Peter Geoghegan (#15)

Re: Adding a TAP test checking data consistency on standby with minRecoveryPoint

On Sat, Mar 23, 2019 at 04:08:42PM -0700, Peter Geoghegan wrote:

Seems like there might be a problem either caused by or detected by
016_min_consistency.pl on piculet:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=piculet&dt=2019-03-23%2022%3A28%3A59

Interesting. Based on what regress_log_016_min_consistency tells,
the test attempts to stop the standby in fast mode but it fails
because of a timeout:
### Stopping node "standby" using mode fast
[...]
pg_ctl: server does not shut down
Bail out! system pg_ctl failed

There is only one place in the tests where that happens, and before
attempting to stop the standby we issue a checkpoint on it with its
primary killed:
# Issue a restart point on the standby now, which makes the checkpointer
# update minRecoveryPoint.
$standby->safe_psql('postgres', 'CHECKPOINT;');
[...]
$primary->stop('immediate');
$standby->stop('fast');

The failure is a bit weird, as I would expect all those three actions
to be sequential. piculet is the only failure happening on the
buildfarm and it uses --disable-atomics, so I am wondering if that is
related and if 0dfe3d0 is part of that. With a primary/standby set,
it could be possible to test that scenario pretty easily. I'll give
it a shot.
--
Michael

#17

Michael Paquier

michael@paquier.xyz

over 6 years ago

In reply to: Michael Paquier (#16)

Re: Adding a TAP test checking data consistency on standby with minRecoveryPoint

On Sun, Mar 24, 2019 at 09:47:58PM +0900, Michael Paquier wrote:

The failure is a bit weird, as I would expect all those three actions
to be sequential. piculet is the only failure happening on the
buildfarm and it uses --disable-atomics, so I am wondering if that is
related and if 0dfe3d0 is part of that. With a primary/standby set,
it could be possible to test that scenario pretty easily. I'll give
it a shot.

The buildfarm has just failed with a similar failure, for another
test aka 009_twophase.pl:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dragonet&dt=2019-04-16%2006%3A14%3A01

I think that we have race conditions with checkpointing and shutdown
on HEAD.
--
Michael

#18

Robert Haas

robertmhaas@gmail.com

over 6 years ago

In reply to: Michael Paquier (#17)

Re: Adding a TAP test checking data consistency on standby with minRecoveryPoint

On Tue, Apr 16, 2019 at 2:45 AM Michael Paquier <michael@paquier.xyz> wrote:

I think that we have race conditions with checkpointing and shutdown
on HEAD.

Any idea whether it's something newly-introduced or of long standing?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#19

Michael Paquier

michael@paquier.xyz

over 6 years ago

In reply to: Robert Haas (#18)

Re: Adding a TAP test checking data consistency on standby with minRecoveryPoint

On Thu, Apr 18, 2019 at 03:16:17PM -0400, Robert Haas wrote:

On Tue, Apr 16, 2019 at 2:45 AM Michael Paquier <michael@paquier.xyz> wrote:

I think that we have race conditions with checkpointing and shutdown
on HEAD.

Any idea whether it's something newly-introduced or of long standing?

I would suggest to keep the discussion on the more general thread I
have created a couple of days ago:
/messages/by-id/20190416070119.GK2673@paquier.xyz

Thanks,
--
Michael