pg_rewind tap test unstable

Started by Christoph Bergover 10 years ago6 messages

myon@debian.org

over 10 years ago

Hi,

for something between 10% and 20% of the devel builds for apt.postgresql.org
(which happen every 6h if there's a git change, so it happens every few days),
I'm seeing this:

make[2]: Entering directory '/tmp/buildd/postgresql-9.6-9.6~~devel~20150728.0405/build/src/bin/pg_rewind'
rm -rf /tmp/buildd/postgresql-9.6-9.6~~devel~20150728.0405/build/../src/bin/pg_rewind/tmp_check/log
cd /tmp/buildd/postgresql-9.6-9.6~~devel~20150728.0405/build/../src/bin/pg_rewind && TESTDIR='/tmp/buildd/postgresql-9.6-9.6~~devel~20150728.0405/build/src/bin/pg_rewind' PATH="/tmp/buildd/postgresql-9.6-9.6~~devel~20150728.0405/build/tmp_install/usr/lib/postgresql/9.6/bin:$PATH" LD_LIBRARY_PATH="/tmp/buildd/postgresql-9.6-9.6~~devel~20150728.0405/build/tmp_install/usr/lib/x86_64-linux-gnu" PGPORT='65432' top_builddir='/tmp/buildd/postgresql-9.6-9.6~~devel~20150728.0405/build/src/bin/pg_rewind/../../..' prove -I /tmp/buildd/postgresql-9.6-9.6~~devel~20150728.0405/build/../src/test/perl/ --verbose t/*.pl

# Failed test 'tail-copy: query result matches'
# at RewindTest.pm line 131.
# got: '1
# '
# expected: '10001
# '
# Looks like you failed 1 test of 8.
t/001_basic.pl .......
1..8
ok 1 - pg_rewind local
ok 2 - table content: query result matches
ok 3 - truncation: query result matches
not ok 4 - tail-copy: query result matches
ok 5 - pg_rewind remote
ok 6 - table content: query result matches
ok 7 - truncation: query result matches
ok 8 - tail-copy: query result matches
Dubious, test returned 1 (wstat 256, 0x100)
Failed 1/8 subtests

I don't have the older logs available, but from memory, the subtest
failing and the two numbers mentioned are always the same.

Christoph
--
cb@df7cb.de | http://www.df7cb.de/

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Michael Paquier

michael.paquier@gmail.com

over 10 years ago

In reply to: Christoph Berg (#1)

Re: pg_rewind tap test unstable

On Tue, Jul 28, 2015 at 5:57 PM, Christoph Berg <myon@debian.org> wrote:

for something between 10% and 20% of the devel builds for apt.postgresql.org
(which happen every 6h if there's a git change, so it happens every few days),
I'm seeing this:
Dubious, test returned 1 (wstat 256, 0x100)
Failed 1/8 subtests

I don't have the older logs available, but from memory, the subtest
failing and the two numbers mentioned are always the same.

There should be some output logs in src/bin/pg_rewind/tmp_check/log/*?
Could you attach them here if you have them? That would be helpful to
understand what is happening.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Christoph Berg

myon@debian.org

over 10 years ago

In reply to: Michael Paquier (#2)

5 attachment(s)

Re: pg_rewind tap test unstable

Re: Michael Paquier 2015-07-28 <CAB7nPqQCpGy3u7CMfo8sQQUoBSFmEieOhuEsLxwyCC64j3GWxQ@mail.gmail.com>

On Tue, Jul 28, 2015 at 5:57 PM, Christoph Berg <myon@debian.org> wrote:

for something between 10% and 20% of the devel builds for apt.postgresql.org
(which happen every 6h if there's a git change, so it happens every few days),
I'm seeing this:
Dubious, test returned 1 (wstat 256, 0x100)
Failed 1/8 subtests

I don't have the older logs available, but from memory, the subtest
failing and the two numbers mentioned are always the same.

There should be some output logs in src/bin/pg_rewind/tmp_check/log/*?
Could you attach them here if you have them? That would be helpful to
understand what is happening.

It took me a few attempts to tell the build environment to save a copy
on failure and not shred everything right away. So here we go:

Christoph
--
cb@df7cb.de | http://www.df7cb.de/

Michael Paquier

michael.paquier@gmail.com

over 10 years ago

In reply to: Christoph Berg (#3)

1 attachment(s)

Re: pg_rewind tap test unstable

On Mon, Aug 3, 2015 at 5:35 PM, Christoph Berg <myon@debian.org> wrote:

Re: Michael Paquier 2015-07-28 <CAB7nPqQCpGy3u7CMfo8sQQUoBSFmEieOhuEsLxwyCC64j3GWxQ@mail.gmail.com>

On Tue, Jul 28, 2015 at 5:57 PM, Christoph Berg <myon@debian.org> wrote:

for something between 10% and 20% of the devel builds for apt.postgresql.org
(which happen every 6h if there's a git change, so it happens every few days),
I'm seeing this:
Dubious, test returned 1 (wstat 256, 0x100)
Failed 1/8 subtests

I don't have the older logs available, but from memory, the subtest
failing and the two numbers mentioned are always the same.

There should be some output logs in src/bin/pg_rewind/tmp_check/log/*?
Could you attach them here if you have them? That would be helpful to
understand what is happening.

It took me a few attempts to tell the build environment to save a copy
on failure and not shred everything right away. So here we go:

In test case 2, the failure happens to be that the standby did not
have the time to replicate the database beforepromotion that has been
created on the master. One possible explanation for this failure is
that the standby has been promoted before all the WAL needed for the
tests has been replayed, hence we had better be sure that the
replay_location of the standby matches pg_current_xlog_location()
before promotion. On the buildfarm, hamster, the legendary slowest
machine of the whole set, does not complain about that. Is your
environment that heavy loaded when you run the tests?

Perhaps the attached patch helps?
--
Michael

Attachments:

20150804_pgrewind_tapfix.patchapplication/x-patch; name=20150804_pgrewind_tapfix.patchDownload

diff --git a/src/bin/pg_rewind/RewindTest.pm b/src/bin/pg_rewind/RewindTest.pm
index b66ff0d..fce725f 100644
--- a/src/bin/pg_rewind/RewindTest.pm
+++ b/src/bin/pg_rewind/RewindTest.pm
@@ -232,6 +232,13 @@ sub promote_standby
 	#### Now run the test-specific parts to run after standby has been started
 	# up standby
 
+	# Wait until the standby has caught up with the primary, by polling
+	# pg_stat_replication.
+	my $caughtup_query =
+"SELECT pg_current_xlog_location() = replay_location FROM pg_stat_replication WHERE application_name = 'rewind_standby';";
+	poll_query_until($caughtup_query, $connstr_master)
+	  or die "Timed out while waiting for standby to catch up";
+
 	# Now promote slave and insert some new data on master, this will put
 	# the master out-of-sync with the standby. Wait until the standby is
 	# out of recovery mode, and is ready to accept read-write connections.

Noah Misch

noah@leadboat.com

over 10 years ago

In reply to: Michael Paquier (#4)

1 attachment(s)

Re: pg_rewind tap test unstable

On Tue, Aug 04, 2015 at 02:21:16PM +0900, Michael Paquier wrote:

On Tue, Jul 28, 2015 at 5:57 PM, Christoph Berg <myon@debian.org> wrote:

for something between 10% and 20% of the devel builds for apt.postgresql.org
(which happen every 6h if there's a git change, so it happens every few days),
I'm seeing this:

In test case 2, the failure happens to be that the standby did not
have the time to replicate the database beforepromotion that has been
created on the master. One possible explanation for this failure is
that the standby has been promoted before all the WAL needed for the
tests has been replayed, hence we had better be sure that the
replay_location of the standby matches pg_current_xlog_location()
before promotion.

Perhaps the attached patch helps?

Thanks. In light of your diagnosis, I can reliably reproduce the failure by
injecting a sleep into XLogSendPhysical(). Your patch fixes the problem, but
it adds wal_receiver_status_interval (= 10s) stalls, doubling
src/bin/pg_rewind/t/001_basic.pl runtime on a fast system. (The standby
applies the final WAL quickly, then sleeps for wal_receiver_status_interval
before notifying the master.) The standby will apply any written, unapplied
WAL during promotion. Therefore, I plan to commit the attached
performance-neutral variant of your patch.

Attachments:

pgrewind_tapfix-v2.patchtext/plain; charset=us-asciiDownload

diff --git a/src/bin/pg_rewind/RewindTest.pm b/src/bin/pg_rewind/RewindTest.pm
index 22e5cae..a4c1737 100644
--- a/src/bin/pg_rewind/RewindTest.pm
+++ b/src/bin/pg_rewind/RewindTest.pm
@@ -222,12 +222,8 @@ recovery_target_timeline='latest'
 				   '-l', "$log_path/standby.log",
 				   '-o', "-p $port_standby", 'start');
 
-	# Wait until the standby has caught up with the primary, by polling
-	# pg_stat_replication.
-	my $caughtup_query =
-"SELECT pg_current_xlog_location() = replay_location FROM pg_stat_replication WHERE application_name = 'rewind_standby';";
-	poll_query_until($caughtup_query, $connstr_master)
-	  or die "Timed out while waiting for standby to catch up";
+	# The standby may have WAL to apply before it matches the primary.  That
+	# is fine, because no test examines the standby before promotion.
 }
 
 sub promote_standby
@@ -235,6 +231,12 @@ sub promote_standby
 	#### Now run the test-specific parts to run after standby has been started
 	# up standby
 
+	# Wait for the standby to receive and write all WAL.
+	my $wal_received_query =
+"SELECT pg_current_xlog_location() = write_location FROM pg_stat_replication WHERE application_name = 'rewind_standby';";
+	poll_query_until($wal_received_query, $connstr_master)
+	  or die "Timed out while waiting for standby to receive and write WAL";
+
 	# Now promote slave and insert some new data on master, this will put
 	# the master out-of-sync with the standby. Wait until the standby is
 	# out of recovery mode, and is ready to accept read-write connections.

Michael Paquier

michael.paquier@gmail.com

over 10 years ago

In reply to: Noah Misch (#5)

Re: pg_rewind tap test unstable

On Mon, Sep 7, 2015 at 1:16 PM, Noah Misch <noah@leadboat.com> wrote:

On Tue, Aug 04, 2015 at 02:21:16PM +0900, Michael Paquier wrote:

On Tue, Jul 28, 2015 at 5:57 PM, Christoph Berg <myon@debian.org> wrote:

for something between 10% and 20% of the devel builds for apt.postgresql.org
(which happen every 6h if there's a git change, so it happens every few days),
I'm seeing this:

In test case 2, the failure happens to be that the standby did not
have the time to replicate the database beforepromotion that has been
created on the master. One possible explanation for this failure is
that the standby has been promoted before all the WAL needed for the
tests has been replayed, hence we had better be sure that the
replay_location of the standby matches pg_current_xlog_location()
before promotion.

Perhaps the attached patch helps?

Thanks. In light of your diagnosis, I can reliably reproduce the failure by
injecting a sleep into XLogSendPhysical(). Your patch fixes the problem, but
it adds wal_receiver_status_interval (= 10s) stalls, doubling
src/bin/pg_rewind/t/001_basic.pl runtime on a fast system. (The standby
applies the final WAL quickly, then sleeps for wal_receiver_status_interval
before notifying the master.)

Indeed, thanks for double-checking.

The standby will apply any written, unapplied
WAL during promotion. Therefore, I plan to commit the attached
performance-neutral variant of your patch.

Explaining the use of write_location. This looks fine to me. Thanks again.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers