Fixing WAL instability in various TAP tests

Started by Mark Dilgerover 4 years ago25 messages

mark.dilger@enterprisedb.com

over 4 years ago

Hackers,

A few TAP tests in the project appear to be sensitive to reductions of the PostgresNode's max_wal_size setting, resulting in tests failing due to wal files having been removed too soon. The failures in the logs typically are of the "requested WAL segment %s has already been removed" variety. I would expect tests which fail under legal alternate GUC settings to be hardened to explicitly set the GUCs as they need, rather than implicitly relying on the defaults. As far as missing WAL files go, I would expect the TAP test to prevent this with the use of replication slots or some other mechanism, and not simply to rely on checkpoints not happening too soon. I'm curious if others on this list disagree with that point of view.

Failures in src/test/recovery/t/015_promotion_pages.pl can be fixed by creating a physical replication slot on node "alpha" and using it from node "beta", a technique already used in other TAP tests and apparently merely overlooked in this one.

The first two tests in src/bin/pg_basebackup/t fail, and it's not clear that physical replication slots are the appropriate solution, since no replication is happening. It's not immediately obvious that the tests are at fault anyway. On casual inspection, it seems they might be detecting a live bug which simply doesn't manifest under larger values of max_wal_size. Test 010 appears to show a bug with `pg_basebackup -X`, and test 020 with `pg_receivewal`.

The test in contrib/bloom/t/ is deliberately disabled in contrib/bloom/Makefile with a comment that the test is unstable in the buildfarm, but I didn't find anything to explain what exactly those buildfarm failures might have been when I chased down the email thread that gave rise to the related commit. That test happens to be stable on my laptop until I change GUC settings to both reduce max_wal_size=32MB and to set wal_consistency_checking=all.

Thoughts?

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Noah Misch

noah@leadboat.com

over 4 years ago

In reply to: Mark Dilger (#1)

Re: Fixing WAL instability in various TAP tests

On Fri, Sep 24, 2021 at 05:33:13PM -0700, Mark Dilger wrote:

A few TAP tests in the project appear to be sensitive to reductions of the
PostgresNode's max_wal_size setting, resulting in tests failing due to wal
files having been removed too soon. The failures in the logs typically are
of the "requested WAL segment %s has already been removed" variety. I would
expect tests which fail under legal alternate GUC settings to be hardened to
explicitly set the GUCs as they need, rather than implicitly relying on the
defaults.

That is not the general practice in PostgreSQL tests today. The buildfarm
exercises some settings, so we keep the tests clean for those. Coping with
max_wal_size=2 that way sounds reasonable. I'm undecided about the value of
hardening tests against all possible settings. On the plus side, it would let
us run one buildfarm member that sets every setting to its min_val or
enumvals[1] and another member that elects enumvals[cardinality(enumvals)] or
max_val. We'd catch some real bugs that way. On the minus side, every
nontrivial test suite addition would need to try those two cases before commit
or risk buildfarm wrath. I don't know whether the bugs found would pay for
that trouble. (There's also a less-important loss around the ability to
exercise a setting and manually inspect the results. For example, I sometimes
test parallel_tuple_cost=0 parallel_setup_cost=0 and confirm a lack of
crashes. After hardening, that would require temporary source code edits to
remove the hardening. That's fine, though.)

Mark Dilger

mark.dilger@enterprisedb.com

over 4 years ago

In reply to: Noah Misch (#2)

Re: Fixing WAL instability in various TAP tests

On Sep 24, 2021, at 10:21 PM, Noah Misch <noah@leadboat.com> wrote:

I would
expect tests which fail under legal alternate GUC settings to be hardened to
explicitly set the GUCs as they need, rather than implicitly relying on the
defaults.

That is not the general practice in PostgreSQL tests today. The buildfarm
exercises some settings, so we keep the tests clean for those. Coping with
max_wal_size=2 that way sounds reasonable. I'm undecided about the value of
hardening tests against all possible settings.

Leaving the tests brittle wastes developer time.

I ran into this problem when I changed the storage underlying bloom indexes and ran the contrib/bloom/t/001_wal.pl test with wal_consistency_checking=all. That caused the test to fail with errors about missing wal files, and it took time to backtrack and see that the test fails under this setting even before applying my storage layer changes. Ordinarily, failures about missing wal files would have led me to suspect the TAP test sooner, but since I had mucked around with storage and wal it initially seemed plausible that my code changes were the problem. The real problem is that a replication slot is not used in the test.

The failure in src/test/recovery/t/015_promotion_pages.pl is also that a replication slot should be used but is not.

The failure in src/bin/pg_basebackup/t/010_pg_basebackup.pl stems from not heeding the documented requirement for pg_basebackup -X fetch that the wal_keep_size "be set high enough that the required log data is not removed before the end of the backup". It's just assuming that it will be, because that tends to be true under default GUC settings. I think this can be fixed by setting wal_keep_size=<SOMETHING_BIG_ENOUGH>, but (a) you say this is not the general practice in PostgreSQL tests today, and (b) there doesn't seem to be any principled way to decide what value would be big enough. Sure, we can use something that is big enough in practice, and we'll probably have to go with that, but it feels like we're just papering over the problem.

I'm inclined to guess that the problem in src/bin/pg_basebackup/t/020_pg_receivewal.pl is similar.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Tom Lane

tgl@sss.pgh.pa.us

over 4 years ago

In reply to: Mark Dilger (#3)

Re: Fixing WAL instability in various TAP tests

Mark Dilger <mark.dilger@enterprisedb.com> writes:

On Sep 24, 2021, at 10:21 PM, Noah Misch <noah@leadboat.com> wrote:

I would
expect tests which fail under legal alternate GUC settings to be hardened to
explicitly set the GUCs as they need, rather than implicitly relying on the
defaults.

That is not the general practice in PostgreSQL tests today. The buildfarm
exercises some settings, so we keep the tests clean for those. Coping with
max_wal_size=2 that way sounds reasonable. I'm undecided about the value of
hardening tests against all possible settings.

Leaving the tests brittle wastes developer time.

Trying to make them proof against all possible settings would waste
a lot more time, though.

regards, tom lane

Mark Dilger

mark.dilger@enterprisedb.com

over 4 years ago

In reply to: Tom Lane (#4)

Re: Fixing WAL instability in various TAP tests

On Sep 25, 2021, at 7:17 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Leaving the tests brittle wastes developer time.

Trying to make them proof against all possible settings would waste
a lot more time, though.

You may be right, but the conversation about "all possible settings" was started by Noah. I was really just talking about tests that depend on wal files not being removed, but taking no action to guarantee that, merely trusting that under default settings they won't be. I can't square that design against other TAP tests that do take measures to prevent wal files being removed. Why is the precaution taken in some tests but not others? If this is intentional, shouldn't some comment in the tests without such precautions explain that choice? Are they intentionally testing that the default GUC wal size settings and wal verbosity won't break the test?

This isn't a rhetorical question:

In src/test/recovery/t/015_promotion_pages.pl, the comments talk about the how checkpoints impact what happens on the standby. The test issues an explicit checkpoint on the primary, and again later on the standby, so it is unclear if that's what the comments refer to, or if they also refer to implicit expectations about when/if other checkpoints will happen. The test breaks when I change the GUC settings, but I can fix that breakage by adding a replication slot to the test. Have I broken the purpose of the test by doing so, though? Does using a replication slot to force the wal to not be removed early break what the test is designed to check?

The other tests raise similar questions. Is the brittleness intentional?

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Tom Lane

tgl@sss.pgh.pa.us

over 4 years ago

In reply to: Mark Dilger (#5)

Re: Fixing WAL instability in various TAP tests

Mark Dilger <mark.dilger@enterprisedb.com> writes:

You may be right, but the conversation about "all possible settings" was started by Noah. I was really just talking about tests that depend on wal files not being removed, but taking no action to guarantee that, merely trusting that under default settings they won't be. I can't square that design against other TAP tests that do take measures to prevent wal files being removed. Why is the precaution taken in some tests but not others?

If we are doing something about that in some test cases, I'd agree with
doing the same thing in others that need it. It seems more likely to
be an oversight than an intentional test condition.

regards, tom lane

Noah Misch

noah@leadboat.com

over 4 years ago

In reply to: Mark Dilger (#5)

Re: Fixing WAL instability in various TAP tests

On Sat, Sep 25, 2021 at 08:20:06AM -0700, Mark Dilger wrote:

On Sep 25, 2021, at 7:17 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Leaving the tests brittle wastes developer time.

Trying to make them proof against all possible settings would waste
a lot more time, though.

You may be right, but the conversation about "all possible settings" was
started by Noah.

You wrote, "I would expect tests which fail under legal alternate GUC settings
to be hardened to explicitly set the GUCs as they need, rather than implicitly
relying on the defaults." I read that as raising the general principle, not
just a narrow argument about max_wal_size. We can discontinue talking about
the general principle and focus on max_wal_size.

I was really just talking about tests that depend on wal
files not being removed, but taking no action to guarantee that, merely
trusting that under default settings they won't be.

As I said, +1 for making tests pass under the min_val of max_wal_size. If you
want to introduce a max_wal_size=2 buildfarm member so it stays that way, +1
for that as well.

Tom Lane

tgl@sss.pgh.pa.us

over 4 years ago

In reply to: Noah Misch (#7)

Re: Fixing WAL instability in various TAP tests

Noah Misch <noah@leadboat.com> writes:

On Sat, Sep 25, 2021 at 08:20:06AM -0700, Mark Dilger wrote:

You may be right, but the conversation about "all possible settings" was
started by Noah.

You wrote, "I would expect tests which fail under legal alternate GUC settings
to be hardened to explicitly set the GUCs as they need, rather than implicitly
relying on the defaults." I read that as raising the general principle, not
just a narrow argument about max_wal_size.

As did I.

We can discontinue talking about
the general principle and focus on max_wal_size.

It is worth stopping to think about whether there are adjacent settings
that need similar treatment.

In general, it seems like "premature discarding of WAL segments" is
something akin to "premature timeout" errors, and we've got a pretty
aggressive policy about preventing those. There are a lot of settings
that I'd *not* be in favor of trying to be bulletproof about, because
it doesn't seem worth the trouble; but perhaps this one is.

regards, tom lane

Mark Dilger

mark.dilger@enterprisedb.com

over 4 years ago

In reply to: Noah Misch (#7)

Re: Fixing WAL instability in various TAP tests

On Sep 25, 2021, at 9:00 AM, Noah Misch <noah@leadboat.com> wrote:

You may be right, but the conversation about "all possible settings" was
started by Noah.

You wrote, "I would expect tests which fail under legal alternate GUC settings
to be hardened to explicitly set the GUCs as they need, rather than implicitly
relying on the defaults." I read that as raising the general principle, not
just a narrow argument about max_wal_size.

In the first draft of my email to Tom, I had language about my inartful crafting of my original post that led Noah to respond as he did.... I couldn't quite figure out how to phrase that without distracting from the main point. I don't think you were (much) offended, but my apologies for any perceived fingerpointing.

I also don't have a problem with your idea of testing in the build farm with some animals having the gucs set to minimum values and some to maximum and so forth. I like that idea generally, though don't feel competent to predict how much work that would be to maintain, so I'm just deferring to Tom's and your judgement about that.

My inartful first post was really meant to say, "here is a problem that I perceive about tap tests vis-a-vis wal files, do people disagree with me that this is a problem, and would patches to address the problem be welcome?" I took Tom's response to be, "yeah, go ahead", and am mostly waiting for the weekend to be over to see if anybody else has anything to say about it.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#10

Mark Dilger

mark.dilger@enterprisedb.com

over 4 years ago

In reply to: Mark Dilger (#9)

3 attachment(s)

Re: Fixing WAL instability in various TAP tests

On Sep 25, 2021, at 11:04 AM, Mark Dilger <mark.dilger@enterprisedb.com> wrote:

I took Tom's response to be, "yeah, go ahead", and am mostly waiting for the weekend to be over to see if anybody else has anything to say about it.

Here is a patch set, one patch per test. The third patch enables its test in the Makefile, which is commented as having been disabled due to the test being unstable in the build farm. Re-enabling the test might be wrong, since the instability might not have been due to WAL being recycled early. I didn't find enough history about why that test was disabled, so trying it in the build farm again is the best I can suggest. Maybe somebody else on this list knows more?

Attachments:

v1-0001-Harden-test-againt-premature-discarding-of-WAL.patchapplication/octet-stream; name=v1-0001-Harden-test-againt-premature-discarding-of-WAL.patch; x-unix-mode=0644Download

From 3ce8ff704de4468bba564aa3b3185b0bfdc12670 Mon Sep 17 00:00:00 2001
From: Mark Dilger <mark.dilger@enterprisedb.com>
Date: Mon, 27 Sep 2021 11:19:41 -0700
Subject: [PATCH v1 1/3] Harden test againt premature discarding of WAL

Not using a physical replication slot between the primary and
standby makes the test brittle should default GUC settings someday
be reduced, or WAL volume someday be increased, or some combination
of those two.  We happen to get lucky now that the WAL generated by
the test won't be discarded before it is used.  Harden the test.
---
 src/test/recovery/t/015_promotion_pages.pl | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/src/test/recovery/t/015_promotion_pages.pl b/src/test/recovery/t/015_promotion_pages.pl
index 562c4cd3e4..b73622c7df 100644
--- a/src/test/recovery/t/015_promotion_pages.pl
+++ b/src/test/recovery/t/015_promotion_pages.pl
@@ -22,6 +22,8 @@ EOF
 
 # Start the primary
 $alpha->start;
+$alpha->safe_psql('postgres',
+	"SELECT pg_create_physical_replication_slot('bravo')");
 
 # setup/start a standby
 $alpha->backup('bkp');
@@ -29,6 +31,7 @@ my $bravo = PostgresNode->new('bravo');
 $bravo->init_from_backup($alpha, 'bkp', has_streaming => 1);
 $bravo->append_conf('postgresql.conf', <<EOF);
 checkpoint_timeout=1h
+primary_slot_name = 'bravo'
 EOF
 $bravo->start;
 
-- 
2.21.1 (Apple Git-122.3)

v1-0002-Harden-test-againt-premature-discarding-of-WAL.patchapplication/octet-stream; name=v1-0002-Harden-test-againt-premature-discarding-of-WAL.patch; x-unix-mode=0644Download

From 8be19b4e7f48f40a82e8f308d443f1e1540e33d2 Mon Sep 17 00:00:00 2001
From: Mark Dilger <mark.dilger@enterprisedb.com>
Date: Mon, 27 Sep 2021 11:39:29 -0700
Subject: [PATCH v1 2/3] Harden test againt premature discarding of WAL

Not setting wal_keep_size when testing pg_basebackup makes the test
brittle.  Set it large enough that in practice it should not need to
be increased even if future commits change how much data gets WAL
logged.
---
 src/bin/pg_basebackup/t/010_pg_basebackup.pl | 1 +
 1 file changed, 1 insertion(+)

diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index a2cb2a7679..64668a8b28 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -64,6 +64,7 @@ open my $conf, '>>', "$pgdata/postgresql.conf";
 print $conf "max_replication_slots = 10\n";
 print $conf "max_wal_senders = 10\n";
 print $conf "wal_level = replica\n";
+print $conf "wal_keep_size = 10GB\n";
 close $conf;
 $node->restart;
 
-- 
2.21.1 (Apple Git-122.3)

v1-0003-Harden-and-enable-bloom-TAP-test.patchapplication/octet-stream; name=v1-0003-Harden-and-enable-bloom-TAP-test.patch; x-unix-mode=0644Download

From 6a6fdd8e215b579ed4404e85b3d43ce9fb4f526d Mon Sep 17 00:00:00 2001
From: Mark Dilger <mark.dilger@enterprisedb.com>
Date: Mon, 27 Sep 2021 12:25:47 -0700
Subject: [PATCH v1 3/3] Harden and enable bloom TAP test

Create a physical replication slot between primary and standby that
was missing in contrib/bloom's wal TAP test.  This test was disabled
with a comment about some buildfarm animals failing.  Enable the
test to see if the buildfarm will now be stable.
---
 contrib/bloom/Makefile     | 4 +---
 contrib/bloom/t/001_wal.pl | 3 +++
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/contrib/bloom/Makefile b/contrib/bloom/Makefile
index 6d7a612fe4..8a781e4388 100644
--- a/contrib/bloom/Makefile
+++ b/contrib/bloom/Makefile
@@ -16,9 +16,7 @@ PGFILEDESC = "bloom access method - signature file based index"
 
 REGRESS = bloom
 
-# Disable TAP tests for this module for now, as these are unstable on several
-# buildfarm environments.
-# TAP_TESTS = 1
+TAP_TESTS = 1
 
 ifdef USE_PGXS
 PG_CONFIG = pg_config
diff --git a/contrib/bloom/t/001_wal.pl b/contrib/bloom/t/001_wal.pl
index 55ad35926f..ef5f8b5283 100644
--- a/contrib/bloom/t/001_wal.pl
+++ b/contrib/bloom/t/001_wal.pl
@@ -46,6 +46,8 @@ SELECT * FROM tst WHERE i = 7 AND t = 'e';
 $node_primary = PostgresNode->new('primary');
 $node_primary->init(allows_streaming => 1);
 $node_primary->start;
+$node_primary->safe_psql('postgres',
+	"SELECT pg_create_physical_replication_slot('standby')");
 my $backup_name = 'my_backup';
 
 # Take backup
@@ -55,6 +57,7 @@ $node_primary->backup($backup_name);
 $node_standby = PostgresNode->new('standby');
 $node_standby->init_from_backup($node_primary, $backup_name,
 	has_streaming => 1);
+$node_standby->append_conf('postgresql.conf', "primary_slot_name = 'standby'");
 $node_standby->start;
 
 # Create some bloom index on primary
-- 
2.21.1 (Apple Git-122.3)

#11

Tom Lane

tgl@sss.pgh.pa.us

over 4 years ago

In reply to: Mark Dilger (#10)

Re: Fixing WAL instability in various TAP tests

Mark Dilger <mark.dilger@enterprisedb.com> writes:

Here is a patch set, one patch per test. The third patch enables its test in the Makefile, which is commented as having been disabled due to the test being unstable in the build farm. Re-enabling the test might be wrong, since the instability might not have been due to WAL being recycled early. I didn't find enough history about why that test was disabled, so trying it in the build farm again is the best I can suggest. Maybe somebody else on this list knows more?

Digging in the archives where the commit points, I find

/messages/by-id/20181126025125.GH1776@paquier.xyz

which says there was an unexpected result on my animal longfin.
I tried the same thing (i.e., re-enable bloom's TAP test) on my laptop
just now, and it passed fine. The laptop is not exactly the same
as longfin was in 2018, but it ought to be close enough. Not sure
what to make of that --- maybe the failure is only intermittent,
or else we fixed the underlying issue since then.

I'm a little inclined to re-enable the test without your other
changes, just to see what happens.

regards, tom lane

#12

Mark Dilger

mark.dilger@enterprisedb.com

over 4 years ago

In reply to: Tom Lane (#11)

Re: Fixing WAL instability in various TAP tests

On Sep 27, 2021, at 1:19 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I'm a little inclined to re-enable the test without your other
changes, just to see what happens.

That sounds like a good idea. Even if it passes at first, I'd prefer to leave it for a week or more to have a better sense of how stable it is. Applying my patches too soon would just add more variables to a not-so-well understood situation.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#13

Tom Lane

tgl@sss.pgh.pa.us

over 4 years ago

In reply to: Mark Dilger (#12)

Re: Fixing WAL instability in various TAP tests

Mark Dilger <mark.dilger@enterprisedb.com> writes:

On Sep 27, 2021, at 1:19 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I'm a little inclined to re-enable the test without your other
changes, just to see what happens.

That sounds like a good idea. Even if it passes at first, I'd prefer to leave it for a week or more to have a better sense of how stable it is. Applying my patches too soon would just add more variables to a not-so-well understood situation.

Done. I shall retire to a safe viewing distance and observe the
buildfarm.

regards, tom lane

#14

Michael Paquier

michael@paquier.xyz

over 4 years ago

In reply to: Tom Lane (#11)

Re: Fixing WAL instability in various TAP tests

On Mon, Sep 27, 2021 at 04:19:27PM -0400, Tom Lane wrote:

I tried the same thing (i.e., re-enable bloom's TAP test) on my laptop
just now, and it passed fine. The laptop is not exactly the same
as longfin was in 2018, but it ought to be close enough. Not sure
what to make of that --- maybe the failure is only intermittent,
or else we fixed the underlying issue since then.

Honestly, I have no idea what change in the backend matters here. And
it is not like bloom has changed in any significant way since d3c09b9.
--
Michael

#15

Tom Lane

tgl@sss.pgh.pa.us

over 4 years ago

In reply to: Michael Paquier (#14)

Re: Fixing WAL instability in various TAP tests

Michael Paquier <michael@paquier.xyz> writes:

On Mon, Sep 27, 2021 at 04:19:27PM -0400, Tom Lane wrote:

I tried the same thing (i.e., re-enable bloom's TAP test) on my laptop
just now, and it passed fine. The laptop is not exactly the same
as longfin was in 2018, but it ought to be close enough. Not sure
what to make of that --- maybe the failure is only intermittent,
or else we fixed the underlying issue since then.

Honestly, I have no idea what change in the backend matters here. And
it is not like bloom has changed in any significant way since d3c09b9.

I went so far as to check out 03faa4a8dd on longfin's host, and I find
that I cannot reproduce the failure shown at

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=longfin&dt=2018-11-25+23%3A59%3A03

So that's the same hardware, and identical PG source tree, and different
results. This seems to leave only two theories standing:

1. It was a since-fixed macOS bug. (Unlikely, especially if we also saw
it on other platforms.)

2. The failure manifested only in the buildfarm, not under manual "make
check". This is somewhat more plausible, especially since subsequent
buildfarm script changes might then explain why it went away. But I have
no idea what the "subsequent script changes" might've been.

regards, tom lane

#16

Andrew Dunstan

andrew@dunslane.net

over 4 years ago

In reply to: Tom Lane (#15)

Re: Fixing WAL instability in various TAP tests

On 9/27/21 10:20 PM, Tom Lane wrote:

Michael Paquier <michael@paquier.xyz> writes:

On Mon, Sep 27, 2021 at 04:19:27PM -0400, Tom Lane wrote:

I tried the same thing (i.e., re-enable bloom's TAP test) on my laptop
just now, and it passed fine. The laptop is not exactly the same
as longfin was in 2018, but it ought to be close enough. Not sure
what to make of that --- maybe the failure is only intermittent,
or else we fixed the underlying issue since then.

Honestly, I have no idea what change in the backend matters here. And
it is not like bloom has changed in any significant way since d3c09b9.

I went so far as to check out 03faa4a8dd on longfin's host, and I find
that I cannot reproduce the failure shown at

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=longfin&dt=2018-11-25+23%3A59%3A03

So that's the same hardware, and identical PG source tree, and different
results. This seems to leave only two theories standing:

1. It was a since-fixed macOS bug. (Unlikely, especially if we also saw
it on other platforms.)

2. The failure manifested only in the buildfarm, not under manual "make
check". This is somewhat more plausible, especially since subsequent
buildfarm script changes might then explain why it went away. But I have
no idea what the "subsequent script changes" might've been.

Nothing I can think of.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

#17

Tom Lane

tgl@sss.pgh.pa.us

over 4 years ago

In reply to: Tom Lane (#15)

Re: Fixing WAL instability in various TAP tests

I wrote:

So that's the same hardware, and identical PG source tree, and different
results. This seems to leave only two theories standing:

I forgot theory 3: it's intermittent. Apparently the probability has
dropped a lot since 2018, but behold:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris&dt=2021-09-28%2014%3A20%3A41

(with successful runs just before and after this one, on the same
animal)

Note that the delta is not exactly like the previous result, either.
So there's more than one symptom, but in any case it seems like
we have an issue in WAL replay. I wonder whether it's bloom's fault
or a core bug.

regards, tom lane

#18

Mark Dilger

mark.dilger@enterprisedb.com

over 4 years ago

In reply to: Tom Lane (#17)

Re: Fixing WAL instability in various TAP tests

On Sep 28, 2021, at 10:27 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I wonder whether it's bloom's fault
or a core bug.

Looking closer at the TAP test, it's not ORDERing the result set from the SELECTs on either node, but it is comparing the sets for stringwise equality, which is certainly order dependent.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#19

Tom Lane

tgl@sss.pgh.pa.us

over 4 years ago

In reply to: Tom Lane (#17)

Re: Fixing WAL instability in various TAP tests

I wrote:

So there's more than one symptom, but in any case it seems like
we have an issue in WAL replay. I wonder whether it's bloom's fault
or a core bug.

Actually ... I bet it's just the test script's fault. It waits for the
standby to catch up like this:

my $caughtup_query =
"SELECT pg_current_wal_lsn() <= write_lsn FROM pg_stat_replication WHERE application_name = '$applname';";
$node_primary->poll_query_until('postgres', $caughtup_query)
or die "Timed out while waiting for standby 1 to catch up";

which seems like completely the wrong condition. Don't we need the
standby to have *replayed* the WAL, not merely written it to disk?

I'm also wondering why this doesn't use wait_for_catchup, instead
of reinventing the query to use.

regards, tom lane

#20

Tom Lane

tgl@sss.pgh.pa.us

over 4 years ago

In reply to: Mark Dilger (#18)

Re: Fixing WAL instability in various TAP tests

Mark Dilger <mark.dilger@enterprisedb.com> writes:

Looking closer at the TAP test, it's not ORDERing the result set from the SELECTs on either node, but it is comparing the sets for stringwise equality, which is certainly order dependent.

Well, it's forcing a bitmap scan, so what we're getting is the native
ordering of a bitmapscan result. That should match, given that what
we're doing is physical replication. I think adding ORDER BY would
be more likely to obscure real issues than hide test instability.

regards, tom lane

#21

Mark Dilger

mark.dilger@enterprisedb.com

over 4 years ago

In reply to: Mark Dilger (#18)

Re: Fixing WAL instability in various TAP tests

On Sep 28, 2021, at 11:07 AM, Mark Dilger <mark.dilger@enterprisedb.com> wrote:

Looking closer at the TAP test, it's not ORDERing the result set from the SELECTs on either node, but it is comparing the sets for stringwise equality, which is certainly order dependent.

Taking the output from the buildfarm page, parsing out the first test's results and comparing got vs. expected for this test:

is($primary_result, $standby_result, "$test_name: query result matches");

the primary result had all the same rows as the standby, along with additional rows. Comparing the results, they match other than rows missing from the standby that are present on the primary. That seems consistent with the view that the query on the standby is running before all the data has replicated across.

However, the missing rows all have column i either 0 or 3, though the test round-robins i=0..9 as it performs the inserts. I would expect the wal for the inserts to not cluster around any particular value of i. The DELETE and VACUUM commands do operate on a single value of i, so that would make sense if the data failed to be deleted on the standby after successfully being deleted on the primary, but then I'd expect the standby to have more rows, not fewer.

Perhaps having the bloom index messed up answers that, though. I think it should be easy enough to get the path to the heap main table fork and the bloom main index fork for both the primary and standby and do a filesystem comparison as part of the wal test. That would tell us if they differ, and also if the differences are limited to just one or the other.

I'll go write that up....

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#22

Tom Lane

tgl@sss.pgh.pa.us

over 4 years ago

In reply to: Mark Dilger (#21)

2 attachment(s)

Re: Fixing WAL instability in various TAP tests

Mark Dilger <mark.dilger@enterprisedb.com> writes:

Perhaps having the bloom index messed up answers that, though. I think it should be easy enough to get the path to the heap main table fork and the bloom main index fork for both the primary and standby and do a filesystem comparison as part of the wal test. That would tell us if they differ, and also if the differences are limited to just one or the other.

I think that's probably overkill, and definitely out-of-scope for
contrib/bloom. If we fear that WAL replay is not reproducing the data
accurately, we should be testing for that in some more centralized place.

Anyway, I confirmed my diagnosis by adding a delay in WAL apply
(0001 below); that makes this test fall over spectacularly.
And 0002 fixes it. So I propose to push 0002 as soon as the
v14 release freeze ends.

Should we back-patch 0002? I'm inclined to think so. Should
we then also back-patch enablement of the bloom test? Less
sure about that, but I'd lean to doing so. A test that appears
to be there but isn't actually invoked is pretty misleading.

regards, tom lane

Attachments:

0001-break-it.patchtext/x-diff; charset=us-ascii; name=0001-break-it.patchDownload

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e51a7a749d..eecbe57aee 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7370,6 +7370,9 @@ StartupXLOG(void)
 			{
 				bool		switchedTLI = false;
 
+				if (random() < INT_MAX/100)
+					pg_usleep(100000);
+
 #ifdef WAL_DEBUG
 				if (XLOG_DEBUG ||
 					(rmid == RM_XACT_ID && trace_recovery_messages <= DEBUG2) ||

0002-fix-it.patchtext/x-diff; charset=us-ascii; name=0002-fix-it.patchDownload

diff --git a/contrib/bloom/t/001_wal.pl b/contrib/bloom/t/001_wal.pl
index 55ad35926f..be8916a8eb 100644
--- a/contrib/bloom/t/001_wal.pl
+++ b/contrib/bloom/t/001_wal.pl
@@ -16,12 +16,10 @@ sub test_index_replay
 {
 	my ($test_name) = @_;
 
+	local $Test::Builder::Level = $Test::Builder::Level + 1;
+
 	# Wait for standby to catch up
-	my $applname = $node_standby->name;
-	my $caughtup_query =
-	  "SELECT pg_current_wal_lsn() <= write_lsn FROM pg_stat_replication WHERE application_name = '$applname';";
-	$node_primary->poll_query_until('postgres', $caughtup_query)
-	  or die "Timed out while waiting for standby 1 to catch up";
+	$node_primary->wait_for_catchup($node_standby);
 
 	my $queries = qq(SET enable_seqscan=off;
 SET enable_bitmapscan=on;

#23

Michael Paquier

michael@paquier.xyz

over 4 years ago

In reply to: Tom Lane (#22)

Re: Fixing WAL instability in various TAP tests

On Tue, Sep 28, 2021 at 03:00:13PM -0400, Tom Lane wrote:

Should we back-patch 0002? I'm inclined to think so. Should
we then also back-patch enablement of the bloom test? Less
sure about that, but I'd lean to doing so. A test that appears
to be there but isn't actually invoked is pretty misleading.

A backpatch sounds adapted to me for both patches. The only risk that
I could see here is somebody implementing a new test by copy-pasting
this one if we were to keep things as they are on stable branches.
--
Michael

#24

Bossart, Nathan

bossartn@amazon.com

about 4 years ago

In reply to: Michael Paquier (#23)

Re: Fixing WAL instability in various TAP tests

On 9/28/21, 8:17 PM, "Michael Paquier" <michael@paquier.xyz> wrote:

On Tue, Sep 28, 2021 at 03:00:13PM -0400, Tom Lane wrote:

Should we back-patch 0002? I'm inclined to think so. Should
we then also back-patch enablement of the bloom test? Less
sure about that, but I'd lean to doing so. A test that appears
to be there but isn't actually invoked is pretty misleading.

A backpatch sounds adapted to me for both patches. The only risk that
I could see here is somebody implementing a new test by copy-pasting
this one if we were to keep things as they are on stable branches.

I found this thread via the Commitfest entry
(https://commitfest.postgresql.org/35/3333/), and I also see that the
following patches have been committed:

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=7d1aa6b
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=6bc6bd4

However, it looks like there are a couple of other patches upthread
[0]: /messages/by-id/C1D227C2-C271-4310-8C85-C5368C298622@enterprisedb.com
max_wal_size. Do we intend to proceed with those, or should we just
close out the Commmitfest entry?

Nathan

[0]: /messages/by-id/C1D227C2-C271-4310-8C85-C5368C298622@enterprisedb.com

#25

Mark Dilger

mark.dilger@enterprisedb.com

about 4 years ago

In reply to: Bossart, Nathan (#24)

Re: Fixing WAL instability in various TAP tests

On Oct 21, 2021, at 3:23 PM, Bossart, Nathan <bossartn@amazon.com> wrote:

Do we intend to proceed with those, or should we just
close out the Commmitfest entry?

I have withdrawn the patch. The issues were intermittent on the buildfarm, and committing other changes along with what Tom already committed would seem to confuse matters if any new issues were to arise. We can come back to this sometime in the future, if need be.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company