Fixing WAL instability in various TAP tests

Started by Mark Dilgerover 4 years ago25 messageshackers
Jump to latest
#1Mark Dilger
mark.dilger@enterprisedb.com

Hackers,

A few TAP tests in the project appear to be sensitive to reductions of the PostgresNode's max_wal_size setting, resulting in tests failing due to wal files having been removed too soon. The failures in the logs typically are of the "requested WAL segment %s has already been removed" variety. I would expect tests which fail under legal alternate GUC settings to be hardened to explicitly set the GUCs as they need, rather than implicitly relying on the defaults. As far as missing WAL files go, I would expect the TAP test to prevent this with the use of replication slots or some other mechanism, and not simply to rely on checkpoints not happening too soon. I'm curious if others on this list disagree with that point of view.

Failures in src/test/recovery/t/015_promotion_pages.pl can be fixed by creating a physical replication slot on node "alpha" and using it from node "beta", a technique already used in other TAP tests and apparently merely overlooked in this one.

The first two tests in src/bin/pg_basebackup/t fail, and it's not clear that physical replication slots are the appropriate solution, since no replication is happening. It's not immediately obvious that the tests are at fault anyway. On casual inspection, it seems they might be detecting a live bug which simply doesn't manifest under larger values of max_wal_size. Test 010 appears to show a bug with `pg_basebackup -X`, and test 020 with `pg_receivewal`.

The test in contrib/bloom/t/ is deliberately disabled in contrib/bloom/Makefile with a comment that the test is unstable in the buildfarm, but I didn't find anything to explain what exactly those buildfarm failures might have been when I chased down the email thread that gave rise to the related commit. That test happens to be stable on my laptop until I change GUC settings to both reduce max_wal_size=32MB and to set wal_consistency_checking=all.

Thoughts?


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#2Noah Misch
noah@leadboat.com
In reply to: Mark Dilger (#1)
Re: Fixing WAL instability in various TAP tests

On Fri, Sep 24, 2021 at 05:33:13PM -0700, Mark Dilger wrote:

A few TAP tests in the project appear to be sensitive to reductions of the
PostgresNode's max_wal_size setting, resulting in tests failing due to wal
files having been removed too soon. The failures in the logs typically are
of the "requested WAL segment %s has already been removed" variety. I would
expect tests which fail under legal alternate GUC settings to be hardened to
explicitly set the GUCs as they need, rather than implicitly relying on the
defaults.

That is not the general practice in PostgreSQL tests today. The buildfarm
exercises some settings, so we keep the tests clean for those. Coping with
max_wal_size=2 that way sounds reasonable. I'm undecided about the value of
hardening tests against all possible settings. On the plus side, it would let
us run one buildfarm member that sets every setting to its min_val or
enumvals[1] and another member that elects enumvals[cardinality(enumvals)] or
max_val. We'd catch some real bugs that way. On the minus side, every
nontrivial test suite addition would need to try those two cases before commit
or risk buildfarm wrath. I don't know whether the bugs found would pay for
that trouble. (There's also a less-important loss around the ability to
exercise a setting and manually inspect the results. For example, I sometimes
test parallel_tuple_cost=0 parallel_setup_cost=0 and confirm a lack of
crashes. After hardening, that would require temporary source code edits to
remove the hardening. That's fine, though.)

#3Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Noah Misch (#2)
Re: Fixing WAL instability in various TAP tests

On Sep 24, 2021, at 10:21 PM, Noah Misch <noah@leadboat.com> wrote:

I would
expect tests which fail under legal alternate GUC settings to be hardened to
explicitly set the GUCs as they need, rather than implicitly relying on the
defaults.

That is not the general practice in PostgreSQL tests today. The buildfarm
exercises some settings, so we keep the tests clean for those. Coping with
max_wal_size=2 that way sounds reasonable. I'm undecided about the value of
hardening tests against all possible settings.

Leaving the tests brittle wastes developer time.

I ran into this problem when I changed the storage underlying bloom indexes and ran the contrib/bloom/t/001_wal.pl test with wal_consistency_checking=all. That caused the test to fail with errors about missing wal files, and it took time to backtrack and see that the test fails under this setting even before applying my storage layer changes. Ordinarily, failures about missing wal files would have led me to suspect the TAP test sooner, but since I had mucked around with storage and wal it initially seemed plausible that my code changes were the problem. The real problem is that a replication slot is not used in the test.

The failure in src/test/recovery/t/015_promotion_pages.pl is also that a replication slot should be used but is not.

The failure in src/bin/pg_basebackup/t/010_pg_basebackup.pl stems from not heeding the documented requirement for pg_basebackup -X fetch that the wal_keep_size "be set high enough that the required log data is not removed before the end of the backup". It's just assuming that it will be, because that tends to be true under default GUC settings. I think this can be fixed by setting wal_keep_size=<SOMETHING_BIG_ENOUGH>, but (a) you say this is not the general practice in PostgreSQL tests today, and (b) there doesn't seem to be any principled way to decide what value would be big enough. Sure, we can use something that is big enough in practice, and we'll probably have to go with that, but it feels like we're just papering over the problem.

I'm inclined to guess that the problem in src/bin/pg_basebackup/t/020_pg_receivewal.pl is similar.


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Mark Dilger (#3)
Re: Fixing WAL instability in various TAP tests

Mark Dilger <mark.dilger@enterprisedb.com> writes:

On Sep 24, 2021, at 10:21 PM, Noah Misch <noah@leadboat.com> wrote:

I would
expect tests which fail under legal alternate GUC settings to be hardened to
explicitly set the GUCs as they need, rather than implicitly relying on the
defaults.

That is not the general practice in PostgreSQL tests today. The buildfarm
exercises some settings, so we keep the tests clean for those. Coping with
max_wal_size=2 that way sounds reasonable. I'm undecided about the value of
hardening tests against all possible settings.

Leaving the tests brittle wastes developer time.

Trying to make them proof against all possible settings would waste
a lot more time, though.

regards, tom lane

#5Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Tom Lane (#4)
Re: Fixing WAL instability in various TAP tests

On Sep 25, 2021, at 7:17 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Leaving the tests brittle wastes developer time.

Trying to make them proof against all possible settings would waste
a lot more time, though.

You may be right, but the conversation about "all possible settings" was started by Noah. I was really just talking about tests that depend on wal files not being removed, but taking no action to guarantee that, merely trusting that under default settings they won't be. I can't square that design against other TAP tests that do take measures to prevent wal files being removed. Why is the precaution taken in some tests but not others? If this is intentional, shouldn't some comment in the tests without such precautions explain that choice? Are they intentionally testing that the default GUC wal size settings and wal verbosity won't break the test?

This isn't a rhetorical question:

In src/test/recovery/t/015_promotion_pages.pl, the comments talk about the how checkpoints impact what happens on the standby. The test issues an explicit checkpoint on the primary, and again later on the standby, so it is unclear if that's what the comments refer to, or if they also refer to implicit expectations about when/if other checkpoints will happen. The test breaks when I change the GUC settings, but I can fix that breakage by adding a replication slot to the test. Have I broken the purpose of the test by doing so, though? Does using a replication slot to force the wal to not be removed early break what the test is designed to check?

The other tests raise similar questions. Is the brittleness intentional?


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Mark Dilger (#5)
Re: Fixing WAL instability in various TAP tests

Mark Dilger <mark.dilger@enterprisedb.com> writes:

You may be right, but the conversation about "all possible settings" was started by Noah. I was really just talking about tests that depend on wal files not being removed, but taking no action to guarantee that, merely trusting that under default settings they won't be. I can't square that design against other TAP tests that do take measures to prevent wal files being removed. Why is the precaution taken in some tests but not others?

If we are doing something about that in some test cases, I'd agree with
doing the same thing in others that need it. It seems more likely to
be an oversight than an intentional test condition.

regards, tom lane

#7Noah Misch
noah@leadboat.com
In reply to: Mark Dilger (#5)
Re: Fixing WAL instability in various TAP tests

On Sat, Sep 25, 2021 at 08:20:06AM -0700, Mark Dilger wrote:

On Sep 25, 2021, at 7:17 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Leaving the tests brittle wastes developer time.

Trying to make them proof against all possible settings would waste
a lot more time, though.

You may be right, but the conversation about "all possible settings" was
started by Noah.

You wrote, "I would expect tests which fail under legal alternate GUC settings
to be hardened to explicitly set the GUCs as they need, rather than implicitly
relying on the defaults." I read that as raising the general principle, not
just a narrow argument about max_wal_size. We can discontinue talking about
the general principle and focus on max_wal_size.

I was really just talking about tests that depend on wal
files not being removed, but taking no action to guarantee that, merely
trusting that under default settings they won't be.

As I said, +1 for making tests pass under the min_val of max_wal_size. If you
want to introduce a max_wal_size=2 buildfarm member so it stays that way, +1
for that as well.

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Noah Misch (#7)
Re: Fixing WAL instability in various TAP tests

Noah Misch <noah@leadboat.com> writes:

On Sat, Sep 25, 2021 at 08:20:06AM -0700, Mark Dilger wrote:

You may be right, but the conversation about "all possible settings" was
started by Noah.

You wrote, "I would expect tests which fail under legal alternate GUC settings
to be hardened to explicitly set the GUCs as they need, rather than implicitly
relying on the defaults." I read that as raising the general principle, not
just a narrow argument about max_wal_size.

As did I.

We can discontinue talking about
the general principle and focus on max_wal_size.

It is worth stopping to think about whether there are adjacent settings
that need similar treatment.

In general, it seems like "premature discarding of WAL segments" is
something akin to "premature timeout" errors, and we've got a pretty
aggressive policy about preventing those. There are a lot of settings
that I'd *not* be in favor of trying to be bulletproof about, because
it doesn't seem worth the trouble; but perhaps this one is.

regards, tom lane

#9Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Noah Misch (#7)
Re: Fixing WAL instability in various TAP tests

On Sep 25, 2021, at 9:00 AM, Noah Misch <noah@leadboat.com> wrote:

You may be right, but the conversation about "all possible settings" was
started by Noah.

You wrote, "I would expect tests which fail under legal alternate GUC settings
to be hardened to explicitly set the GUCs as they need, rather than implicitly
relying on the defaults." I read that as raising the general principle, not
just a narrow argument about max_wal_size.

In the first draft of my email to Tom, I had language about my inartful crafting of my original post that led Noah to respond as he did.... I couldn't quite figure out how to phrase that without distracting from the main point. I don't think you were (much) offended, but my apologies for any perceived fingerpointing.

I also don't have a problem with your idea of testing in the build farm with some animals having the gucs set to minimum values and some to maximum and so forth. I like that idea generally, though don't feel competent to predict how much work that would be to maintain, so I'm just deferring to Tom's and your judgement about that.

My inartful first post was really meant to say, "here is a problem that I perceive about tap tests vis-a-vis wal files, do people disagree with me that this is a problem, and would patches to address the problem be welcome?" I took Tom's response to be, "yeah, go ahead", and am mostly waiting for the weekend to be over to see if anybody else has anything to say about it.


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#10Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Mark Dilger (#9)
Re: Fixing WAL instability in various TAP tests

On Sep 25, 2021, at 11:04 AM, Mark Dilger <mark.dilger@enterprisedb.com> wrote:

I took Tom's response to be, "yeah, go ahead", and am mostly waiting for the weekend to be over to see if anybody else has anything to say about it.

Here is a patch set, one patch per test. The third patch enables its test in the Makefile, which is commented as having been disabled due to the test being unstable in the build farm. Re-enabling the test might be wrong, since the instability might not have been due to WAL being recycled early. I didn't find enough history about why that test was disabled, so trying it in the build farm again is the best I can suggest. Maybe somebody else on this list knows more?

Attachments:

v1-0001-Harden-test-againt-premature-discarding-of-WAL.patchapplication/octet-stream; name=v1-0001-Harden-test-againt-premature-discarding-of-WAL.patch; x-unix-mode=0644Download+3-1
v1-0002-Harden-test-againt-premature-discarding-of-WAL.patchapplication/octet-stream; name=v1-0002-Harden-test-againt-premature-discarding-of-WAL.patch; x-unix-mode=0644Download+1-1
v1-0003-Harden-and-enable-bloom-TAP-test.patchapplication/octet-stream; name=v1-0003-Harden-and-enable-bloom-TAP-test.patch; x-unix-mode=0644Download+4-4
#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: Mark Dilger (#10)
Re: Fixing WAL instability in various TAP tests

Mark Dilger <mark.dilger@enterprisedb.com> writes:

Here is a patch set, one patch per test. The third patch enables its test in the Makefile, which is commented as having been disabled due to the test being unstable in the build farm. Re-enabling the test might be wrong, since the instability might not have been due to WAL being recycled early. I didn't find enough history about why that test was disabled, so trying it in the build farm again is the best I can suggest. Maybe somebody else on this list knows more?

Digging in the archives where the commit points, I find

/messages/by-id/20181126025125.GH1776@paquier.xyz

which says there was an unexpected result on my animal longfin.
I tried the same thing (i.e., re-enable bloom's TAP test) on my laptop
just now, and it passed fine. The laptop is not exactly the same
as longfin was in 2018, but it ought to be close enough. Not sure
what to make of that --- maybe the failure is only intermittent,
or else we fixed the underlying issue since then.

I'm a little inclined to re-enable the test without your other
changes, just to see what happens.

regards, tom lane

#12Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Tom Lane (#11)
Re: Fixing WAL instability in various TAP tests

On Sep 27, 2021, at 1:19 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I'm a little inclined to re-enable the test without your other
changes, just to see what happens.

That sounds like a good idea. Even if it passes at first, I'd prefer to leave it for a week or more to have a better sense of how stable it is. Applying my patches too soon would just add more variables to a not-so-well understood situation.


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#13Tom Lane
tgl@sss.pgh.pa.us
In reply to: Mark Dilger (#12)
Re: Fixing WAL instability in various TAP tests

Mark Dilger <mark.dilger@enterprisedb.com> writes:

On Sep 27, 2021, at 1:19 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I'm a little inclined to re-enable the test without your other
changes, just to see what happens.

That sounds like a good idea. Even if it passes at first, I'd prefer to leave it for a week or more to have a better sense of how stable it is. Applying my patches too soon would just add more variables to a not-so-well understood situation.

Done. I shall retire to a safe viewing distance and observe the
buildfarm.

regards, tom lane

#14Michael Paquier
michael@paquier.xyz
In reply to: Tom Lane (#11)
Re: Fixing WAL instability in various TAP tests

On Mon, Sep 27, 2021 at 04:19:27PM -0400, Tom Lane wrote:

I tried the same thing (i.e., re-enable bloom's TAP test) on my laptop
just now, and it passed fine. The laptop is not exactly the same
as longfin was in 2018, but it ought to be close enough. Not sure
what to make of that --- maybe the failure is only intermittent,
or else we fixed the underlying issue since then.

Honestly, I have no idea what change in the backend matters here. And
it is not like bloom has changed in any significant way since d3c09b9.
--
Michael

#15Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Paquier (#14)
Re: Fixing WAL instability in various TAP tests

Michael Paquier <michael@paquier.xyz> writes:

On Mon, Sep 27, 2021 at 04:19:27PM -0400, Tom Lane wrote:

I tried the same thing (i.e., re-enable bloom's TAP test) on my laptop
just now, and it passed fine. The laptop is not exactly the same
as longfin was in 2018, but it ought to be close enough. Not sure
what to make of that --- maybe the failure is only intermittent,
or else we fixed the underlying issue since then.

Honestly, I have no idea what change in the backend matters here. And
it is not like bloom has changed in any significant way since d3c09b9.

I went so far as to check out 03faa4a8dd on longfin's host, and I find
that I cannot reproduce the failure shown at

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=longfin&amp;dt=2018-11-25+23%3A59%3A03

So that's the same hardware, and identical PG source tree, and different
results. This seems to leave only two theories standing:

1. It was a since-fixed macOS bug. (Unlikely, especially if we also saw
it on other platforms.)

2. The failure manifested only in the buildfarm, not under manual "make
check". This is somewhat more plausible, especially since subsequent
buildfarm script changes might then explain why it went away. But I have
no idea what the "subsequent script changes" might've been.

regards, tom lane

#16Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#15)
Re: Fixing WAL instability in various TAP tests

On 9/27/21 10:20 PM, Tom Lane wrote:

Michael Paquier <michael@paquier.xyz> writes:

On Mon, Sep 27, 2021 at 04:19:27PM -0400, Tom Lane wrote:

I tried the same thing (i.e., re-enable bloom's TAP test) on my laptop
just now, and it passed fine. The laptop is not exactly the same
as longfin was in 2018, but it ought to be close enough. Not sure
what to make of that --- maybe the failure is only intermittent,
or else we fixed the underlying issue since then.

Honestly, I have no idea what change in the backend matters here. And
it is not like bloom has changed in any significant way since d3c09b9.

I went so far as to check out 03faa4a8dd on longfin's host, and I find
that I cannot reproduce the failure shown at

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=longfin&amp;dt=2018-11-25+23%3A59%3A03

So that's the same hardware, and identical PG source tree, and different
results. This seems to leave only two theories standing:

1. It was a since-fixed macOS bug. (Unlikely, especially if we also saw
it on other platforms.)

2. The failure manifested only in the buildfarm, not under manual "make
check". This is somewhat more plausible, especially since subsequent
buildfarm script changes might then explain why it went away. But I have
no idea what the "subsequent script changes" might've been.

Nothing I can think of.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

#17Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#15)
Re: Fixing WAL instability in various TAP tests

I wrote:

So that's the same hardware, and identical PG source tree, and different
results. This seems to leave only two theories standing:

I forgot theory 3: it's intermittent. Apparently the probability has
dropped a lot since 2018, but behold:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris&amp;dt=2021-09-28%2014%3A20%3A41

(with successful runs just before and after this one, on the same
animal)

Note that the delta is not exactly like the previous result, either.
So there's more than one symptom, but in any case it seems like
we have an issue in WAL replay. I wonder whether it's bloom's fault
or a core bug.

regards, tom lane

#18Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Tom Lane (#17)
Re: Fixing WAL instability in various TAP tests

On Sep 28, 2021, at 10:27 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I wonder whether it's bloom's fault
or a core bug.

Looking closer at the TAP test, it's not ORDERing the result set from the SELECTs on either node, but it is comparing the sets for stringwise equality, which is certainly order dependent.


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#19Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#17)
Re: Fixing WAL instability in various TAP tests

I wrote:

So there's more than one symptom, but in any case it seems like
we have an issue in WAL replay. I wonder whether it's bloom's fault
or a core bug.

Actually ... I bet it's just the test script's fault. It waits for the
standby to catch up like this:

my $caughtup_query =
"SELECT pg_current_wal_lsn() <= write_lsn FROM pg_stat_replication WHERE application_name = '$applname';";
$node_primary->poll_query_until('postgres', $caughtup_query)
or die "Timed out while waiting for standby 1 to catch up";

which seems like completely the wrong condition. Don't we need the
standby to have *replayed* the WAL, not merely written it to disk?

I'm also wondering why this doesn't use wait_for_catchup, instead
of reinventing the query to use.

regards, tom lane

#20Tom Lane
tgl@sss.pgh.pa.us
In reply to: Mark Dilger (#18)
Re: Fixing WAL instability in various TAP tests

Mark Dilger <mark.dilger@enterprisedb.com> writes:

Looking closer at the TAP test, it's not ORDERing the result set from the SELECTs on either node, but it is comparing the sets for stringwise equality, which is certainly order dependent.

Well, it's forcing a bitmap scan, so what we're getting is the native
ordering of a bitmapscan result. That should match, given that what
we're doing is physical replication. I think adding ORDER BY would
be more likely to obscure real issues than hide test instability.

regards, tom lane

#21Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Mark Dilger (#18)
#22Tom Lane
tgl@sss.pgh.pa.us
In reply to: Mark Dilger (#21)
#23Michael Paquier
michael@paquier.xyz
In reply to: Tom Lane (#22)
#24Nathan Bossart
nathandbossart@gmail.com
In reply to: Michael Paquier (#23)
#25Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Nathan Bossart (#24)