SKIP_LOCKED test causes random buildfarm failures

Started by Tom Laneover 6 years ago11 messageshackers
Jump to latest
#1Tom Lane
tgl@sss.pgh.pa.us

Every once in awhile we get failures like this one:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=gull&dt=2019-11-05%2008%3A27%3A27

diff -U3 /home/pgsql/build-farm/buildroot-clang/HEAD/pgsql.build/../pgsql/src/test/regress/expected/vacuum.out /home/pgsql/build-farm/buildroot-clang/HEAD/pgsql.build/src/test/regress/results/vacuum.out
--- /home/pgsql/build-farm/buildroot-clang/HEAD/pgsql.build/../pgsql/src/test/regress/expected/vacuum.out	2019-08-11 03:02:18.921535948 -0700
+++ /home/pgsql/build-farm/buildroot-clang/HEAD/pgsql.build/src/test/regress/results/vacuum.out	2019-11-05 00:50:42.381244885 -0800
@@ -204,6 +204,7 @@
 -- SKIP_LOCKED option
 VACUUM (SKIP_LOCKED) vactst;
 VACUUM (SKIP_LOCKED, FULL) vactst;
+WARNING:  skipping vacuum of "vactst" --- lock not available
 ANALYZE (SKIP_LOCKED) vactst;
 -- ensure VACUUM and ANALYZE don't have a problem with serializable
 SET default_transaction_isolation = serializable;

No doubt this is a conflict with autovacuum. There are two reasonable
ways to remove the test instability:

* Crank up client_min_messages to more than WARNING for this test
stanza.

* Downgrade the "skipping" messages to DEBUG1 or less.

I kind of wonder why we are issuing a "WARNING" when the statement
does exactly what you asked it to, anyway. At most I'd expect
that to be a NOTICE condition.

regards, tom lane

#2Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#1)
Re: SKIP_LOCKED test causes random buildfarm failures

Hi,

On 2019-11-06 16:54:38 -0500, Tom Lane wrote:

Every once in awhile we get failures like this one:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=gull&dt=2019-11-05%2008%3A27%3A27

diff -U3 /home/pgsql/build-farm/buildroot-clang/HEAD/pgsql.build/../pgsql/src/test/regress/expected/vacuum.out /home/pgsql/build-farm/buildroot-clang/HEAD/pgsql.build/src/test/regress/results/vacuum.out
--- /home/pgsql/build-farm/buildroot-clang/HEAD/pgsql.build/../pgsql/src/test/regress/expected/vacuum.out	2019-08-11 03:02:18.921535948 -0700
+++ /home/pgsql/build-farm/buildroot-clang/HEAD/pgsql.build/src/test/regress/results/vacuum.out	2019-11-05 00:50:42.381244885 -0800
@@ -204,6 +204,7 @@
-- SKIP_LOCKED option
VACUUM (SKIP_LOCKED) vactst;
VACUUM (SKIP_LOCKED, FULL) vactst;
+WARNING:  skipping vacuum of "vactst" --- lock not available
ANALYZE (SKIP_LOCKED) vactst;
-- ensure VACUUM and ANALYZE don't have a problem with serializable
SET default_transaction_isolation = serializable;

No doubt this is a conflict with autovacuum. There are two reasonable
ways to remove the test instability:

I assume you consider disabling autovacuum for that table not a
reasonable approach? Due to the danger that it could end up still
running, e.g. due to anti-wraparound or such?

* Crank up client_min_messages to more than WARNING for this test
stanza.

Hm.

* Downgrade the "skipping" messages to DEBUG1 or less.

I kind of wonder why we are issuing a "WARNING" when the statement
does exactly what you asked it to, anyway. At most I'd expect
that to be a NOTICE condition.

I don't know what lead us to doing so, but it doesn't seem reasonable to
allow the user to see whether the table has actually been vacuumed. I
would assume that one uses SKIP_LOCKED partially to avoid unnecessary
impacts in production due to other tasks starting to block on e.g. a
VACUUM FULL, even though without the "ordered queueing" everything could
just go on working fine. I'm not sure that indicates whether WARNING or
NOTICE is the best choice.

So I'd be inclined to go with the client_min_messages approach?

Greetings,

Andres Freund

#3Michael Paquier
michael@paquier.xyz
In reply to: Andres Freund (#2)
Re: SKIP_LOCKED test causes random buildfarm failures

On Wed, Nov 06, 2019 at 03:01:11PM -0800, Andres Freund wrote:

I don't know what lead us to doing so, but it doesn't seem reasonable to
allow the user to see whether the table has actually been vacuumed. I
would assume that one uses SKIP_LOCKED partially to avoid unnecessary
impacts in production due to other tasks starting to block on e.g. a
VACUUM FULL, even though without the "ordered queueing" everything could
just go on working fine. I'm not sure that indicates whether WARNING or
NOTICE is the best choice.

Good question. That's a historical choice, still I have seen cases
where those warnings are helpful while not making the logs too
verbose to see some congestion in the jobs.

So I'd be inclined to go with the client_min_messages approach?

The main purpose of the tests in regress/ is to check after the
grammar, so using client_min_messages sounds like a plan. We have
a second set of tests in isolation/ where I would actually like to
disable autovacuum by default on a subset of tables. Thoughts about
the attached?
--
Michael

Attachments:

vacuum-lock-tests.patchtext/x-diff; charset=us-asciiDownload+10-0
#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Paquier (#3)
Re: SKIP_LOCKED test causes random buildfarm failures

Michael Paquier <michael@paquier.xyz> writes:

On Wed, Nov 06, 2019 at 03:01:11PM -0800, Andres Freund wrote:

I don't know what lead us to doing so, but it doesn't seem reasonable to
allow the user to see whether the table has actually been vacuumed. I
would assume that one uses SKIP_LOCKED partially to avoid unnecessary
impacts in production due to other tasks starting to block on e.g. a
VACUUM FULL, even though without the "ordered queueing" everything could
just go on working fine. I'm not sure that indicates whether WARNING or
NOTICE is the best choice.

Good question. That's a historical choice, still I have seen cases
where those warnings are helpful while not making the logs too
verbose to see some congestion in the jobs.

I kind of feel that NOTICE is more semantically appropriate, but
perhaps there's an argument for keeping it at WARNING.

So I'd be inclined to go with the client_min_messages approach?

The main purpose of the tests in regress/ is to check after the
grammar, so using client_min_messages sounds like a plan. We have
a second set of tests in isolation/ where I would actually like to
disable autovacuum by default on a subset of tables. Thoughts about
the attached?

I do not want to fix this in the main tests by disabling autovacuum,
because that'd actually reduce the tests' cross-section. The fact
that this happens occasionally is a Good Thing for verifying that the
code paths actually work. So it seems that there's a consensus on
adjusting client_min_messages to fix the test output instability ---
but we need to agree on whether to do s/WARNING/NOTICE/ first, so we
can know what to set client_min_messages to.

As for the case in the isolation test, shouldn't we also use
client_min_messages there, rather than prevent the conflict
from arising? Or would that test fail in some larger way if
autovacuum gets in the way?

regards, tom lane

#5Michael Paquier
michael@paquier.xyz
In reply to: Tom Lane (#4)
Re: SKIP_LOCKED test causes random buildfarm failures

On Thu, Nov 07, 2019 at 11:50:25AM -0500, Tom Lane wrote:

Michael Paquier <michael@paquier.xyz> writes:

Good question. That's a historical choice, still I have seen cases
where those warnings are helpful while not making the logs too
verbose to see some congestion in the jobs.

I kind of feel that NOTICE is more semantically appropriate, but
perhaps there's an argument for keeping it at WARNING.

Perhaps. Well, that's the same level as the one used after the
permission checks on the relation vacuumed.

I do not want to fix this in the main tests by disabling autovacuum,
because that'd actually reduce the tests' cross-section. The fact
that this happens occasionally is a Good Thing for verifying that the
code paths actually work. So it seems that there's a consensus on
adjusting client_min_messages to fix the test output instability ---
but we need to agree on whether to do s/WARNING/NOTICE/ first, so we
can know what to set client_min_messages to.

Makes sense.

As for the case in the isolation test, shouldn't we also use
client_min_messages there, rather than prevent the conflict
from arising? Or would that test fail in some larger way if
autovacuum gets in the way?

I think that there is no risk regarding the stability of the output
because we use LOCK before from a first session on the relation to
vacuum in a second session. So if autovacuum runs in parallel, the
consequence would be a small slow down while waiting on the lock to be
taken. And per the way the test is ordered, it seems to me that it
makes the most sense to disable autovacuum as it would just get in the
way. In this case I think that it is actually better to show the
messages as that makes the tests more verbose and we make sure to test
their format, even if we could just rely on the fact that VACUUM
should just be blocking or non-blocking.
--
Michael

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#4)
Re: SKIP_LOCKED test causes random buildfarm failures

I wrote:

Michael Paquier <michael@paquier.xyz> writes:

Good question. That's a historical choice, still I have seen cases
where those warnings are helpful while not making the logs too
verbose to see some congestion in the jobs.

I kind of feel that NOTICE is more semantically appropriate, but
perhaps there's an argument for keeping it at WARNING.

Well, I'm not hearing any groundswell of support for changing the
message level, so let's leave that as-is and just see about
stabilizing the tests.

The main purpose of the tests in regress/ is to check after the
grammar, so using client_min_messages sounds like a plan. We have
a second set of tests in isolation/ where I would actually like to
disable autovacuum by default on a subset of tables. Thoughts about
the attached?

After looking more closely at the isolation test, I agree that adding
the "ALTER TABLE ... SET (autovacuum_enabled = false)" bits to it is
a good idea. The LOCK operations should make that irrelevant for
part1, but there's at least a theoretical hazard for part2.
(Actually, is "autovacuum_enabled = false" really sufficient to
keep autovacuum away? It'd probably lock the table for long enough
to examine its reloptions, so it seems like all we're doing here is
reducing the window for trouble a little bit. Still, maybe that's
worthwhile.)

As for the SKIP_LOCKED tests in vacuum.sql, what I now propose is that
we just remove them. I do not see that they're offering any coverage
that's not completely redundant with this isolation test. We don't
need to spend cycles every day on that.

regards, tom lane

#7Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#6)
Re: SKIP_LOCKED test causes random buildfarm failures

Hi,

On 2019-11-14 13:37:44 -0500, Tom Lane wrote:

I wrote:

Michael Paquier <michael@paquier.xyz> writes:

Good question. That's a historical choice, still I have seen cases
where those warnings are helpful while not making the logs too
verbose to see some congestion in the jobs.

I kind of feel that NOTICE is more semantically appropriate, but
perhaps there's an argument for keeping it at WARNING.

Well, I'm not hearing any groundswell of support for changing the
message level, so let's leave that as-is and just see about
stabilizing the tests.

Ok.

The main purpose of the tests in regress/ is to check after the
grammar, so using client_min_messages sounds like a plan. We have
a second set of tests in isolation/ where I would actually like to
disable autovacuum by default on a subset of tables. Thoughts about
the attached?

After looking more closely at the isolation test, I agree that adding
the "ALTER TABLE ... SET (autovacuum_enabled = false)" bits to it is
a good idea. The LOCK operations should make that irrelevant for
part1, but there's at least a theoretical hazard for part2.
(Actually, is "autovacuum_enabled = false" really sufficient to
keep autovacuum away? It'd probably lock the table for long enough
to examine its reloptions, so it seems like all we're doing here is
reducing the window for trouble a little bit. Still, maybe that's
worthwhile.)

+1

As for the SKIP_LOCKED tests in vacuum.sql, what I now propose is that
we just remove them. I do not see that they're offering any coverage
that's not completely redundant with this isolation test. We don't
need to spend cycles every day on that.

-0 on that, I'd rather just put a autovacuum_enabled = false for
them. They're quick enough, and it's nice to have decent coverage of
various options within the plain regression tests when possible.

Greetings,

Andres Freund

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#7)
Re: SKIP_LOCKED test causes random buildfarm failures

Andres Freund <andres@anarazel.de> writes:

On 2019-11-14 13:37:44 -0500, Tom Lane wrote:

As for the SKIP_LOCKED tests in vacuum.sql, what I now propose is that
we just remove them. I do not see that they're offering any coverage
that's not completely redundant with this isolation test. We don't
need to spend cycles every day on that.

-0 on that, I'd rather just put a autovacuum_enabled = false for
them. They're quick enough, and it's nice to have decent coverage of
various options within the plain regression tests when possible.

If we're going to keep them in vacuum.sql, we should use the
client_min_messages fix there, as that's a full solution not just
reducing the window. But I don't agree that these tests are worth
the cycles, given the coverage elsewhere. The probability of breaking
this option is just not high enough to justify core-regression-test
coverage.

regards, tom lane

#9Michael Paquier
michael@paquier.xyz
In reply to: Tom Lane (#8)
Re: SKIP_LOCKED test causes random buildfarm failures

On Thu, Nov 14, 2019 at 03:20:09PM -0500, Tom Lane wrote:

If we're going to keep them in vacuum.sql, we should use the
client_min_messages fix there, as that's a full solution not just
reducing the window. But I don't agree that these tests are worth
the cycles, given the coverage elsewhere. The probability of breaking
this option is just not high enough to justify core-regression-test
coverage.

I would rather keep the solution with client_min_messages, and the
tests in vacuum.sql to keep those checks for the grammar parsing. So
this basically brings us back to use the patch I proposed here:
/messages/by-id/20191107013942.GA1768@paquier.xyz

Any objections?
--
Michael

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Paquier (#9)
Re: SKIP_LOCKED test causes random buildfarm failures

Michael Paquier <michael@paquier.xyz> writes:

I would rather keep the solution with client_min_messages, and the
tests in vacuum.sql to keep those checks for the grammar parsing. So
this basically brings us back to use the patch I proposed here:
/messages/by-id/20191107013942.GA1768@paquier.xyz

OK.

regards, tom lane

#11Michael Paquier
michael@paquier.xyz
In reply to: Tom Lane (#10)
Re: SKIP_LOCKED test causes random buildfarm failures

On Fri, Nov 15, 2019 at 11:22:20AM -0500, Tom Lane wrote:

Michael Paquier <michael@paquier.xyz> writes:

I would rather keep the solution with client_min_messages, and the
tests in vacuum.sql to keep those checks for the grammar parsing. So
this basically brings us back to use the patch I proposed here:
/messages/by-id/20191107013942.GA1768@paquier.xyz

OK.

Thanks, applied and back-patched.
--
Michael