pgsql: When VACUUM or ANALYZE skips a concurrently dropped table, log i

Started by Robert Haasover 8 years ago9 messageshackers
Jump to latest
#1Robert Haas
robertmhaas@gmail.com

When VACUUM or ANALYZE skips a concurrently dropped table, log it.

Hopefully, the additional logging will help avoid confusion that
could otherwise result.

Nathan Bossart, reviewed by Michael Paquier, Fabrízio Mello, and me

Branch
------
master

Details
-------
https://git.postgresql.org/pg/commitdiff/ab6eaee88420db58a948849d5a735997728d73a9

Modified Files
--------------
doc/src/sgml/config.sgml | 4 +-
src/backend/commands/analyze.c | 46 +++++++++++--
src/backend/commands/vacuum.c | 49 ++++++++++++--
.../isolation/expected/vacuum-concurrent-drop.out | 76 ++++++++++++++++++++++
src/test/isolation/isolation_schedule | 1 +
.../isolation/specs/vacuum-concurrent-drop.spec | 45 +++++++++++++
6 files changed, 208 insertions(+), 13 deletions(-)

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#1)
Re: pgsql: When VACUUM or ANALYZE skips a concurrently dropped table, log i

Robert Haas <rhaas@postgresql.org> writes:

When VACUUM or ANALYZE skips a concurrently dropped table, log it.

When this went in, I was pretty skeptical of the value of an isolation
test for it, but said nothing. However, I now observe that the isolation
test is falling over on buildfarm machines with -DCLOBBER_CACHE_ALWAYS.
The buildfarm reports are a bit hard to interpret, but it's easy to
reproduce locally, and what I get is

$ more output_iso/regression.diffs
*** /home/postgres/pgsql/src/test/isolation/expected/vacuum-concurrent-drop.out
Mon Dec  4 17:02:55 2017
--- /home/postgres/pgsql/src/test/isolation/output_iso/results/vacuum-concurrent-drop.out       Wed Dec  6 12:07:37 2017
***************
*** 49,54 ****
--- 49,55 ----
        COMMIT;

step analyze_all: <... completed>
+ error in steps drop_and_commit analyze_all: ERROR: canceling statement due to user request

starting permutation: lock vac_analyze_specified drop_and_commit
step lock:

======================================================================

What appears to be happening is that a database-wide ANALYZE takes more
than a minute under CLOBBER_CACHE_ALWAYS, causing isolationtester.c's
hardwired one-minute timeout to trigger.

While you could imagine doing something to get around that, I do not
believe that this test is worth memorializing in perpetuity to begin
with. I'd recommend just taking it out again.

regards, tom lane

#3Nathan Bossart
nathandbossart@gmail.com
In reply to: Tom Lane (#2)
Re: pgsql: When VACUUM or ANALYZE skips a concurrently dropped table, log i

On 12/6/17, 11:57 AM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:

Robert Haas <rhaas@postgresql.org> writes:

When VACUUM or ANALYZE skips a concurrently dropped table, log it.

When this went in, I was pretty skeptical of the value of an isolation
test for it, but said nothing. However, I now observe that the isolation
test is falling over on buildfarm machines with -DCLOBBER_CACHE_ALWAYS.
The buildfarm reports are a bit hard to interpret, but it's easy to
reproduce locally, and what I get is

$ more output_iso/regression.diffs
*** /home/postgres/pgsql/src/test/isolation/expected/vacuum-concurrent-drop.out
Mon Dec  4 17:02:55 2017
--- /home/postgres/pgsql/src/test/isolation/output_iso/results/vacuum-concurrent-drop.out       Wed Dec  6 12:07:37 2017
***************
*** 49,54 ****
--- 49,55 ----
COMMIT;

step analyze_all: <... completed>
+ error in steps drop_and_commit analyze_all: ERROR: canceling statement due to user request

starting permutation: lock vac_analyze_specified drop_and_commit
step lock:

======================================================================

What appears to be happening is that a database-wide ANALYZE takes more
than a minute under CLOBBER_CACHE_ALWAYS, causing isolationtester.c's
hardwired one-minute timeout to trigger.

Thanks for digging into this.

While you could imagine doing something to get around that, I do not
believe that this test is worth memorializing in perpetuity to begin
with. I'd recommend just taking it out again.

While the current version of the test is clearly broken, I thought
Robert made a pretty strong argument regarding the value of the test
[0]: /messages/by-id/CA+TgmobH17W=WdduhXJhxdwHAeTazNp7MDP=k0p=2w1nuSSruw@mail.gmail.com
lines of code is not worth the extra work needed to maintain the
isolation test. I’m not strongly opinionated either way, but I lean
towards wanting to keep the test around.

Perhaps this could be fixed by modifying the database-wide cases to
use partitioned tables instead. The individual partitions will not
have RangeVars specified, so it would cover the case when logging
should be skipped.

Nathan

[0]: /messages/by-id/CA+TgmobH17W=WdduhXJhxdwHAeTazNp7MDP=k0p=2w1nuSSruw@mail.gmail.com

#4Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#2)
Re: pgsql: When VACUUM or ANALYZE skips a concurrently dropped table, log i

On Wed, Dec 6, 2017 at 12:57 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

What appears to be happening is that a database-wide ANALYZE takes more
than a minute under CLOBBER_CACHE_ALWAYS, causing isolationtester.c's
hardwired one-minute timeout to trigger.

While you could imagine doing something to get around that, I do not
believe that this test is worth memorializing in perpetuity to begin
with. I'd recommend just taking it out again.

Mumble. I don't really mind that, but I'll bet $0.05 that this will
get broken at some point and we won't notice right away without the
isolation test.

Is it really our policy that no isolation test can take more than a
minute on the slowest buildfarm critter? If somebody decides to start
running CLOBBER_CACHE_ALWAYS on an even-slower critter, will we just
nuke isolation tests from orbit until the tests pass there? I have
difficulty seeing that as a sound approach.

Another thought is that it might not be necessary to have a
database-wide ANALYZE to trigger this. I managed to reproduce it
locally by doing VACUUM a, b while alternately locking a and b, so
that I let the name lookups complete, but then blocked trying to
vacuum a, and then at that point dropped b, then released the VACUUM.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#5Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#3)
Re: pgsql: When VACUUM or ANALYZE skips a concurrently dropped table, log i

On 12/6/17, 1:23 PM, "Bossart, Nathan" <bossartn@amazon.com> wrote:

Perhaps this could be fixed by modifying the database-wide cases to
use partitioned tables instead. The individual partitions will not
have RangeVars specified, so it would cover the case when logging
should be skipped.

This fixed the problem for me. I've attached a patch.

Nathan

Attachments:

fix_vacuum_concurrent_drop_test_v1.patchapplication/octet-stream; name=fix_vacuum_concurrent_drop_test_v1.patchDownload+42-42
#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#4)
Re: pgsql: When VACUUM or ANALYZE skips a concurrently dropped table, log i

Robert Haas <robertmhaas@gmail.com> writes:

On Wed, Dec 6, 2017 at 12:57 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

What appears to be happening is that a database-wide ANALYZE takes more
than a minute under CLOBBER_CACHE_ALWAYS, causing isolationtester.c's
hardwired one-minute timeout to trigger.

Is it really our policy that no isolation test can take more than a
minute on the slowest buildfarm critter?

Well, I think it's a minute per query not per whole test script. But in
any case, if it's taking a longer time than any other isolation test on
the CLOBBER_CACHE_ALWAYS critters, then it's also taking a proportionately
longer time than any other test on every other platform, and is therefore
costing every developer precious time today and indefinitely far into the
future. I continue to say that this test ain't worth it.

It's possible that we could compromise on dropping the steps that test
whole-database VACUUM/ANALYZE; the incremental gain from testing those
scenarios is certainly even less worth its cost than the basic cases.

regards, tom lane

#7Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#6)
Re: pgsql: When VACUUM or ANALYZE skips a concurrently dropped table, log i

On Wed, Dec 6, 2017 at 4:31 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Well, I think it's a minute per query not per whole test script. But in
any case, if it's taking a longer time than any other isolation test on
the CLOBBER_CACHE_ALWAYS critters, then it's also taking a proportionately
longer time than any other test on every other platform, and is therefore
costing every developer precious time today and indefinitely far into the
future. I continue to say that this test ain't worth it.

Sure. But, you also continue to not respond to my arguments about why
it IS worth it. I don't want to spend a lot of time fighting about
this, but it looks to me like your preferences here are purely
arbitrary. Yesterday, you added - without discussion - a test that I
had "obviously" left out by accident. Today, you want a test removed
that I added on purpose but which you assert has insufficient value.
So, sometimes you think it's worth adding tests that make the test
suite longer, and other times you think it isn't. That's fair enough
-- everyone comes down in different places on this at different times
-- but the only actual reason you've offered is that the script
contains a command that runs for over a minute on very slow machines
that have been artificially slowed down 100x. That's a silly reason:
it means that on real machines we're talking less than a second of
runtime even without modifying the test case, and if we do modify the
test, it can probably be made much less.

Please give me a little time to see if I can speed up this test enough
to fix this problem. If that doesn't work out, then we can rip this
out.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#8Nathan Bossart
nathandbossart@gmail.com
In reply to: Robert Haas (#7)
Re: pgsql: When VACUUM or ANALYZE skips a concurrently dropped table, log i

On 12/6/17, 8:25 PM, "Robert Haas" <robertmhaas@gmail.com> wrote:

Please give me a little time to see if I can speed up this test enough
to fix this problem. If that doesn't work out, then we can rip this
out.

Just in case it got missed earlier, here’s a patch that speeds it
up enough to pass with CLOBBER_CACHE_ALWAYS enabled. Instead of
doing database-wide operations, it just uses a partitioned table.

Nathan

Attachments:

fix_vacuum_concurrent_drop_test_v1.patchapplication/octet-stream; name=fix_vacuum_concurrent_drop_test_v1.patchDownload+42-42
#9Robert Haas
robertmhaas@gmail.com
In reply to: Nathan Bossart (#8)
Re: pgsql: When VACUUM or ANALYZE skips a concurrently dropped table, log i

On Wed, Dec 6, 2017 at 9:42 PM, Bossart, Nathan <bossartn@amazon.com> wrote:

On 12/6/17, 8:25 PM, "Robert Haas" <robertmhaas@gmail.com> wrote:

Please give me a little time to see if I can speed up this test enough
to fix this problem. If that doesn't work out, then we can rip this
out.

Just in case it got missed earlier, here’s a patch that speeds it
up enough to pass with CLOBBER_CACHE_ALWAYS enabled. Instead of
doing database-wide operations, it just uses a partitioned table.

Yeah, that looks like a reasonable approach to try. Committed, thanks.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company