Instability in TRUNCATE regression test

Started by Tom Laneover 19 years ago11 messages
#1Tom Lane
tgl@sss.pgh.pa.us

Buildfarm member platypus is showing a regression failure that I'm
surprised we have not seen before:

http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=platypus&dt=2006-06-28%2014:05:01

Basically what this is showing is that when there is more than one
referencing table, the order in which things get done is dependent
on chance locations of system catalog entries. That results in
cosmetic differences in which of multiple violations gets reported,
or in the order of "truncate cascades to" notices.

Given our push to have the buildfarm "all green all the time",
I don't think I want to just live with occasional failures.
Seems like the alternatives are

1. Find a way to make the processing order consistent (eg by driving it
off OID ordering). Doesn't seem easy, but maybe I'm missing an idea.

2. Install multiple expected files for the truncate test.

3. Dumb down the test cases so that they don't test multiple-cascade
situations.

Don't much care for any of these :-(.

Also, it seems possible that not-so-cosmetic problems could occur, for
instance deadlock between two backends trying to truncate the same
tables in different orders. That suggests that answer #1 would be the
best way to fix it, but that would mean ordering the tables consistently
before we even get any locks on them, which seems hard.

Thoughts?

regards, tom lane

#2Alvaro Herrera
alvherre@commandprompt.com
In reply to: Tom Lane (#1)
Re: Instability in TRUNCATE regression test

Tom Lane wrote:

1. Find a way to make the processing order consistent (eg by driving it
off OID ordering). Doesn't seem easy, but maybe I'm missing an idea.

Hmm, what about

1. get the complete list of tables to truncate, AccessShareLock'ed, get
their names
2. release locks
3. sort the list lexicographically (or by Oid, whatever)
4. acquire the stronger locks, in list order, taking care of not
aborting if a table is no longer there
5. truncate

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#2)
Re: Instability in TRUNCATE regression test

Alvaro Herrera <alvherre@commandprompt.com> writes:

Tom Lane wrote:

1. Find a way to make the processing order consistent (eg by driving it
off OID ordering). Doesn't seem easy, but maybe I'm missing an idea.

Hmm, what about

1. get the complete list of tables to truncate, AccessShareLock'ed, get
their names
2. release locks
3. sort the list lexicographically (or by Oid, whatever)
4. acquire the stronger locks, in list order, taking care of not
aborting if a table is no longer there
5. truncate

Releasing locks is no good ... what if someone adds/drops FK constraints
while you've not got any lock?

One thing I was toying with was to add an index to pg_constraint on,
say, (confrelid, conrelid), and to replace the existing seqscans for FK
constraints with scans using this index. The second-column ordering
would guarantee everybody visits the entries in the same order. Not
sure about overall performance implications ... in a small database,
several indexscans might take more time than one seqscan.

regards, tom lane

#4Jim C. Nasby
jnasby@pervasive.com
In reply to: Tom Lane (#3)
Re: Instability in TRUNCATE regression test

On Wed, Jun 28, 2006 at 01:13:42PM -0400, Tom Lane wrote:

One thing I was toying with was to add an index to pg_constraint on,
say, (confrelid, conrelid), and to replace the existing seqscans for FK
constraints with scans using this index. The second-column ordering
would guarantee everybody visits the entries in the same order. Not
sure about overall performance implications ... in a small database,
several indexscans might take more time than one seqscan.

In a small database, both operations are likely to be plenty fast for
TRUNCATE, though. Surely the performance impact of getting the requisite
locks would far exceed any catalog scan times, no? And if you were doing
TRUNCATE's very often, I'd expect the right pages to be in cache
anyway...
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

#5Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#1)
Re: Instability in TRUNCATE regression test

Tom Lane wrote:

Buildfarm member platypus is showing a regression failure that I'm
surprised we have not seen before:

http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=platypus&amp;dt=2006-06-28%2014:05:01

Basically what this is showing is that when there is more than one
referencing table, the order in which things get done is dependent
on chance locations of system catalog entries. That results in
cosmetic differences in which of multiple violations gets reported,
or in the order of "truncate cascades to" notices.

Given our push to have the buildfarm "all green all the time",
I don't think I want to just live with occasional failures.
Seems like the alternatives are

1. Find a way to make the processing order consistent (eg by driving it
off OID ordering). Doesn't seem easy, but maybe I'm missing an idea.

2. Install multiple expected files for the truncate test.

3. Dumb down the test cases so that they don't test multiple-cascade
situations.

Don't much care for any of these :-(.

Also, it seems possible that not-so-cosmetic problems could occur, for
instance deadlock between two backends trying to truncate the same
tables in different orders. That suggests that answer #1 would be the
best way to fix it, but that would mean ordering the tables consistently
before we even get any locks on them, which seems hard.

Thoughts?

If this were a significant risk wouldn't we have seen many such failures
before now? I guess we don't expect to see concurrent truncates being
run. Probably worth protecting against, but also probably something of a
corner case.

In the absence of a fix I'd go for the extra regression result file.

cheers

andrew

#6Alvaro Herrera
alvherre@commandprompt.com
In reply to: Tom Lane (#3)
Re: Instability in TRUNCATE regression test

Tom Lane wrote:

Alvaro Herrera <alvherre@commandprompt.com> writes:

Tom Lane wrote:

1. Find a way to make the processing order consistent (eg by driving it
off OID ordering). Doesn't seem easy, but maybe I'm missing an idea.

Hmm, what about

1. get the complete list of tables to truncate, AccessShareLock'ed, get
their names
2. release locks
3. sort the list lexicographically (or by Oid, whatever)
4. acquire the stronger locks, in list order, taking care of not
aborting if a table is no longer there
5. truncate

Releasing locks is no good ... what if someone adds/drops FK constraints
while you've not got any lock?

Recheck after acquiring the stronger locks, unlock and drop from list.

One thing I was toying with was to add an index to pg_constraint on,
say, (confrelid, conrelid), and to replace the existing seqscans for FK
constraints with scans using this index. The second-column ordering
would guarantee everybody visits the entries in the same order. Not
sure about overall performance implications ... in a small database,
several indexscans might take more time than one seqscan.

I think there is more than one place that would benefit from such an
index. Probably turn into a syscache as well?

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#7Alvaro Herrera
alvherre@commandprompt.com
In reply to: Alvaro Herrera (#6)
Re: Instability in TRUNCATE regression test

Alvaro Herrera wrote:

Tom Lane wrote:

Alvaro Herrera <alvherre@commandprompt.com> writes:

Tom Lane wrote:

1. Find a way to make the processing order consistent (eg by driving it
off OID ordering). Doesn't seem easy, but maybe I'm missing an idea.

Hmm, what about

1. get the complete list of tables to truncate, AccessShareLock'ed, get
their names
2. release locks
3. sort the list lexicographically (or by Oid, whatever)
4. acquire the stronger locks, in list order, taking care of not
aborting if a table is no longer there
5. truncate

Releasing locks is no good ... what if someone adds/drops FK constraints
while you've not got any lock?

Recheck after acquiring the stronger locks, unlock and drop from list.

Oops, this doesn't cover the "add FK constraints" case, only drop. I
think it would work to keep the locks on the tables initially mentioned
in the command (i.e. those not followed by CASCADE). Hmm, but it fails
if it cascades more than once, so scratch that.

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#5)
Re: Instability in TRUNCATE regression test

Andrew Dunstan <andrew@dunslane.net> writes:

If this were a significant risk wouldn't we have seen many such failures
before now?

Hard to tell. It's possibly architecture-dependent, for one thing
(MAXALIGN will affect space availability). Since this happened in a
parallel regression run, it could also be a matter of timing relative to
the concurrent tests. I've often thought that we are not getting as much
mileage out of the parallel-testing facility as we could, because it's
really not exercising variations in timing all that much. It'd be
interesting to throw in a small random delay at the start of each member
of a concurrent set of tests.

regards, tom lane

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#6)
Re: Instability in TRUNCATE regression test

Alvaro Herrera <alvherre@commandprompt.com> writes:

Tom Lane wrote:

One thing I was toying with was to add an index to pg_constraint on,
say, (confrelid, conrelid), and to replace the existing seqscans for FK
constraints with scans using this index.

I think there is more than one place that would benefit from such an
index. Probably turn into a syscache as well?

Yeah, that was in the back of my mind too, but I haven't looked through
the code to see. A syscache wouldn't work because it's not a unique key.

regards, tom lane

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#5)
Re: Instability in TRUNCATE regression test

Andrew Dunstan <andrew@dunslane.net> writes:

Tom Lane wrote:

Buildfarm member platypus is showing a regression failure that I'm
surprised we have not seen before:
http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=platypus&amp;dt=2006-06-28%2014:05:01

If this were a significant risk wouldn't we have seen many such failures
before now?

mongoose just failed with almost the exact same symptoms:

http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=mongoose&amp;dt=2006-06-28%2021:30:02

It's not quite the same diffs, which is unsurprising given the presumed
mechanism behind the failure, but that probably shoots down the "add
another expected file" response.

I imagine some recent change has made the probability of this behavior
much higher than it was before; perhaps there's more pg_constraint
update traffic in concurrent tests? Anyway, it's now up to "must fix"
in my estimation. I'll look into the new-pg_constraint-index idea.

I think someone should also take a hard look at that idea of introducing
more timing variability into the parallel tests. Any volunteers?

regards, tom lane

#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#9)
Re: Instability in TRUNCATE regression test

I wrote:

Alvaro Herrera <alvherre@commandprompt.com> writes:

Tom Lane wrote:

One thing I was toying with was to add an index to pg_constraint on,
say, (confrelid, conrelid), and to replace the existing seqscans for FK
constraints with scans using this index.

I think there is more than one place that would benefit from such an
index. Probably turn into a syscache as well?

Yeah, that was in the back of my mind too, but I haven't looked through
the code to see. A syscache wouldn't work because it's not a unique key.

Having looked through the code, the only two places that currently seem
to have any need for an index on confrelid are the two paths in TRUNCATE
that find/check for FK relationships. So I'm hesitant to add an index
just for that; seems like too much overhead to put onto all other
updates of pg_constraint.

What we can perhaps do instead is pull out the related OIDs (ie, a
function that given a rel OID returns a list of rels that have FK
dependencies on that rel) and then sort that list into OID order before
acting on it.

Note: the OID-sort-order concept is not perfect; if the OID counter were
to wrap around while the regression tests are running, you could get a
bogus failure of this type. That seems low enough probability to live
with, though. Anyway it'll never happen in the buildfarm's usage, since
buildfarm only runs the tests in freshly-initdb'd databases.

regards, tom lane