Isolation tests still falling over routinely

Started by Tom Laneover 14 years ago6 messageshackers
Jump to latest
#1Tom Lane
tgl@sss.pgh.pa.us

The buildfarm is still showing isolation test failures more days than
not, eg
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=pika&dt=2011-09-17%2012%3A43%3A11
and I've personally seen such failures when testing with
CLOBBER_CACHE_ALWAYS. Could we please fix those tests to not have such
fragile timing assumptions?

regards, tom lane

#2Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Tom Lane (#1)
Re: Isolation tests still falling over routinely

Tom Lane wrote:

The buildfarm is still showing isolation test failures more days
than not, eg

http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=pika&dt=2011-09-17%2012%3A43%3A11

and I've personally seen such failures when testing with
CLOBBER_CACHE_ALWAYS. Could we please fix those tests to not have
such fragile timing assumptions?

I went back over two months, and only found one failure related to an
SSI test, and that was because the machine ran out of disk space.
There should never be any timing-related failures on the SSI tests,
as there is no blocking or deadlocking.

If you have seen any failures on isolation tests other than the fk-*
tests, I'd be very interested in details.

The rest are not related to SSI but test deadlock conditions related
to foreign keys. I didn't have anything to do with these but to
provide alternate result files for REPEATABLE READ and SERIALIZABLE
isolation levels. (I test the installcheck-world target and the
isolation tests in those modes frequently, and the fk-deadlock tests
were failing every time at those levels.)

If I remember right, Alvaro chose these timings to balance run time
against chance of failure. Unless we want to remove these deadlock
handling tests or ignore failures (which both seem like bad ideas to
me), I think we need to bump the long timings by an order of
magnitude and just concede that those tests run for a while.

-Kevin

#3Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Kevin Grittner (#2)
Re: Isolation tests still falling over routinely

Excerpts from Kevin Grittner's message of mar sep 20 22:51:39 -0300 2011:

If I remember right, Alvaro chose these timings to balance run time
against chance of failure. Unless we want to remove these deadlock
handling tests or ignore failures (which both seem like bad ideas to
me), I think we need to bump the long timings by an order of
magnitude and just concede that those tests run for a while.

The main problem I have is that I haven't found a way to reproduce the
problems in my machine. I was playing with modifying the way the error
messages are reported, but that ended up unfinished in a local branch.

I'll give it a go once more and see if I can commit so that buildfarm
tells us if it works or not.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#3)
Re: Isolation tests still falling over routinely

Alvaro Herrera <alvherre@commandprompt.com> writes:

The main problem I have is that I haven't found a way to reproduce the
problems in my machine.

Try -DCLOBBER_CACHE_ALWAYS.

regards, tom lane

#5Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#1)
Re: Isolation tests still falling over routinely

Excerpts from Tom Lane's message of mar sep 20 21:30:42 -0300 2011:

The buildfarm is still showing isolation test failures more days than
not, eg
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=pika&amp;dt=2011-09-17%2012%3A43%3A11
and I've personally seen such failures when testing with
CLOBBER_CACHE_ALWAYS. Could we please fix those tests to not have such
fragile timing assumptions?

The fix has now been installed for two weeks and no new failure has
occured. The only failure in the IsolationCheck phase since then was
caused by a disk filling up (and it wasn't in the fk-* tests anyway).
I think we can consider this issue fixed.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#5)
Re: Isolation tests still falling over routinely

Alvaro Herrera <alvherre@commandprompt.com> writes:

Excerpts from Tom Lane's message of mar sep 20 21:30:42 -0300 2011:

Could we please fix those tests to not have such
fragile timing assumptions?

The fix has now been installed for two weeks and no new failure has
occured. The only failure in the IsolationCheck phase since then was
caused by a disk filling up (and it wasn't in the fk-* tests anyway).
I think we can consider this issue fixed.

Yeah, it looks good. Thanks!

regards, tom lane