Improving the isolationtester: fewer failures, less delay
This is a followup to the conversation at [1]/messages/by-id/2527507.1598237598@sss.pgh.pa.us, in which we speculated
about constraining the isolationtester's behavior by annotating the
specfiles, in order to eliminate common buildfarm failures such as [2]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=anole&dt=2021-06-13%2016%3A31%3A57
and reduce the need to use long delays to stabilize the test results.
I've spent a couple days hacking on this idea, and I think it has worked
out really well. On my machine, the time needed for "make installcheck"
in src/test/isolation drops from ~93 seconds to ~26 seconds, as a result
of removing all the multiple-second delays we used before. Also,
while I'm not fool enough to claim that this will reduce the rate of
bogus failures to zero, I do think it addresses all the repeating
failures we've seen lately.
In the credit-where-credit-is-due department, this owes some inspiration
to the patch Asim Praveen offered at [3]/messages/by-id/F8DC434A-9141-451C-857F-148CCA1D42AD@vmware.com, though this takes the idea a
good bit further.
This is still WIP to some extent, as I've not spent much time looking at
specfiles other than the ones with big delays; there may be additional
improvements possible in some places. Also, I've not worried about
whether the tests pass in serializable mode, since we have problems there
already [4]/messages/by-id/324309.1623722988@sss.pgh.pa.us. But this seemed like a good point at which to solicit
feedback and see what the cfbot thinks of it.
regards, tom lane
[1]: /messages/by-id/2527507.1598237598@sss.pgh.pa.us
[2]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=anole&dt=2021-06-13%2016%3A31%3A57
[3]: /messages/by-id/F8DC434A-9141-451C-857F-148CCA1D42AD@vmware.com
[4]: /messages/by-id/324309.1623722988@sss.pgh.pa.us
Attachments:
improve-isolationtester-1.patchtext/x-diff; charset=us-ascii; name=improve-isolationtester-1.patchDownload+1012-668
Hi,
On 2021-06-14 22:57:08 -0400, Tom Lane wrote:
This is a followup to the conversation at [1], in which we speculated
about constraining the isolationtester's behavior by annotating the
specfiles, in order to eliminate common buildfarm failures such as [2]
and reduce the need to use long delays to stabilize the test results.I've spent a couple days hacking on this idea, and I think it has worked
out really well. On my machine, the time needed for "make installcheck"
in src/test/isolation drops from ~93 seconds to ~26 seconds, as a result
of removing all the multiple-second delays we used before.
Very cool stuff. All the reliability things aside, isolationtester
frequently is the slowest test in a parallel check world...
Also, while I'm not fool enough to claim that this will reduce the
rate of bogus failures to zero, I do think it addresses all the
repeating failures we've seen lately.
And it should make it easier to fix some others and also to make it
easier to write some tests that were too hard to get to reliable today.
This is still WIP to some extent, as I've not spent much time looking at
specfiles other than the ones with big delays; there may be additional
improvements possible in some places. Also, I've not worried about
whether the tests pass in serializable mode, since we have problems there
already [4]. But this seemed like a good point at which to solicit
feedback and see what the cfbot thinks of it.
Are there spec output changes / new failures, if you apply the patch,
but do not apply the changes to the spec files?
Will look at the patch itself in a bit.
Greetings,
Andres Freund
Andres Freund <andres@anarazel.de> writes:
On 2021-06-14 22:57:08 -0400, Tom Lane wrote:
This is still WIP to some extent, as I've not spent much time looking at
specfiles other than the ones with big delays; there may be additional
improvements possible in some places. Also, I've not worried about
whether the tests pass in serializable mode, since we have problems there
already [4]. But this seemed like a good point at which to solicit
feedback and see what the cfbot thinks of it.
Are there spec output changes / new failures, if you apply the patch,
but do not apply the changes to the spec files?
If you make only the code changes, there are a bunch of diffs stemming
from the removal of the 'error in steps' message prefix. If you just
mechanically remove those from the .out files without touching the .spec
files, most tests pass, but I don't recall whether that's 100% the case.
Will look at the patch itself in a bit.
I'll have a v2 in a little bit --- the cfbot pointed out that there
were some contrib tests I'd missed fixing, and I found a couple of
other improvements.
regards, tom lane
On 6/14/21 10:57 PM, Tom Lane wrote:
This is a followup to the conversation at [1], in which we speculated
about constraining the isolationtester's behavior by annotating the
specfiles, in order to eliminate common buildfarm failures such as [2]
and reduce the need to use long delays to stabilize the test results.I've spent a couple days hacking on this idea, and I think it has worked
out really well. On my machine, the time needed for "make installcheck"
in src/test/isolation drops from ~93 seconds to ~26 seconds, as a result
of removing all the multiple-second delays we used before. Also,
while I'm not fool enough to claim that this will reduce the rate of
bogus failures to zero, I do think it addresses all the repeating
failures we've seen lately.In the credit-where-credit-is-due department, this owes some inspiration
to the patch Asim Praveen offered at [3], though this takes the idea a
good bit further.This is still WIP to some extent, as I've not spent much time looking at
specfiles other than the ones with big delays; there may be additional
improvements possible in some places. Also, I've not worried about
whether the tests pass in serializable mode, since we have problems there
already [4]. But this seemed like a good point at which to solicit
feedback and see what the cfbot thinks of it.
Cool stuff. Minor gripe while we're on $subject - I wish we'd rename it.
It's long outgrown the original purpose that gave it its name, and
keeping the name makes it unnecessarily obscure. Yes, I know Lisp still
has car and cdr, but we don't need to follow that example.
cheers
andrew
--
Andrew Dunstan
EDB: https://www.enterprisedb.com
I wrote:
I'll have a v2 in a little bit --- the cfbot pointed out that there
were some contrib tests I'd missed fixing, and I found a couple of
other improvements.
Here 'tis. This passes check-world, unlike v1 (mea culpa for not
checking that). I also cleaned up the variant expected-files,
so it's now no worse than HEAD as far as failures in serializable
mode go.
I played a bit more with insert-conflict-specconflict.spec, too.
It now seems proof against delays inserted anywhere in the
lock-acquiring subroutines.
regards, tom lane
Attachments:
improve-isolationtester-2.patchtext/x-diff; charset=us-ascii; name=improve-isolationtester-2.patchDownload+1049-700
Hi,
Only halfway related: I wonder if we should remove the automatic
permutation stuff - it's practically never useful. Probably not worth
changing...
On 2021-06-15 17:09:00 -0400, Tom Lane wrote:
+The general form of a permutation entry is + + "step_name" [ ( marker [ , marker ... ] ) ] + +where each marker defines a "blocking condition". The step will not be +reported as completed before all the blocking conditions are satisfied.
Minor suggestion: I think the folliwing would be a bit easier to read if
there first were a list of markers, and then separately the longer
descriptions. Right now it's a bit hard to see which paragraph
introduces a new type of marker, and which just adds further commentary.
+ /* + * Check for other steps that have finished. We must do this + * if oldstep completed; while if it did not, we need to poll + * all the active steps in hopes of unblocking oldstep. + */
Somehow I found the second sentence a bit hard to parse - I think it's
the "while ..." sounding a bit odd to me.
+ /* + * If the target session is still busy, apply a timeout to + * keep from hanging indefinitely, which could happen with + * incorrect blocker annotations. Use the same 2 * + * max_step_wait limit as try_complete_step does for deciding + * to die. (We don't bother with trying to cancel anything, + * since it's unclear what to cancel in this case.) + */ + if (iconn->active_step != NULL) + { + struct timeval current_time; + int64 td; + + gettimeofday(¤t_time, NULL); + td = (int64) current_time.tv_sec - (int64) start_time.tv_sec; + td *= USECS_PER_SEC; + td += (int64) current_time.tv_usec - (int64) start_time.tv_usec; + if (td > 2 * max_step_wait) + { + fprintf(stderr, "step %s timed out after %d seconds\n", + iconn->active_step->name, + (int) (td / USECS_PER_SEC)); + exit(1); + } + } + } }
It might be worth printing out the list of steps the being waited for
when reaching the timeout - it seems like it'd now be easier to end up
with a bit hard to debug specs. And one cannot necessarily look at
pg_locks or such anymore to debug em.
@@ -833,6 +946,19 @@ try_complete_step(TestSpec *testspec, Step *step, int flags)
}
}+ /* + * The step is done, but we won't report it as complete so long as there + * are blockers. + */ + if (step_has_blocker(pstep)) + { + if (!(flags & STEP_RETRY)) + printf("step %s: %s <waiting ...>\n", + step->name, step->sql); + return true; + }
Might be a bug in my mental state machine: Will this work correctly for
PSB_ONCE, where we'll already returned before?
Greetings,
Andres Freund
Andres Freund <andres@anarazel.de> writes:
Only halfway related: I wonder if we should remove the automatic
permutation stuff - it's practically never useful. Probably not worth
changing...
Where it is useful, it saves a lot of error-prone typing ...
Minor suggestion: I think the folliwing would be a bit easier to read if
there first were a list of markers, and then separately the longer
descriptions. Right now it's a bit hard to see which paragraph
introduces a new type of marker, and which just adds further commentary.
OK, will do. Will act on your other cosmetic points too, tomorrow or so.
+ if (step_has_blocker(pstep)) + { + if (!(flags & STEP_RETRY)) + printf("step %s: %s <waiting ...>\n", + step->name, step->sql); + return true; + }
Might be a bug in my mental state machine: Will this work correctly for
PSB_ONCE, where we'll already returned before?
This bit ignores PSB_ONCE. Once we've dropped out of try_complete_step
the first time, PSB_ONCE is done affecting things. (I'm not in love
with that symbol name, if you have a better idea.)
regards, tom lane
Andres Freund <andres@anarazel.de> writes:
[ assorted review comments ]
Here's a v3 responding to your comments, plus some other cleanup:
* don't use C99-style declarations-in-for, to ease planned backpatch
* make use of (*) annotation in multiple-cic.spec, to get rid of
the need for a variant expected-file for it
regards, tom lane