Recording test runtimes with the buildfarm

Started by David Rowleyover 5 years ago23 messages

dgrowleyml@gmail.com

over 5 years ago

Hi,

I've just had some thoughts about the possible usefulness of having
the buildfarm record the run-time of each regression test to allow us
to have some sort of ability to track the run-time history of each
test.

I thought the usefulness might be two-fold:

1. We could quickly identify when someone adds some overly complex
test and slows down the regression tests too much.
2. We might get some faster insight into performance regressions.

I can think of about 3 reasons that a test might slow down.

a) Someone adds some new tests within the test file.
b) Actual performance regression in Postgres
c) Animal busy with other work.

We likely could do a semi-decent job of telling a) and b) apart by
just recording the latest commit that changed the .sql file for the
test. We could also likely see when c) is at play by the results
returning back to normal again a few runs after some spike. We'd only
want to pay attention to consistent slowdowns. Perhaps there would be
too much variability with the parallel tests, but maybe we could just
record it for the serial tests in make check-world.

I only thought of this after reading [1]/messages/by-id/CAJ3gD9eEXJ2CHMSiOehvpTZu3Ap2GMi5jaXhoZuW=3XJLmZWpw@mail.gmail.com. If we went ahead with that,
as of now, it feels like someone could quite easily break that
optimisation and nobody would notice for a long time.

I admit to not having looked at the buildfarm code to determine how
practical such a change would be. I've assumed there is a central
database that stores all the results.

David

[1]: /messages/by-id/CAJ3gD9eEXJ2CHMSiOehvpTZu3Ap2GMi5jaXhoZuW=3XJLmZWpw@mail.gmail.com

Stephen Frost

sfrost@snowman.net

over 5 years ago

In reply to: David Rowley (#1)

Re: Recording test runtimes with the buildfarm

Greetings,

* David Rowley (dgrowleyml@gmail.com) wrote:

1. We could quickly identify when someone adds some overly complex
test and slows down the regression tests too much.

Sure, makes sense to me. We do track the individual 'stage_duration'
but we don't track things down to a per-regression-test basis. To do
that, I think we'd need the regression system to spit that kind of
detailed information out somewhere (in a structured format) that the
buildfarm client would then be able to pick it up and send to the server
to write into an appropriate table.

2. We might get some faster insight into performance regressions.

There's some effort going into continuing to build out a "performance"
farm, whose specific goal is to try and help exactly this issue. Trying
to do that with the buildfarm has the challenge that many of those
systems aren't dedicated and therefore the timing could vary wildly
between runs due to entirely independent things than our code.

Would certainly be great to have more people working on that. Currently
it's primarily Ilaria (GSoC student), Ads and I.

Current repo is https://github.com/PGPerfFarm/pgperffarm if folks want
to look at it, but we're in the process of making some pretty serious
changes, so now might not be the best time to look at the code. We're
coordinating on the 'Postgresteam' slack in #perffarm for anyone
interested tho.

I only thought of this after reading [1]. If we went ahead with that,
as of now, it feels like someone could quite easily break that
optimisation and nobody would notice for a long time.

One of the goals with the perffarm is to be able to support different
types of benchmarks, beyond just pgbench, so that we'd be able to add a
benchmark for "numeric", perhaps, or maybe create a script with pgbench
that ends up being heavy on numerics or such.

I admit to not having looked at the buildfarm code to determine how
practical such a change would be. I've assumed there is a central
database that stores all the results.

Yes, there's a central database where results are pushed and that's what
you see when you go to buildfarm.postgresql.org, there's also an archive
server which has the logs going much farther back (and is quite a bit
larger, of course..).

Thanks,

Stephen

Andrew Dunstan

andrew.dunstan@2ndquadrant.com

over 5 years ago

In reply to: David Rowley (#1)

Re: Recording test runtimes with the buildfarm

On 6/10/20 8:58 AM, David Rowley wrote:

Hi,

I've just had some thoughts about the possible usefulness of having
the buildfarm record the run-time of each regression test to allow us
to have some sort of ability to track the run-time history of each
test.

I thought the usefulness might be two-fold:

1. We could quickly identify when someone adds some overly complex
test and slows down the regression tests too much.
2. We might get some faster insight into performance regressions.

I can think of about 3 reasons that a test might slow down.

a) Someone adds some new tests within the test file.
b) Actual performance regression in Postgres
c) Animal busy with other work.

We likely could do a semi-decent job of telling a) and b) apart by
just recording the latest commit that changed the .sql file for the
test. We could also likely see when c) is at play by the results
returning back to normal again a few runs after some spike. We'd only
want to pay attention to consistent slowdowns. Perhaps there would be
too much variability with the parallel tests, but maybe we could just
record it for the serial tests in make check-world.

I only thought of this after reading [1]. If we went ahead with that,
as of now, it feels like someone could quite easily break that
optimisation and nobody would notice for a long time.

I admit to not having looked at the buildfarm code to determine how
practical such a change would be. I've assumed there is a central
database that stores all the results.

David,

Yes we have a central database. But we don't have anything that digs
into each step. In fact, on the server side the code isn't even really
aware that it's Postgres that's being tested.

The basic structures are:

buildsystems - one row per animal

build_status - one row per run

build_status_log - one row per step within each run

The last table has a blob containing the output of the step (e.g. "make
check") plus any relevant files (like the postmaster.log).

But we don't have any structure corresponding to an individual
regression test.

Architecturally, we could add a table for named times, and have the
client extract and report them.

Alternatively, people with access to the database could extract the logs
and post-process them using perl or python. That would involve no work
on my part :-) But it would not be automated.

What we do record (in build_status_log) is the time each step took. So
any regression test that suddenly blew out should likewise cause a
blowout in the time the whole "make check" took. It might be possible to
create a page that puts stats like that up. I think that's probably a
better way to go.

Tying results to an individual commit is harder. There could be dozens
of commits that happened between the current run and the previous run on
a given animal. But usually what has caused any event is fairly clear
when you look at it.

cheers

andrew

Andrew Dunstan https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Tom Lane

tgl@sss.pgh.pa.us

over 5 years ago

In reply to: Andrew Dunstan (#3)

Re: Recording test runtimes with the buildfarm

Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes:

Alternatively, people with access to the database could extract the logs
and post-process them using perl or python. That would involve no work
on my part :-) But it would not be automated.

Yeah, we could easily extract per-test-script runtimes, since pg_regress
started to print those. But ...

What we do record (in build_status_log) is the time each step took. So
any regression test that suddenly blew out should likewise cause a
blowout in the time the whole "make check" took.

I have in the past scraped the latter results and tried to make sense of
them. They are *mighty* noisy, even when considering just one animal
that I know to be running on a machine with little else to do. Maybe
averaging across the whole buildfarm could reduce the noise level, but
I'm not very hopeful. Per-test-script times would likely be even
noisier (ISTM anyway, maybe I'm wrong).

The entire reason we've been discussing a separate performance farm
is the expectation that buildfarm timings will be too noisy to be
useful to detect any but the most obvious performance effects.

regards, tom lane

Andrew Dunstan

andrew.dunstan@2ndquadrant.com

over 5 years ago

In reply to: Tom Lane (#4)

Re: Recording test runtimes with the buildfarm

On 6/10/20 10:13 AM, Tom Lane wrote:

Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes:

Alternatively, people with access to the database could extract the logs
and post-process them using perl or python. That would involve no work
on my part :-) But it would not be automated.

Yeah, we could easily extract per-test-script runtimes, since pg_regress
started to print those. But ...

What we do record (in build_status_log) is the time each step took. So
any regression test that suddenly blew out should likewise cause a
blowout in the time the whole "make check" took.

I have in the past scraped the latter results and tried to make sense of
them. They are *mighty* noisy, even when considering just one animal
that I know to be running on a machine with little else to do. Maybe
averaging across the whole buildfarm could reduce the noise level, but
I'm not very hopeful. Per-test-script times would likely be even
noisier (ISTM anyway, maybe I'm wrong).

The entire reason we've been discussing a separate performance farm
is the expectation that buildfarm timings will be too noisy to be
useful to detect any but the most obvious performance effects.

Yes, but will the performance farm be testing regression timings?

Maybe we're going to need several test suites, one of which could be
regression tests. But the regression tests themselves are not really
intended for performance testing.

cheers

andrew

--
Andrew Dunstan https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Stephen Frost

sfrost@snowman.net

over 5 years ago

In reply to: Andrew Dunstan (#5)

Re: Recording test runtimes with the buildfarm

Greetings,

* Andrew Dunstan (andrew.dunstan@2ndquadrant.com) wrote:

On 6/10/20 10:13 AM, Tom Lane wrote:

Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes:

Alternatively, people with access to the database could extract the logs
and post-process them using perl or python. That would involve no work
on my part :-) But it would not be automated.

Yeah, we could easily extract per-test-script runtimes, since pg_regress
started to print those. But ...

What we do record (in build_status_log) is the time each step took. So
any regression test that suddenly blew out should likewise cause a
blowout in the time the whole "make check" took.

I have in the past scraped the latter results and tried to make sense of
them. They are *mighty* noisy, even when considering just one animal
that I know to be running on a machine with little else to do. Maybe
averaging across the whole buildfarm could reduce the noise level, but
I'm not very hopeful. Per-test-script times would likely be even
noisier (ISTM anyway, maybe I'm wrong).

The entire reason we've been discussing a separate performance farm
is the expectation that buildfarm timings will be too noisy to be
useful to detect any but the most obvious performance effects.

Yes, but will the performance farm be testing regression timings?

We are not currently envisioning that, no.

Maybe we're going to need several test suites, one of which could be
regression tests. But the regression tests themselves are not really
intended for performance testing.

Agree with this- better would be tests which are specifically written to
test performance instead.

Thanks,

Stephen

David Rowley

dgrowleyml@gmail.com

over 5 years ago

In reply to: Tom Lane (#4)

Re: Recording test runtimes with the buildfarm

On Thu, 11 Jun 2020 at 02:13, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I have in the past scraped the latter results and tried to make sense of
them. They are *mighty* noisy, even when considering just one animal
that I know to be running on a machine with little else to do.

Do you recall if you looked at the parallel results or the serially
executed ones?

I imagine that the parallel ones will have much more noise since we
run the tests up to 20 at a time. I imagine probably none, or at most
not many of the animals have enough CPU cores not to be context
switching a lot during those the parallel runs. I thought the serial
ones would be better but didn't have an idea of they'd be good enough
to be useful.

David

Thomas Munro

thomas.munro@gmail.com

over 5 years ago

In reply to: Tom Lane (#4)

Re: Recording test runtimes with the buildfarm

On Thu, Jun 11, 2020 at 2:13 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I have in the past scraped the latter results and tried to make sense of
them. They are *mighty* noisy, even when considering just one animal
that I know to be running on a machine with little else to do. Maybe
averaging across the whole buildfarm could reduce the noise level, but
I'm not very hopeful. Per-test-script times would likely be even
noisier (ISTM anyway, maybe I'm wrong).

I've been doing that in a little database that pulls down the results
and analyses them with primitive regexes. First I wanted to know the
pass/fail history for each individual regression, isolation and TAP
script, then I wanted to build something that could identify tests
that are 'flapping', and work out when the started and stopped
flapping etc. I soon realised it was all too noisy, but then I
figured that I could fix that by detecting crashes. So I classify
every top level build farm run as SUCCESS, FAILURE or CRASH. If the
top level run was CRASH, than I can disregard the individual per
script results, because they're all BS.

Tom Lane

tgl@sss.pgh.pa.us

over 5 years ago

In reply to: David Rowley (#7)

Re: Recording test runtimes with the buildfarm

David Rowley <dgrowleyml@gmail.com> writes:

On Thu, 11 Jun 2020 at 02:13, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I have in the past scraped the latter results and tried to make sense of
them. They are *mighty* noisy, even when considering just one animal
that I know to be running on a machine with little else to do.

Do you recall if you looked at the parallel results or the serially
executed ones?

I imagine that the parallel ones will have much more noise since we
run the tests up to 20 at a time. I imagine probably none, or at most
not many of the animals have enough CPU cores not to be context
switching a lot during those the parallel runs. I thought the serial
ones would be better but didn't have an idea of they'd be good enough
to be useful.

I can't claim to recall specifically, but I agree with your theory
about that, so I probably looked at the serial-schedule case.

Note that this is moot for animals using use_installcheck_parallel
... but it looks like that's still just a minority of them.

regards, tom lane

#10

Thomas Munro

thomas.munro@gmail.com

over 5 years ago

In reply to: Thomas Munro (#8)

Re: Recording test runtimes with the buildfarm

On Thu, Jun 11, 2020 at 9:43 AM Thomas Munro <thomas.munro@gmail.com> wrote:

On Thu, Jun 11, 2020 at 2:13 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I have in the past scraped the latter results and tried to make sense of
them. They are *mighty* noisy, even when considering just one animal
that I know to be running on a machine with little else to do. Maybe
averaging across the whole buildfarm could reduce the noise level, but
I'm not very hopeful. Per-test-script times would likely be even
noisier (ISTM anyway, maybe I'm wrong).

I've been doing that in a little database that pulls down the results
and analyses them with primitive regexes. First I wanted to know the
pass/fail history for each individual regression, isolation and TAP
script, then I wanted to build something that could identify tests
that are 'flapping', and work out when the started and stopped
flapping etc. I soon realised it was all too noisy, but then I
figured that I could fix that by detecting crashes. So I classify
every top level build farm run as SUCCESS, FAILURE or CRASH. If the
top level run was CRASH, than I can disregard the individual per
script results, because they're all BS.

With more coffee I realise that you were talking about noise times,
not noisy pass/fail results. But I still want to throw that idea out
there, if we're considering analysing the logs.

#11

Andres Freund

andres@anarazel.de

over 5 years ago

In reply to: David Rowley (#7)

Re: Recording test runtimes with the buildfarm

Hi,

On June 10, 2020 2:13:51 PM PDT, David Rowley <dgrowleyml@gmail.com> wrote:

On Thu, 11 Jun 2020 at 02:13, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I have in the past scraped the latter results and tried to make sense

of

them. They are *mighty* noisy, even when considering just one animal
that I know to be running on a machine with little else to do.

Do you recall if you looked at the parallel results or the serially
executed ones?

I imagine that the parallel ones will have much more noise since we
run the tests up to 20 at a time. I imagine probably none, or at most
not many of the animals have enough CPU cores not to be context
switching a lot during those the parallel runs. I thought the serial
ones would be better but didn't have an idea of they'd be good enough
to be useful.

I'd assume that a rolling average (maybe 10 runs or so) would hide noise enough to see at least some trends even for parallel runs.

We should be able to prototype this with a few queries over the bf database, right?

Andres

--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

#12

Tom Lane

tgl@sss.pgh.pa.us

over 5 years ago

In reply to: Thomas Munro (#8)

Re: Recording test runtimes with the buildfarm

Thomas Munro <thomas.munro@gmail.com> writes:

I've been doing that in a little database that pulls down the results
and analyses them with primitive regexes. First I wanted to know the
pass/fail history for each individual regression, isolation and TAP
script, then I wanted to build something that could identify tests
that are 'flapping', and work out when the started and stopped
flapping etc. I soon realised it was all too noisy, but then I
figured that I could fix that by detecting crashes. So I classify
every top level build farm run as SUCCESS, FAILURE or CRASH. If the
top level run was CRASH, than I can disregard the individual per
script results, because they're all BS.

If you can pin the crash on a particular test script, it'd be useful
to track that as a kind of failure. In general, though, both crashes
and non-crash failures tend to cause collateral damage to later test
scripts --- if you can't filter that out then the later scripts will
have high false-positive rates.

regards, tom lane

#13

David Rowley

dgrowleyml@gmail.com

over 5 years ago

In reply to: Tom Lane (#12)

Re: Recording test runtimes with the buildfarm

On Thu, 11 Jun 2020 at 10:02, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Thomas Munro <thomas.munro@gmail.com> writes:

I've been doing that in a little database that pulls down the results
and analyses them with primitive regexes. First I wanted to know the
pass/fail history for each individual regression, isolation and TAP
script, then I wanted to build something that could identify tests
that are 'flapping', and work out when the started and stopped
flapping etc. I soon realised it was all too noisy, but then I
figured that I could fix that by detecting crashes. So I classify
every top level build farm run as SUCCESS, FAILURE or CRASH. If the
top level run was CRASH, than I can disregard the individual per
script results, because they're all BS.

If you can pin the crash on a particular test script, it'd be useful
to track that as a kind of failure. In general, though, both crashes
and non-crash failures tend to cause collateral damage to later test
scripts --- if you can't filter that out then the later scripts will
have high false-positive rates.

I guess the fact that you've both needed to do analysis on individual
tests shows that there might be a call for this beyond just recording
the test's runtime.

If we had a table that stored the individual test details, pass/fail
and just stored the timing information along with that, then, even if
the timing was unstable, it could still be useful for some analysis.
I'd be happy enough even if that was only available as a csv file
download. I imagine the buildfarm does not need to provide us with
any tools for doing analysis on this. Ideally, there would be some
run_id that we could link it back to the test run which would give us
the commit SHA, and the animal that it ran on. Joining to details
about the animal could be useful too, e.g perhaps a certain test
always fails on 32-bit machines.

I suppose that maybe we could modify pg_regress to add a command line
option to have it write out a machine-readable file, e.g:
testname,result,runtime\n, then just have the buildfarm client ship
that off to the buildfarm server to record in the database.

David

#14

Joe Conway

mail@joeconway.com

over 5 years ago

In reply to: Andres Freund (#11)

Re: Recording test runtimes with the buildfarm

On 6/10/20 6:00 PM, Andres Freund wrote:

On June 10, 2020 2:13:51 PM PDT, David Rowley <dgrowleyml@gmail.com> wrote:

On Thu, 11 Jun 2020 at 02:13, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I have in the past scraped the latter results and tried to make sense

of

them. They are *mighty* noisy, even when considering just one animal
that I know to be running on a machine with little else to do.

Do you recall if you looked at the parallel results or the serially
executed ones?

I imagine that the parallel ones will have much more noise since we
run the tests up to 20 at a time. I imagine probably none, or at most
not many of the animals have enough CPU cores not to be context
switching a lot during those the parallel runs. I thought the serial
ones would be better but didn't have an idea of they'd be good enough
to be useful.

I'd assume that a rolling average (maybe 10 runs or so) would hide noise enough to see at least some trends even for parallel runs.

We should be able to prototype this with a few queries over the bf database, right?

This seems to me like a perfect use case for control charts:

https://en.wikipedia.org/wiki/Control_chart

They are designed specifically to detect systematic changes in an environment
with random noise.

Joe

--
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development

#15

Stephen Frost

sfrost@snowman.net

over 5 years ago

In reply to: David Rowley (#13)

Re: Recording test runtimes with the buildfarm

Greetings,

* David Rowley (dgrowleyml@gmail.com) wrote:

On Thu, 11 Jun 2020 at 10:02, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Thomas Munro <thomas.munro@gmail.com> writes:

I've been doing that in a little database that pulls down the results
and analyses them with primitive regexes. First I wanted to know the
pass/fail history for each individual regression, isolation and TAP
script, then I wanted to build something that could identify tests
that are 'flapping', and work out when the started and stopped
flapping etc. I soon realised it was all too noisy, but then I
figured that I could fix that by detecting crashes. So I classify
every top level build farm run as SUCCESS, FAILURE or CRASH. If the
top level run was CRASH, than I can disregard the individual per
script results, because they're all BS.

If you can pin the crash on a particular test script, it'd be useful
to track that as a kind of failure. In general, though, both crashes
and non-crash failures tend to cause collateral damage to later test
scripts --- if you can't filter that out then the later scripts will
have high false-positive rates.

I guess the fact that you've both needed to do analysis on individual
tests shows that there might be a call for this beyond just recording
the test's runtime.

If we had a table that stored the individual test details, pass/fail
and just stored the timing information along with that, then, even if
the timing was unstable, it could still be useful for some analysis.
I'd be happy enough even if that was only available as a csv file
download. I imagine the buildfarm does not need to provide us with
any tools for doing analysis on this. Ideally, there would be some
run_id that we could link it back to the test run which would give us
the commit SHA, and the animal that it ran on. Joining to details
about the animal could be useful too, e.g perhaps a certain test
always fails on 32-bit machines.

I suppose that maybe we could modify pg_regress to add a command line
option to have it write out a machine-readable file, e.g:
testname,result,runtime\n, then just have the buildfarm client ship
that off to the buildfarm server to record in the database.

That seems like it'd be the best approach to me, though I'd defer to
Andrew on it.

By the way, if you'd like access to the buildfarm archive server where
all this stuff is stored, that can certainly be arranged, just let me
know.

Thanks,

Stephen

#16

Andrew Dunstan

andrew.dunstan@2ndquadrant.com

over 5 years ago

In reply to: Stephen Frost (#15)

Re: Recording test runtimes with the buildfarm

On 6/11/20 10:21 AM, Stephen Frost wrote:

Greetings,

* David Rowley (dgrowleyml@gmail.com) wrote:

On Thu, 11 Jun 2020 at 10:02, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Thomas Munro <thomas.munro@gmail.com> writes:

I've been doing that in a little database that pulls down the results
and analyses them with primitive regexes. First I wanted to know the
pass/fail history for each individual regression, isolation and TAP
script, then I wanted to build something that could identify tests
that are 'flapping', and work out when the started and stopped
flapping etc. I soon realised it was all too noisy, but then I
figured that I could fix that by detecting crashes. So I classify
every top level build farm run as SUCCESS, FAILURE or CRASH. If the
top level run was CRASH, than I can disregard the individual per
script results, because they're all BS.

If you can pin the crash on a particular test script, it'd be useful
to track that as a kind of failure. In general, though, both crashes
and non-crash failures tend to cause collateral damage to later test
scripts --- if you can't filter that out then the later scripts will
have high false-positive rates.

I guess the fact that you've both needed to do analysis on individual
tests shows that there might be a call for this beyond just recording
the test's runtime.

If we had a table that stored the individual test details, pass/fail
and just stored the timing information along with that, then, even if
the timing was unstable, it could still be useful for some analysis.
I'd be happy enough even if that was only available as a csv file
download. I imagine the buildfarm does not need to provide us with
any tools for doing analysis on this. Ideally, there would be some
run_id that we could link it back to the test run which would give us
the commit SHA, and the animal that it ran on. Joining to details
about the animal could be useful too, e.g perhaps a certain test
always fails on 32-bit machines.

I suppose that maybe we could modify pg_regress to add a command line
option to have it write out a machine-readable file, e.g:
testname,result,runtime\n, then just have the buildfarm client ship
that off to the buildfarm server to record in the database.

That seems like it'd be the best approach to me, though I'd defer to
Andrew on it.

By the way, if you'd like access to the buildfarm archive server where
all this stuff is stored, that can certainly be arranged, just let me
know.

Yeah, we'll need to work out where to stash the file. The client will
pick up anything in src/regress/log for "make check", but would need
adjusting for other steps that invoke pg_regress. I'm getting close to
cutting a new client release, but I can delay it till we settle this.

On the server side, we could add a table with a key of <animal,
snapshot, branch, step, testname> but we'd need to make sure those test
names were unique. Maybe we need a way of telling pg_regress to prepend
a module name (e.g. btree_gist ot plperl) to the test name.

cheers

andrew

--
Andrew Dunstan https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#17

Magnus Hagander

magnus@hagander.net

over 5 years ago

In reply to: Andrew Dunstan (#16)

Re: Recording test runtimes with the buildfarm

On Thu, Jun 11, 2020 at 4:56 PM Andrew Dunstan <
andrew.dunstan@2ndquadrant.com> wrote:

On 6/11/20 10:21 AM, Stephen Frost wrote:

Greetings,

* David Rowley (dgrowleyml@gmail.com) wrote:

On Thu, 11 Jun 2020 at 10:02, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Thomas Munro <thomas.munro@gmail.com> writes:

I've been doing that in a little database that pulls down the results
and analyses them with primitive regexes. First I wanted to know the
pass/fail history for each individual regression, isolation and TAP
script, then I wanted to build something that could identify tests
that are 'flapping', and work out when the started and stopped
flapping etc. I soon realised it was all too noisy, but then I
figured that I could fix that by detecting crashes. So I classify
every top level build farm run as SUCCESS, FAILURE or CRASH. If the
top level run was CRASH, than I can disregard the individual per
script results, because they're all BS.

If you can pin the crash on a particular test script, it'd be useful
to track that as a kind of failure. In general, though, both crashes
and non-crash failures tend to cause collateral damage to later test
scripts --- if you can't filter that out then the later scripts will
have high false-positive rates.

I guess the fact that you've both needed to do analysis on individual
tests shows that there might be a call for this beyond just recording
the test's runtime.

If we had a table that stored the individual test details, pass/fail
and just stored the timing information along with that, then, even if
the timing was unstable, it could still be useful for some analysis.
I'd be happy enough even if that was only available as a csv file
download. I imagine the buildfarm does not need to provide us with
any tools for doing analysis on this. Ideally, there would be some
run_id that we could link it back to the test run which would give us
the commit SHA, and the animal that it ran on. Joining to details
about the animal could be useful too, e.g perhaps a certain test
always fails on 32-bit machines.

I suppose that maybe we could modify pg_regress to add a command line
option to have it write out a machine-readable file, e.g:
testname,result,runtime\n, then just have the buildfarm client ship
that off to the buildfarm server to record in the database.

That seems like it'd be the best approach to me, though I'd defer to
Andrew on it.

By the way, if you'd like access to the buildfarm archive server where
all this stuff is stored, that can certainly be arranged, just let me
know.

Yeah, we'll need to work out where to stash the file. The client will
pick up anything in src/regress/log for "make check", but would need
adjusting for other steps that invoke pg_regress. I'm getting close to
cutting a new client release, but I can delay it till we settle this.

On the server side, we could add a table with a key of <animal,
snapshot, branch, step, testname> but we'd need to make sure those test
names were unique. Maybe we need a way of telling pg_regress to prepend
a module name (e.g. btree_gist ot plperl) to the test name.

It seems pretty trivial to for example get all the steps out of check.log
and their timing with a regexp. I just used '^(?:test)?\s+(\S+)\s+\.\.\.
ok\s+(\d+) ms$' as the regexp. Running that against a few hundred build
runs in the db generally looks fine, though I didn't look into it in
detail.

Of course, that only looked at check.log, and more logic would be needed if
we want to look into the other areas as well, but as long as it's
pg_regress output I think it should be easy?

--
Magnus Hagander
Me: https://www.hagander.net/ <http://www.hagander.net/>
Work: https://www.redpill-linpro.com/ <http://www.redpill-linpro.com/>

#18

Tom Lane

tgl@sss.pgh.pa.us

over 5 years ago

In reply to: Magnus Hagander (#17)

Re: Recording test runtimes with the buildfarm

Magnus Hagander <magnus@hagander.net> writes:

On Thu, Jun 11, 2020 at 4:56 PM Andrew Dunstan <
andrew.dunstan@2ndquadrant.com> wrote:

Yeah, we'll need to work out where to stash the file. The client will
pick up anything in src/regress/log for "make check", but would need
adjusting for other steps that invoke pg_regress. I'm getting close to
cutting a new client release, but I can delay it till we settle this.

It seems pretty trivial to for example get all the steps out of check.log
and their timing with a regexp.

Yeah, I don't see why we can't scrape this data from the existing
buildfarm output, at least for the core regression tests. New
infrastructure could make it easier/cheaper, but I don't think we
should invest in that until we've got a provably useful tool.

regards, tom lane

#19

Magnus Hagander

magnus@hagander.net

over 5 years ago

In reply to: Tom Lane (#18)

Re: Recording test runtimes with the buildfarm

On Thu, Jun 11, 2020 at 6:32 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Magnus Hagander <magnus@hagander.net> writes:

On Thu, Jun 11, 2020 at 4:56 PM Andrew Dunstan <
andrew.dunstan@2ndquadrant.com> wrote:

Yeah, we'll need to work out where to stash the file. The client will
pick up anything in src/regress/log for "make check", but would need
adjusting for other steps that invoke pg_regress. I'm getting close to
cutting a new client release, but I can delay it till we settle this.

It seems pretty trivial to for example get all the steps out of check.log
and their timing with a regexp.

Yeah, I don't see why we can't scrape this data from the existing
buildfarm output, at least for the core regression tests. New
infrastructure could make it easier/cheaper, but I don't think we
should invest in that until we've got a provably useful tool.

So spending a few minutes to look at my data, it is not quite that easy if
we want to look at contrib checks for example. Both btree_gin and
btree_gist have checks called "int4" for example, and the aforementioned
regexp won't pick that up. But that's surely a fixable problem.

And perhaps we should at least start off data for just "make check" to see
if it's useful, before spending too much work?

--
Magnus Hagander
Me: https://www.hagander.net/ <http://www.hagander.net/>
Work: https://www.redpill-linpro.com/ <http://www.redpill-linpro.com/>

#20

Andrew Dunstan

andrew.dunstan@2ndquadrant.com

over 5 years ago

In reply to: Tom Lane (#18)

Re: Recording test runtimes with the buildfarm

On 6/11/20 12:32 PM, Tom Lane wrote:

Magnus Hagander <magnus@hagander.net> writes:

On Thu, Jun 11, 2020 at 4:56 PM Andrew Dunstan <
andrew.dunstan@2ndquadrant.com> wrote:

Yeah, we'll need to work out where to stash the file. The client will
pick up anything in src/regress/log for "make check", but would need
adjusting for other steps that invoke pg_regress. I'm getting close to
cutting a new client release, but I can delay it till we settle this.

It seems pretty trivial to for example get all the steps out of check.log
and their timing with a regexp.

Yeah, I don't see why we can't scrape this data from the existing
buildfarm output, at least for the core regression tests. New
infrastructure could make it easier/cheaper, but I don't think we
should invest in that until we've got a provably useful tool.

OK, that makes my life fairly easy. Are you wanting this to be
automated, or just left as an exercise for researchers for now?

cheers

andrew

--
Andrew Dunstan https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#21

Tom Lane

tgl@sss.pgh.pa.us

over 5 years ago

In reply to: Andrew Dunstan (#20)

Re: Recording test runtimes with the buildfarm

Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes:

On 6/11/20 12:32 PM, Tom Lane wrote:

Yeah, I don't see why we can't scrape this data from the existing
buildfarm output, at least for the core regression tests. New
infrastructure could make it easier/cheaper, but I don't think we
should invest in that until we've got a provably useful tool.

OK, that makes my life fairly easy. Are you wanting this to be
automated, or just left as an exercise for researchers for now?

I'm envisioning the latter --- I think somebody should prove that
useful results can be obtained before we spend any effort on making it
easier to gather the input numbers. Mind you, I'm not saying that
I don't believe it's possible to get good results. But building
infrastructure in advance of a solid use-case is a recipe for
building the wrong thing.

regards, tom lane

#22

Fabien COELHO

coelho@cri.ensmp.fr

over 5 years ago

In reply to: Tom Lane (#4)

Re: Recording test runtimes with the buildfarm

Hello Tom,

I have in the past scraped the latter results and tried to make sense of
them. They are *mighty* noisy, even when considering just one animal
that I know to be running on a machine with little else to do. Maybe
averaging across the whole buildfarm could reduce the noise level, but
I'm not very hopeful.

I'd try with median instead of average, so that bad cases due to animal
overloading are ignored.

--
Fabien.

#23

David Rowley

dgrowleyml@gmail.com

over 5 years ago

In reply to: Tom Lane (#18)

Re: Recording test runtimes with the buildfarm

On Fri, 12 Jun 2020 at 04:32, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Magnus Hagander <magnus@hagander.net> writes:

On Thu, Jun 11, 2020 at 4:56 PM Andrew Dunstan <
It seems pretty trivial to for example get all the steps out of check.log
and their timing with a regexp.

Yeah, I don't see why we can't scrape this data from the existing
buildfarm output, at least for the core regression tests. New
infrastructure could make it easier/cheaper, but I don't think we
should invest in that until we've got a provably useful tool.

Looking at the data mentioned in the logs for install-check-C, it
seems some animals are quite stable with their timings but others are
quite unstable.

Taking the lowest half of the timings of each animal,test combination,
where the animal ran the test at least 50 times, the top 3 animals
with the most consistent timing are.

postgres=# select animal, avg(stddev) from (select animal,testname,
stddev(ms) from (select
animal,testname,unnest(ar[:array_length(ar,1)/2]) ms from (select
animal,testname,array_agg(ms order by ms) ar from run_details group by
animal,testname) a)b group by 1,2 having count(*) > 50) c group by 1
order by avg(stddev) limit 3;
animal | avg
------------+-------------------
mule | 4.750935819647279
massasauga | 5.410419286413067
eider | 6.06834118301505
(3 rows)

And the least consistent 3 are:

postgres=# select animal, avg(stddev) from (select animal,testname,
stddev(ms) from (select
animal,testname,unnest(ar[:array_length(ar,1)/2]) ms from (select
animal,testname,array_agg(ms order by ms) ar from run_details group by
animal,testname) a)b group by 1,2 having count(*) > 50) c group by 1
order by avg(stddev) desc limit 3;
animal | avg
----------+-------------------
lorikeet | 830.9292062818336
gharial | 725.9874198764217
dory | 683.5182140482121
(3 rows)

If I look at a random test from mule:

postgres=# select commit,time,result,ms from run_details where animal
= 'mule' and testname = 'join' order by time desc limit 10;
commit | time | result | ms
---------+---------------------+--------+-----
e532b1d | 2020-06-15 19:30:03 | ok | 563
7a3543c | 2020-06-15 15:30:03 | ok | 584
3baa7e3 | 2020-06-15 11:30:03 | ok | 596
47d4d0c | 2020-06-15 07:30:03 | ok | 533
decbe2b | 2020-06-14 15:30:04 | ok | 535
378badc | 2020-06-14 07:30:04 | ok | 563
23cbeda | 2020-06-13 19:30:04 | ok | 557
8f5b596 | 2020-06-13 07:30:04 | ok | 553
6472572 | 2020-06-13 03:30:04 | ok | 580
9a7fccd | 2020-06-12 23:30:04 | ok | 561
(10 rows)

and from lorikeet:

postgres=# select commit,time,result,ms from run_details where animal
= 'lorikeet' and testname = 'join' order by time desc limit 10;
commit | time | result | ms
---------+---------------------+--------+------
47d4d0c | 2020-06-15 08:28:35 | ok | 8890
decbe2b | 2020-06-14 20:28:33 | ok | 8878
378badc | 2020-06-14 08:28:35 | ok | 8854
cc07264 | 2020-06-14 05:22:59 | ok | 8883
8f5b596 | 2020-06-13 10:36:14 | ok | 8942
2f48ede | 2020-06-12 20:28:41 | ok | 8904
ffd2582 | 2020-06-12 08:29:52 | ok | 2016
7aa4fb5 | 2020-06-11 23:21:26 | ok | 1939
ad9291f | 2020-06-11 09:56:48 | ok | 1924
c2bd1fe | 2020-06-10 23:01:42 | ok | 1873
(10 rows)

I can supply the data I used for this, just send me an offlist email.
It's about 19MB using bzip2.

I didn't look at the make check data and I do see some animals use the
parallel_schedule for make installcheck, which my regex neglected to
account for, so that data is missing from the set.

David