Running tests under valgrind is getting slower at an alarming pace

Started by Andres Freundover 4 years ago10 messages
#1Andres Freund
andres@anarazel.de

Hi,

After a recent migration of the skink and a few other animals (sorry for the
false reports on BF, I forgot to adjust a path), I looked at the time it takes
to complete a valgrind run:

9.6: Consumed 4h 53min 18.518s CPU time
10: Consumed 5h 32min 50.839s CPU time
11: Consumed 7h 7min 17.455s CPU time
14: still going at 11h 51min 57.951s
HEAD: 14h 32min 29.571s CPU time

I changed it so that HEAD with be built in parallel separately from the other
branches, so that HEAD gets results within a useful timeframe. But even with
that, the test times are increasing at a rate we're not going to be able to
keep up.

Part of this is caused by a lot of tests running serially, rather than in
parallel. I was pondering setting PROVE_FLAGS=-j5 or something to reduce the
impact of tap tests a bit.

Greetings,

Andres Freund

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#1)
Re: Running tests under valgrind is getting slower at an alarming pace

Andres Freund <andres@anarazel.de> writes:

After a recent migration of the skink and a few other animals (sorry for the
false reports on BF, I forgot to adjust a path), I looked at the time it takes
to complete a valgrind run:

9.6: Consumed 4h 53min 18.518s CPU time
10: Consumed 5h 32min 50.839s CPU time
11: Consumed 7h 7min 17.455s CPU time
14: still going at 11h 51min 57.951s
HEAD: 14h 32min 29.571s CPU time

I have observed similar slowdowns across versions on just-plain-slow
animals, too. Awhile ago (last December, I think), I tried enabling
--enable-tap-tests across the board on prairiedog, and observed
these buildfarm cycle times:

9.5 01:50:24
9.6 02:06:32
10 02:26:34
11 02:54:44
12 03:41:11
13 04:46:31
HEAD 04:49:04

I went back to not running TAP tests in the back branches :-(

prairiedog's latest HEAD run consumed 5:30, so it's gotten way
worse since December.

In the same comparison, my other animal longfin had gone from 14 to 18
minutes (and it's now up to 22 minutes). It's not clear to me whether
greater available parallelism (12 CPUs vs 1) is alone enough to
explain why the more modern machine isn't suffering so badly. As you
say, the TAP tests are not well parallelized at all, so that doesn't
seem to fit the facts.

In any case, it seems like we do need to be paying more attention to
how long it takes to do the TAP tests. We could try to shave more
cycles off the overhead, and we should think a little harder about
the long-term value of every test case we add.

regards, tom lane

#3Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#1)
Re: Running tests under valgrind is getting slower at an alarming pace

On Wed, Oct 6, 2021 at 1:57 AM Andres Freund <andres@anarazel.de> wrote:

After a recent migration of the skink and a few other animals (sorry for the
false reports on BF, I forgot to adjust a path), I looked at the time it takes
to complete a valgrind run:

9.6: Consumed 4h 53min 18.518s CPU time
10: Consumed 5h 32min 50.839s CPU time
11: Consumed 7h 7min 17.455s CPU time
14: still going at 11h 51min 57.951s
HEAD: 14h 32min 29.571s CPU time

I changed it so that HEAD with be built in parallel separately from the other
branches, so that HEAD gets results within a useful timeframe. But even with
that, the test times are increasing at a rate we're not going to be able to
keep up.

Is the problem here that we're adding a lot of new new test cases? Or
is the problem that valgrind runs are getting slower for the same
number of test cases?

If it's taking longer because we have more test cases, I'm honestly
not sure that's really something we should try to fix. I mean, I'm
sure we have some bad test cases here and there, but overall I think
we still have too little test coverage, not too much. The recent
discovery that recovery_end_command had zero test coverage is one fine
example of that.

But if we've done something that increases the relative cost of
valgrind, maybe we can fix that in a centralized way.

--
Robert Haas
EDB: http://www.enterprisedb.com

#4Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#3)
Re: Running tests under valgrind is getting slower at an alarming pace

Hi,

On 2021-10-06 12:09:36 -0400, Robert Haas wrote:

Is the problem here that we're adding a lot of new new test cases? Or
is the problem that valgrind runs are getting slower for the same
number of test cases?

I don't know precisely. It's probably a combination of several factors. I do
think we regressed somewhere around valgrind specifically - the leak origin
tracking in older branches seems to work better than in newer branches. But I
don't know if that affects performance.

If it's taking longer because we have more test cases, I'm honestly
not sure that's really something we should try to fix.

I'm not arguing for having fewer tests. But clearly, executing them serially
is problematic, when the times are going up like this. Skink is hosted on a
machine with a CPU clocking around ~3.9GHZ for most of the test - getting a
faster machine won't help that much. But most of the time only a few cores are
active.

This isn't just a problem with valgrind, the reporting times for other animals
also aren't getting shorter...

It takes my workstation 2min20s to execute check-world parallely, but > 16min
sequentially. The BF executes tap tests sequentially...

I mean, I'm sure we have some bad test cases here and there, but overall I
think we still have too little test coverage, not too much. The recent
discovery that recovery_end_command had zero test coverage is one fine
example of that.

But if we've done something that increases the relative cost of
valgrind, maybe we can fix that in a centralized way.

There's probably some of that.

The fact that the tap test infrastructure does all communication with the
server via psql each only execute only a single query is a problem -
connection startup is expensive. I think this is particularly a problem for
things like PostgresNode::poll_query_until(), which is also used by
::wait_for_catchup(), because a) process creation is more expensive on
valgrind b) things take longer on valgrind, so we pay that price many more
times.

At the same time increasing the timeout for the poll loop also makes the tests
slower, because all the waits for things that already finished do add up.

I'd guess that at the very least driving individual poll_query_until() via a
psql that's running across queries would reduce this substantially, and
perhaps allow us to reduce the polling time. But there's probably some
nontrivial challenges around recognizing query boundaries :/

Briefly looking at a profile of valgrind, it looks like a lot of the cpu time
is spent doing syscalls related to logging. So far skink had
log_statement=all, log_connections=on, log_disconnections=on - I've turned
them off for the next runs. We'll see if that helps.

I'll also try to figure out print a bit more detail about timing for each tap
test, looks like I need to figure out how to pass PROVE_TEST='--timer' through
the buildfarm. Shouldn't be too hard.

One thing I think would really help is having the total time for each run
visible in an animals run history. That way we could pinpoint regressions
reasonably efficiently, right now that's not easily possible without writing
nontrivial queries to the buildfarm database...

Greetings,

Andres Freund

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#4)
Re: Running tests under valgrind is getting slower at an alarming pace

Andres Freund <andres@anarazel.de> writes:

One thing I think would really help is having the total time for each run
visible in an animals run history. That way we could pinpoint regressions
reasonably efficiently, right now that's not easily possible without writing
nontrivial queries to the buildfarm database...

+1. I've lost count of how often I've had to drill down to an individual
run just because I wanted to see how long it took. If we could fit that
into the branch history pages like

https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=jay&amp;br=HEAD

it'd be really useful IMO.

Perhaps we could replace "OK" with the total time, so as to avoid making
these tables bigger? (This presumes that the time for a failed run isn't
so interesting.)

regards, tom lane

#6Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#4)
Re: Running tests under valgrind is getting slower at an alarming pace

On Wed, Oct 6, 2021 at 12:47 PM Andres Freund <andres@anarazel.de> wrote:

There's probably some of that.

The fact that the tap test infrastructure does all communication with the
server via psql each only execute only a single query is a problem -
connection startup is expensive.

Ageed. safe_psql() is a poor-quality interface. I've been surprised in
the past that we were relying on something so primitive.

--
Robert Haas
EDB: http://www.enterprisedb.com

#7Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#5)
Re: Running tests under valgrind is getting slower at an alarming pace

Hi,

On 2021-10-06 12:58:34 -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

One thing I think would really help is having the total time for each run
visible in an animals run history. That way we could pinpoint regressions
reasonably efficiently, right now that's not easily possible without writing
nontrivial queries to the buildfarm database...

+1. I've lost count of how often I've had to drill down to an individual
run just because I wanted to see how long it took. If we could fit that
into the branch history pages like

I queried this in the DB for skink using

select snapshot::date, substring(git_head_ref, 1, 12) as git_rev, (SELECT SUM(stage_duration) FROM build_status_log_raw bslr WHERE bslr.sysname = bsr.sysname AND bslr.snapshot = bsr.snapshot) FROM build_status_raw bsr WHERE branch = 'HEAD' AND sysname = 'skink' and stage = 'OK' AND snapshot > '2021-01-01' order by snapshot desc;

snapshot | git_rev | sum
------------+--------------+----------
2021-10-06 | ec2133a44731 | 12:09:17
2021-10-05 | 0266e98c6b86 | 10:55:10
2021-10-03 | 2903f1404df3 | 10:24:11
2021-09-30 | 20f8671ef69b | 10:31:43
...
2021-06-14 | 2d689babe3cb | 10:29:07
2021-06-12 | f452aaf7d4a9 | 10:26:12
2021-06-11 | d08237b5b494 | 10:50:53
2021-06-09 | 845cad4d51cb | 10:58:31
2021-06-08 | eab81953682d | 09:06:35
2021-06-06 | a2dee328bbe5 | 09:02:36
2021-06-05 | e6159885b78e | 08:59:14
2021-06-03 | 187682c32173 | 09:39:07
2021-06-02 | df466d30c6ca | 09:03:05
2021-06-03 | 187682c32173 | 09:39:07
2021-06-02 | df466d30c6ca | 09:03:05
2021-05-31 | 7c544ecdad81 | 09:09:42
2021-05-30 | ba356a397de5 | 08:54:29
2021-05-28 | d69fcb9caef1 | 09:00:36
2021-05-27 | 388e75ad3348 | 09:39:14
2021-05-25 | e30e3fdea873 | 08:51:04
2021-05-24 | 99c5852e20a0 | 08:57:08
...
2021-03-23 | 1e3e8b51bda8 | 09:19:40
2021-03-21 | 96ae658e6238 | 08:29:05
2021-03-20 | 61752afb2640 | 08:15:47
2021-03-18 | da18d829c281 | 08:34:02
2021-03-17 | 6b67d72b604c | 09:11:46
2021-03-15 | 146cb3889c3c | 08:20:21
2021-03-14 | 58f57490facd | 08:06:07
2021-03-12 | d60e61de4fb4 | 08:41:12
2021-03-11 | 3f0daeb02f8d | 08:04:44
2021-03-08 | 8a812e5106c5 | 08:46:01
2021-03-07 | f9a0392e1cf3 | 08:01:47
2021-03-05 | 0ce4cd04da55 | 08:01:32
2021-03-04 | 040af779382e | 07:56:31
2021-03-02 | 5b2f2af3d9d5 | 08:20:50
2021-03-01 | f5a5773a9dc4 | 07:59:14
...
2021-01-02 | 4d3f03f42227 | 08:14:41
2021-01-01 | 32d6287d2eef | 07:31:56

It's not too surprising that 2021-10-06 is slower, I yesterday changed things
so that more valgrind runs are done in parallel (increasing individual test
times, but still allowing to get results faster than testing 1-by-1).

I don't see anything immediately suspicious for the slowdowns around
eab81953682d. Perhaps there was a system update at that time causing
changes. Unfortunately I don't have logs from back then anymore. OTOH, I don't
see a clear slowdown in 13, 12 around that time.

Greetings,

Andres Freund

#8Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#4)
Re: Running tests under valgrind is getting slower at an alarming pace

Hi,

On 2021-10-06 09:47:54 -0700, Andres Freund wrote:

I'll also try to figure out print a bit more detail about timing for each tap
test, looks like I need to figure out how to pass PROVE_TEST='--timer' through
the buildfarm. Shouldn't be too hard.

Turns out that the buildfarm already adds --timer. I added -j4 to allow for
some concurrency in tap tests, but unfortunately my animals fell over after
that (thanks Michael for noticing).

Looks like the buildfarm client code isn't careful enough quoting PROVE_FLAGS?

my $pflags = "PROVE_FLAGS=--timer";
if (exists $ENV{PROVE_FLAGS})
{
$pflags =
$ENV{PROVE_FLAGS}
? "PROVE_FLAGS=$ENV{PROVE_FLAGS}"
: "";
}

@makeout =
run_log("cd $dir && $make NO_LOCALE=1 $pflags $instflags $taptarget");

Which doesn't work if pflags ends up as '-j4 --timer' or such...

Greetings,

Andres Freund

#9Andrew Dunstan
andrew@dunslane.net
In reply to: Andres Freund (#8)
Re: Running tests under valgrind is getting slower at an alarming pace

On 10/6/21 10:03 PM, Andres Freund wrote:

Hi,

On 2021-10-06 09:47:54 -0700, Andres Freund wrote:

I'll also try to figure out print a bit more detail about timing for each tap
test, looks like I need to figure out how to pass PROVE_TEST='--timer' through
the buildfarm. Shouldn't be too hard.

Turns out that the buildfarm already adds --timer. I added -j4 to allow for
some concurrency in tap tests, but unfortunately my animals fell over after
that (thanks Michael for noticing).

Looks like the buildfarm client code isn't careful enough quoting PROVE_FLAGS?

my $pflags = "PROVE_FLAGS=--timer";
if (exists $ENV{PROVE_FLAGS})
{
$pflags =
$ENV{PROVE_FLAGS}
? "PROVE_FLAGS=$ENV{PROVE_FLAGS}"
: "";
}

@makeout =
run_log("cd $dir && $make NO_LOCALE=1 $pflags $instflags $taptarget");

Which doesn't work if pflags ends up as '-j4 --timer' or such...

see
<https://github.com/PGBuildFarm/client-code/commit/85ba5866c334f16c8683b524743f4d714be28d77&gt;

cheers

andrew

--

Andrew Dunstan
EDB: https://www.enterprisedb.com

#10Andres Freund
andres@anarazel.de
In reply to: Andrew Dunstan (#9)
Re: Running tests under valgrind is getting slower at an alarming pace

On 2021-10-08 15:41:09 -0400, Andrew Dunstan wrote:

see
<https://github.com/PGBuildFarm/client-code/commit/85ba5866c334f16c8683b524743f4d714be28d77&gt;

Thanks!