how to speed up 002_pg_upgrade.pl and 025_stream_regress.pl under valgrind

Started by Tomas Vondraover 1 year ago6 messageshackers
Jump to latest
#1Tomas Vondra
tomas.vondra@2ndquadrant.com

Hi,

I've been doing a lot of tests under valgrind lately, and it made me
acutely aware of how long check-world takes. I realize valgrind is
inherently expensive and slow, and maybe the reasonable reply is to just
run a couple tests that are "interesting" for a patch ...

Anyway, I did a simple experiment - I ran check-world with timing info
for TAP tests, both with and without valgrind, and if I plot the results
I get the attached charts (same data, second one has log-scale axes).

The basic rule is that valgrind means a very consistent 100x slowdown. I
guess it might vary a bit depending on compile flags, but not much.

But there are two tests very clearly stand out - not by slowdown, that's
perfectly in line with the 100x figure - but by total duration. I've
labeled them on the linear-scale chart.

It's 002_pg_upgrade and 027_stream_regress. I guess the reasons for the
slowness are pretty clear - those are massive tests. pg_upgrade creates,
dumps and restores many objects (thousands?), stream_regress runs the
whole regress test suite on primary, and cross-checks what gets
replicated to standby. So it's expected to be somewhat expected.

Still, I wonder if there might be faster way to do these tests, because
these two tests alone add close to 3h of the valgrind run. Of course,
it's not just about valgrind - these tests are slow even in regular
runs, taking almost a minute each, but it's a different scale (minutes
instead of hours). Would be nice to speed it up too, though.

I don't have a great idea how to speed up these tests, unfortunately.
But one of the problems is that all the TAP tests run serially - one
after each other. Could we instead run them in parallel? The tests setup
their "private" clusters anyway, right?

regards

--
Tomas Vondra

Attachments:

valgrind-log-scale.pngimage/png; name=valgrind-log-scale.pngDownload
valgrind.pngimage/png; name=valgrind.pngDownload
#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tomas Vondra (#1)
Re: how to speed up 002_pg_upgrade.pl and 025_stream_regress.pl under valgrind

Tomas Vondra <tomas@vondra.me> writes:

[ 002_pg_upgrade and 027_stream_regress are slow ]

I don't have a great idea how to speed up these tests, unfortunately.
But one of the problems is that all the TAP tests run serially - one
after each other. Could we instead run them in parallel? The tests setup
their "private" clusters anyway, right?

But there's parallelism within those two tests already, or I would
hope so at least. If you run them in parallel then you are probably
causing 40 backends instead of 20 to be running at once (plus 40
valgrind instances). Maybe you have a machine beefy enough to make
that useful, but I don't.

Really the way to fix those two tests would be to rewrite them to not
depend on the core regression tests. The core tests do a lot of work
that's not especially useful for the purposes of those tests, and it's
not even clear that they are exercising all that we'd like to have
exercised for those purposes. In the case of 002_pg_upgrade, all
we really need to do is create objects that will stress all of
pg_dump. It's a little harder to scope out what we want to test for
027_stream_regress, but it's still clear that the core tests do a lot
of work that's not helpful.

regards, tom lane

#3Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tom Lane (#2)
Re: how to speed up 002_pg_upgrade.pl and 025_stream_regress.pl under valgrind

On 9/15/24 20:31, Tom Lane wrote:

Tomas Vondra <tomas@vondra.me> writes:

[ 002_pg_upgrade and 027_stream_regress are slow ]

I don't have a great idea how to speed up these tests, unfortunately.
But one of the problems is that all the TAP tests run serially - one
after each other. Could we instead run them in parallel? The tests setup
their "private" clusters anyway, right?

But there's parallelism within those two tests already, or I would
hope so at least. If you run them in parallel then you are probably
causing 40 backends instead of 20 to be running at once (plus 40
valgrind instances). Maybe you have a machine beefy enough to make
that useful, but I don't.

I did look into that for both tests, albeit not very thoroughly, and
most of the time there were only 1-2 valgrind processes using CPU. The
stream_regress seems more aggressive, but even for that the CPU spikes
are short, and the machine could easily do something else in parallel.

I'll try to do better analysis and some charts to visualize this ...

Really the way to fix those two tests would be to rewrite them to not
depend on the core regression tests. The core tests do a lot of work
that's not especially useful for the purposes of those tests, and it's
not even clear that they are exercising all that we'd like to have
exercised for those purposes. In the case of 002_pg_upgrade, all
we really need to do is create objects that will stress all of
pg_dump. It's a little harder to scope out what we want to test for
027_stream_regress, but it's still clear that the core tests do a lot
of work that's not helpful.

Perhaps, but that's a lot of work and time, and tricky - it seems we
might easily remove some useful test, even if it's not the original
purpose of that particular script.

regards

--
Tomas Vondra

#4Thomas Munro
thomas.munro@gmail.com
In reply to: Tom Lane (#2)
Re: how to speed up 002_pg_upgrade.pl and 025_stream_regress.pl under valgrind

On Mon, Sep 16, 2024 at 6:31 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Really the way to fix those two tests would be to rewrite them to not
depend on the core regression tests. The core tests do a lot of work
that's not especially useful for the purposes of those tests, and it's
not even clear that they are exercising all that we'd like to have
exercised for those purposes. In the case of 002_pg_upgrade, all
we really need to do is create objects that will stress all of
pg_dump. It's a little harder to scope out what we want to test for
027_stream_regress, but it's still clear that the core tests do a lot
of work that's not helpful.

027_stream_regress wants to get test coverage for the _redo routines
and replay subsystem, so I've wondered about defining a
src/test/regress/redo_schedule that removes what can be removed
without reducing _redo coverage. For example, join_hash.sql must eat
a *lot* of valgrind CPU cycles, and contributes nothing to redo
testing.

Thinking along the same lines, 002_pg_upgrade wants to create database
objects to dump, so I was thinking you could have a dump_schedule that
removes anything that doesn't leave objects behind. But you might be
right that it'd be better to start from scratch for that with that
goal in mind, and arguably also for the other.

(An interesting archeological detail about the regression tests is
that they seem to derive from the Wisconsin benchmark, famous for
benchmark wars and Oracle lawyers[1]https://jimgray.azurewebsites.net/BenchmarkHandbook/chapter4.pdf. It seems quaint now that 'tenk'
was a lot of tuples, but I guess that Ingres on a PDP 11, which caused
offence by running that benchmark 5x faster, ran in something like
128kB of memory[2]https://www.seas.upenn.edu/~zives/cis650/papers/INGRES.PDF, so I can only guess the buffer pool must have been
something like 8 buffers or not much more in total?)

[1]: https://jimgray.azurewebsites.net/BenchmarkHandbook/chapter4.pdf
[2]: https://www.seas.upenn.edu/~zives/cis650/papers/INGRES.PDF

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Thomas Munro (#4)
Re: how to speed up 002_pg_upgrade.pl and 025_stream_regress.pl under valgrind

Thomas Munro <thomas.munro@gmail.com> writes:

(An interesting archeological detail about the regression tests is
that they seem to derive from the Wisconsin benchmark, famous for
benchmark wars and Oracle lawyers[1].

This is quite off-topic for the thread, but ... we actually had an
implementation of the Wisconsin benchmark in src/test/bench, which
we eventually removed (a05a4b478). It does look like the modern
regression tests borrowed the definitions of "tenk1" and some related
tables from there, but I think it'd be a stretch to say the tests
descended from it.

regards, tom lane

#6Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#3)
Re: how to speed up 002_pg_upgrade.pl and 025_stream_regress.pl under valgrind

On 9/15/24 21:47, Tomas Vondra wrote:

On 9/15/24 20:31, Tom Lane wrote:

Tomas Vondra <tomas@vondra.me> writes:

[ 002_pg_upgrade and 027_stream_regress are slow ]

I don't have a great idea how to speed up these tests, unfortunately.
But one of the problems is that all the TAP tests run serially - one
after each other. Could we instead run them in parallel? The tests setup
their "private" clusters anyway, right?

But there's parallelism within those two tests already, or I would
hope so at least. If you run them in parallel then you are probably
causing 40 backends instead of 20 to be running at once (plus 40
valgrind instances). Maybe you have a machine beefy enough to make
that useful, but I don't.

I did look into that for both tests, albeit not very thoroughly, and
most of the time there were only 1-2 valgrind processes using CPU. The
stream_regress seems more aggressive, but even for that the CPU spikes
are short, and the machine could easily do something else in parallel.

I'll try to do better analysis and some charts to visualize this ...

I see there's already a discussion about how to make these tests cheaper
by running only a subset of the regression tests, but here are two
charts showing how many processes and CPU usage for the two tests (under
valgrind). In both cases there are occasional spikes with >10 backends,
and high CPU usage, but most of the time it's only 1-2 processes, using
1-2 cores.

In fact, the two charts are almost exactly the same - which is somewhat
expected, considering the expensive part is running regression tests,
and that's the same for both.

But doesn't this also mean we might speed up check-world by reordering
the tests a bit? The low-usage parts happen because one of the tests in
a group takes much longer, so what if moved those slow tests into a
group on their own?

regards

--
Tomas Vondra

Attachments:

027_stream_regress.pngimage/png; name=027_stream_regress.pngDownload+1-1
002_pg_upgrade.pl.pngimage/png; name=002_pg_upgrade.pl.pngDownload+0-1