TAP tests take a long time

Started by Andrew Dunstanalmost 9 years ago24 messages
#1Andrew Dunstan
andrew.dunstan@2ndquadrant.com

I've complained about this before. Below are some timings from buildfarm
member nightjar as I test out the new client code.

This buildfarm run as you can see takes 33m32s, and the Tap tests take a
combined 19m52s of that time. That seems quite an inordinate amount of
time, when checking out the code, configuring, building and running all
the other checks in two locales in many cases consumes 13m40s. If people
want to incorporate these into the check-world targets and then
encourage developers to run check-world, there will be a massive fail if
you can take a shower and make and drink a cup of coffee in the time it
takes to run it.

cheers

andrew

Tue Apr 11 10:49:18 2017: buildfarm run for nightjar:HEAD starting
[10:49:18] checking out source ...
[10:49:34] checking if build run needed ...
[10:49:34] creating vpath build dir pgsql.build ...
[10:49:34] running configure ...
[10:49:59] running make ...
[10:50:43] running make check ...
[10:51:47] running make contrib ...
[10:51:53] running make testmodules ...
[10:51:54] running make install ...
[10:52:01] running make contrib install ...
[10:52:03] running make testmodules install ...
[10:52:03] checking pg_upgrade
[10:54:17] running make bin installcheck ...
[11:04:04] running make misc checks ...
[11:14:09] setting up db cluster (C)...
[11:14:15] starting db (C)...
[11:14:16] running make installcheck (C)...
[11:15:31] restarting db (C)...
[11:15:36] running make isolation check ...
[11:17:24] restarting db (C)...
[11:17:28] running make PL installcheck (C)...
[11:17:36] restarting db (C)...
[11:17:38] running make contrib installcheck (C)...
[11:18:24] restarting db (C)...
[11:18:27] running make test-modules installcheck (C)...
[11:18:39] stopping db (C)...
[11:18:56] setting up db cluster (cs_CZ.UTF-8)...
[11:19:01] starting db (cs_CZ.UTF-8)...
[11:19:02] running make installcheck (cs_CZ.UTF-8)...
[11:20:03] restarting db (cs_CZ.UTF-8)...
[11:20:08] running make PL installcheck (cs_CZ.UTF-8)...
[11:20:20] restarting db (cs_CZ.UTF-8)...
[11:20:22] running make contrib installcheck (cs_CZ.UTF-8)...
[11:21:06] restarting db (cs_CZ.UTF-8)...
[11:21:08] running make test-modules installcheck (cs_CZ.UTF-8)...
[11:21:24] stopping db (cs_CZ.UTF-8)...
[11:21:42] running make ecpg check ...
[11:22:12] running find_typedefs ...
[11:22:50] OK

--
Andrew Dunstan https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#2Robert Haas
robertmhaas@gmail.com
In reply to: Andrew Dunstan (#1)
Re: TAP tests take a long time

On Tue, Apr 11, 2017 at 11:32 AM, Andrew Dunstan
<andrew.dunstan@2ndquadrant.com> wrote:

I've complained about this before. Below are some timings from buildfarm
member nightjar as I test out the new client code.

This buildfarm run as you can see takes 33m32s, and the Tap tests take a
combined 19m52s of that time. That seems quite an inordinate amount of
time, when checking out the code, configuring, building and running all
the other checks in two locales in many cases consumes 13m40s. If people
want to incorporate these into the check-world targets and then
encourage developers to run check-world, there will be a massive fail if
you can take a shower and make and drink a cup of coffee in the time it
takes to run it.

I don't think it's quite fair to complain about the TAP tests taking a
long time to run as a general matter. Many people here have long
wanted a separate test-suite for long running tests that can be run to
really check everything possible, and apparently, TAP tests are it.

What I think would be more useful is to break down where the time is
getting spent. It may be that some of those tests are not adding
value proportionate to their runtime.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#2)
Re: TAP tests take a long time

Robert Haas <robertmhaas@gmail.com> writes:

On Tue, Apr 11, 2017 at 11:32 AM, Andrew Dunstan
<andrew.dunstan@2ndquadrant.com> wrote:

This buildfarm run as you can see takes 33m32s, and the Tap tests take a
combined 19m52s of that time.

I don't think it's quite fair to complain about the TAP tests taking a
long time to run as a general matter. Many people here have long
wanted a separate test-suite for long running tests that can be run to
really check everything possible, and apparently, TAP tests are it.

What I think would be more useful is to break down where the time is
getting spent. It may be that some of those tests are not adding
value proportionate to their runtime.

The other thing that might be useful here is to push on parallelizing
buildfarm runs. Admittedly that will do nothing for the oldest and
slowest buildfarm critters, but for reasonably modern hardware the
serialization of the tests really handicaps you. We seem to have
fixed that for manual application of "make check-world", at least
if you know the right magic incantations to parallelize it; but
AFAIK the buildfarm script is entirely serial.

BTW, speaking of "value proportionate to runtime", the hash_index
regression script is now visibly a bottleneck in the core regression
tests. Could we take another look at whether that needs to run
quite so long?

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Peter Eisentraut
peter.eisentraut@2ndquadrant.com
In reply to: Tom Lane (#3)
Re: TAP tests take a long time

On 4/11/17 12:08, Tom Lane wrote:

The other thing that might be useful here is to push on parallelizing
buildfarm runs. Admittedly that will do nothing for the oldest and
slowest buildfarm critters, but for reasonably modern hardware the
serialization of the tests really handicaps you.

Especially the recovery and subscription tests do a lot of waiting for
one of the nodes to catch up. You get by that by running several of
these tests in parallel, which is well supported now.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Andrew Dunstan
andrew.dunstan@2ndquadrant.com
In reply to: Tom Lane (#3)
Re: TAP tests take a long time

On 04/11/2017 12:08 PM, Tom Lane wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Tue, Apr 11, 2017 at 11:32 AM, Andrew Dunstan
<andrew.dunstan@2ndquadrant.com> wrote:

This buildfarm run as you can see takes 33m32s, and the Tap tests take a
combined 19m52s of that time.

I don't think it's quite fair to complain about the TAP tests taking a
long time to run as a general matter. Many people here have long
wanted a separate test-suite for long running tests that can be run to
really check everything possible, and apparently, TAP tests are it.
What I think would be more useful is to break down where the time is
getting spent. It may be that some of those tests are not adding
value proportionate to their runtime.

Well, you can get a lot of information on timings in crake's latest
builds for example, or nightjar's.

Here's an interesting fact: almost all the time taken up in the scripts
test set seems to be consumed in running initdb over and over. The
actual tests take almost no time at all. Other test sets might have
similar stats, I am still digging.

+1 for doing some assessments on test value, and also for possibly
creating several test categories, based on both speed and criticality. I
could imaging having some very long running tests, using TAP or
pg_regress, and optionally enabling them in a buildfarm critter.

I'm not at all opposed to having TAP tests - I have just put in quite a
bit of time extending and improving buildfarm coverage of them. And I'm
probably about to do some more work to enable similar functionality for
MSVC.

The other thing that might be useful here is to push on parallelizing
buildfarm runs. Admittedly that will do nothing for the oldest and
slowest buildfarm critters, but for reasonably modern hardware the
serialization of the tests really handicaps you. We seem to have
fixed that for manual application of "make check-world", at least
if you know the right magic incantations to parallelize it; but
AFAIK the buildfarm script is entirely serial.

Yeah, it is, and changing that won't be simple. Say 3 steps run in
parallel fail? What do we do? I guess we can just pick the first (by
some definition) and discard the others. Parallelizing individual steps
might be simpler, I'd have to take a look.

Honestly, I'm not that concerned about how long it takes the buildfarm
client per se to run, although I am working on facilities for it where
timing will matter to some people.

And anyone who is concerned about how long it takes on their machine can
run without --enable-tap-tests.

Not sure what the magic incantation is other than "make -j nn"?

One reason I haven't supported "make -j nn" everywhere (it is supported
for making core, contrib and test_modules) is that the output tends to
get rather jumbled, and I didn't want to impose that extra burden in
people trying to decipher the results.

cheers

andrew

--
Andrew Dunstan https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#5)
Re: TAP tests take a long time

Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes:

The other thing that might be useful here is to push on parallelizing
buildfarm runs.

One reason I haven't supported "make -j nn" everywhere (it is supported
for making core, contrib and test_modules) is that the output tends to
get rather jumbled, and I didn't want to impose that extra burden in
people trying to decipher the results.

Agreed, that would be a mess. I was thinking in terms of running steps in
parallel if they have independent output log files, so that that problem
wouldn't arise. AFAIK, for example, we could run the per-subdirectory
"make check" tests in src/bin/ in parallel without incurring any
legibility issues.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#6)
Re: TAP tests take a long time

On 2017-04-11 15:52:34 -0400, Tom Lane wrote:

Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes:

The other thing that might be useful here is to push on parallelizing
buildfarm runs.

One reason I haven't supported "make -j nn" everywhere (it is supported
for making core, contrib and test_modules) is that the output tends to
get rather jumbled, and I didn't want to impose that extra burden in
people trying to decipher the results.

Agreed, that would be a mess.

Doesn't make's -Otarget largely solve this?

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#7)
Re: TAP tests take a long time

Andres Freund <andres@anarazel.de> writes:

On 2017-04-11 15:52:34 -0400, Tom Lane wrote:

Agreed, that would be a mess.

Doesn't make's -Otarget largely solve this?

Given a sufficiently new make (which would include exactly zero of
my own buildfarm critters), that would help. I think it does nothing
much for "prove -j" though, and certainly not for MSVC.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Andrew Dunstan
andrew.dunstan@2ndquadrant.com
In reply to: Andres Freund (#7)
Re: TAP tests take a long time

On 04/11/2017 03:58 PM, Andres Freund wrote:

On 2017-04-11 15:52:34 -0400, Tom Lane wrote:

Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes:

The other thing that might be useful here is to push on parallelizing
buildfarm runs.

One reason I haven't supported "make -j nn" everywhere (it is supported
for making core, contrib and test_modules) is that the output tends to
get rather jumbled, and I didn't want to impose that extra burden in
people trying to decipher the results.

Agreed, that would be a mess.

Doesn't make's -Otarget largely solve this?

Possibly, but I have just gone to a not inconsiderable amount of trouble
to disaggregate the TAP tests so we get not only the output but all the
logs for each step separately. Say there's a problem in pg_rewind.
Compare this
<https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=nightjar&amp;dt=2017-04-11%2017%3A54%3A34&amp;stg=pg_rewind-installcheck&gt;
with having to pick apart this:
<https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=crake&amp;dt=2017-04-08%2010%3A37%3A38&amp;stg=bin-check&gt;
So I don't really want the output from these run in parallel to be run
all together, even if they aren't jumbled.

Now it might make sense for other pieces to be run like that. I will
have a look. But the ones that are taking the most time are ones that
seem least suitable.

Afterthought: It might be possible to make it a runtime option.

cheers

andrew

--
Andrew Dunstan https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Craig Ringer
craig.ringer@2ndquadrant.com
In reply to: Andrew Dunstan (#5)
Re: TAP tests take a long time

On 12 Apr. 2017 03:14, "Andrew Dunstan" <andrew.dunstan@2ndquadrant.com>
wrote:

On 04/11/2017 12:08 PM, Tom Lane wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Tue, Apr 11, 2017 at 11:32 AM, Andrew Dunstan
<andrew.dunstan@2ndquadrant.com> wrote:

This buildfarm run as you can see takes 33m32s, and the Tap tests take a
combined 19m52s of that time.

I don't think it's quite fair to complain about the TAP tests taking a
long time to run as a general matter. Many people here have long
wanted a separate test-suite for long running tests that can be run to
really check everything possible, and apparently, TAP tests are it.
What I think would be more useful is to break down where the time is
getting spent. It may be that some of those tests are not adding
value proportionate to their runtime.

Well, you can get a lot of information on timings in crake's latest
builds for example, or nightjar's.

Here's an interesting fact: almost all the time taken up in the scripts
test set seems to be consumed in running initdb over and over.

Is it worth adding an init(cache => 1) option to PostgresNode, where we
stash an initdb'd db in tmp_check/ and just do a simple fs copy of it ?

Even default to caching and allow an option to disable the cached copy.

We're going to need to initdb a lot in the TAP tests. We often need a clean
state. Tests also get harder to debug the more you pack into a single test
run and more difficult to run individually to test some specific failure.
So IMO the best thing is to try to make that repeat initdb as fast as
possible.

It reduces our coverage of initdb only incredibly slightly - all that
repeat runs will do is help find very uncommon intermittent failures. And
we rerun the buildfarm all the time so it's not like there's a shortage of
initdb runs anyway.

We should also have a debug --no-fsync option for initdb, or maybe allow it
to take -o options to pass to child postgres so we can pass fsync=off .

#11Michael Paquier
michael.paquier@gmail.com
In reply to: Craig Ringer (#10)
Re: TAP tests take a long time

On Wed, Apr 12, 2017 at 9:12 AM, Craig Ringer
<craig.ringer@2ndquadrant.com> wrote:

Well, you can get a lot of information on timings in crake's latest
builds for example, or nightjar's.

Here's an interesting fact: almost all the time taken up in the scripts
test set seems to be consumed in running initdb over and over.

Is it worth adding an init(cache => 1) option to PostgresNode, where we
stash an initdb'd db in tmp_check/ and just do a simple fs copy of it ?

This looks like a good idea to me, but I don't like much the idea of
an option in the init() routine as that's hard to maintain. It would
make sense to me to be able to override the initialization with an
environment variable instead, or just make it the default and get the
base image stored in tmp_install/. Base backups are out of scope
though.

It reduces our coverage of initdb only incredibly slightly - all that repeat
runs will do is help find very uncommon intermittent failures. And we rerun
the buildfarm all the time so it's not like there's a shortage of initdb
runs anyway.

initdb is always run with the same set of options when init() is
called, so this is not something to worry about. And tests inherent to
initdb should happen in src/bin/initdb.

We should also have a debug --no-fsync option for initdb, or maybe allow it
to take -o options to pass to child postgres so we can pass fsync=off .

That's an idea as well...
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Craig Ringer
craig.ringer@2ndquadrant.com
In reply to: Michael Paquier (#11)
Re: TAP tests take a long time

On 12 April 2017 at 08:22, Michael Paquier <michael.paquier@gmail.com> wrote:

On Wed, Apr 12, 2017 at 9:12 AM, Craig Ringer
<craig.ringer@2ndquadrant.com> wrote:

We should also have a debug --no-fsync option for initdb, or maybe allow it
to take -o options to pass to child postgres so we can pass fsync=off .

That's an idea as well...

OK, looking at this, initdb already runs postgres -F then does its own
fsync()s at the end.

It already has a -N / --no-sync option too. Which is already used by
PostgresNode.pm . So as much as can be done has already been done
here.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Andrew Dunstan
andrew.dunstan@2ndquadrant.com
In reply to: Craig Ringer (#10)
Re: TAP tests take a long time

On 04/11/2017 08:12 PM, Craig Ringer wrote:

On 12 Apr. 2017 03:14, "Andrew Dunstan"
<andrew.dunstan@2ndquadrant.com
<mailto:andrew.dunstan@2ndquadrant.com>> wrote:

On 04/11/2017 12:08 PM, Tom Lane wrote:

Robert Haas <robertmhaas@gmail.com

<mailto:robertmhaas@gmail.com>> writes:

On Tue, Apr 11, 2017 at 11:32 AM, Andrew Dunstan
<andrew.dunstan@2ndquadrant.com

<mailto:andrew.dunstan@2ndquadrant.com>> wrote:

This buildfarm run as you can see takes 33m32s, and the Tap

tests take a

combined 19m52s of that time.

I don't think it's quite fair to complain about the TAP tests

taking a

long time to run as a general matter. Many people here have long
wanted a separate test-suite for long running tests that can be

run to

really check everything possible, and apparently, TAP tests are it.
What I think would be more useful is to break down where the

time is

getting spent. It may be that some of those tests are not adding
value proportionate to their runtime.

Well, you can get a lot of information on timings in crake's latest
builds for example, or nightjar's.

Here's an interesting fact: almost all the time taken up in the
scripts
test set seems to be consumed in running initdb over and over.

Is it worth adding an init(cache => 1) option to PostgresNode, where
we stash an initdb'd db in tmp_check/ and just do a simple fs copy of
it ?

Even default to caching and allow an option to disable the cached copy.

We're going to need to initdb a lot in the TAP tests. We often need a
clean state. Tests also get harder to debug the more you pack into a
single test run and more difficult to run individually to test some
specific failure. So IMO the best thing is to try to make that repeat
initdb as fast as possible.

It reduces our coverage of initdb only incredibly slightly - all that
repeat runs will do is help find very uncommon intermittent failures.
And we rerun the buildfarm all the time so it's not like there's a
shortage of initdb runs anyway.

We should also have a debug --no-fsync option for initdb, or maybe
allow it to take -o options to pass to child postgres so we can pass
fsync=off .

Absolutely worth it I should say.

cheers

andrew

--
Andrew Dunstan https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14Andres Freund
andres@anarazel.de
In reply to: Craig Ringer (#12)
Re: TAP tests take a long time

On 2017-04-12 10:39:22 +0800, Craig Ringer wrote:

On 12 April 2017 at 08:22, Michael Paquier <michael.paquier@gmail.com> wrote:

On Wed, Apr 12, 2017 at 9:12 AM, Craig Ringer
<craig.ringer@2ndquadrant.com> wrote:

We should also have a debug --no-fsync option for initdb, or maybe allow it
to take -o options to pass to child postgres so we can pass fsync=off .

That's an idea as well...

OK, looking at this, initdb already runs postgres -F then does its own
fsync()s at the end.

It already has a -N / --no-sync option too. Which is already used by
PostgresNode.pm . So as much as can be done has already been done
here.

The time in initdb is largely spent in regprocin and server start/stop,
so that doesn't surprise me.

- Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#14)
Re: TAP tests take a long time

Andres Freund <andres@anarazel.de> writes:

The time in initdb is largely spent in regprocin and server start/stop,
so that doesn't surprise me.

Yeah. We had a plan for removing the regprocin overhead via preprocessing
of the .bki file, but I've not heard anything about that project for
months :-(

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Craig Ringer
craig.ringer@2ndquadrant.com
In reply to: Michael Paquier (#11)
1 attachment(s)
Re: TAP tests take a long time

On 12 April 2017 at 08:22, Michael Paquier <michael.paquier@gmail.com> wrote:

On Wed, Apr 12, 2017 at 9:12 AM, Craig Ringer
<craig.ringer@2ndquadrant.com> wrote:

Well, you can get a lot of information on timings in crake's latest
builds for example, or nightjar's.

Here's an interesting fact: almost all the time taken up in the scripts
test set seems to be consumed in running initdb over and over.

Is it worth adding an init(cache => 1) option to PostgresNode, where we
stash an initdb'd db in tmp_check/ and just do a simple fs copy of it ?

This looks like a good idea to me, but I don't like much the idea of
an option in the init() routine as that's hard to maintain.

How so?

It would
make sense to me to be able to override the initialization with an
environment variable instead

Yeah, that's reasonable.

Patch attached. It supports setting CACHE_INITDB=0 to disable the cache.

With cache: 3m55.063s (user 7.7s, sys 0m11.833s)

Without cache: 4m11.229s (user 0m16.963s, sys 0m12.384s)

so in a serial run it doesn't make a ton of difference.

Adding some timing on initdb runs shows that initdb takes about 1s, a
cached copy about 0.1s with our Perl based recursive copy code. So
it's hardly an overwhelming benefit, but might be more beneficial with
prove -j .

BTW, I suggest adding --timer to our default PROVE_FLAGS, so we can
collect more data from the buildfarm on what takes a while. Sample
output:

[13:20:47] t/001_stream_rep.pl .................. ok 50445 ms (
0.01 usr 0.00 sys + 2.16 cusr 3.57 csys = 5.74 CPU)
[13:21:38] t/002_archiving.pl ................... ok 5883 ms (
0.01 usr 0.00 sys + 0.19 cusr 0.45 csys = 0.65 CPU)
[13:21:44] t/003_recovery_targets.pl ............ ok 26197 ms (
0.00 usr 0.00 sys + 0.70 cusr 1.75 csys = 2.45 CPU)
[13:22:10] t/004_timeline_switch.pl ............. ok 10049 ms (
0.01 usr 0.00 sys + 0.24 cusr 0.53 csys = 0.78 CPU)
[13:22:20] t/005_replay_delay.pl ................ ok 10224 ms (
0.00 usr 0.01 sys + 0.18 cusr 0.43 csys = 0.62 CPU)
[13:22:30] t/006_logical_decoding.pl ............ ok 7570 ms (
0.00 usr 0.00 sys + 0.25 cusr 0.32 csys = 0.57 CPU)
[13:22:38] t/007_sync_rep.pl .................... ok 20693 ms (
0.00 usr 0.00 sys + 0.51 cusr 1.24 csys = 1.75 CPU)
[13:22:58] t/008_fsm_truncation.pl .............. ok 25399 ms (
0.01 usr 0.00 sys + 0.26 cusr 0.52 csys = 0.79 CPU)
[13:23:24] t/009_twophase.pl .................... ok 58531 ms (
0.01 usr 0.00 sys + 0.49 cusr 0.92 csys = 1.42 CPU)
[13:24:22] t/010_logical_decoding_timelines.pl .. ok 8135 ms (
0.00 usr 0.01 sys + 0.34 cusr 0.76 csys = 1.11 CPU)
[13:24:30] t/011_crash_recovery.pl .............. ok 4334 ms (
0.00 usr 0.00 sys + 0.12 cusr 0.16 csys = 0.28 CPU)

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

cache-initdb-tap-v1.patchtext/x-patch; charset=US-ASCII; name=cache-initdb-tap-v1.patchDownload
From d2b30831803eb3edb2e39b2e49b89bc74a65ee37 Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Wed, 12 Apr 2017 13:25:19 +0800
Subject: [PATCH] Cache initdb runs in TAP tests

---
 src/test/perl/PostgresNode.pm  | 51 +++++++++++++++++++++++++++++++++++++++---
 src/test/perl/RecursiveCopy.pm |  5 +++++
 src/test/perl/TestLib.pm       |  5 +++++
 3 files changed, 58 insertions(+), 3 deletions(-)

diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index cb84f1f..eef9f66 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -391,13 +391,28 @@ sub init
 
 	$params{allows_streaming} = 0 unless defined $params{allows_streaming};
 	$params{has_archiving}    = 0 unless defined $params{has_archiving};
+	$params{cache_initdb}	  = 1 unless defined $params{cache_initdb};
 
 	mkdir $self->backup_dir;
 	mkdir $self->archive_dir;
 
-	TestLib::system_or_bail('initdb', '-D', $pgdata, '-A', 'trust', '-N',
-		@{ $params{extra} });
-	TestLib::system_or_bail($ENV{PG_REGRESS}, '--config-auth', $pgdata);
+	# Respect CACHE_INITDB=0 in the environment if set
+	$params{cache_initdb} = 0
+		if (defined($ENV{'CACHE_INITDB'}) && !$ENV{'CACHE_INITDB'});
+
+	# We don't cache initdb results for non-default runs
+	# and they're too uncommon to care about anyway.
+	$params{cache_initdb} = 0
+		if defined($params{extra});
+
+	if ($params{cache_initdb})
+	{
+		$self->_initdb_cached(%params);
+	}
+	else
+	{
+		$self->_initdb($pgdata, %params);
+	}
 
 	open my $conf, '>>', "$pgdata/postgresql.conf";
 	print $conf "\n# Added by PostgresNode.pm\n";
@@ -446,6 +461,36 @@ sub init
 	$self->enable_archiving     if $params{has_archiving};
 }
 
+# Wrapper to actually run initdb its self
+sub _initdb
+{
+	my ($self, $pgdata, %params) = @_;
+
+	TestLib::system_or_bail('initdb', '-D', $pgdata, '-A', 'trust', '-N',
+		@{ $params{extra} });
+	TestLib::system_or_bail($ENV{PG_REGRESS}, '--config-auth', $pgdata);
+}
+
+sub _initdb_cached
+{
+	my ($self, %params) = @_;
+	my $pgdata = $self->data_dir;
+
+	my $cachedir = TestLib::tmp_check_dir() . "/initdb_cache";
+
+	if (! -d $cachedir)
+	{
+		print(STDERR "initializing initdb cache\n");
+		# initdb into a tempdir, then rename the result so we don't risk
+		# leaving failed initdb results around
+		my $temp_dbdir = TestLib::tempdir("initdb_cache_");
+		$self->_initdb($temp_dbdir, %params);
+		rename($temp_dbdir, $cachedir);
+	}
+
+	RecursiveCopy::copypath($cachedir, $pgdata);
+}
+
 =pod
 
 =item $node->append_conf(filename, str)
diff --git a/src/test/perl/RecursiveCopy.pm b/src/test/perl/RecursiveCopy.pm
index 28ecaf6..103b976 100644
--- a/src/test/perl/RecursiveCopy.pm
+++ b/src/test/perl/RecursiveCopy.pm
@@ -22,6 +22,9 @@ use warnings;
 use File::Basename;
 use File::Copy;
 
+# see https://perldoc.perl.org/perlport.html#stat
+${^WIN32_SLOPPY_STAT} = 1;
+
 =pod
 
 =head1 DESCRIPTION
@@ -111,6 +114,8 @@ sub _copypath_recurse
 
 	# Otherwise this is directory: create it on dest and recurse onto it.
 	mkdir($destpath) or die "mkdir($destpath) failed: $!";
+	# It's OK to run this on win32, see perlport
+	chmod((stat $srcpath)[2] & 07777, $destpath) or die "chmod() of $destpath failed: $!";
 
 	opendir(my $directory, $srcpath) or die "could not opendir($srcpath): $!";
 	while (my $entry = readdir($directory))
diff --git a/src/test/perl/TestLib.pm b/src/test/perl/TestLib.pm
index ae8d178..5187f75 100644
--- a/src/test/perl/TestLib.pm
+++ b/src/test/perl/TestLib.pm
@@ -141,6 +141,11 @@ sub tempdir
 		CLEANUP => 1);
 }
 
+sub tmp_check_dir
+{
+	return $tmp_check;
+}
+
 sub tempdir_short
 {
 
-- 
2.5.5

#17Craig Ringer
craig.ringer@2ndquadrant.com
In reply to: Craig Ringer (#16)
1 attachment(s)
Re: TAP tests take a long time

On 12 April 2017 at 13:27, Craig Ringer <craig.ringer@2ndquadrant.com> wrote:

On 12 April 2017 at 08:22, Michael Paquier <michael.paquier@gmail.com> wrote:

On Wed, Apr 12, 2017 at 9:12 AM, Craig Ringer
<craig.ringer@2ndquadrant.com> wrote:

Well, you can get a lot of information on timings in crake's latest
builds for example, or nightjar's.

Here's an interesting fact: almost all the time taken up in the scripts
test set seems to be consumed in running initdb over and over.

Is it worth adding an init(cache => 1) option to PostgresNode, where we
stash an initdb'd db in tmp_check/ and just do a simple fs copy of it ?

This looks like a good idea to me, but I don't like much the idea of
an option in the init() routine as that's hard to maintain.

How so?

It would
make sense to me to be able to override the initialization with an
environment variable instead

Yeah, that's reasonable.

Patch attached. It supports setting CACHE_INITDB=0 to disable the cache.

With cache: 3m55.063s (user 7.7s, sys 0m11.833s)

Without cache: 4m11.229s (user 0m16.963s, sys 0m12.384s)

so in a serial run it doesn't make a ton of difference.

Adding some timing on initdb runs shows that initdb takes about 1s, a
cached copy about 0.1s with our Perl based recursive copy code. So
it's hardly an overwhelming benefit, but might be more beneficial with
prove -j .

Of course, it'll need locking for prove -j. Added. Patch attached,
applies on top of prior patch.

With

make PROVE_FLAGS="--timer -j 9" check

I don't see much difference with/without caching initdb results -
saves about 4 seconds, from 74 to 70 seconds, but hard to tell with
the error margins.

So if we're going to do anything, defaulting to parallel prove seems a
good start, and I'm not sure caching initdb results is worth it.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

cache-initdb-tap-locking-v1.patchtext/x-patch; charset=US-ASCII; name=cache-initdb-tap-locking-v1.patchDownload
From db1aadf450a15cbd4041c5f9bf5a0b16f66c2ad5 Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Wed, 12 Apr 2017 14:03:25 +0800
Subject: [PATCH] Add locking for initdb caching to allow parallel prove

---
 src/test/perl/PostgresNode.pm | 23 ++++++++++++++++++-----
 1 file changed, 18 insertions(+), 5 deletions(-)

diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index eef9f66..0eacd41 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -94,6 +94,7 @@ use Socket;
 use Test::More;
 use TestLib ();
 use Scalar::Util qw(blessed);
+use Fcntl qw(:flock);
 
 our @EXPORT = qw(
   get_new_node
@@ -477,15 +478,27 @@ sub _initdb_cached
 	my $pgdata = $self->data_dir;
 
 	my $cachedir = TestLib::tmp_check_dir() . "/initdb_cache";
+	my $lockfile = TestLib::tmp_check_dir() . "/.initdb_cache.lock";
 
 	if (! -d $cachedir)
 	{
 		print(STDERR "initializing initdb cache\n");
-		# initdb into a tempdir, then rename the result so we don't risk
-		# leaving failed initdb results around
-		my $temp_dbdir = TestLib::tempdir("initdb_cache_");
-		$self->_initdb($temp_dbdir, %params);
-		rename($temp_dbdir, $cachedir);
+		open my $lockfh, ">", $lockfile
+			or die "cannot open $lockfile for writing: $1\n";
+		flock($lockfh, LOCK_EX)
+			or die "cannot flock($lockfile) in exclusive mode: $!\n";
+		# If running in parallel with other tests, someone else might've got
+		# the lock first and have swapped an initdb'd directory in already.
+		if (! -d $cachedir)
+		{
+			# initdb into a tempdir, then rename the result so we don't risk
+			# leaving failed initdb results around
+			my $temp_dbdir = TestLib::tempdir("initdb_cache_");
+			$self->_initdb($temp_dbdir, %params);
+			rename($temp_dbdir, $cachedir);
+		}
+		flock($lockfh, LOCK_UN)
+			or die "cannot unlock $lockfile: $!";
 	}
 
 	RecursiveCopy::copypath($cachedir, $pgdata);
-- 
2.5.5

#18Tom Lane
tgl@sss.pgh.pa.us
In reply to: Craig Ringer (#17)
Re: TAP tests take a long time

Craig Ringer <craig.ringer@2ndquadrant.com> writes:

With
make PROVE_FLAGS="--timer -j 9" check
I don't see much difference with/without caching initdb results -
saves about 4 seconds, from 74 to 70 seconds, but hard to tell with
the error margins.

So if we're going to do anything, defaulting to parallel prove seems a
good start, and I'm not sure caching initdb results is worth it.

Yeah, the results shown in this thread are not very good :-(. I'm
suspicious that the results are basically I/O-bound in your environment.
There might be some value in the cached-initdb-tree solution for slower,
more CPU-bound machines, but cutting out the regprocin overhead would
probably do just as much good there.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19Andrew Dunstan
andrew.dunstan@2ndquadrant.com
In reply to: Craig Ringer (#16)
Re: TAP tests take a long time

On 04/12/2017 01:27 AM, Craig Ringer wrote:

BTW, I suggest adding --timer to our default PROVE_FLAGS, so we can
collect more data from the buildfarm on what takes a while. Sample
output:

I'll add that to the commandline the buildfarm uses in the upcoming release.

cheers

andrew

--
Andrew Dunstan https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20Amit Kapila
amit.kapila16@gmail.com
In reply to: Tom Lane (#3)
Re: TAP tests take a long time

On Tue, Apr 11, 2017 at 9:38 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Tue, Apr 11, 2017 at 11:32 AM, Andrew Dunstan
<andrew.dunstan@2ndquadrant.com> wrote:

This buildfarm run as you can see takes 33m32s, and the Tap tests take a
combined 19m52s of that time.

I don't think it's quite fair to complain about the TAP tests taking a
long time to run as a general matter. Many people here have long
wanted a separate test-suite for long running tests that can be run to
really check everything possible, and apparently, TAP tests are it.

What I think would be more useful is to break down where the time is
getting spent. It may be that some of those tests are not adding
value proportionate to their runtime.

The other thing that might be useful here is to push on parallelizing
buildfarm runs. Admittedly that will do nothing for the oldest and
slowest buildfarm critters, but for reasonably modern hardware the
serialization of the tests really handicaps you. We seem to have
fixed that for manual application of "make check-world", at least
if you know the right magic incantations to parallelize it; but
AFAIK the buildfarm script is entirely serial.

BTW, speaking of "value proportionate to runtime", the hash_index
regression script is now visibly a bottleneck in the core regression
tests. Could we take another look at whether that needs to run
quite so long?

I have looked into the tests and I think we can do some optimization
without losing much on code coverage. First is we are doing both
Vacuum Full and Vacuum on hash_split_heap in the same test after
executing few statements, it seems to me that we can avoid doing
Vacuum Full. Also, I think we can try by reducing the number of
inserts in hash_split_heap to see if we can achieve the code coverage
of split bucket code path. Finally, I am not sure if Reindex Index
hash_split_index is required for code coverage, but we might need it
for the sake of testing that operation.

Mithun, as you are the original author of these tests, can you please
try some of the above optimizations and any others you can think of
and see if we can reduce the time for hash index tests?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21Mithun Cy
mithun.cy@enterprisedb.com
In reply to: Amit Kapila (#20)
1 attachment(s)
Re: TAP tests take a long time

On Wed, Apr 12, 2017 at 6:24 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

I have looked into the tests and I think we can do some optimization
without losing much on code coverage. First is we are doing both
Vacuum Full and Vacuum on hash_split_heap in the same test after
executing few statements, it seems to me that we can avoid doing
Vacuum Full. Also, I think we can try by reducing the number of
inserts in hash_split_heap to see if we can achieve the code coverage
of split bucket code path. Finally, I am not sure if Reindex Index
hash_split_index is required for code coverage, but we might need it
for the sake of testing that operation.

Mithun, as you are the original author of these tests, can you please
try some of the above optimizations and any others you can think of
and see if we can reduce the time for hash index tests?

I have tried to optimize the tests maintaining the same coverage we were
able to get with it.

Original coverage

Lines: 1466 2341 62.6 %
Functions: 79 101 78.2 %

After test changes

Lines: 1534 2341 65.5 %
Functions: 79 101 78.2 %

changes done
Earlier :

CREATE INDEX hash_split_index on hash_split_heap USING HASH (keycol);

+ Time: 8.950 ms

INSERT INTO hash_split_heap SELECT 1 FROM generate_series(1, 70000) a;
+ Time: 3135.098 ms (00:03.135)

VACUUM FULL hash_split_heap;

+ Time: 2748.146 ms (00:02.748)
REINDEX INDEX hash_split_index;

+ Time: 114.290 ms

Insertion and vacuum full were taking most of the time in tests. I have
removed the vacuum full as suggested by you, which is taken care by vacuum
statement at the end, And reduced the number of rows inserted to trigger a
split, by inserting some records before creating hash index. This way we
start with less number of buckets and thus splits will happen much early.

Now :

+ INSERT INTO hash_split_heap SELECT 1 FROM generate_series(1, 500) a;

+ Time: 2.946 ms

CREATE INDEX hash_split_index on hash_split_heap USING HASH (keycol);

! Time: 8.933 ms

! INSERT INTO hash_split_heap SELECT 1 FROM generate_series(1, 5000) a;

! Time: 81.674 ms

Also removed REINDEX INDEX as suggested which was not adding for coverage.
With this, I think tests should run significantly faster now.

--
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com

Attachments:

hash_index_optimize_testruns.patchapplication/octet-stream; name=hash_index_optimize_testruns.patchDownload
diff --git a/src/test/regress/expected/hash_index.out b/src/test/regress/expected/hash_index.out
index 0a18efa..097609b 100644
--- a/src/test/regress/expected/hash_index.out
+++ b/src/test/regress/expected/hash_index.out
@@ -200,9 +200,9 @@ SELECT h.seqno AS f20000
 -- Cause some overflow insert and splits.
 --
 CREATE TABLE hash_split_heap (keycol INT);
+INSERT INTO hash_split_heap SELECT 1 FROM generate_series(1, 500) a;
 CREATE INDEX hash_split_index on hash_split_heap USING HASH (keycol);
-INSERT INTO hash_split_heap SELECT 1 FROM generate_series(1, 70000) a;
-VACUUM FULL hash_split_heap;
+INSERT INTO hash_split_heap SELECT 1 FROM generate_series(1, 5000) a;
 -- Let's do a backward scan.
 BEGIN;
 SET enable_seqscan = OFF;
@@ -215,9 +215,8 @@ CLOSE c;
 END;
 -- DELETE, INSERT, REBUILD INDEX.
 DELETE FROM hash_split_heap WHERE keycol = 1;
-INSERT INTO hash_split_heap SELECT a/2 FROM generate_series(1, 50000) a;
+INSERT INTO hash_split_heap SELECT a/2 FROM generate_series(1, 25000) a;
 VACUUM hash_split_heap;
-REINDEX INDEX hash_split_index;
 -- Clean up.
 DROP TABLE hash_split_heap;
 -- Index on temp table.
diff --git a/src/test/regress/sql/hash_index.sql b/src/test/regress/sql/hash_index.sql
index 15a3b06..81b9b05 100644
--- a/src/test/regress/sql/hash_index.sql
+++ b/src/test/regress/sql/hash_index.sql
@@ -156,10 +156,9 @@ SELECT h.seqno AS f20000
 -- Cause some overflow insert and splits.
 --
 CREATE TABLE hash_split_heap (keycol INT);
+INSERT INTO hash_split_heap SELECT 1 FROM generate_series(1, 500) a;
 CREATE INDEX hash_split_index on hash_split_heap USING HASH (keycol);
-INSERT INTO hash_split_heap SELECT 1 FROM generate_series(1, 70000) a;
-
-VACUUM FULL hash_split_heap;
+INSERT INTO hash_split_heap SELECT 1 FROM generate_series(1, 5000) a;
 
 -- Let's do a backward scan.
 BEGIN;
@@ -175,10 +174,9 @@ END;
 
 -- DELETE, INSERT, REBUILD INDEX.
 DELETE FROM hash_split_heap WHERE keycol = 1;
-INSERT INTO hash_split_heap SELECT a/2 FROM generate_series(1, 50000) a;
+INSERT INTO hash_split_heap SELECT a/2 FROM generate_series(1, 25000) a;
 
 VACUUM hash_split_heap;
-REINDEX INDEX hash_split_index;
 
 -- Clean up.
 DROP TABLE hash_split_heap;
#22Tom Lane
tgl@sss.pgh.pa.us
In reply to: Mithun Cy (#21)
Re: TAP tests take a long time

Mithun Cy <mithun.cy@enterprisedb.com> writes:

I have tried to optimize the tests maintaining the same coverage we were
able to get with it.

This patch looks good to me: on my workstation, it reduces the total
runtime of the parallel regression tests from ~20.5 to ~16.5 seconds.
I concur that it doesn't look like it would reduce test coverage
significantly. It gets rid of a VACUUM FULL and a REINDEX, both of
which should be equivalent to an ordinary index build so far as hash is
concerned, and it trims the volume of data being run through the test.

If no objections, I'll push this soon.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#23Andrew Dunstan
andrew.dunstan@2ndquadrant.com
In reply to: Andrew Dunstan (#19)
Re: TAP tests take a long time

On 04/12/2017 08:40 AM, Andrew Dunstan wrote:

On 04/12/2017 01:27 AM, Craig Ringer wrote:

BTW, I suggest adding --timer to our default PROVE_FLAGS, so we can
collect more data from the buildfarm on what takes a while. Sample
output:

I'll add that to the commandline the buildfarm uses in the upcoming release.

See
<https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=nightjar&amp;dt=2017-04-12%2013%3A54%3A22&gt;

One thing I noticed was this

[10:02:40] t/001_basic.pl ......... ok 320 ms
[10:02:59] t/002_pg_dump.pl ....... ok 19441 ms
[10:03:47] t/010_dump_connstr.pl .. ok 47430 ms

Why does the last test set take 47s?

cheers

andrew

--
Andrew Dunstan https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#24Stephen Frost
sfrost@snowman.net
In reply to: Andrew Dunstan (#23)
Re: TAP tests take a long time

Andrew,

* Andrew Dunstan (andrew.dunstan@2ndquadrant.com) wrote:

On 04/12/2017 08:40 AM, Andrew Dunstan wrote:

On 04/12/2017 01:27 AM, Craig Ringer wrote:

BTW, I suggest adding --timer to our default PROVE_FLAGS, so we can
collect more data from the buildfarm on what takes a while. Sample
output:

I'll add that to the commandline the buildfarm uses in the upcoming release.

See
<https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=nightjar&amp;dt=2017-04-12%2013%3A54%3A22&gt;

One thing I noticed was this

[10:02:40] t/001_basic.pl ......... ok 320 ms
[10:02:59] t/002_pg_dump.pl ....... ok 19441 ms
[10:03:47] t/010_dump_connstr.pl .. ok 47430 ms

Why does the last test set take 47s?

It's doing a bunch of dropdb/createdb stuff, and also creates multiple
clusters. Part of this is because it's generating random names for
databases to make sure they are handled correctly. I'd certainly love
to see it take less time but I've not really done very much with it.

Thanks!

Stephen