Finding cause of test fails on the cfbot site

Started by Peter Smithalmost 5 years ago15 messages

smithpb2250@gmail.com

almost 5 years ago

Hi -

I saw that one of our commitfest entries (32/2914) is recently
reporting a fail on the cfbot site [1]http://cfbot.cputube.org/. I thought this was all ok a
few days ago.

We can see the test log indicating what was the test that failed [2]https://api.cirrus-ci.com/v1/task/5352561114873856/logs/test.log
Test Summary Report
-------------------
t/002_twophase_streaming.pl (Wstat: 7424 Tests: 1 Failed: 0)
Non-zero exit status: 29
Parse errors: Bad plan. You planned 2 tests but ran 1.
Files=2, Tests=3, 4 wallclock secs ( 0.03 usr 0.00 sys + 1.59 cusr
0.81 csys = 2.44 CPU)
Result: FAIL
gmake[2]https://api.cirrus-ci.com/v1/task/5352561114873856/logs/test.log: *** [../../src/makefiles/pgxs.mk:440: check] Error 1
gmake[1]http://cfbot.cputube.org/: *** [Makefile:94: check-test_decoding-recurse] Error 2
gmake: *** [GNUmakefile:71: check-world-contrib-recurse] Error 2
*** Error code 2

Is there any other detailed information available anywhere, e.g.
logs?, which might help us work out what was the cause of the test
failure?

Thankyou.

---
[1]: http://cfbot.cputube.org/
[2]: https://api.cirrus-ci.com/v1/task/5352561114873856/logs/test.log

Kind Regards,
Peter Smith.
Fujitsu Australia

Tom Lane

tgl@sss.pgh.pa.us

almost 5 years ago

In reply to: Peter Smith (#1)

Re: Finding cause of test fails on the cfbot site

Peter Smith <smithpb2250@gmail.com> writes:

I saw that one of our commitfest entries (32/2914) is recently
reporting a fail on the cfbot site [1]. I thought this was all ok a
few days ago.
...
Is there any other detailed information available anywhere, e.g.
logs?, which might help us work out what was the cause of the test
failure?

AFAIK the cfbot doesn't capture anything beyond the session typescript.
However, this doesn't look that hard to reproduce locally ... have you
tried, using similar configure options to what that cfbot run did?
Once you did reproduce it, there'd be logs under
contrib/test_decoding/tmp_check/.

regards, tom lane

Andrew Dunstan

andrew@dunslane.net

almost 5 years ago

In reply to: Tom Lane (#2)

Re: Finding cause of test fails on the cfbot site

On 2/17/21 11:06 AM, Tom Lane wrote:

Peter Smith <smithpb2250@gmail.com> writes:

I saw that one of our commitfest entries (32/2914) is recently
reporting a fail on the cfbot site [1]. I thought this was all ok a
few days ago.
...
Is there any other detailed information available anywhere, e.g.
logs?, which might help us work out what was the cause of the test
failure?

AFAIK the cfbot doesn't capture anything beyond the session typescript.
However, this doesn't look that hard to reproduce locally ... have you
tried, using similar configure options to what that cfbot run did?
Once you did reproduce it, there'd be logs under
contrib/test_decoding/tmp_check/.

yeah. The cfbot runs check-world which makes it difficult for it to know
which log files to show when there's an error. That's a major part of
the reason the buildfarm runs a much finer grained set of steps.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Thomas Munro

thomas.munro@gmail.com

almost 5 years ago

In reply to: Andrew Dunstan (#3)

Re: Finding cause of test fails on the cfbot site

On Thu, Feb 18, 2021 at 9:18 AM Andrew Dunstan <andrew@dunslane.net> wrote:

On 2/17/21 11:06 AM, Tom Lane wrote:

Peter Smith <smithpb2250@gmail.com> writes:

I saw that one of our commitfest entries (32/2914) is recently
reporting a fail on the cfbot site [1]. I thought this was all ok a
few days ago.
...
Is there any other detailed information available anywhere, e.g.
logs?, which might help us work out what was the cause of the test
failure?

AFAIK the cfbot doesn't capture anything beyond the session typescript.
However, this doesn't look that hard to reproduce locally ... have you
tried, using similar configure options to what that cfbot run did?
Once you did reproduce it, there'd be logs under
contrib/test_decoding/tmp_check/.

yeah. The cfbot runs check-world which makes it difficult for it to know
which log files to show when there's an error. That's a major part of
the reason the buildfarm runs a much finer grained set of steps.

Yeah, it's hard to make it print out just the right logs without
dumping so much stuff that it's hard to see the wood for the trees;
perhaps if the Makefile had an option to dump relevant stuff for the
specific tests that failed, or perhaps the buildfarm is already better
at that and cfbot should just use the buildfarm client directly. Hmm.
Another idea would be to figure out how to make a tarball of all log
files that you can download for inspection with better tools at home
when things go wrong. It would rapidly blow through the 1GB limit for
stored "artefacts" on open source/community Cirrus accounts though, so
we'd need to figure out how to manage retention carefully.

For what it's worth, I tried to reproduce this particular on a couple
of systems, many times, with no luck. It doesn't look like a freak CI
failure (there have been a few random terminations I can't explain
recently, but they look different, I think there was a Google Compute
Engine outage that might explain that), and it failed in exactly the
same way on Linux and FreeBSD. I tried locally on FreeBSD, on top of
commit a975ff4980d60f8cbd8d8cbcff70182ea53e787a (which is what the
last cfbot run did), because it conflicts with a recent change so it
doesn't apply on the tip of master right now.

Thomas Munro

thomas.munro@gmail.com

almost 5 years ago

In reply to: Thomas Munro (#4)

Re: Finding cause of test fails on the cfbot site

On Thu, Feb 18, 2021 at 9:42 AM Thomas Munro <thomas.munro@gmail.com> wrote:

... (there have been a few random terminations I can't explain
recently, but they look different, I think there was a Google Compute
Engine outage that might explain that), ...

There's also occasionally a failure like this[1]https://cirrus-ci.com/task/5046982650626048?command=test#L733, on FreeBSD only:

t/001_stream_rep.pl .................. ok
t/002_archiving.pl ................... ok
t/003_recovery_targets.pl ............ ok
Bailout called. Further testing stopped: system pg_basebackup failed
FAILED--Further testing stopped: system pg_basebackup failed

I have a sneaking suspicion this is a real problem on master, but more
logs will be needed to guess more about that...

[1]: https://cirrus-ci.com/task/5046982650626048?command=test#L733

Peter Smith

smithpb2250@gmail.com

almost 5 years ago

In reply to: Thomas Munro (#5)

Re: Finding cause of test fails on the cfbot site

On Thu, Feb 18, 2021 at 9:55 AM Thomas Munro <thomas.munro@gmail.com> wrote:

On Thu, Feb 18, 2021 at 9:42 AM Thomas Munro <thomas.munro@gmail.com> wrote:

... (there have been a few random terminations I can't explain
recently, but they look different, I think there was a Google Compute
Engine outage that might explain that), ...

There's also occasionally a failure like this[1], on FreeBSD only:

t/001_stream_rep.pl .................. ok
t/002_archiving.pl ................... ok
t/003_recovery_targets.pl ............ ok
Bailout called. Further testing stopped: system pg_basebackup failed
FAILED--Further testing stopped: system pg_basebackup failed

I have a sneaking suspicion this is a real problem on master, but more
logs will be needed to guess more about that...

[1] https://cirrus-ci.com/task/5046982650626048?command=test#L733

Thanks for all the effort spent into looking at this.

Meanwhile, since you pointed out the patch is not applying on the HEAD
tip I can at least address that.

------
Kind Regards,
Peter Smith.
Fujitsu Australia

Andrew Dunstan

andrew@dunslane.net

almost 5 years ago

In reply to: Thomas Munro (#4)

Re: Finding cause of test fails on the cfbot site

On 2/17/21 3:42 PM, Thomas Munro wrote:

On Thu, Feb 18, 2021 at 9:18 AM Andrew Dunstan <andrew@dunslane.net> wrote:

On 2/17/21 11:06 AM, Tom Lane wrote:

Peter Smith <smithpb2250@gmail.com> writes:

I saw that one of our commitfest entries (32/2914) is recently
reporting a fail on the cfbot site [1]. I thought this was all ok a
few days ago.
...
Is there any other detailed information available anywhere, e.g.
logs?, which might help us work out what was the cause of the test
failure?

AFAIK the cfbot doesn't capture anything beyond the session typescript.
However, this doesn't look that hard to reproduce locally ... have you
tried, using similar configure options to what that cfbot run did?
Once you did reproduce it, there'd be logs under
contrib/test_decoding/tmp_check/.

yeah. The cfbot runs check-world which makes it difficult for it to know
which log files to show when there's an error. That's a major part of
the reason the buildfarm runs a much finer grained set of steps.

Yeah, it's hard to make it print out just the right logs without
dumping so much stuff that it's hard to see the wood for the trees;
perhaps if the Makefile had an option to dump relevant stuff for the
specific tests that failed, or perhaps the buildfarm is already better
at that and cfbot should just use the buildfarm client directly. Hmm.
Another idea would be to figure out how to make a tarball of all log
files that you can download for inspection with better tools at home
when things go wrong. It would rapidly blow through the 1GB limit for
stored "artefacts" on open source/community Cirrus accounts though, so
we'd need to figure out how to manage retention carefully.

I did some thinking about this. How about if we have the make files and
the msvc build system create a well known file with the location(s) to
search for log files if there's an error. Each bit of testing could
overwrite this file before starting testing, and then tools like cfbot
would know where to look for files to report? To keep things clean for
other users the file would only be created if, say,
PG_NEED_ERROR_LOG_LOCATIONS is set. The well known location would be
something like "$(top_builddir)/error_log_locations.txt", and individual
Makefiles would have entries something like:,

override ERROR_LOG_LOCATIONS =
$(top_builddir)/contrib/test_decoding/tmp_check/log

If this seems like a good idea I can go and try to make that happen.

cheers

andrew

Andrew Dunstan
EDB: https://www.enterprisedb.com

Andrew Dunstan

andrew@dunslane.net

almost 5 years ago

In reply to: Andrew Dunstan (#7)

1 attachment(s)

Re: Finding cause of test fails on the cfbot site

On 2/18/21 11:01 AM, Andrew Dunstan wrote:

On 2/17/21 3:42 PM, Thomas Munro wrote:

On Thu, Feb 18, 2021 at 9:18 AM Andrew Dunstan <andrew@dunslane.net> wrote:

On 2/17/21 11:06 AM, Tom Lane wrote:

Peter Smith <smithpb2250@gmail.com> writes:

I saw that one of our commitfest entries (32/2914) is recently
reporting a fail on the cfbot site [1]. I thought this was all ok a
few days ago.
...
Is there any other detailed information available anywhere, e.g.
logs?, which might help us work out what was the cause of the test
failure?

AFAIK the cfbot doesn't capture anything beyond the session typescript.
However, this doesn't look that hard to reproduce locally ... have you
tried, using similar configure options to what that cfbot run did?
Once you did reproduce it, there'd be logs under
contrib/test_decoding/tmp_check/.

yeah. The cfbot runs check-world which makes it difficult for it to know
which log files to show when there's an error. That's a major part of
the reason the buildfarm runs a much finer grained set of steps.

Yeah, it's hard to make it print out just the right logs without
dumping so much stuff that it's hard to see the wood for the trees;
perhaps if the Makefile had an option to dump relevant stuff for the
specific tests that failed, or perhaps the buildfarm is already better
at that and cfbot should just use the buildfarm client directly. Hmm.
Another idea would be to figure out how to make a tarball of all log
files that you can download for inspection with better tools at home
when things go wrong. It would rapidly blow through the 1GB limit for
stored "artefacts" on open source/community Cirrus accounts though, so
we'd need to figure out how to manage retention carefully.

I did some thinking about this. How about if we have the make files and
the msvc build system create a well known file with the location(s) to
search for log files if there's an error. Each bit of testing could
overwrite this file before starting testing, and then tools like cfbot
would know where to look for files to report? To keep things clean for
other users the file would only be created if, say,
PG_NEED_ERROR_LOG_LOCATIONS is set. The well known location would be
something like "$(top_builddir)/error_log_locations.txt", and individual
Makefiles would have entries something like:,

override ERROR_LOG_LOCATIONS =
$(top_builddir)/contrib/test_decoding/tmp_check/log

If this seems like a good idea I can go and try to make that happen.

here's a very small and simple (and possibly naive) POC patch that
demonstrates this and seems to do the right thing.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Attachments:

error_log_locations_POC.patchtext/x-patch; charset=UTF-8; name=error_log_locations_POC.patchDownload

diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 74b3a6acd2..a1339d6ec9 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -390,6 +390,15 @@ endif
 endif
 
 
+.PHONY: set_error_log_locations
+
+set_error_log_locations:
+ifdef PG_NEED_ERROR_LOCATIONS
+ifdef ERROR_LOG_LOCATIONS
+	echo "$(ERROR_LOG_LOCATIONS)" > $(top_builddir)/error_log_locations.txt
+endif
+endif
+
 # Testing
 
 # In much the same way as above, these rules ensure that we build a temp
diff --git a/src/bin/psql/Makefile b/src/bin/psql/Makefile
index d00881163c..c325d76b52 100644
--- a/src/bin/psql/Makefile
+++ b/src/bin/psql/Makefile
@@ -12,6 +12,8 @@
 PGFILEDESC = "psql - the PostgreSQL interactive terminal"
 PGAPPICON=win32
 
+override ERROR_LOG_LOCATIONS = $(abs_top_builddir)/$(subdir)/tmp_check/log
+
 subdir = src/bin/psql
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
@@ -83,7 +85,7 @@ clean distclean:
 maintainer-clean: distclean
 	rm -f sql_help.h sql_help.c psqlscanslash.c
 
-check:
+check: set_error_log_locations
 	$(prove_check)
 
 installcheck:
diff --git a/src/test/regress/GNUmakefile b/src/test/regress/GNUmakefile
index 3d85857bfa..5fbcb5e34d 100644
--- a/src/test/regress/GNUmakefile
+++ b/src/test/regress/GNUmakefile
@@ -13,6 +13,8 @@
 PGFILEDESC = "pg_regress - test driver"
 PGAPPICON = win32
 
+override ERROR_LOG_LOCATIONS = $(abs_top_builddir)/$(subdir)/log
+
 subdir = src/test/regress
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
@@ -128,10 +130,10 @@ tablespace-setup:
 
 REGRESS_OPTS = --dlpath=. --max-concurrent-tests=20 $(EXTRA_REGRESS_OPTS)
 
-check: all tablespace-setup
+check: all set_error_log_locations tablespace-setup
 	$(pg_regress_check) $(REGRESS_OPTS) --schedule=$(srcdir)/parallel_schedule $(MAXCONNOPT) $(EXTRA_TESTS)
 
-check-tests: all tablespace-setup | temp-install
+check-tests: all set_error_log_locations tablespace-setup | temp-install
 	$(pg_regress_check) $(REGRESS_OPTS) $(MAXCONNOPT) $(TESTS) $(EXTRA_TESTS)
 
 installcheck: all tablespace-setup

Peter Smith

smithpb2250@gmail.com

almost 5 years ago

In reply to: Thomas Munro (#4)

Re: Finding cause of test fails on the cfbot site

Here is another related question about the cfbot error reporting -

The main cfbot "status page" [1]http://cfbot.cputube.org/ still shows a couple of fails for the
32/2914 (for freebsd & linux). But looking more closely, those fails
are not from the latest run. e.g. I also found this execution
"history" page [2]https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/32/2914 for our patch which shows the most recent run was
ok for commit a7e715.

So it seems like there is some kind of rule that says the main status
page will still indicate "recent* errors (even if the latest execution
was ok)...

IIUC that explains the difference between a hollow red 'X' (old fail)
and a solid red 'X' fail (new fail)? And I am guessing if our patch
continues to work ok (for how long?) then that hollow red 'X' may
morph into a solid green 'tick' (when the old fail becomes too old to
care about anymore)?

But those are just my guesses based on those icon tooltips. What
*really* are the rules for those main page status indicators, and how
long do the old failure icons linger around before changing to success
icons? (Apologies if a legend for those icons is already described
somewhere - I didn't find it).

Thanks!

------
[1]: http://cfbot.cputube.org/
[2]: https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/32/2914

Kind Regards,
Peter Smith.
Fujitsu Australia

#10

Thomas Munro

thomas.munro@gmail.com

almost 5 years ago

In reply to: Peter Smith (#9)

Re: Finding cause of test fails on the cfbot site

On Sat, Feb 20, 2021 at 10:31 AM Peter Smith <smithpb2250@gmail.com> wrote:

Here is another related question about the cfbot error reporting -

The main cfbot "status page" [1] still shows a couple of fails for the
32/2914 (for freebsd & linux). But looking more closely, those fails
are not from the latest run. e.g. I also found this execution
"history" page [2] for our patch which shows the most recent run was
ok for commit a7e715.

~~

So it seems like there is some kind of rule that says the main status
page will still indicate "recent* errors (even if the latest execution
was ok)...

Hmmph. It seems like there is indeed some kind of occasional glitch,
possible a bug in my code for picking up statuses from Cirrus through
their GraphQL API (that's a recent thing; we had to change providers
due to another CI's policy changes in January, and apparently
something in the new pipeline isn't quite fully baked). Will look
into that this weekend. Sorry about that, and thanks for letting me
know.

IIUC that explains the difference between a hollow red 'X' (old fail)
and a solid red 'X' fail (new fail)? And I am guessing if our patch
continues to work ok (for how long?) then that hollow red 'X' may
morph into a solid green 'tick' (when the old fail becomes too old to
care about anymore)?

But those are just my guesses based on those icon tooltips. What
*really* are the rules for those main page status indicators, and how
long do the old failure icons linger around before changing to success
icons? (Apologies if a legend for those icons is already described
somewhere - I didn't find it).

Yeah, I will try to clarify the UI a bit...

#11

Thomas Munro

thomas.munro@gmail.com

almost 5 years ago

In reply to: Thomas Munro (#10)

Re: Finding cause of test fails on the cfbot site

On Sat, Feb 20, 2021 at 11:33 AM Thomas Munro <thomas.munro@gmail.com> wrote:

On Sat, Feb 20, 2021 at 10:31 AM Peter Smith <smithpb2250@gmail.com> wrote:

Here is another related question about the cfbot error reporting -

The main cfbot "status page" [1] still shows a couple of fails for the
32/2914 (for freebsd & linux). But looking more closely, those fails
are not from the latest run. e.g. I also found this execution
"history" page [2] for our patch which shows the most recent run was
ok for commit a7e715.

Hmmph. It seems like there is indeed some kind of occasional glitch,
possible a bug in my code for picking up statuses from Cirrus through
their GraphQL API (that's a recent thing; we had to change providers
due to another CI's policy changes in January, and apparently
something in the new pipeline isn't quite fully baked). Will look
into that this weekend. Sorry about that, and thanks for letting me
know.

Should be fixed now. (Embarrassing detail: I used flock(LOCK_EX) to
prevent two copies of a cron job from running at the same time, but
Python helpfully garbage collected and closed the fd, releasing the
lock, Rarely, the script would run for long enough to have some state
clobbered by the next run, so you wouldn't see the latest result.)

#12

Thomas Munro

thomas.munro@gmail.com

almost 5 years ago

In reply to: Andrew Dunstan (#8)

Re: Finding cause of test fails on the cfbot site

On Sat, Feb 20, 2021 at 3:54 AM Andrew Dunstan <andrew@dunslane.net> wrote:

here's a very small and simple (and possibly naive) POC patch that
demonstrates this and seems to do the right thing.

As a small variation that might be more parallelism-friendly, would it
be better to touch a file with a known name in any subdirectory that
contains potentially interesting logs, and rm it when a test succeeds?

#13

Andres Freund

andres@anarazel.de

almost 5 years ago

In reply to: Andrew Dunstan (#3)

Re: Finding cause of test fails on the cfbot site

Hi,

On 2021-02-17 15:18:02 -0500, Andrew Dunstan wrote:

yeah. The cfbot runs check-world which makes it difficult for it to know
which log files to show when there's an error. That's a major part of
the reason the buildfarm runs a much finer grained set of steps.

I really think we need a better solution for this across the different
use-cases of running tests. For development parallel check-world is
important for a decent hack-test loop. But I waste a fair bit of time to
scroll back to find the original source of failures. And on the
buildfarm we waste a significant amount of time by limiting parallelism
due to the non-parallel sequence of finer grained steps.

And it's not just about logs - even just easily seeing the first
reported test failure without needing to search through large amounts of
text would be great.

With, um, more modern buildtools (e.g. ninja) you'll at least get the
last failure displayed at the end, instead of seing a lot of other
things after it like with make.

My suspicion is that, given the need to have this work for both msvc and
make, writing an in-core test-runner script is the only real option to
improve upon the current situation.

For make it'd not be hard to add a recursive 'listchecks' target listing
the individual tests that need to be run. Hacking up vcregress.pl to do
that, instead of what it currently does, shouldn't be too hard either.

Once there's a list of commands that need to be run it's not hard to
write a loop in perl that runs up to N tests in parallel, saving their
output. Which then allows to display the failing test reports at the
end.

If we then also add a convention that each test outputs something like
TESTLOG: path/to/logfile
...
it'd not be hard to add support for the test runner to list the files
that cfbot et al should output.

Looking around the tree, the most annoying bit to implement something
like this is that things below src/bin/, src/interfaces, src/test,
src/pl implement their own check, installcheck targets. Given the number
of these that just boil down to a variant of

check:
$(pg_regress_check)
$(prove_check)
installcheck:
$(pg_regress_installcheck)

it seems we should lift the REGRESS and TAP_TESTS specific logic in
pgxs.mk up into src/Makefiles.global. Which then would make something
list a global listchecks target easy.

Greetings,

Andres Freund

#14

Andrew Dunstan

andrew@dunslane.net

almost 5 years ago

In reply to: Thomas Munro (#12)

Re: Finding cause of test fails on the cfbot site

On 2/21/21 9:28 PM, Thomas Munro wrote:

On Sat, Feb 20, 2021 at 3:54 AM Andrew Dunstan <andrew@dunslane.net> wrote:

here's a very small and simple (and possibly naive) POC patch that
demonstrates this and seems to do the right thing.

As a small variation that might be more parallelism-friendly, would it
be better to touch a file with a known name in any subdirectory that
contains potentially interesting logs, and rm it when a test succeeds?

Yes, that sounds better, Thanks.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

#15

Andrew Dunstan

andrew@dunslane.net

almost 5 years ago

In reply to: Andres Freund (#13)

Re: Finding cause of test fails on the cfbot site

On 2/21/21 10:34 PM, Andres Freund wrote:

Hi,

On 2021-02-17 15:18:02 -0500, Andrew Dunstan wrote:

yeah. The cfbot runs check-world which makes it difficult for it to know
which log files to show when there's an error. That's a major part of
the reason the buildfarm runs a much finer grained set of steps.

I really think we need a better solution for this across the different
use-cases of running tests. For development parallel check-world is
important for a decent hack-test loop. But I waste a fair bit of time to
scroll back to find the original source of failures. And on the
buildfarm we waste a significant amount of time by limiting parallelism
due to the non-parallel sequence of finer grained steps.

Maybe but running fast isn't really a design goal of the buildfarm. It's
meant to be automated so it doesn't matter if it takes 10 or 20 or 60
minutes.

Another reason for using fine grained tasks is to be able to
include/exclude them as needed. See the buildfarm's
exclude-steps/only-steps parameters.

That said there is some provision for parallelism, in that multiple
branches and multiple members can be tested at the same time, and
run_branches.pl will manage that fairly nicely for you. See
<https://wiki.postgresql.org/wiki/PostgreSQL_Buildfarm_Howto#Running_in_Parallel>
for details

And it's not just about logs - even just easily seeing the first
reported test failure without needing to search through large amounts of
text would be great.

With, um, more modern buildtools (e.g. ninja) you'll at least get the
last failure displayed at the end, instead of seing a lot of other
things after it like with make.

My suspicion is that, given the need to have this work for both msvc and
make, writing an in-core test-runner script is the only real option to
improve upon the current situation.

Ok ... be prepared for a non-trivial maintenance cost, however, which
will be born by those of us fluent in perl, the only realistic
possibility unless we want to add to build dependencies. That's far from
everyone.

Part of the problem that this isn't going to solve is the sheer volume
that some tests produce. For example, the pg_dump tests produce about
40k lines / 5Mb of log.

For make it'd not be hard to add a recursive 'listchecks' target listing
the individual tests that need to be run. Hacking up vcregress.pl to do
that, instead of what it currently does, shouldn't be too hard either.

Once there's a list of commands that need to be run it's not hard to
write a loop in perl that runs up to N tests in parallel, saving their
output. Which then allows to display the failing test reports at the
end.

If we then also add a convention that each test outputs something like
TESTLOG: path/to/logfile
...
it'd not be hard to add support for the test runner to list the files
that cfbot et al should output.

Yeah, there is code in the buildfarm that contains a lot of building
blocks that can be used for this sort of stuff, see the PGBuild::Utils
and PGBuild::Log modules.

Looking around the tree, the most annoying bit to implement something
like this is that things below src/bin/, src/interfaces, src/test,
src/pl implement their own check, installcheck targets. Given the number
of these that just boil down to a variant of

check:
$(pg_regress_check)
$(prove_check)
installcheck:
$(pg_regress_installcheck)

it seems we should lift the REGRESS and TAP_TESTS specific logic in
pgxs.mk up into src/Makefiles.global. Which then would make something
list a global listchecks target easy.

Yeah, some of this stuff has grown a bit haphazardly, and maybe needs
some rework.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com