Weird test mixup
I got a weird test failure while testing my forking refactor patches on
Cirrus CI
(https://cirrus-ci.com/task/5880724448870400?logs=test_running#L121):
[16:52:39.753] Summary of Failures:
[16:52:39.753]
[16:52:39.753] 66/73 postgresql:intarray-running / intarray-running/regress ERROR 6.27s exit status 1
[16:52:39.753]
[16:52:39.753] Ok: 72
[16:52:39.753] Expected Fail: 0
[16:52:39.753] Fail: 1
[16:52:39.753] Unexpected Pass: 0
[16:52:39.753] Skipped: 0
[16:52:39.753] Timeout: 0
[16:52:39.753]
[16:52:39.753] Full log written to /tmp/cirrus-ci-build/build/meson-logs/testlog-running.txt
And:
diff -U3 /tmp/cirrus-ci-build/contrib/intarray/expected/_int.out /tmp/cirrus-ci-build/build/testrun/intarray-running/regress/results/_int.out --- /tmp/cirrus-ci-build/contrib/intarray/expected/_int.out 2024-03-14 16:48:48.690367000 +0000 +++ /tmp/cirrus-ci-build/build/testrun/intarray-running/regress/results/_int.out 2024-03-14 16:52:05.759444000 +0000 @@ -804,6 +804,7 @@DROP INDEX text_idx; CREATE INDEX text_idx on test__int using gin ( a gin__int_ops ); +ERROR: error triggered for injection point gin-leave-leaf-split-incomplete SELECT count(*) from test__int WHERE a && '{23,50}'; count ------- @@ -877,6 +878,7 @@ (1 row)DROP INDEX text_idx; +ERROR: index "text_idx" does not exist -- Repeat the same queries with an extended data set. The data set is the -- same that we used before, except that each element in the array is -- repeated three times, offset by 1000 and 2000. For example, {1, 5}
Somehow the 'gin-leave-leaf-split-incomplete' injection point was active
in the 'intarray' test. That makes no sense. That injection point is
only used by the test in src/test/modules/gin/. Perhaps that ran at the
same time as the intarray test? But they run in separate instances, with
different data directories. And the 'gin' test passed.
I'm completely stumped. Anyone have a theory?
--
Heikki Linnakangas
Neon (https://neon.tech)
Heikki Linnakangas <hlinnaka@iki.fi> writes:
Somehow the 'gin-leave-leaf-split-incomplete' injection point was active
in the 'intarray' test. That makes no sense. That injection point is
only used by the test in src/test/modules/gin/. Perhaps that ran at the
same time as the intarray test? But they run in separate instances, with
different data directories.
Do they? It'd be fairly easy to explain this if these things were
being run in "installcheck" style. I'm not sure about CI, but from
memory, the buildfarm does use installcheck for some things.
I wonder if it'd be wise to adjust the injection point stuff so that
it's active in only the specific database the injection point was
activated in.
regards, tom lane
On Fri, Mar 15, 2024 at 11:19 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Heikki Linnakangas <hlinnaka@iki.fi> writes:
Somehow the 'gin-leave-leaf-split-incomplete' injection point was active
in the 'intarray' test. That makes no sense. That injection point is
only used by the test in src/test/modules/gin/. Perhaps that ran at the
same time as the intarray test? But they run in separate instances, with
different data directories.Do they? It'd be fairly easy to explain this if these things were
being run in "installcheck" style. I'm not sure about CI, but from
memory, the buildfarm does use installcheck for some things.
Right, as mentioned here:
/messages/by-id/CA+hUKGJYhcG_o2nwSK6r01eOZJwNWUJUbX==AVnW84f-+8yamQ@mail.gmail.com
That's the "running" test, which is like the old installcheck.
I wrote:
Heikki Linnakangas <hlinnaka@iki.fi> writes:
Somehow the 'gin-leave-leaf-split-incomplete' injection point was active
in the 'intarray' test. That makes no sense. That injection point is
only used by the test in src/test/modules/gin/. Perhaps that ran at the
same time as the intarray test? But they run in separate instances, with
different data directories.
Do they? It'd be fairly easy to explain this if these things were
being run in "installcheck" style. I'm not sure about CI, but from
memory, the buildfarm does use installcheck for some things.
Hmm, Munro's comment yesterday[1]/messages/by-id/CA+hUKGJYhcG_o2nwSK6r01eOZJwNWUJUbX==AVnW84f-+8yamQ@mail.gmail.com says that current CI does use
installcheck mode in some cases.
regards, tom lane
[1]: /messages/by-id/CA+hUKGJYhcG_o2nwSK6r01eOZJwNWUJUbX==AVnW84f-+8yamQ@mail.gmail.com
Thomas Munro <thomas.munro@gmail.com> writes:
On Fri, Mar 15, 2024 at 11:19 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Do they? It'd be fairly easy to explain this if these things were
being run in "installcheck" style. I'm not sure about CI, but from
memory, the buildfarm does use installcheck for some things.
Right, as mentioned here:
/messages/by-id/CA+hUKGJYhcG_o2nwSK6r01eOZJwNWUJUbX==AVnW84f-+8yamQ@mail.gmail.com
That's the "running" test, which is like the old installcheck.
Hmm. Seems like maybe we need to institute a rule that anything
using injection points has to be marked NO_INSTALLCHECK. That's
kind of a big hammer though.
regards, tom lane
On Thu, Mar 14, 2024 at 06:19:38PM -0400, Tom Lane wrote:
Do they? It'd be fairly easy to explain this if these things were
being run in "installcheck" style. I'm not sure about CI, but from
memory, the buildfarm does use installcheck for some things.I wonder if it'd be wise to adjust the injection point stuff so that
it's active in only the specific database the injection point was
activated in.
It can be made optional by extending InjectionPointAttach() to
specify a database OID or a database name. Note that
041_checkpoint_at_promote.pl wants an injection point to run in the
checkpointer, where we don't have a database requirement.
Or we could just disable runningcheck because of the concurrency
requirement in this test. The test would still be able to run, just
less times.
--
Michael
On Fri, Mar 15, 2024 at 07:53:57AM +0900, Michael Paquier wrote:
It can be made optional by extending InjectionPointAttach() to
specify a database OID or a database name. Note that
041_checkpoint_at_promote.pl wants an injection point to run in the
checkpointer, where we don't have a database requirement.
Slight correction here. It is also possible to not touch
InjectionPointAttach() at all: just tweak the callbacks to do that as
long as the database that should be used is tracked in shmem with its
point name, say with new fields in InjectionPointSharedState. That
keeps the backend APIs in a cleaner state.
--
Michael
Michael Paquier <michael@paquier.xyz> writes:
On Thu, Mar 14, 2024 at 06:19:38PM -0400, Tom Lane wrote:
I wonder if it'd be wise to adjust the injection point stuff so that
it's active in only the specific database the injection point was
activated in.
It can be made optional by extending InjectionPointAttach() to
specify a database OID or a database name. Note that
041_checkpoint_at_promote.pl wants an injection point to run in the
checkpointer, where we don't have a database requirement.
Or we could just disable runningcheck because of the concurrency
requirement in this test. The test would still be able to run, just
less times.
No, actually we *must* mark all these tests NO_INSTALLCHECK if we
stick with the current definition of injection points. The point
of installcheck mode is that the tests are supposed to be safe to
run in a live installation. Side-effects occurring in other
databases are completely not OK.
I can see that some tests would want to be able to inject code
cluster-wide, but I bet that's going to be a small minority.
I suggest that we invent a notion of "global" vs "local"
injection points, where a "local" one only fires in the DB it
was defined in. Then only tests that require a global injection
point need to be NO_INSTALLCHECK.
regards, tom lane
On Thu, Mar 14, 2024 at 07:13:53PM -0400, Tom Lane wrote:
Michael Paquier <michael@paquier.xyz> writes:
Or we could just disable runningcheck because of the concurrency
requirement in this test. The test would still be able to run, just
less times.No, actually we *must* mark all these tests NO_INSTALLCHECK if we
stick with the current definition of injection points. The point
of installcheck mode is that the tests are supposed to be safe to
run in a live installation. Side-effects occurring in other
databases are completely not OK.
I really don't want to plug any runtime conditions into the backend
APIs, because there can be so much more that can be done there than
only restricting a callback to a database. I can imagine process type
restrictions, process PID restrictions, etc. So this knowledge should
stick into the test module itself, and be expanded there. That's
easier ABI-wise, as well.
I can see that some tests would want to be able to inject code
cluster-wide, but I bet that's going to be a small minority.
I suggest that we invent a notion of "global" vs "local"
injection points, where a "local" one only fires in the DB it
was defined in. Then only tests that require a global injection
point need to be NO_INSTALLCHECK.
Attached is a POC of what could be done. I have extended the module
injection_points so as it is possible to register what I'm calling a
"condition" in the module that can be defined with a new SQL function.
The condition is stored in shared memory with the point name, then at
runtime the conditions are cross-checked in the callbacks. With the
interface of this patch, the condition should be registered *before* a
point is attached, but this stuff could also be written so as
injection_points_attach() takes an optional argument with a database
name. Or this could use a different, new SQL function, say a
injection_points_attach_local() that registers a condition with
MyDatabaseId on top of attaching the point, making the whole happening
while holding once the spinlock of the shmem state for the module.
By the way, modules/gin/ was missing missing a detach, so the test was
not repeatable with successive installchecks. Adding a pg_sleep of a
few seconds after 'gin-leave-leaf-split-incomplete' is registered
enlarges the window, and the patch avoids failures when running
installcheck in parallel for modules/gin/ and something else using
gin, like contrib/btree_gin/:
while make USE_MODULE_DB=1 installcheck; do :; done
0001 is the condition facility for the module, 0002 is a fix for the
GIN test. Thoughts are welcome.
--
Michael
On 15/03/2024 09:39, Michael Paquier wrote:
On Thu, Mar 14, 2024 at 07:13:53PM -0400, Tom Lane wrote:
I can see that some tests would want to be able to inject code
cluster-wide, but I bet that's going to be a small minority.
I suggest that we invent a notion of "global" vs "local"
injection points, where a "local" one only fires in the DB it
was defined in. Then only tests that require a global injection
point need to be NO_INSTALLCHECK.Attached is a POC of what could be done. I have extended the module
injection_points so as it is possible to register what I'm calling a
"condition" in the module that can be defined with a new SQL function.The condition is stored in shared memory with the point name, then at
runtime the conditions are cross-checked in the callbacks. With the
interface of this patch, the condition should be registered *before* a
point is attached, but this stuff could also be written so as
injection_points_attach() takes an optional argument with a database
name. Or this could use a different, new SQL function, say a
injection_points_attach_local() that registers a condition with
MyDatabaseId on top of attaching the point, making the whole happening
while holding once the spinlock of the shmem state for the module.
For the gin test, a single "SELECT injection_points_attach_local()" at
the top of the test file would be most convenient.
If I have to do "SELECT
injection_points_condition('gin-finish-incomplete-split', :'datname');"
for every injection point in the test, I will surely forget it sometimes.
In the 'gin' test, they could actually be scoped to the same backend.
Wrt. the spinlock and shared memory handling, I think this would be
simpler if you could pass some payload in the InjectionPointAttach()
call, which would be passed back to the callback function:
void
InjectionPointAttach(const char *name,
const char *library,
- const char *function)
+ const char *function,
+ uint64 payload)
In this case, the payload would be the "slot index" in shared memory.
Or perhaps always allocate, say, 1024 bytes of working area for every
attached injection point that the test module can use any way it wants.
Like for storing extra conditions, or for the wakeup counter stuff in
injection_wait(). A fixed size working area is a little crude, but would
be very handy in practice.
By the way, modules/gin/ was missing missing a detach, so the test was
not repeatable with successive installchecks.
Oops.
It would be nice to automatically detach all the injection points on
process exit. You wouldn't always want that, but I think most tests hold
a session open throughout the test, and for those it would be handy.
--
Heikki Linnakangas
Neon (https://neon.tech)
On 15/03/2024 01:13, Tom Lane wrote:
Michael Paquier <michael@paquier.xyz> writes:
Or we could just disable runningcheck because of the concurrency
requirement in this test. The test would still be able to run, just
less times.No, actually we *must* mark all these tests NO_INSTALLCHECK if we
stick with the current definition of injection points. The point
of installcheck mode is that the tests are supposed to be safe to
run in a live installation. Side-effects occurring in other
databases are completely not OK.
I committed a patch to do that, to put out the fire.
--
Heikki Linnakangas
Neon (https://neon.tech)
On 15/03/2024 13:09, Heikki Linnakangas wrote:
On 15/03/2024 01:13, Tom Lane wrote:
Michael Paquier <michael@paquier.xyz> writes:
Or we could just disable runningcheck because of the concurrency
requirement in this test. The test would still be able to run, just
less times.No, actually we *must* mark all these tests NO_INSTALLCHECK if we
stick with the current definition of injection points. The point
of installcheck mode is that the tests are supposed to be safe to
run in a live installation. Side-effects occurring in other
databases are completely not OK.I committed a patch to do that, to put out the fire.
That's turning the buildfarm quite red. Many, but not all animals are
failing like this:
--- /home/buildfarm/hippopotamus/buildroot/HEAD/pgsql.build/src/test/modules/injection_points/expected/injection_points.out 2024-03-15 12:41:16.363286975 +0100 +++ /home/buildfarm/hippopotamus/buildroot/HEAD/pgsql.build/src/test/modules/injection_points/results/injection_points.out 2024-03-15 12:53:11.528159615 +0100 @@ -1,118 +1,111 @@ CREATE EXTENSION injection_points; +ERROR: extension "injection_points" is not available +DETAIL: Could not open extension control file "/home/buildfarm/hippopotamus/buildroot/HEAD/pgsql.build/tmp_install/home/buildfarm/hippopotamus/buildroot/HEAD/inst/share/postgresql/extension/injection_points.control": No such file or directory. +HINT: The extension must first be installed on the system where PostgreSQL is running. ...
Looks like adding NO_INSTALLCHECK somehow affected how the modules are
installed in tmp_install. I'll investigate..
--
Heikki Linnakangas
Neon (https://neon.tech)
On 15/03/2024 14:10, Heikki Linnakangas wrote:
On 15/03/2024 13:09, Heikki Linnakangas wrote:
I committed a patch to do that, to put out the fire.
That's turning the buildfarm quite red. Many, but not all animals are
failing like this:--- /home/buildfarm/hippopotamus/buildroot/HEAD/pgsql.build/src/test/modules/injection_points/expected/injection_points.out 2024-03-15 12:41:16.363286975 +0100 +++ /home/buildfarm/hippopotamus/buildroot/HEAD/pgsql.build/src/test/modules/injection_points/results/injection_points.out 2024-03-15 12:53:11.528159615 +0100 @@ -1,118 +1,111 @@ CREATE EXTENSION injection_points; +ERROR: extension "injection_points" is not available +DETAIL: Could not open extension control file "/home/buildfarm/hippopotamus/buildroot/HEAD/pgsql.build/tmp_install/home/buildfarm/hippopotamus/buildroot/HEAD/inst/share/postgresql/extension/injection_points.control": No such file or directory. +HINT: The extension must first be installed on the system where PostgreSQL is running. ...Looks like adding NO_INSTALLCHECK somehow affected how the modules are
installed in tmp_install. I'll investigate..
I think this is a bug in the buildfarm client. In the make_misc_check
step, it does this (reduced to just the interesting parts):
# run the modules that can't be run with installcheck
sub make_misc_check
{
...
my @dirs = glob("$pgsql/src/test/modules/* $pgsql/contrib/*");
foreach my $dir (@dirs)
{
next unless -e "$dir/Makefile";
my $makefile = file_contents("$dir/Makefile");
next unless $makefile =~ /^NO_INSTALLCHECK/m;
my $test = basename($dir);# skip redundant TAP tests which are called elsewhere
my @out = run_log("cd $dir && $make $instflags TAP_TESTS= check");
...
}
So it scans src/test/modules, and runs "make check" for all
subdirectories that have NO_INSTALLCHECK in the makefile. But the
injection fault tests are also conditional on the
enable_injection_points in the parent Makefile:
ifeq ($(enable_injection_points),yes)
SUBDIRS += injection_points gin
else
ALWAYS_SUBDIRS += injection_points gin
endif
The buildfarm client doesn't pay any attention to that, and runs the
test anyway.
I committed an ugly hack to the subdirectory Makefiles, to turn "make
check" into a no-op if injection points are disabled. Normally when you
run "make check" at the parent level, it doesn't even recurse to the
directories, but this works around the buildfarm script. I hope...
--
Heikki Linnakangas
Neon (https://neon.tech)
Heikki Linnakangas <hlinnaka@iki.fi> writes:
On 15/03/2024 13:09, Heikki Linnakangas wrote:
I committed a patch to do that, to put out the fire.
That's turning the buildfarm quite red. Many, but not all animals are
failing like this:
It may be even worse than it appears from the buildfarm status page.
My animals were stuck in infinite loops that required a manual "kill"
to get out of, and it's reasonable to assume there are others that
will require owner intervention. Why would this test have done that,
if the module failed to load?
regards, tom lane
On 15/03/2024 16:00, Tom Lane wrote:
Heikki Linnakangas <hlinnaka@iki.fi> writes:
On 15/03/2024 13:09, Heikki Linnakangas wrote:
I committed a patch to do that, to put out the fire.
That's turning the buildfarm quite red. Many, but not all animals are
failing like this:It may be even worse than it appears from the buildfarm status page.
My animals were stuck in infinite loops that required a manual "kill"
to get out of, and it's reasonable to assume there are others that
will require owner intervention. Why would this test have done that,
if the module failed to load?
The gin_incomplete_split test inserts rows until it hits the injection
point, at page split. There is a backstop, it should give up after 10000
iterations, but that was broken. Fixed that, thanks for the report!
Hmm, don't we have any timeout that would kill tests if they get stuck?
--
Heikki Linnakangas
Neon (https://neon.tech)
Heikki Linnakangas <hlinnaka@iki.fi> writes:
On 15/03/2024 16:00, Tom Lane wrote:
It may be even worse than it appears from the buildfarm status page.
My animals were stuck in infinite loops that required a manual "kill"
to get out of, and it's reasonable to assume there are others that
will require owner intervention. Why would this test have done that,
if the module failed to load?
The gin_incomplete_split test inserts rows until it hits the injection
point, at page split. There is a backstop, it should give up after 10000
iterations, but that was broken. Fixed that, thanks for the report!
Duh ...
Hmm, don't we have any timeout that would kill tests if they get stuck?
AFAIK, the only constraint on a buildfarm animal's runtime is the
wait_timeout setting, which is infinite by default, and was on my
machines. (Not anymore ;-).) We do have timeouts in (most?) TAP
tests, but this wasn't a TAP test.
If this is a continuous-insertion loop, presumably it will run the
machine out of disk space eventually, which could be unpleasant
if there are other services running besides the buildfarm.
I think I'll go notify the buildfarm owners list to check for
trouble.
Are there limits on the runtime of CI or cfbot jobs? Maybe
somebody should go check those systems.
regards, tom lane
On Sat, Mar 16, 2024 at 7:27 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Are there limits on the runtime of CI or cfbot jobs? Maybe
somebody should go check those systems.
Those get killed at a higher level after 60 minutes (configurable but
we didn't change it AFAIK):
https://cirrus-ci.org/faq/#instance-timed-out
It's a fresh virtual machine for each run, and after that it's gone
(well the ccache directory survives but only by being
uploaded/downloaded in explicit steps to transmit it between runs).
On Fri, Mar 15, 2024 at 11:23:31AM +0200, Heikki Linnakangas wrote:
For the gin test, a single "SELECT injection_points_attach_local()" at the
top of the test file would be most convenient.If I have to do "SELECT
injection_points_condition('gin-finish-incomplete-split', :'datname');" for
every injection point in the test, I will surely forget it sometimes.
So will I, most likely.. The odds never play in favor of hackers. I
have a few more tests in mind that can be linked to a specific
backend with SQL queries, but I've not been able to get back to it
yet.
Wrt. the spinlock and shared memory handling, I think this would be simpler
if you could pass some payload in the InjectionPointAttach() call, which
would be passed back to the callback function:In this case, the payload would be the "slot index" in shared memory.
Or perhaps always allocate, say, 1024 bytes of working area for every
attached injection point that the test module can use any way it wants. Like
for storing extra conditions, or for the wakeup counter stuff in
injection_wait(). A fixed size working area is a little crude, but would be
very handy in practice.
Perhaps. I am not sure that we need more than the current signature,
all that can just be handled in some module-specific shmem area. The
key is to be able to link a point name to some state related to it.
Using a hash table would be more efficient, but performance wise a
array is not going to matter as there will most likely never be more
than 8 points. 4 is already a lot, just doubling that on safety
ground.
It would be nice to automatically detach all the injection points on process
exit. You wouldn't always want that, but I think most tests hold a session
open throughout the test, and for those it would be handy.
Linking all the points to a PID with a injection_points_attach_local()
that switches a static flag while registering a before_shmem_exit() to
do an automated cleanup sounds like the simplest approach to me based
on what I'm reading on this thread.
(Just saw the buildfarm storm, wow.)
--
Michael
On Sat, Mar 16, 2024 at 08:40:21AM +0900, Michael Paquier wrote:
Linking all the points to a PID with a injection_points_attach_local()
that switches a static flag while registering a before_shmem_exit() to
do an automated cleanup sounds like the simplest approach to me based
on what I'm reading on this thread.
Please find a patch to do exactly that, without touching the backend
APIs. 0001 adds a new function call injection_points_local() that can
be added on top of a SQL test to make it concurrent-safe. 0002 is the
fix for the GIN tests.
I am going to add an open item to not forget about all that.
Comments are welcome.
--
Michael
On 18 Mar 2024, at 06:04, Michael Paquier <michael@paquier.xyz> wrote:
new function call injection_points_local() that can
be added on top of a SQL test to make it concurrent-safe.
Maybe consider function injection_points_attach_local(‘point name’) instead of static switch?
Or even injection_points_attach_global(‘point name’), while function injection_points_attach(‘point name’) will be global? This would favour writing concurrent test by default…
Best regards, Andrey Borodin.