increased duration of stats_ext tests with -DCLOBBER_CACHE_ALWAYS
Hi,
I have two buildfarm animals running tests with -DCLOBBER_CACHE_ALWAYS,
and both of them started having issues with timeouts
Last file mtime in snapshot: Sat Nov 29 21:41:49 2025 GMT
===================================================
timed out after 14400 secs
The animals are avocet and trilobite, and it's actually a single VM,
with one animal for gcc or clang.
It's fine that these tests take quite a bit of time, but I'm puzzled why
it started to fail. And it seems the stats_ext regression test got much
slower at the end of September.
The last couple runs from avocet have these durations:
11:44:42 ago timedout [713d9a8] (01:54:23) Details
7 days 23:33:36 ago OK [bc32a12] (34:19:40) Config
16 days 12:19:09 ago OK [* 23792] (34:32:51) Config
63 days 03:55:34 ago OK [91df046] (31:14:46) Config
71 days 10:57:06 ago OK [9fc7f6a] (31:51:02) Config
79 days 05:46:56 ago OK [454c046] (19:35:25) Config
87 days 00:36:55 ago OK [06473f5] (19:31:16) Config
94 days 19:50:32 ago OK [f727b63] (19:05:05) Config
102 days 14:43:13 ago OK [ef03ea0] (19:09:05) Config
...
So it was ~19h for a while (started at ~14h about 4y back). And then
between September 14 and 22 it jumped to ~32h. Which seems like a lot.
And that seems to be due to stats_ext changing from
ok 157 + stats_ext 855865 ms
to
ok 157 + stats_ext 15158426 ms
That's ~14300 seconds, i.e. ~4h. Pretty close to the timeout limit. The
machine is running the test with two locales, so there are multiple runs
of this regression test.
Any ideas what might be causing this? I see exactly the same issue on
the other machine (trilobite). There were no upgrades or other changes
around that time, and no other regression test has this issue.
regards
--
Tomas Vondra
On 12/2/25 17:05, Tomas Vondra wrote:
...
So it was ~19h for a while (started at ~14h about 4y back). And then
between September 14 and 22 it jumped to ~32h. Which seems like a lot.And that seems to be due to stats_ext changing from
ok 157 + stats_ext 855865 ms
to
ok 157 + stats_ext 15158426 ms
That's ~14300 seconds, i.e. ~4h. Pretty close to the timeout limit. The
machine is running the test with two locales, so there are multiple runs
of this regression test.
After bisecting this, it seems to have changed in this commit:
commit 1eccb93150707acfcc8f24556a15742a6313c8ac (HEAD -> stats-ext)
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date: Sat Sep 20 12:44:52 2025 -0400
Re-allow using statistics for bool-valued functions in WHERE.
Commit a391ff3c3, which added the ability for a function's support
function to provide a custom selectivity estimate for "WHERE ...
And it seems to be "stuck" on the last few queries doing this:
SELECT * FROM check_estimated_rows('SELECT * FROM sb_2 WHERE
extstat_small(y)');
Attached is a perf-report for this, but the interesting part is:
--99.51%--plpgsql_exec_function
exec_toplevel_block.constprop.0
exec_stmt_block
exec_stmts
exec_stmt_return (inlined)
exec_eval_expr
exec_eval_simple_expr (inlined)
|
--99.37%--SPI_plan_get_cached_plan
GetCachedPlan
|
|--96.58%--RevalidateCachedQuery
| |
| |--96.00%--pg_analyze_and_rewrite_withcb
| | parse_analyze_withcb
| | |
| | --96.00%--transformTopLevelStmt (inlined)
I'm not sure if this a problem in the code, or whether it's just that
the test case happens to be this expensive with cache clobbering. Maybe
I simply need to increase the timeout for this machine, so that it does
not fail after 4h? It'd be unfortunate, as the runs are already long,
and this pushes them over 24h ... :-(
regards
--
Tomas Vondra
Attachments:
Tomas Vondra <tomas@vondra.me> writes:
On 12/2/25 17:05, Tomas Vondra wrote:
So it was ~19h for a while (started at ~14h about 4y back). And then
between September 14 and 22 it jumped to ~32h. Which seems like a lot.
After bisecting this, it seems to have changed in this commit:
commit 1eccb93150707acfcc8f24556a15742a6313c8ac (HEAD -> stats-ext)
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date: Sat Sep 20 12:44:52 2025 -0400
Re-allow using statistics for bool-valued functions in WHERE.
Ugh.
Attached is a perf-report for this, but the interesting part is:
--99.51%--plpgsql_exec_function
Yeah, I can imagine that constantly flushing the cached plan for
that plpgsql function would be bad. Let me see if I can reformulate
that test without using a plpgsql function --- right offhand, it's
not obvious why a built-in function wouldn't serve the purpose
just as well.
regards, tom lane
I wrote:
Yeah, I can imagine that constantly flushing the cached plan for
that plpgsql function would be bad. Let me see if I can reformulate
that test without using a plpgsql function --- right offhand, it's
not obvious why a built-in function wouldn't serve the purpose
just as well.
I pushed a change for this. On my Mac laptop, it brings the time
for stats_ext with -DCLOBBER_CACHE_ALWAYS down to ~8 minutes, from
I-didn't-have-the-patience-to-wait-but-it-would-have-been-hours.
BTW, I noticed that neither avocet nor trilobite seem to have
'use_installcheck_parallel' enabled in their BF config files.
That results in the installcheck steps taking longer than the
check step, when they should be the same time or shorter.
You could shave several hours off the animals' runtime by
enabling that.
(I'm not really sure why that's not on by default ... if the
check steps are automatically parallelized, why not installcheck?)
regards, tom lane
On 12/3/25 19:33, Tom Lane wrote:
I wrote:
Yeah, I can imagine that constantly flushing the cached plan for
that plpgsql function would be bad. Let me see if I can reformulate
that test without using a plpgsql function --- right offhand, it's
not obvious why a built-in function wouldn't serve the purpose
just as well.I pushed a change for this. On my Mac laptop, it brings the time
for stats_ext with -DCLOBBER_CACHE_ALWAYS down to ~8 minutes, from
I-didn't-have-the-patience-to-wait-but-it-would-have-been-hours.
Thanks!
BTW, I noticed that neither avocet nor trilobite seem to have
'use_installcheck_parallel' enabled in their BF config files.
That results in the installcheck steps taking longer than the
check step, when they should be the same time or shorter.
You could shave several hours off the animals' runtime by
enabling that.
Good idea, I enabled that for both animals.
--
Tomas Vondra
Hello Tomas,
03.12.2025 21:23, Tomas Vondra wrote:
On 12/3/25 19:33, Tom Lane wrote:
I wrote:
Yeah, I can imagine that constantly flushing the cached plan for
that plpgsql function would be bad. Let me see if I can reformulate
that test without using a plpgsql function --- right offhand, it's
not obvious why a built-in function wouldn't serve the purpose
just as well.I pushed a change for this. On my Mac laptop, it brings the time
for stats_ext with -DCLOBBER_CACHE_ALWAYS down to ~8 minutes, from
I-didn't-have-the-patience-to-wait-but-it-would-have-been-hours.Thanks!
That change decreased the test duration on master:
[1]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=avocet&dt=2025-12-05%2013%3A21%3A32
vs
[2]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=avocet&dt=2025-11-24%2016%3A21%3A44 [3[ https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=avocet&br=REL_15_STABLE
but the test run still failed (and also fails on other branches, where the
duration is moderate), probably because of the overall timeout changed
somehow.
The last good run on avocet, REL_15_STABLE [3] took 12:49:15, however all
the failed runs timed out in 4 hours (14400 seconds):
timedout [7e54eac] (03:59:27)
timedout [7792bdc] (03:59:53)
The difference between the last good run and the first bad one I see is:
'script_version' => 'REL_19_1',
vs
'script_version' => 'REL_20',
though I can't see timeout-related changes in [4]https://github.com/PGBuildFarm/client-code/releases/tag/REL_20, probably something was
changed in the environment during the upgrade...
[1]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=avocet&dt=2025-12-05%2013%3A21%3A32
[2]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=avocet&dt=2025-11-24%2016%3A21%3A44 [3[ https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=avocet&br=REL_15_STABLE
[3[ https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=avocet&br=REL_15_STABLE
[4]: https://github.com/PGBuildFarm/client-code/releases/tag/REL_20
Best regards,
Alexander
On 12/6/25 10:00, Alexander Lakhin wrote:
Hello Tomas,
03.12.2025 21:23, Tomas Vondra wrote:
On 12/3/25 19:33, Tom Lane wrote:
I wrote:
Yeah, I can imagine that constantly flushing the cached plan for
that plpgsql function would be bad. Let me see if I can reformulate
that test without using a plpgsql function --- right offhand, it's
not obvious why a built-in function wouldn't serve the purpose
just as well.I pushed a change for this. On my Mac laptop, it brings the time
for stats_ext with -DCLOBBER_CACHE_ALWAYS down to ~8 minutes, from
I-didn't-have-the-patience-to-wait-but-it-would-have-been-hours.Thanks!
That change decreased the test duration on master:
[1] ok 159 + stats_ext 964154 ms
vs
[2] ok 157 + stats_ext 16557572 msbut the test run still failed (and also fails on other branches, where the
duration is moderate), probably because of the overall timeout changed
somehow.The last good run on avocet, REL_15_STABLE [3] took 12:49:15, however all
the failed runs timed out in 4 hours (14400 seconds):
timedout [7e54eac] (03:59:27)
timedout [7792bdc] (03:59:53)
The difference between the last good run and the first bad one I see is:
'script_version' => 'REL_19_1',
vs
'script_version' => 'REL_20',
though I can't see timeout-related changes in [4], probably something was
changed in the environment during the upgrade...
Yeah, I noticed that too. But I have no idea what could have changed or
why - the only thing I updated is the buildfarm client :-(
I initially thought it might be due to setting
use_installcheck_parallel => 1
but it still fails after reverting that change. I still have the
build-farm client v19.1 around, so I can try switching back to that.
regards
--
Tomas Vondra
Tomas Vondra <tomas@vondra.me> writes:
On 12/6/25 10:00, Alexander Lakhin wrote:
though I can't see timeout-related changes in [4], probably something was
changed in the environment during the upgrade...
Yeah, I noticed that too. But I have no idea what could have changed or
why - the only thing I updated is the buildfarm client :-(
Comparing the configuration dumps from avocet's last successful run
and its latest, I see that last successful shows
'wait_timeout' => 0
but the failing run shows
'wait_timeout' => undef
Per the comments in the sample config file:
# max time in seconds allowed for a single branch run
# undef means default, which is 4 hours. For unlimited specify 0
wait_timeout => undef,
I'd actually recommend a setting of a couple of days for these
animals, rather than "0". We've been known to get BF animals
into infinite loops, and if you use "0" then you will need to
clean up such mistakes manually.
regards, tom lane
On 2025-12-06 Sa 7:04 AM, Tomas Vondra wrote:
On 12/6/25 10:00, Alexander Lakhin wrote:
Hello Tomas,
03.12.2025 21:23, Tomas Vondra wrote:
On 12/3/25 19:33, Tom Lane wrote:
I wrote:
Yeah, I can imagine that constantly flushing the cached plan for
that plpgsql function would be bad. Let me see if I can reformulate
that test without using a plpgsql function --- right offhand, it's
not obvious why a built-in function wouldn't serve the purpose
just as well.I pushed a change for this. On my Mac laptop, it brings the time
for stats_ext with -DCLOBBER_CACHE_ALWAYS down to ~8 minutes, from
I-didn't-have-the-patience-to-wait-but-it-would-have-been-hours.Thanks!
That change decreased the test duration on master:
[1] ok 159 + stats_ext 964154 ms
vs
[2] ok 157 + stats_ext 16557572 msbut the test run still failed (and also fails on other branches, where the
duration is moderate), probably because of the overall timeout changed
somehow.The last good run on avocet, REL_15_STABLE [3] took 12:49:15, however all
the failed runs timed out in 4 hours (14400 seconds):
timedout [7e54eac] (03:59:27)
timedout [7792bdc] (03:59:53)
The difference between the last good run and the first bad one I see is:
'script_version' => 'REL_19_1',
vs
'script_version' => 'REL_20',
though I can't see timeout-related changes in [4], probably something was
changed in the environment during the upgrade...Yeah, I noticed that too. But I have no idea what could have changed or
why - the only thing I updated is the buildfarm client :-(I initially thought it might be due to setting
use_installcheck_parallel => 1
but it still fails after reverting that change. I still have the
build-farm client v19.1 around, so I can try switching back to that.
Nothing of significance to what is run changed between 19.1 and 20.
See https://github.com/PGBuildFarm/client-code/compare/REL_19_1...REL_20
cheers
andrew
--
Andrew Dunstan
EDB: https://www.enterprisedb.com
On 12/6/25 17:08, Tom Lane wrote:
Tomas Vondra <tomas@vondra.me> writes:
On 12/6/25 10:00, Alexander Lakhin wrote:
though I can't see timeout-related changes in [4], probably something was
changed in the environment during the upgrade...Yeah, I noticed that too. But I have no idea what could have changed or
why - the only thing I updated is the buildfarm client :-(Comparing the configuration dumps from avocet's last successful run
and its latest, I see that last successful shows'wait_timeout' => 0
but the failing run shows
'wait_timeout' => undef
Per the comments in the sample config file:
# max time in seconds allowed for a single branch run
# undef means default, which is 4 hours. For unlimited specify 0
wait_timeout => undef,I'd actually recommend a setting of a couple of days for these
animals, rather than "0". We've been known to get BF animals
into infinite loops, and if you use "0" then you will need to
clean up such mistakes manually.
I'll try, but that's just the default in v20 of the buildfarm client (I
always get the new .sample and update it with the necessary changes,
keeping all the defaults).
regards
--
Tomas Vondra
Import Notes
Reply to msg id not found: 2172608.1765037310@sss.pgh.pa.us