Why is parula failing?
For the last few days, buildfarm member parula has been intermittently
failing the partition_prune regression test, due to unexpected plan
changes [1]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=parula&dt=2024-03-19%2016%3A09%3A02[2]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=parula&dt=2024-03-18%2011%3A13%3A02[3]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=parula&dt=2024-03-14%2011%3A40%3A02[4]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=parula&dt=2024-03-14%2019%3A00%3A02. The symptoms can be reproduced exactly by
inserting a "vacuum" of one or another of the partitions of table
"ab", so we can presume that the underlying cause is an autovacuum run
against one of those tables. But why is that happening? None of
those tables receive any insertions during the test, so I don't
understand why autovacuum would trigger on them.
I suppose we could attach "autovacuum=off" settings to these tables,
but it doesn't seem to me that that should be necessary. These test
cases are several years old and haven't given trouble before.
Moreover, if that's necessary then there are a lot of other regression
tests that would presumably need the same treatment.
I'm also baffled why this wasn't happening before. I scraped the
buildfarm logs for 3 months back and confirmed my impression that
this is a new failure mode. But one of these four runs is on
REL_14_STABLE, eliminating the theory that the cause is a recent
HEAD-only change.
Any ideas?
regards, tom lane
[1]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=parula&dt=2024-03-19%2016%3A09%3A02
[2]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=parula&dt=2024-03-18%2011%3A13%3A02
[3]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=parula&dt=2024-03-14%2011%3A40%3A02
[4]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=parula&dt=2024-03-14%2019%3A00%3A02
On Tue, 19 Mar 2024 at 20:58, Tom Lane <tgl@sss.pgh.pa.us> wrote:
For the last few days, buildfarm member parula has been intermittently
failing the partition_prune regression test, due to unexpected plan
changes [1][2][3][4]. The symptoms can be reproduced exactly by
inserting a "vacuum" of one or another of the partitions of table
"ab", so we can presume that the underlying cause is an autovacuum run
against one of those tables. But why is that happening? None of
those tables receive any insertions during the test, so I don't
understand why autovacuum would trigger on them.I suppose we could attach "autovacuum=off" settings to these tables,
but it doesn't seem to me that that should be necessary. These test
cases are several years old and haven't given trouble before.
Moreover, if that's necessary then there are a lot of other regression
tests that would presumably need the same treatment.I'm also baffled why this wasn't happening before. I scraped the
buildfarm logs for 3 months back and confirmed my impression that
this is a new failure mode. But one of these four runs is on
REL_14_STABLE, eliminating the theory that the cause is a recent
HEAD-only change.Any ideas?
This may be purely coincidental, but yesterday I also did have a
seemingly random failure in the recovery test suite locally, in
t/027_stream_regress.pl, where it changed the join order of exactly
one of the queries (that uses the tenk table multiple times, iirc 3x
or so). As the work I was doing wasn't related to join order-related
problems, this surprised me. After checking for test file changes
(none), I re-ran the tests without recompilation and the failure went
away, so I attributed this to an untimely autoanalize. However, as
this was also an unexpected plan change in the tests this could be
related.
Sadly, I did not save the output of that test run, so this is just
about all the information I have. The commit I was testing on was
based on ca108be7, and system config is available if needed.
-Matthias
On Wed, 20 Mar 2024 at 11:50, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
On Tue, 19 Mar 2024 at 20:58, Tom Lane <tgl@sss.pgh.pa.us> wrote:
For the last few days, buildfarm member parula has been intermittently
failing the partition_prune regression test, due to unexpected plan
changes [1][2][3][4]. The symptoms can be reproduced exactly by
inserting a "vacuum" of one or another of the partitions of table
"ab", so we can presume that the underlying cause is an autovacuum run
against one of those tables. But why is that happening? None of
those tables receive any insertions during the test, so I don't
understand why autovacuum would trigger on them.
This may be purely coincidental, but yesterday I also did have a
seemingly random failure in the recovery test suite locally, in
t/027_stream_regress.pl, where it changed the join order of exactly
one of the queries (that uses the tenk table multiple times, iirc 3x
or so).
[...]
Sadly, I did not save the output of that test run, so this is just
about all the information I have. The commit I was testing on was
based on ca108be7, and system config is available if needed.
It looks like Cirrus CI reproduced my issue with recent commit
a0390f6c [0]https://cirrus-ci.com/task/6295909005262848 and previously also with 4665cebc [1]https://cirrus-ci.com/task/5229745516838912, 875e46a0 [2]https://cirrus-ci.com/task/5098544567156736,
449e798c [3]https://cirrus-ci.com/task/4783906419900416, and other older commits, on both the Windows Server 2019
build and Debian for a39f1a36 (with slightly different plan changes on
this Debian run). That should rule out most of the environments.
After analyzing the logs produced by the various primaries, I can't
find a good explanation why they would have this issue. The table is
vacuum analyzed before the regression tests start, and in this run
autovacuum/autoanalyze doesn't seem to kick in until (at least)
seconds after this query was run.
Kind regards,
Matthias van de Meent
[0]: https://cirrus-ci.com/task/6295909005262848
[1]: https://cirrus-ci.com/task/5229745516838912
[2]: https://cirrus-ci.com/task/5098544567156736
[3]: https://cirrus-ci.com/task/4783906419900416
On Wed, 20 Mar 2024 at 08:58, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I suppose we could attach "autovacuum=off" settings to these tables,
but it doesn't seem to me that that should be necessary. These test
cases are several years old and haven't given trouble before.
Moreover, if that's necessary then there are a lot of other regression
tests that would presumably need the same treatment.
Is it worth running that animal with log_autovacuum_min_duration = 0
so we can see what's going on in terms of auto-vacuum auto-analyze in
the log?
David
David Rowley <dgrowleyml@gmail.com> writes:
Is it worth running that animal with log_autovacuum_min_duration = 0
so we can see what's going on in terms of auto-vacuum auto-analyze in
the log?
Maybe, but I'm not sure. I thought that if parula were somehow
hitting an ill-timed autovac/autoanalyze, it should be possible to
make that reproducible by inserting "pg_sleep(60)" or so in the test
script, to give the autovac daemon plenty of time to come around and
do the dirty deed. No luck though --- the results didn't change for
me. So now I'm not sure what is going on.
Perhaps though it is autovacuum, and there's some environment-specific
enabling condition that parula has and my machine doesn't (which
could also help explain why no other animal is doing this).
So yeah, if we could have log_autovacuum_min_duration = 0 perhaps
that would yield a clue.
regards, tom lane
On Thu, 21 Mar 2024 at 12:36, Tom Lane <tgl@sss.pgh.pa.us> wrote:
So yeah, if we could have log_autovacuum_min_duration = 0 perhaps
that would yield a clue.
FWIW, I agree with your earlier statement about it looking very much
like auto-vacuum has run on that table, but equally, if something like
the pg_index record was damaged we could get the same plan change.
We could also do something like the attached just in case we're
barking up the wrong tree.
David
Attachments:
debug_partition_prune_failures.patchtext/plain; charset=US-ASCII; name=debug_partition_prune_failures.patchDownload+38-0
David Rowley <dgrowleyml@gmail.com> writes:
We could also do something like the attached just in case we're
barking up the wrong tree.
Yeah, checking indisvalid isn't a bad idea. I'd put another
one further down, just before the DROP of table ab, so we
can see the state both before and after the unstable tests.
regards, tom lane
On Thu, 21 Mar 2024 at 14:19, Tom Lane <tgl@sss.pgh.pa.us> wrote:
David Rowley <dgrowleyml@gmail.com> writes:
We could also do something like the attached just in case we're
barking up the wrong tree.Yeah, checking indisvalid isn't a bad idea. I'd put another
one further down, just before the DROP of table ab, so we
can see the state both before and after the unstable tests.
So it's taken quite a while to finally fail again.
Effectively, we're getting:
relname | relpages | reltuples | indisvalid | autovacuum_count
| autoanalyze_count
----------------+----------+-----------+------------+------------------+-------------------
- ab_a2_b2 | 0 | -1 | |
0 | 0
+ ab_a2_b2 | 0 | 48 | |
0 | 0
I see AddNewRelationTuple() does set reltuples to -1, so I can't quite
figure out why 48 is in there. Even if auto-analyze had somehow
mistakenly run and the autoanalyze_count stats just were not
up-to-date yet, the table has zero blocks, and I don't see how
acquire_sample_rows() would set *totalrows to anything other than 0.0
in this case. For the vacuum case, I see that reltuples is set from:
/* now we can compute the new value for pg_class.reltuples */
vacrel->new_live_tuples = vac_estimate_reltuples(vacrel->rel, rel_pages,
vacrel->scanned_pages,
vacrel->live_tuples);
Again, hard to see how that could come to anything other than zero
given that rel_pages and scanned_pages should be 0.
Looking at the binary representation of a float of -1 vs 48, they're
not nearly the same. 0xBF800000 vs 0x42400000, so it's not looking
like a flipped bit.
It would be good to have log_autovacuum_min_duration = 0 on this
machine for a while.
David
Hi David / Tom,
David Rowley <dgrowleyml@gmail.com> writes:
It would be good to have log_autovacuum_min_duration = 0 on this machine for a while.
- Have set log_autovacuum_min_duration=0 on parula and a test run came out okay.
- Also added REL_16_STABLE to the branches being tested (in case it matters here).
Let me know if I can help with any other changes here.
-
Robins | tharar@ | adelaide@australia
On Tue, 26 Mar 2024 at 21:03, Tharakan, Robins <tharar@amazon.com> wrote:
David Rowley <dgrowleyml@gmail.com> writes:
It would be good to have log_autovacuum_min_duration = 0 on this machine for a while.- Have set log_autovacuum_min_duration=0 on parula and a test run came out okay.
- Also added REL_16_STABLE to the branches being tested (in case it matters here).
Thanks for doing that.
I see PG_16_STABLE has failed twice already with the same issue. I
don't see any autovacuum / autoanalyze in the log, so I guess that
rules out auto vacuum activity causing this.
Unfortunately, REL_16_STABLE does not have the additional debugging,
so don't get to know what reltuples was set to.
David
David Rowley <dgrowleyml@gmail.com> writes:
Unfortunately, REL_16_STABLE does not have the additional debugging,
so don't get to know what reltuples was set to.
Let's wait a bit to see if it fails in HEAD ... but if not, would
it be reasonable to back-patch the additional debugging output?
regards, tom lane
On Wed, 27 Mar 2024 at 18:28, Tom Lane <tgl@sss.pgh.pa.us> wrote:
David Rowley <dgrowleyml@gmail.com> writes:
Unfortunately, REL_16_STABLE does not have the additional debugging,
so don't get to know what reltuples was set to.Let's wait a bit to see if it fails in HEAD ... but if not, would
it be reasonable to back-patch the additional debugging output?
I think REL_16_STABLE has told us that it's not an auto-vacuum issue.
I'm uncertain what a few more failures in master will tell us aside
from if reltuples == 48 is consistent or if that value is going to
fluctuate.
Let's give it a week and see if it fails a few more times.
David
David Rowley <dgrowleyml@gmail.com> writes:
On Wed, 27 Mar 2024 at 18:28, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Let's wait a bit to see if it fails in HEAD ... but if not, would
it be reasonable to back-patch the additional debugging output?
I think REL_16_STABLE has told us that it's not an auto-vacuum issue.
I'm uncertain what a few more failures in master will tell us aside
from if reltuples == 48 is consistent or if that value is going to
fluctuate.
Let's give it a week and see if it fails a few more times.
We have another failure today [1]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=parula&dt=2024-03-29%2012%3A46%3A02 with the same symptom:
ab_a2 | 0 | -1 | | 0 | 0
- ab_a2_b1 | 0 | -1 | | 0 | 0
+ ab_a2_b1 | 0 | 48 | | 0 | 0
ab_a2_b1_a_idx | 1 | 0 | t | |
Different table, same "48" reltuples. But I have to confess that
I'd not looked closely enough at the previous failure, because
now that I have, this is well out in WTFF territory: how can
reltuples be greater than zero when relpages is zero? This can't
be a state that autovacuum would have left behind, unless it's
really seriously broken. I think we need to be looking for
explanations like "memory stomp" or "compiler bug".
regards, tom lane
[1]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=parula&dt=2024-03-29%2012%3A46%3A02
I wrote:
I'd not looked closely enough at the previous failure, because
now that I have, this is well out in WTFF territory: how can
reltuples be greater than zero when relpages is zero? This can't
be a state that autovacuum would have left behind, unless it's
really seriously broken. I think we need to be looking for
explanations like "memory stomp" or "compiler bug".
... in connection with which, I can't help noticing that parula
is using a very old compiler:
configure: using compiler=gcc (GCC) 7.3.1 20180712 (Red Hat 7.3.1-17)
From some quick checking around, that would have to be near the
beginning of aarch64 support in RHEL (Fedora hadn't promoted aarch64
to a primary architecture until earlier that same year). It's not
exactly hard to believe that there were some lingering compiler bugs.
I wonder why parula is using that when its underlying system seems
markedly newer (the kernel at least has a recent build date).
regards, tom lane
On Sat, 30 Mar 2024 at 09:17, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I wrote:
I'd not looked closely enough at the previous failure, because
now that I have, this is well out in WTFF territory: how can
reltuples be greater than zero when relpages is zero? This can't
be a state that autovacuum would have left behind, unless it's
really seriously broken. I think we need to be looking for
explanations like "memory stomp" or "compiler bug".... in connection with which, I can't help noticing that parula
is using a very old compiler:configure: using compiler=gcc (GCC) 7.3.1 20180712 (Red Hat 7.3.1-17)
From some quick checking around, that would have to be near the
beginning of aarch64 support in RHEL (Fedora hadn't promoted aarch64
to a primary architecture until earlier that same year). It's not
exactly hard to believe that there were some lingering compiler bugs.
I wonder why parula is using that when its underlying system seems
markedly newer (the kernel at least has a recent build date).
It could be, but wouldn't the bug have to relate to the locking code
and be caused by some other backend stomping on the memory?
Otherwise, shouldn't it be failing consistently every run rather than
sporadically?
David
David Rowley <dgrowleyml@gmail.com> writes:
On Sat, 30 Mar 2024 at 09:17, Tom Lane <tgl@sss.pgh.pa.us> wrote:
... in connection with which, I can't help noticing that parula
is using a very old compiler:
configure: using compiler=gcc (GCC) 7.3.1 20180712 (Red Hat 7.3.1-17)
I wonder why parula is using that when its underlying system seems
markedly newer (the kernel at least has a recent build date).
It could be, but wouldn't the bug have to relate to the locking code
and be caused by some other backend stomping on the memory?
Otherwise, shouldn't it be failing consistently every run rather than
sporadically?
Your guess is as good as mine ... but we still haven't seen this
class of failure on any other animal, so the idea that it's strictly
a chance timing issue is getting weaker all the time.
regards, tom lane
... in connection with which, I can't help noticing that parula is using a very old compiler:
configure: using compiler=gcc (GCC) 7.3.1 20180712 (Red Hat 7.3.1-17)
From some quick checking around, that would have to be near the beginning of aarch64
support in RHEL (Fedora hadn't promoted aarch64 to a primary architecture until earlier
that same year). It's not exactly hard to believe that there were some lingering compiler bugs.
I wonder why parula is using that when its underlying system seems markedly newer (the kernel at least has a recent build date).
Parula used GCC v7.3.1 since that's what came by default.
I've now switched to GCC v13.2 and triggered a run. Let's see if the tests stabilize now.
-
Robins
I've now switched to GCC v13.2 and triggered a run. Let's see if the tests stabilize now.
So although HEAD ran fine, but I saw multiple failures (v12, v13, v16) all of which passed on subsequent-tries,
of which some were even"signal 6: Aborted".
FWIW, I compiled gcc v13.2 (default options) from source which IIUC shouldn't be to blame. Two other possible
reasons could be that the buildfarm doesn't have an aarch64 + gcc 13.2 combination (quick check I couldn't
see any), or, that this hardware is flaky.
Either way, this instance is a throw-away so let me know if this isn't helping. I'll swap it out in case there's not
much benefit to be had.
Also, it'd be great if someone could point me to a way to update the "Compiler" section in "System Detail" on
the buildfarm page (it wrongly shows GCC as v7.3.1).
-
thanks
robins
"Tharakan, Robins" <tharar@amazon.com> writes:
I've now switched to GCC v13.2 and triggered a run. Let's see if the tests stabilize now.
So although HEAD ran fine, but I saw multiple failures (v12, v13, v16) all of which passed on subsequent-tries,
of which some were even"signal 6: Aborted".
Ugh...
Also, it'd be great if someone could point me to a way to update the "Compiler" section in "System Detail" on
the buildfarm page (it wrongly shows GCC as v7.3.1).
The update_personality.pl script in the buildfarm client distro
is what to use to adjust OS version or compiler version data.
regards, tom lane
On Tue, 2 Apr 2024 at 15:01, Tom Lane <tgl@sss.pgh.pa.us> wrote:
"Tharakan, Robins" <tharar@amazon.com> writes:
So although HEAD ran fine, but I saw multiple failures (v12, v13, v16)
all of which passed on subsequent-tries,
of which some were even"signal 6: Aborted".
Ugh...
parula didn't send any reports to buildfarm for the past 44 hours. Logged in
to see that postgres was stuck on pg_sleep(), which was quite odd! I
captured
the backtrace and triggered another run on HEAD, which came out
okay.
I'll keep an eye on this instance more often for the next few days.
(Let me know if I could capture more if a run gets stuck again)
(gdb) bt
#0 0x0000ffff952ae954 in epoll_pwait () from /lib64/libc.so.6
#1 0x000000000083e9c8 in WaitEventSetWaitBlock (nevents=1,
occurred_events=<optimized out>, cur_timeout=297992, set=0x2816dac0) at
latch.c:1570
#2 WaitEventSetWait (set=0x2816dac0, timeout=timeout@entry=600000,
occurred_events=occurred_events@entry=0xffffc395ed28, nevents=nevents@entry=1,
wait_event_info=wait_event_info@entry=150994946) at latch.c:1516
#3 0x000000000083ed84 in WaitLatch (latch=<optimized out>,
wakeEvents=wakeEvents@entry=41, timeout=600000,
wait_event_info=wait_event_info@entry=150994946) at latch.c:538
#4 0x0000000000907404 in pg_sleep (fcinfo=<optimized out>) at misc.c:406
#5 0x0000000000696b10 in ExecInterpExpr (state=0x28384040,
econtext=0x28383e38, isnull=<optimized out>) at execExprInterp.c:764
#6 0x00000000006ceef8 in ExecEvalExprSwitchContext (isNull=0xffffc395ee9f,
econtext=0x28383e38, state=<optimized out>) at
../../../src/include/executor/executor.h:356
#7 ExecProject (projInfo=<optimized out>) at
../../../src/include/executor/executor.h:390
#8 ExecResult (pstate=<optimized out>) at nodeResult.c:135
#9 0x00000000006b7aec in ExecProcNode (node=0x28383d28) at
../../../src/include/executor/executor.h:274
#10 gather_getnext (gatherstate=0x28383b38) at nodeGather.c:287
#11 ExecGather (pstate=0x28383b38) at nodeGather.c:222
#12 0x000000000069aa4c in ExecProcNode (node=0x28383b38) at
../../../src/include/executor/executor.h:274
#13 ExecutePlan (execute_once=<optimized out>, dest=0x2831ffb0,
direction=<optimized out>, numberTuples=0, sendTuples=<optimized out>,
operation=CMD_SELECT, use_parallel_mode=<optimized out>,
planstate=0x28383b38, estate=0x28383910) at execMain.c:1646
#14 standard_ExecutorRun (queryDesc=0x283239c0, direction=<optimized out>,
count=0, execute_once=<optimized out>) at execMain.c:363
#15 0x000000000086d454 in PortalRunSelect (portal=portal@entry=0x281f0fb0,
forward=forward@entry=true, count=0, count@entry=9223372036854775807,
dest=dest@entry=0x2831ffb0) at pquery.c:924
#16 0x000000000086ec70 in PortalRun (portal=portal@entry=0x281f0fb0,
count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true,
run_once=run_once@entry=true, dest=dest@entry=0x2831ffb0,
altdest=altdest@entry=0x2831ffb0, qc=qc@entry=0xffffc395f250) at
pquery.c:768
#17 0x000000000086a944 in exec_simple_query
(query_string=query_string@entry=0x28171c90
"SELECT pg_sleep(0.1);") at postgres.c:1274
#18 0x000000000086b480 in PostgresMain (dbname=<optimized out>,
username=<optimized out>) at postgres.c:4680
#19 0x0000000000866a0c in BackendMain (startup_data=<optimized out>,
startup_data_len=<optimized out>) at backend_startup.c:101
#20 0x00000000007c1738 in postmaster_child_launch
(child_type=child_type@entry=B_BACKEND,
startup_data=startup_data@entry=0xffffc395f718
"", startup_data_len=startup_data_len@entry=4,
client_sock=client_sock@entry=0xffffc395f720)
at launch_backend.c:265
#21 0x00000000007c5120 in BackendStartup (client_sock=0xffffc395f720) at
postmaster.c:3593
#22 ServerLoop () at postmaster.c:1674
#23 0x00000000007c6dc8 in PostmasterMain (argc=argc@entry=8,
argv=argv@entry=0x2816d320)
at postmaster.c:1372
#24 0x0000000000496bb8 in main (argc=8, argv=0x2816d320) at main.c:197
The update_personality.pl script in the buildfarm client distro
is what to use to adjust OS version or compiler version data.
Thanks. Fixed that.
-
robins