Several buildfarm animals fail tests because of shared memory error

Started by Alexander Lakhin12 months ago8 messages
#1Alexander Lakhin
Alexander Lakhin
exclusion@gmail.com

Hello hackers,

I'd like to bring your attention to multiple buildfarm failures, which
occurred this month, on master only, caused by "could not open shared
memory segment ...: No such file or directory" errors.

First such errors were produced on 2024-12-16 by:
leafhopper
Amazon Linux 2023 | gcc 11.4.1 | aarch64/graviton4/r8g.2xl | tharar [ a t ] amazon.com
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=leafhopper&dt=2024-12-16%2012%3A27%3A01
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=leafhopper&dt=2024-12-16%2020%3A40%3A09

and batta:
sid | gcc recent | aarch64 | michael [ a t ] paquier.xyz
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=batta&dt=2024-12-16%2008%3A05%3A04

Then there was alligator:
Ubuntu 24.04 LTS | gcc experimental (nightly build) | x86_64 | tharakan [ a t ] gmail.com
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=alligator&dt=2024-12-19%2001%3A30%3A57

and parula:
Amazon Linux 2 | gcc 13.2.0 | aarch64/Graviton3/c7g.2xl | tharar [ a t ] amazon.com
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=parula&dt=2024-12-21%2009%3A56%3A28

Maybe it's a configuration issue (all animals except batta are owned by
Robins), as described here:
https://www.postgresql.org/docs/devel/kernel-resources.html#SYSTEMD-REMOVEIPC

And maybe leafhopper is faulty by itself, because it also produced very
weird test outputs (in older branches) like:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=leafhopper&dt=2024-12-16%2023%3A43%3A03
REL_15_STABLE
-               Rows Removed by Filter: 9990
+               Rows Removed by Filter: 447009543
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=leafhopper&dt=2024-12-21%2022%3A18%3A04
REL_16_STABLE
-               Rows Removed by Filter: 9990
+               Rows Removed by Filter: 9395

But still why master only?

Unfortunately I'm unable to reproduce such failures locally, so I'm sorry
for such raw information, but I see no way to investigate this further
without assistance. Perhaps owners of these animals could shed some light
on this...

Best regards,
Alexander

#2Robins Tharakan
Robins Tharakan
tharakan@gmail.com
In reply to: Alexander Lakhin (#1)
Re: Several buildfarm animals fail tests because of shared memory error

Hi Alexander,

Thanks for collating this list.
I'll try to add as much as I know, in hopes that it helps.

On Sun, 22 Dec 2024 at 16:30, Alexander Lakhin <exclusion@gmail.com> wrote:

I'd like to bring your attention to multiple buildfarm failures, which
occurred this month, on master only, caused by "could not open shared
memory segment ...: No such file or directory" errors.

- I am unsure how batta is set up, but till late last week, none of my
instances had set REMOVEIPC correctly. I am sorry, I didn't know about this
until Thomas pointed it out to me in another thread. So if that's a key
reason here, then probably by this time next week things should settle
down. I've begun setting it correctly (2 done with a few more to go) -
although given that some machines are at work, I'll try to get to them this
coming week.

But still why master only?

+1. It is interesting though as to why master is affected more often. This
may be statistical - since master ends up with more commits and thus more
tests? Unsure.

Also:
- I recently (~2 days back) switched parula to gcc-experimental nightly -
after which I see 4 of the recent errors - although the recent most test is
green.
- The only info about leafhopper may be relevant is that it's one of the
newest machines (Graviton4) so it comes with a recent hardware / kernel /
stock gcc 11.4.1.

Unfortunately I'm unable to reproduce such failures locally, so I'm sorry

for such raw information, but I see no way to investigate this further
without assistance. Perhaps owners of these animals could shed some light
on this...

Since the instances are created with work accounts, it isn't trivial to
share access but I could revert with any outputs / capture if it can help
here.

Lastly, alligator has been on gcc nightly for a few months, and is on
x86_64 - so by this time next week if alligator is still stuttering, pretty
sure there's more than just aarch64 or gcc or IPC config to blame here.

-
robins

#3Michael Paquier
Michael Paquier
michael@paquier.xyz
In reply to: Alexander Lakhin (#1)
Re: Several buildfarm animals fail tests because of shared memory error

On Sun, Dec 22, 2024 at 08:00:00AM +0200, Alexander Lakhin wrote:

and batta:
sid | gcc recent | aarch64 | michael [ a t ] paquier.xyz
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=batta&amp;dt=2024-12-16%2008%3A05%3A04

I suspect that this one has been caused by me, as I have logged into
the host around this time last week to update the buildfarm client and
a few ore things. And as far I can see, RemoveIPC was set to "yes"
for the host in /etc/systemd/logind.conf, just disabled it now.
--
Michael

#4Alexander Lakhin
Alexander Lakhin
exclusion@gmail.com
In reply to: Robins Tharakan (#2)
1 attachment(s)
Re: Several buildfarm animals fail tests because of shared memory error

Hello Robins,

22.12.2024 09:27, Robins Tharakan wrote:

- The only info about leafhopper may be relevant is that it's one of the newest machines (Graviton4) so it comes with
a recent hardware / kernel / stock gcc 11.4.1.

Could you please take a look at leafhopper. which is producing weird test
failures rather often? For example,
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=leafhopper&amp;dt=2024-12-16%2023%3A43%3A03 - REL_15_STABLE
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=leafhopper&amp;dt=2024-12-21%2022%3A18%3A04 - REL_16_STABLE
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=leafhopper&amp;dt=2025-01-02%2009%3A21%3A04 - REL_17_STABLE

--- /home/bf/proj/bf/build-farm-17/REL_16_STABLE/pgsql.build/src/test/regress/expected/select_parallel.out 2024-12-21 
22:18:03.844773742 +0000
+++ /home/bf/proj/bf/build-farm-17/REL_16_STABLE/pgsql.build/src/test/recovery/tmp_check/results/select_parallel.out 
2024-12-21 22:23:28.264849796 +0000
@@ -551,7 +551,7 @@
     ->  Nested Loop (actual rows=98000 loops=1)
           ->  Seq Scan on tenk2 (actual rows=10 loops=1)
                 Filter: (thousand = 0)
-               Rows Removed by Filter: 9990
+               Rows Removed by Filter: 9395
Or:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=leafhopper&dt=2024-12-18%2023%3A35%3A04 - master
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=leafhopper&dt=2025-01-02%2009%3A22%3A04 - master
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=leafhopper&dt=2025-01-08%2007%3A38%3A03 - master
#   Failed test 'regression tests pass'
#   at t/027_stream_regress.pl line 95.
#          got: '256'
#     expected: '0'
# Looks like you failed 1 test of 9.
[23:42:59] t/027_stream_regress.pl ...............
...
diff -U3 /home/bf/proj/bf/build-farm-17/HEAD/pgsql.build/src/test/regress/expected/memoize.out 
/home/bf/proj/bf/build-farm-17/HEAD/pgsql.build/src/test/recovery/tmp_check/results/memoize.out
--- /home/bf/proj/bf/build-farm-17/HEAD/pgsql.build/src/test/regress/expected/memoize.out 2024-12-18 23:35:04.318987642 
+0000
+++ /home/bf/proj/bf/build-farm-17/HEAD/pgsql.build/src/test/recovery/tmp_check/results/memoize.out 2024-12-18 
23:42:24.806028990 +0000
@@ -179,7 +179,7 @@
                 Hits: 980  Misses: 20  Evictions: Zero  Overflows: 0  Memory Usage: NkB
                 ->  Seq Scan on tenk1 t2 (actual rows=1 loops=N)
                       Filter: ((t1.twenty = unique1) AND (t1.two = two))
-                     Rows Removed by Filter: 9999
+                     Rows Removed by Filter: 9775
  (12 rows)

Maybe you could try to reproduce such failures without buildfarm client, just
by running select_parallel, for example, with the attached patch applied.
I mean running `make check` with parallel_schedule like:
...
# ----------
# Run these alone so they don't run out of parallel workers
# select_parallel depends on create_misc
# ----------
test: select_parallel
test: select_parallel
test: select_parallel
....
(e.g. with 100 repetitions)

Or
TESTS="test_setup copy create_misc create_index $(printf "select_parallel %.0s" {1..100})" make check-tests

Best regards,
Alexander Lakhin
Neon (https://neon.tech)

Attachments:

select_parallel-repeatable.patchtext/x-patch; charset=UTF-8; name=select_parallel-repeatable.patch
#5Robins Tharakan
Robins Tharakan
tharakan@gmail.com
In reply to: Alexander Lakhin (#4)
Re: Several buildfarm animals fail tests because of shared memory error

On Thu, 9 Jan 2025 at 15:30, Alexander Lakhin <exclusion@gmail.com> wrote:

Maybe you could try to reproduce such failures without buildfarm client,

just

by running select_parallel, for example, with the attached patch applied.
I mean running `make check` with parallel_schedule like:
...
Or
TESTS="test_setup copy create_misc create_index $(printf "select_parallel

%.0s" {1..100})" make check-tests

Thanks Alexander for pointing to the test steps. I'll try to run these on
leafhopper
the next couple of days and come back if I see anything interesting.

-
robins

#6Robins Tharakan
Robins Tharakan
tharakan@gmail.com
In reply to: Robins Tharakan (#2)
Re: Several buildfarm animals fail tests because of shared memory error

Hi Alexander,

On Sun, 22 Dec 2024 at 17:57, Robins Tharakan <tharakan@gmail.com> wrote:

.... So if that's a key reason here, then probably by this time next week

things should

settle down. I've begun setting it correctly (2 done with a few more to

go) - although

given that some machines are at work, I'll try to get to them this coming

week.

All of my machines now have the RemoveIPC config set correctly and seem to
work
well for the past few days, so ideally we should be good there.

Unrelated, parula has been failing the libperl test (only v15 and older),
for the past
3 weeks - to clarify, this test started to fail (~18 days ago) before I
fixed the
'RemoveIPC' configuration (~5 days ago), so this is unrelated to that
change.

https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=parula&amp;dt=2025-01-09%2003%3A13%3A18&amp;stg=configure

The first REL_15_STABLE test failure points to acd5c28db5 but I didn't see
anything interesting there.

The error seems to be around "annobin.so" and so it may be about how
gcc is being compiled (not sure). While I figure out if GCC compilation
needs work, I thought to bring it up here since v16+ seems to work fine on
the same box and we may want to consider doing something similar for all
older versions too?

"
configure:19818: checking for libperl
configure:19834: ccache gcc -o conftest -Wall -Wmissing-prototypes
-Wpointer-arith -Wdeclaration-after-statement -Werror=vla -Wendif-labels
-Wmissing-format-attribute -Wimplicit-fallthrough=3 -Wcast-function-type
-Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard
-Wno-deprecated-non-prototype -Wno-format-truncation
-Wno-stringop-truncation -g -O2 -std=gnu17 -fPIC -D_GNU_SOURCE
-I/usr/include/libxml2 -I/usr/lib64/perl5/CORE conftest.c -Wl,-z,relro
-Wl,--as-needed -Wl,-z,now -specs=/usr/lib/rpm/redhat/redhat-hardened-ld
-specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -Wl,--build-id=sha1
-fstack-protector-strong -L/usr/local/lib -L/usr/lib64/perl5/CORE -lperl
-lpthread -lresolv -ldl -lm -lcrypt -lutil -lc >&5
cc1: fatal error: inaccessible plugin file
/opt/gcc/prod/bin/../lib/gcc/aarch64-unknown-linux-gnu/15.0.0/plugin/annobin.so
expanded from short plugin name annobin: No such file or directory
compilation terminated.
configure:19834: $? = 1
"

A wild guess is that this may be about the config
"-specs=/usr/lib/rpm/redhat/redhat-annobin-cc1"
which I don't see in v16+, but I don't know enough to be sure if that's in
the
correct direction.

-
robins

#7Tom Lane
Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robins Tharakan (#6)
Re: Several buildfarm animals fail tests because of shared memory error

Robins Tharakan <tharakan@gmail.com> writes:

Unrelated, parula has been failing the libperl test (only v15 and older),
for the past
3 weeks - to clarify, this test started to fail (~18 days ago) before I
fixed the
'RemoveIPC' configuration (~5 days ago), so this is unrelated to that
change.

https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=parula&amp;dt=2025-01-09%2003%3A13%3A18&amp;stg=configure

The first REL_15_STABLE test failure points to acd5c28db5 but I didn't see
anything interesting there.

The error seems to be around "annobin.so" and so it may be about how
gcc is being compiled (not sure). While I figure out if GCC compilation
needs work, I thought to bring it up here since v16+ seems to work fine on
the same box and we may want to consider doing something similar for all
older versions too?

In a failing build (v13) I see

checking for CFLAGS recommended by Perl... -D_REENTRANT -D_GNU_SOURCE -O2 -ftree-vectorize -flto=auto -ffat-lto-objects -fexceptions -g -grecord-gcc-switches -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -fstack-protector-strong -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -march=armv8.2-a+crypto -mtune=neoverse-n1 -mbranch-protection=standard -fasynchronous-unwind-tables -fstack-clash-protection -fwrapv -fno-strict-aliasing -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64
checking for CFLAGS to compile embedded Perl...
checking for flags to link embedded Perl... -Wl,-z,relro -Wl,--as-needed -Wl,-z,now -specs=/usr/lib/rpm/redhat/redhat-hardened-ld -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -Wl,--build-id=sha1 -fstack-protector-strong -L/usr/local/lib -L/usr/lib64/perl5/CORE -lperl -lpthread -lresolv -ldl -lm -lcrypt -lutil -lc

HEAD reports the same "CFLAGS recommended by Perl" but is much more
selective about what it actually adopts:

checking for CFLAGS recommended by Perl... -D_REENTRANT -D_GNU_SOURCE -O2 -ftree-vectorize -flto=auto -ffat-lto-objects -fexceptions -g -grecord-gcc-switches -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -fstack-protector-strong -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -march=armv8.2-a+crypto -mtune=neoverse-n1 -mbranch-protection=standard -fasynchronous-unwind-tables -fstack-clash-protection -fwrapv -fno-strict-aliasing -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64
checking for CFLAGS to compile embedded Perl...
checking for flags to link embedded Perl... -L/usr/lib64/perl5/CORE -lperl -lpthread -lresolv -ldl -lm -lcrypt -lutil -lc

So it would appear that the link failure is down to those -specs
switches that we uncritically adopted. The change in our behavior
was presumably at

commit b4e936859dc441102eb0b6fb7a104f3948c90490
Author: Peter Eisentraut <peter@eisentraut.org>
Date: Tue Aug 23 16:00:38 2022 +0200

Remove further unwanted linker flags from perl_embed_ldflags

Remove the contents of $Config{ldflags} from ExtUtils::Embed's ldopts,
like we already do with $Config{ccdlflags}. Those flags are the
choices of those who built the Perl installation, which are not
necessarily appropriate for building PostgreSQL. What we really want
from ldopts are the options identifying the location and name of the
libperl library, but unfortunately it doesn't appear possible to get
that separately from the other stuff.

The motivation for this was to strip -mmacosx-version-min options. We
already did something similar for the -arch option. Both of those are
now covered by this more general approach.

This went into v16 and was not back-patched, which is probably wise
because there was at least one followup fix (1c3aa5450) and we still
didn't entirely understand what was happening in the Cygwin build [1]/messages/by-id/8c4fcb72-2574-ff7c-4c25-1f032d4a2a57@enterprisedb.com.
So I'm hesitant to consider back-patching it just because your
experimental gcc isn't working. At the very least we ought to find
out why it worked up till three weeks ago.

regards, tom lane

[1]: /messages/by-id/8c4fcb72-2574-ff7c-4c25-1f032d4a2a57@enterprisedb.com

#8Robins Tharakan
Robins Tharakan
tharakan@gmail.com
In reply to: Tom Lane (#7)
Re: Several buildfarm animals fail tests because of shared memory error

On Tue, 14 Jan 2025 at 12:32, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robins Tharakan <tharakan@gmail.com> writes:

The error seems to be around "annobin.so" and so it may be about how
gcc is being compiled (not sure). While I figure out if GCC compilation
needs work, I thought to bring it up here since v16+ seems to work fine

on

the same box and we may want to consider doing something similar for all
older versions too?

This went into v16 and was not back-patched, which is probably wise
because there was at least one followup fix (1c3aa5450) and we still
didn't entirely understand what was happening in the Cygwin build [1].
So I'm hesitant to consider back-patching it just because your
experimental gcc isn't working.

Thanks for that background. My goal here was to bring up a use-case
where we may want to backpatch, but given that there are more moving
parts and a backpatch isn't trivial, I also agree that it may not be worth
the risk. Especially given that the scenario is rare (i.e. not many people
would be hand-compiling gcc to step onto this issue).

I've paused parula for now and will re-enable older branches
once it is working properly, or, I'll revert to stock gcc as a last resort.

At the very least we ought to find
out why it worked up till three weeks ago.

+1. I'll try to see if GCC is to blame (basically if a gcc commit did this).
If that isn't it, another suspect is that since Nov 24, I enabled automatic
updates on the box (I used to apply OS updates manually every few
months), and an automatic update messed things up in some
way (unsure).

-
robins