BUG #18610: llvm error: __aarch64_swp4_acq_rel which could not be resolved

Started by PG Bug reporting formover 1 year ago6 messagesbugs
Jump to latest
#1PG Bug reporting form
noreply@postgresql.org

The following bug has been logged on the website:

Bug reference: 18610
Logged by: Alexander Kozhemyakin
Email address: a.kozhemyakin@postgrespro.ru
PostgreSQL version: 17rc1
Operating system: ubuntu-24.04
Description:

Hi, On the master branch (3beb945d) built with llvm, the following request
fails with an error

CREATE SCHEMA addr_nsp;
CREATE FOREIGN DATA WRAPPER addr_fdw;
CREATE SERVER addr_fserv FOREIGN DATA WRAPPER addr_fdw;
CREATE TEXT SEARCH DICTIONARY addr_ts_dict (template=simple);
CREATE FOREIGN TABLE addr_nsp.genftable (a int) SERVER addr_fserv;
CREATE PUBLICATION addr_pub FOR TABLE addr_nsp.gentable;

select
pg_last_wal_receive_lsn()
from
pg_publication as ref_1
inner join pg_foreign_table as ref_2
inner join pg_aggregate as ref_4
on ( (select partclass from pg_partitioned_table limit 1) < (select
indcollation from pg_index limit 1))
on ( (select 1 from pg_stat_bgwriter limit 1) <> (select total_time from
pg_stat_xact_user_functions limit 1) )
right join pg_aggregate as ref_5 on ( ref_5.aggtransspace <= (9764) );

ERROR: relation "addr_nsp.gentable" does not exist
WARNING: failed to resolve name __aarch64_swp4_acq_rel
FATAL: fatal llvm error: Program used external function
'__aarch64_swp4_acq_rel' which could not be resolved!
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Succeeded.

#2Thomas Munro
thomas.munro@gmail.com
In reply to: PG Bug reporting form (#1)
Re: BUG #18610: llvm error: __aarch64_swp4_acq_rel which could not be resolved

On Wed, Sep 11, 2024 at 7:58 PM PG Bug reporting form
<noreply@postgresql.org> wrote:

FATAL: fatal llvm error: Program used external function
'__aarch64_swp4_acq_rel' which could not be resolved!

Hmm, I think it is inlining spinlock code from
pg_last_wal_receive_lsn()'s call to GetWalRcvFlushRecPtr(), and then
failing to find fallbacks for pre-ARMv8.1 systems that didn't have the
LSE atomic instructions. I wonder if there could be some mismatch in
the default -march for parts of your toolchain, so that it doesn't
link the library that has that stuff, but then the clang that built
walreceiverfuncs.bc expects it to be found. I wonder if it goes away
if you add "-mno-outline-atomics" or "-march=armv8.1-a" or
"-march=armv8-a+lse" to BITCODE_CXXFLAGS (not saying that's a fix,
just trying to understand what's happening...). Assuming you're using
GCC, maybe "gcc -Q --help=target | grep march" could show what Ubuntu
24 has set as the baseline, and "clang -### -c -x c /dev/null" might
show what clang is selecting... just ideas, I'm not sure, I don't have
such a system, I just noticed a few distros cranking up the baseline
instruction sets recently...

#3a.kozhemyakin
a.kozhemyakin@postgrespro.ru
In reply to: Thomas Munro (#2)
Re: BUG #18610: llvm error: __aarch64_swp4_acq_rel which could not be resolved

Thank you, the error no longer occurs after adding -mno-outline-atomics
or -march=armv8.1-a

I installed packages on debian-12 (arm64) from
https://apt.postgresql.org/pub/repos/apt, the error is repeated.

11.09.2024 16:55, Thomas Munro пишет:

Show quoted text

Hmm, I think it is inlining spinlock code from
pg_last_wal_receive_lsn()'s call to GetWalRcvFlushRecPtr(), and then
failing to find fallbacks for pre-ARMv8.1 systems that didn't have the
LSE atomic instructions. I wonder if there could be some mismatch in
the default -march for parts of your toolchain, so that it doesn't
link the library that has that stuff, but then the clang that built
walreceiverfuncs.bc expects it to be found. I wonder if it goes away
if you add "-mno-outline-atomics" or "-march=armv8.1-a" or
"-march=armv8-a+lse" to BITCODE_CXXFLAGS (not saying that's a fix,
just trying to understand what's happening...). Assuming you're using
GCC, maybe "gcc -Q --help=target | grep march" could show what Ubuntu
24 has set as the baseline, and "clang -### -c -x c /dev/null" might
show what clang is selecting... just ideas, I'm not sure, I don't have
such a system, I just noticed a few distros cranking up the baseline
instruction sets recently...

#4Thomas Munro
thomas.munro@gmail.com
In reply to: a.kozhemyakin (#3)
Re: BUG #18610: llvm error: __aarch64_swp4_acq_rel which could not be resolved

On Thu, Sep 12, 2024 at 2:30 PM a.kozhemyakin
<a.kozhemyakin@postgrespro.ru> wrote:

Thank you, the error no longer occurs after adding -mno-outline-atomics or -march=armv8.1-a

I installed packages on debian-12 (arm64) from https://apt.postgresql.org/pub/repos/apt, the error is repeated.

Ahh, I see that we already have a thread proposing to add
-moutline-atomics to our main executable, over here:

/messages/by-id/099F69EE-51D3-4214-934A-1F28C0A1A7A7@amazon.com

If clang has changed to assuming -moutline-atomics now (perhaps you
can see which version of clang starting adding that switch by default
the command I showed earlier, if you have a few versions of clang
around, perhaps from apt.llvm.org), then we either need to stop it
from doing that with -mno-outline-atomics, or compile our executable
(or at least llvmjit.so?) to use that too so the helper library that
defines those functions is linked in. The people who developed that
stuff are, I think, interested in using the faster LSE stuff on modern
phones, while still being able to run apps on the older phones that
are still in circulation, but of course we have the same issue: we
want our stuff to use LSE on Ampere/Grativon etc server chips, while
still being able to run on Raspberry Pi 4 etc. And that'll speed up
many things related to locking in the server itself, not only in this
fairly obscure JIT-inlined thing.

#5Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#4)
Re: BUG #18610: llvm error: __aarch64_swp4_acq_rel which could not be resolved

While hacking on ARM problems this week, I remembered this one.

I see now that this is not new: all Debian clang versions at least as
far back as 14, including the default one on Debian stable, have it on
by default. That is out of sync with their gcc package. So unless
you also set CC to clang, you get:

postgres=# set jit_inline_above_cost = 0;
SET
postgres=# set jit_above_cost = 0;
SET
postgres=# select pg_last_wal_receive_lsn();
WARNING: failed to resolve name __aarch64_swp4_acq_rel
FATAL: fatal llvm error: Program used external function
'__aarch64_swp4_acq_rel' which could not be resolved!

I suspect some other distros might have turned this on in their gcc
too but I am not sure which ones.

Some options:

1. Somehow figure out if $CC is using -moutline-atomics, then make
BITCODE_CFLAGS agree. I haven't thought of a nice way to do it...

2. Tell $CC to use it. That was already kinda rejected in that other
thread, leaving this up to the toolchain people. If we ever change
our minds on that, it I guess it would be master only, so not help
with this problem.

3. Disable it in BITCODE_CFLAGS, whether or not $CC is using it. It
might be fairly rare for inlined functions to use relevant atomic ops,
so perhaps it's not such a big deal to be conservative here for now.

Here's a patch for approach #3. This is low-urgency and a code freeze
begins in a day or two, so I'm *not* proposing it for next week's
release.

Attachments:

0001-jit-Use-mno-outline-atomics-for-bitcode.patchapplication/octet-stream; name=0001-jit-Use-mno-outline-atomics-for-bitcode.patchDownload+112-1
#6Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#5)
Re: BUG #18610: llvm error: __aarch64_swp4_acq_rel which could not be resolved

On Fri, Nov 8, 2024 at 6:15 PM Thomas Munro <thomas.munro@gmail.com> wrote:

Here's a patch for approach #3. This is low-urgency and a code freeze
begins in a day or two, so I'm *not* proposing it for next week's
release.

Hearing no objections, I pushed this.