Segfault in jit tuple deforming on arm64 due to LLVM issue

Started by Anthonin Bonnefoyover 1 year ago22 messages

anthonin.bonnefoy@datadoghq.com

over 1 year ago

Hi!

I have an instance that started to consistently crash with segfault or
bus error and most of the generated coredumps had corrupted stacks.
Some salvageable frames showed the error happening within
ExecRunCompiledExpr. Sure enough, running the query with jit disabled
stopped the crashes. The issue happens with the following setup:

Ubuntu jammy on arm64, 30G
postgresql-14 14.12-1.pgdg22.04+1
libllvm15 1:15.0.7-0ubuntu0.22.04.3

I was able to isolate the impacted database the db (pg_dump of the
table was not enough, a base backup had to be used) and reproduce the
issue on a debug build of PostgresSQL. This time, there's no crash but
it was stuck in an infinite loop within jit tuple deforming:

#0 0x0000ec53660aa14c in deform_0_1 ()
#1 0x0000ec53660aa064 in evalexpr_0_0 ()
#2 0x0000ab8f9b322948 in ExecEvalExprSwitchContext
(isNull=0xfffff47c3c87, econtext=0xab8fd0f13878, state=0xab8fd0f13c50)
at executor/./build/../src/include/executor/executor.h:342
#3 ExecProject (projInfo=0xab8fd0f13c48) at
executor/./build/../src/include/executor/executor.h:376

Looking at the generated assembly, the infinite loop happens between
deform_0_1+140 and deform_0_1+188

// Store address page in x11 register
0xec53660aa130 <deform_0_1+132> adrp x11, 0xec53fd308000
// Start of the infinite loop
0xec53660aa138 <deform_0_1+140> adr x8, 0xec53660aa138 <deform_0_1+140>
// Load the content of 0xec53fd308000[x12] in x10, x12 was 0 at that time
0xec53660aa13c <deform_0_1+144> ldrsw x10, [x11, x12, lsl #2]
// Add the loaded offset to x8
0xec53660aa140 <deform_0_1+148> add x8, x8, x10
...
// Branch to address in x8. Since x10 was 0, x8 has the value
deform_0_1+140, creating the infinite loop
0xec53660aa168 <deform_0_1+188> br x8

Looking at the content of 0xec53fd308000, We only see 0 values stored
at the address.

x/6 0xec53fd308000
0xec53fd308000: 0x00000000 0x00000000 0x00000000 0x00000000
0xec53fd308010: 0x00000000 0x00000000

The assembly matches the code for the find_start switch case in
llvmjit_deform[1]https://github.com/postgres/postgres/blob/REL_14_STABLE/src/backend/jit/llvm/llvmjit_deform.c#L364-L382. The content at the address 0xec53fd308000 should
contain the offset table from the PC to branch to the correct
attcheckattnoblocks block. As a comparison, if I execute a query not
impacted by the issue (the size of the jit compiled module seems to be
a factor), I can see that the offset table was correctly filled.

x/6 0xec55fd30700
0xec55fd307000: 0x00000060 0x00000098 0x000000e8 0x00000170
0xec55fd307010: 0x0000022c 0x000002e8

I was suspecting something was erasing the content of the offset table
so I've checked with rr. However, it was only initialized and nothing
was written at this memory address. I was starting to suspect a
possible LLVM issue and ran the query against a debug build of
llvm_jit. It immediately triggered the following assertion[2]https://github.com/llvm/llvm-project/blob/release/14.x/llvm/lib/ExecutionEngine/RuntimeDyld/RuntimeDyldELF.cpp#L501-L513:

void llvm::RuntimeDyldELF::resolveAArch64Relocation(const
llvm::SectionEntry &, uint64_t, uint64_t, uint32_t, int64_t):
Assertion `isInt<33>(Result) && "overflow check failed for
relocation"' failed.

This happens when LLVM is resolving relocations.

#5 __GI___assert_fail (assertion=0xf693f214771a "isInt<33>(Result) &&
\"overflow check failed for relocation\"", file=0xf693f2147269
"/var/lib/postgresql/llvm-project/llvm/lib/ExecutionEngine/RuntimeDyld/RuntimeDyldELF.cpp",
line=507, function=0xf693f214754f "void
llvm::RuntimeDyldELF::resolveAArch64Relocation(const
llvm::SectionEntry &, uint64_t, uint64_t, uint32_t, int64_t)") at
./assert/assert.c:101
#6 llvm::RuntimeDyldELF::resolveAArch64Relocation () at
/var/lib/postgresql/llvm-project/llvm/lib/ExecutionEngine/RuntimeDyld/RuntimeDyldELF.cpp:507
#7 llvm::RuntimeDyldELF::resolveRelocation () at
/var/lib/postgresql/llvm-project/llvm/lib/ExecutionEngine/RuntimeDyld/RuntimeDyldELF.cpp:1044
#8 llvm::RuntimeDyldELF::resolveRelocation () at
/var/lib/postgresql/llvm-project/llvm/lib/ExecutionEngine/RuntimeDyld/RuntimeDyldELF.cpp:1026
#9 llvm::RuntimeDyldImpl::resolveRelocationList () at
/var/lib/postgresql/llvm-project/llvm/lib/ExecutionEngine/RuntimeDyld/RuntimeDyld.cpp:1112
#10 llvm::RuntimeDyldImpl::resolveLocalRelocations () at
/var/lib/postgresql/llvm-project/llvm/lib/ExecutionEngine/RuntimeDyld/RuntimeDyld.cpp:157
#11 llvm::RuntimeDyldImpl::finalizeAsync() at
/var/lib/postgresql/llvm-project/llvm/lib/ExecutionEngine/RuntimeDyld/RuntimeDyld.cpp:1247

During the assertion failure, I have the following values:
Value: 0xfbc84fab9000
FinalAddress: 0xfbc5b9cea12c
Addend: 0x0
Result: 0x295dcf000

The result is indeed greater than an int32, triggering the assert.
Looking at the sections created by LLVM in allocateSection[3]https://github.com/llvm/llvm-project/blob/release/14.x/llvm/lib/ExecutionEngine/SectionMemoryManager.cpp#L41C32-L41C47, we have
3 sections created:
.text {Address = 0xfbc5b9cea000, AllocatedSize = 90112}
.rodata {Address = 0xfbc84fab9000, AllocatedSize = 4096}
.eh_frame {Address = 0xfbc84fab7000, AllocatedSize = 8192}

When resolving relocation, the difference between the rodata section
and the PC is computed and stored in the ADRP instruction. However,
when a new section is allocated, LLVM will request a new memory block
from the memory allocator[4]https://github.com/llvm/llvm-project/blob/release/14.x/llvm/lib/ExecutionEngine/SectionMemoryManager.cpp#L94-L110. The MemGroup.Near is passed as the start
hint of mmap but that's only a hint and the kernel doesn't provide any
guarantee that the new allocated block will be near. With the impacted
query, there are more than 10GB of gap between the .text section and
the .rodata section, making it impossible for the code in the .text
section to correctly fetch data from the .rodata section as the
address in ADRP is limited to a +/-4GB range.

There are mentions about this in the ABI that the GOT section should
be within 4GB from the text section[5]https://github.com/ARM-software/abi-aa/blob/main/sysvabi64/sysvabi64.rst#7code-models. Though in this case, there's
no GOT section as the offsets are stored in the .rodata section but
the constraint is going to be similar. This is a known LLVM issue[6]https://github.com/llvm/llvm-project/issues/71963
that impacted Impala, Numba and Julia. There's an open PR[7]https://github.com/llvm/llvm-project/pull/71968 to fix
the issue by allocating all sections as a single memory block,
avoiding the gaps between sections. There's also a related discussion
on this on llvm-rtdyld discourse[8]https://discourse.llvm.org/t/llvm-rtdyld-aarch64-abi-relocation-restrictions/74616.

A possible mitigation is to switch from RuntimeDyld to JITLinking but
this requires at least LLVM15 as LLVM14 doesn't have any significant
relocation support for aarch64[9]https://github.com/llvm/llvm-project/blob/release/14.x/llvm/lib/ExecutionEngine/JITLink/ELF_aarch64.cpp#L75-L84. I did test using JITLinking on my
impacted db and it seems to fix the issue. JITLinking has no exposed C
interface though so it requires additional wrapping.

I don't necessarily have a good answer for this issue. I've tried to
tweak relocation settings or the jit code to avoid relocation without
too much success. Ideally, the llvm fix will be merged and backported
in llvm but the PR has been open for some time now. I've seen multiple
segfault reports that look similar to this issue (example: [10]/messages/by-id/CABa+nRvwZy_5t1QF9NJNGwAf03tv_PO_Sg1FsN1+-3Odb1XgBA@mail.gmail.com, [11]/messages/by-id/CADAf1kavcN-kY=vEm3MYxhUa+rtGFs7tym5d7Ee6Ni2cwwxGqQ@mail.gmail.com)
but I don't think it was linked to the LLVM bug so I figured I would
at least share my findings.

[1]: https://github.com/postgres/postgres/blob/REL_14_STABLE/src/backend/jit/llvm/llvmjit_deform.c#L364-L382
[2]: https://github.com/llvm/llvm-project/blob/release/14.x/llvm/lib/ExecutionEngine/RuntimeDyld/RuntimeDyldELF.cpp#L501-L513
[3]: https://github.com/llvm/llvm-project/blob/release/14.x/llvm/lib/ExecutionEngine/SectionMemoryManager.cpp#L41C32-L41C47
[4]: https://github.com/llvm/llvm-project/blob/release/14.x/llvm/lib/ExecutionEngine/SectionMemoryManager.cpp#L94-L110
[5]: https://github.com/ARM-software/abi-aa/blob/main/sysvabi64/sysvabi64.rst#7code-models
[6]: https://github.com/llvm/llvm-project/issues/71963
[7]: https://github.com/llvm/llvm-project/pull/71968
[8]: https://discourse.llvm.org/t/llvm-rtdyld-aarch64-abi-relocation-restrictions/74616
[9]: https://github.com/llvm/llvm-project/blob/release/14.x/llvm/lib/ExecutionEngine/JITLink/ELF_aarch64.cpp#L75-L84
[10]: /messages/by-id/CABa+nRvwZy_5t1QF9NJNGwAf03tv_PO_Sg1FsN1+-3Odb1XgBA@mail.gmail.com
[11]: /messages/by-id/CADAf1kavcN-kY=vEm3MYxhUa+rtGFs7tym5d7Ee6Ni2cwwxGqQ@mail.gmail.com

Regards,
Anthonin Bonnefoy

Thomas Munro

thomas.munro@gmail.com

over 1 year ago

In reply to: Anthonin Bonnefoy (#1)

Re: Segfault in jit tuple deforming on arm64 due to LLVM issue

On Thu, Aug 22, 2024 at 7:22 PM Anthonin Bonnefoy
<anthonin.bonnefoy@datadoghq.com> wrote:

Ideally, the llvm fix will be merged and backported
in llvm but the PR has been open for some time now.

I fear that back-porting, for the LLVM project, would mean "we fix it
in main/20.x, and also back-port it to 19.x". Do distros back-port
further?

Nice detective work!

The JITLINK change sounds interesting, and like something we need to
do sooner or later.

Anthonin Bonnefoy

anthonin.bonnefoy@datadoghq.com

over 1 year ago

In reply to: Thomas Munro (#2)

Re: Segfault in jit tuple deforming on arm64 due to LLVM issue

On Thu, Aug 22, 2024 at 12:33 PM Thomas Munro <thomas.munro@gmail.com> wrote:

I fear that back-porting, for the LLVM project, would mean "we fix it
in main/20.x, and also back-port it to 19.x". Do distros back-port
further?

That's also my fear, I'm not familiar with distros back-port policy
but eyeballing ubuntu package changelog[1]https://launchpad.net/ubuntu/+source/llvm-toolchain-16/+changelog, it seems to be mostly
build fixes.

Given that there's no visible way to fix the relocation issue, I
wonder if jit shouldn't be disabled for arm64 until either the
RuntimeDyld fix is merged or the switch to JITLink is done. Disabling
jit tuple deforming may be enough but I'm not confident the issue
won't happen in a different part.

[1]: https://launchpad.net/ubuntu/+source/llvm-toolchain-16/+changelog

Thomas Munro

thomas.munro@gmail.com

over 1 year ago

In reply to: Anthonin Bonnefoy (#3)

Re: Segfault in jit tuple deforming on arm64 due to LLVM issue

On Sat, Aug 24, 2024 at 12:22 AM Anthonin Bonnefoy
<anthonin.bonnefoy@datadoghq.com> wrote:

On Thu, Aug 22, 2024 at 12:33 PM Thomas Munro <thomas.munro@gmail.com> wrote:

I fear that back-porting, for the LLVM project, would mean "we fix it
in main/20.x, and also back-port it to 19.x". Do distros back-port
further?

That's also my fear, I'm not familiar with distros back-port policy
but eyeballing ubuntu package changelog[1], it seems to be mostly
build fixes.

Given that there's no visible way to fix the relocation issue, I
wonder if jit shouldn't be disabled for arm64 until either the
RuntimeDyld fix is merged or the switch to JITLink is done. Disabling
jit tuple deforming may be enough but I'm not confident the issue
won't happen in a different part.

We've experienced something a little similar before: In the early days
of PostgreSQL LLVM, it didn't work at all on ARM or POWER. We sent a
trivial fix[1]/messages/by-id/CAEepm=39F_B3Ou8S3OrUw+hJEUP3p=wCu0ug-TTW67qKN53g3w@mail.gmail.com upstream that landed in LLVM 7; since it was a small
and obvious problem and it took a long time for some distros to ship
LLVM 7, we even contemplated hot-patching that LLVM function with our
own copy (but, ugh, only for about 7 nanoseconds). That was before we
turned JIT on by default, and was also easier to deal with because it
was an obvious consistent failure in basic tests, so packagers
probably just disabled the build option on those architectures. IIUC
this one is a random and rare crash depending on malloc() and perhaps
also the working size of your virtual memory dart board. (Annoyingly,
I had tried to reproduce this quite a few times on small ARM systems
when earlier reports came in, d'oh!).

This degree of support window mismatch is probably what triggered RHEL
to develop their new rolling LLVM version policy. Unfortunately, it's
the other distros that tell *us* which versions to support, and not
the reverse (for example CF #4920 is about to drop support for LLVM <
14, but that will only be for PostgreSQL 18+).

Ultimately, if it doesn't work, and doesn't get fixed, it's hard for
us to do much about it. But hmm, this is probably madness... I wonder
if it would be feasible to detect address span overflow ourselves at a
useful time, as a kind of band-aid defence...

[1]: /messages/by-id/CAEepm=39F_B3Ou8S3OrUw+hJEUP3p=wCu0ug-TTW67qKN53g3w@mail.gmail.com

Anthonin Bonnefoy

anthonin.bonnefoy@datadoghq.com

over 1 year ago

In reply to: Thomas Munro (#4)

Re: Segfault in jit tuple deforming on arm64 due to LLVM issue

On Mon, Aug 26, 2024 at 4:33 AM Thomas Munro <thomas.munro@gmail.com> wrote:

IIUC this one is a random and rare crash depending on malloc() and
perhaps also the working size of your virtual memory dart board.
(Annoyingly, I had tried to reproduce this quite a few times on small ARM
systems when earlier reports came in, d'oh!).

allocateMappedMemory used when creating sections will eventually call
mmap[1]https://github.com/llvm/llvm-project/blob/release/14.x/llvm/lib/Support/Unix/Memory.inc#L115-L117, not malloc. So the amount of shared memory configured may be
a factor in triggering the issue.

My first attempts to reproduce the issue from scratch weren't
successful either. However, trying again with different values of
shared_buffers, I've managed to trigger the issue somewhat reliably.

On a clean Ubuntu jammy, I've compiled the current PostgreSQL
REL_14_STABLE (6bc2bfc3) with the following options:
CLANG=clang-14 ../configure --enable-cassert --enable-debug --prefix
~/.local/ --with-llvm

Set "shared_buffers = '4GB'" in the configuration. More may be needed
but 4GB was enough for me.

Create a table with multiple partitions with pgbench. The goal is to
have a jit module big enough to trigger the issue.
pgbench -i --partitions=64

Then run the following query with jit forcefully enabled:
psql options=-cjit_above_cost=0 -c 'SELECT count(bid) from pgbench_accounts;'

If the issue was successfully triggered, it should segfault or be
stuck in an infinite loop.

Ultimately, if it doesn't work, and doesn't get fixed, it's hard for
us to do much about it. But hmm, this is probably madness... I wonder
if it would be feasible to detect address span overflow ourselves at a
useful time, as a kind of band-aid defence...

There's a possible alternative, but it's definitely in the same
category as the hot-patching idea. llvmjit uses
LLVMOrcCreateRTDyldObjectLinkingLayerWithSectionMemoryManager to
create the ObjectLinkingLayer and it will be created with the default
SectionMemoryManager[2]https://github.com/llvm/llvm-project/blob/release/14.x/llvm/lib/ExecutionEngine/Orc/OrcV2CBindings.cpp#L967-L973. It should be possible to provide a modified
SectionMemoryManager with the change to allocate sections in a single
block and it could be restricted to arm64 architecture. A part of me
tells me this is probably a bad idea but on the other hand, LLVM
provides this way to plug a custom allocator and it would fix the
issue...

[1]: https://github.com/llvm/llvm-project/blob/release/14.x/llvm/lib/Support/Unix/Memory.inc#L115-L117
[2]: https://github.com/llvm/llvm-project/blob/release/14.x/llvm/lib/ExecutionEngine/Orc/OrcV2CBindings.cpp#L967-L973

Thomas Munro

thomas.munro@gmail.com

over 1 year ago

In reply to: Anthonin Bonnefoy (#5)

Re: Segfault in jit tuple deforming on arm64 due to LLVM issue

On Tue, Aug 27, 2024 at 2:16 AM Anthonin Bonnefoy
<anthonin.bonnefoy@datadoghq.com> wrote:

There's a possible alternative, but it's definitely in the same
category as the hot-patching idea. llvmjit uses
LLVMOrcCreateRTDyldObjectLinkingLayerWithSectionMemoryManager to
create the ObjectLinkingLayer and it will be created with the default
SectionMemoryManager[2]. It should be possible to provide a modified
SectionMemoryManager with the change to allocate sections in a single
block and it could be restricted to arm64 architecture. A part of me
tells me this is probably a bad idea but on the other hand, LLVM
provides this way to plug a custom allocator and it would fix the
issue...

Interesting. Here is a quick hack to experiment with injecting a new
memory manager. This one just wraps the normal one and logs the
addresses it allocates, but from here, you're right, we could try to
constraint its address range somehow (or perhaps just check its range
and fail gracefully).

Thomas Munro

thomas.munro@gmail.com

over 1 year ago

In reply to: Thomas Munro (#6)

Re: Segfault in jit tuple deforming on arm64 due to LLVM issue

Here is an experimental attempt to steal the SectorMemoryManager from
https://github.com/llvm/llvm-project/pull/71968, rename it to
SafeSectorMemoryManager, and inject it as shown in the previous patch.
Another approach might be to try to make a new class that derives from
SectorMemoryManager and adjusts minimal bits and pieces, but I figured
it would be easier to diff against their code if we take the whole
file. Hmm, I guess if "diff" convenience is the driving factor, it
might be better to use a different namespace instead of a different
name...

I am sure this requires changes for various LLVM versions. I tested
it with LLVM 14 on a Mac where I've never managed to reproduce the
original complaint, but ... ooooh, this might be exacerbated by ASLR,
and macOS only has a small ALSR slide window (16M or 256M apparently,
according to me in another thread), so I'd probably have to interpose
my own mmap() to choose some more interesting addresses, or run some
other OS, but that's quite enough rabbit holes for one morning.

Thomas Munro

thomas.munro@gmail.com

over 1 year ago

In reply to: Thomas Munro (#7)

Re: Segfault in jit tuple deforming on arm64 due to LLVM issue

On Tue, Aug 27, 2024 at 11:32 AM Thomas Munro <thomas.munro@gmail.com> wrote:

SectorMemoryManager

Erm, "Section". (I was working on some file system stuff at the
weekend, and apparently my fingers now auto-complete "sector".)

Anthonin Bonnefoy

anthonin.bonnefoy@datadoghq.com

over 1 year ago

In reply to: Thomas Munro (#8)

Re: Segfault in jit tuple deforming on arm64 due to LLVM issue

On Tue, Aug 27, 2024 at 1:33 AM Thomas Munro <thomas.munro@gmail.com> wrote:

I am sure this requires changes for various LLVM versions. I tested
it with LLVM 14 on a Mac where I've never managed to reproduce the
original complaint, but ... ooooh, this might be exacerbated by ASLR,
and macOS only has a small ALSR slide window (16M or 256M apparently,
according to me in another thread), so I'd probably have to interpose
my own mmap() to choose some more interesting addresses, or run some
other OS, but that's quite enough rabbit holes for one morning.

I've tested the patch. I had to make sure the issue was triggered on
master first. The issue didn't happen with 4GB shared_buffers and 64
partitions. However, increasing to 6GB and 128 partitions triggered
the issue.

The architecture check in the patch was incorrect (__arch64__ instead
of __aarch64__, glad to see I'm not the only one being confused with
aarch64 and arm64 :)) but once fixed, it worked and avoided the
segfault.

I've run some additional tests to try to test different parameters:
- I've tried disabling randomize_va_space, the issue still happened
even with ASLR disabled.
- I've tested different PG versions. With 14 and 15, 4GB and 64
partitions were enough. Starting PG 16, I had to increase
shared_buffers to 6GB and partitions to 128. I've been able to trigger
the issue on all versions from 14 to master (which was expected but I
wanted confirmation)
- I haven't been able to reproduce this on a macOS either. I've tried
to remove MemGroup.Near hint so mmap addresses would be more random
and played with different shared_buffers and partition values without
success

I've modified the patch with 3 changes:
- meson.build was using SectionMemoryManager.cpp file name, I've
replaced with SafeSectionMemoryManager.cpp
- Use __aarch64__ instead of __arch64__
- Moved the architecture switch to llvm_create_object_layer and go
through the normal
LLVMOrcCreateRTDyldObjectLinkingLayerWithSectionMemoryManager on non
arm64 architectures. There's no need to use the custom memory manager
for non arm64 so it looked better to avoid it entirely if there's no
need for the reserve allocation.

#10

Thomas Munro

thomas.munro@gmail.com

over 1 year ago

In reply to: Anthonin Bonnefoy (#9)

Re: Segfault in jit tuple deforming on arm64 due to LLVM issue

Thanks! And that's great news. Do you want to report this experience
to the PR, in support of committing it? That'd make it seem easier to
consider shipping a back-ported copy...

#11

Anthonin Bonnefoy

anthonin.bonnefoy@datadoghq.com

over 1 year ago

In reply to: Thomas Munro (#10)

Re: Segfault in jit tuple deforming on arm64 due to LLVM issue

On Tue, Aug 27, 2024 at 12:01 PM Thomas Munro <thomas.munro@gmail.com> wrote:

Thanks! And that's great news. Do you want to report this experience
to the PR, in support of committing it? That'd make it seem easier to
consider shipping a back-ported copy...

Yes, I will do that.

#12

Thomas Munro

thomas.munro@gmail.com

over 1 year ago

In reply to: Anthonin Bonnefoy (#11)

Re: Segfault in jit tuple deforming on arm64 due to LLVM issue

On Wed, Aug 28, 2024 at 12:07 AM Anthonin Bonnefoy
<anthonin.bonnefoy@datadoghq.com> wrote:

On Tue, Aug 27, 2024 at 12:01 PM Thomas Munro <thomas.munro@gmail.com> wrote:

Thanks! And that's great news. Do you want to report this experience
to the PR, in support of committing it? That'd make it seem easier to
consider shipping a back-ported copy...

Yes, I will do that.

Thanks. Here's a slightly tidied up version:

1. I used namespace llvm::backport, instead of a different class
name. That minimises the diff against their code.

2. I tested against LLVM 10-18, and found that 10 and 11 lack some
needed symbols. So I just hid this code from them. Even though our
stable branches support those and even older versions, I am not sure
if it's worth trying to do something about that for EOL'd distros that
no one has ever complained about. I am willing to try harder if
someone thinks that's important...

One little problem I am aware of is that if you make an empty .o,
macOS's new linker issues a warning, but I think I could live with
that. I guess I could put a dummy symbol in there... FWIW those old
LLVM versions spit out tons of other warnings from the headers on
newer compilers too, so *shrug*, don't use them? But then if this
code lands in LLVM 19 we'll also be hiding it for 19+ too.

Next, I think we should wait to see if the LLVM project commits that
PR, this so that we can sync with their 19.x stable branch, instead of
using code from a PR. Our next minor release is in November, so we
have some time. If they don't commit it, we can consider it anyway: I
mean, it's crashing all over the place in production, and we see that
other projects are shipping this code already.

#13

Thomas Munro

thomas.munro@gmail.com

over 1 year ago

In reply to: Thomas Munro (#12)

Re: Segfault in jit tuple deforming on arm64 due to LLVM issue

Slightly better version, which wraps the conditional code in #ifdef
USE_LLVM_BACKPORT_SECTION_MEMORY_MANAGER.

#14

Anthonin Bonnefoy

anthonin.bonnefoy@datadoghq.com

over 1 year ago

In reply to: Thomas Munro (#13)

Re: Segfault in jit tuple deforming on arm64 due to LLVM issue

On Wed, Aug 28, 2024 at 12:24 AM Thomas Munro <thomas.munro@gmail.com> wrote:

2. I tested against LLVM 10-18, and found that 10 and 11 lack some
needed symbols. So I just hid this code from them. Even though our
stable branches support those and even older versions, I am not sure
if it's worth trying to do something about that for EOL'd distros that
no one has ever complained about. I am willing to try harder if
someone thinks that's important...

I would also assume that people using arm64 are more likely to use
recent versions than not.

I've done some additional tests on different LLVM versions with both
the unpatched version (to make sure the crash was triggered) and the
patched version. I'm joining the test scripts I've used as reference.
They target a kubernetes pod since it was the easiest way for me to
get a test ubuntu Jammy:
- setup_pod.sh: Install necessary packages, get multiple llvm
versions, fetch and compile master and patched version of postgres on
different LLVM version
- run_test.sh: go through all LLVM versions for both unpatched and
patched postgres to run the test_script.sh
- test_script.sh: ran inside the pod to setup the db with the
necessary tables and check if the crash happens

This generated the following output:
Test unpatched version on LLVM 19, : Crash triggered
Test unpatched version on LLVM 18, libLLVM-18.so.18.1: Crash triggered
Test unpatched version on LLVM 17, libLLVM-17.so.1: Crash triggered
Test unpatched version on LLVM 16, libLLVM-16.so.1: Crash triggered
Test unpatched version on LLVM 15, libLLVM-15.so.1: Crash triggered
Test unpatched version on LLVM 14, libLLVM-14.so.1: Crash triggered
Test unpatched version on LLVM 13, libLLVM-13.so.1: Crash triggered

Test patched version on LLVM 19, : Query ran successfully
Test patched version on LLVM 18, libLLVM-18.so.18.1: Query ran successfully
Test patched version on LLVM 17, libLLVM-17.so.1: Query ran successfully
Test patched version on LLVM 16, libLLVM-16.so.1: Query ran successfully
Test patched version on LLVM 15, libLLVM-15.so.1: Query ran successfully
Test patched version on LLVM 14, libLLVM-14.so.1: Query ran successfully
Test patched version on LLVM 13, libLLVM-13.so.1: Query ran successfully

I try to print the libLLVM linked to llvm.jit in the output to double
check whether I test on the correct version. The LLVM 19 package only
provides static libraries (probably because it's still a release
candidate?) so it shows as empty in the output. There was no LLVM 12
available when using the llvm.sh script so I couldn't test it. As for
the result, prepatch PG all crashed as expected while the patched
version was able to run the query successfully.

Next, I think we should wait to see if the LLVM project commits that
PR, this so that we can sync with their 19.x stable branch, instead of
using code from a PR. Our next minor release is in November, so we
have some time. If they don't commit it, we can consider it anyway: I
mean, it's crashing all over the place in production, and we see that
other projects are shipping this code already.

The PR[1]https://github.com/llvm/llvm-project/pull/71968 just received an approval and it sounds like they are ok to
eventually merge it.

[1]: https://github.com/llvm/llvm-project/pull/71968

#15

Anthonin Bonnefoy

anthonin.bonnefoy@datadoghq.com

over 1 year ago

In reply to: Anthonin Bonnefoy (#14)

Re: Segfault in jit tuple deforming on arm64 due to LLVM issue

I created a commitfest entry[1]https://commitfest.postgresql.org/49/5220/ to have the CI test the patch. There
was a failure in headerscheck and cpluspluscheck when the include of
SectionMemoryManager.h is checked[2]https://cirrus-ci.com/task/4646639124611072?logs=headers_headerscheck#L42-L46

In file included from /usr/include/llvm/ADT/SmallVector.h:18,
from /tmp/cirrus-ci-build/src/include/jit/SectionMemoryManager.h:23,
from /tmp/headerscheck.4b1i5C/test.c:2:
/usr/include/llvm/Support/type_traits.h:17:10: fatal error:
type_traits: No such file or directory
17 | #include <type_traits>

Since the SmallVector.h include type_traits, this file can't be
compiled with a C compiler so I've just excluded it from headerscheck.

Loosely related to headerscheck, running it locally was failing as it
couldn't find the <llvm-c/Core.h> file. This is because headerscheck
except llvm include files to be in /usr/include and don't rely on
llvm-config. I created a second patch to use the LLVM_CPPFLAGS as
extra flags when testing the src/include/jit/* files.

Lastly, I've used www.github.com instead of github.com link to stop
spamming the llvm-project's PR with reference to the commit every time
it is pushed somewhere (which seems to be the unofficial hack[3]https://github.com/orgs/community/discussions/23123#discussioncomment-3239240).

[1]: https://commitfest.postgresql.org/49/5220/
[2]: https://cirrus-ci.com/task/4646639124611072?logs=headers_headerscheck#L42-L46
[3]: https://github.com/orgs/community/discussions/23123#discussioncomment-3239240

#16

Anthonin Bonnefoy

anthonin.bonnefoy@datadoghq.com

over 1 year ago

In reply to: Anthonin Bonnefoy (#15)

Re: Segfault in jit tuple deforming on arm64 due to LLVM issue

I've run some additional tests, mostly pgbench with
options=-cjit_above_cost=0 for an extended duration on an instance
that was impacted. I haven't seen any issues nor performance
regressions compared to the unpatched version.

I will switch the commitfest entry to Ready for Committer if there's
no objection.

#17

Thomas Munro

thomas.munro@gmail.com

over 1 year ago

In reply to: Anthonin Bonnefoy (#16)

Re: Segfault in jit tuple deforming on arm64 due to LLVM issue

On Thu, Oct 17, 2024 at 10:36 PM Anthonin Bonnefoy
<anthonin.bonnefoy@datadoghq.com> wrote:

I've run some additional tests, mostly pgbench with
options=-cjit_above_cost=0 for an extended duration on an instance
that was impacted. I haven't seen any issues nor performance
regressions compared to the unpatched version.

I will switch the commitfest entry to Ready for Committer if there's
no objection.

Thanks! I'm going to go ahead and commit this. I asked Andres if he
wanted to object to this plan (as author of our LLVM stuff) and he did
not. I tried, for several frustrating days, to figure out how to
solve our problem using JITLink and stay on a more "supported" path,
but it all seems a bit unready, as various things don't work or aren't
available in various versions in our support range. At least I now
have the bones of a patch to prepare for JITLink in LLVM 20 or
whenever they force our hand... I'll write about that in a new thread
soon.

#18

Thomas Munro

thomas.munro@gmail.com

over 1 year ago

In reply to: Thomas Munro (#17)

Re: Segfault in jit tuple deforming on arm64 due to LLVM issue

On Thu, Oct 17, 2024 at 10:41 PM Thomas Munro <thomas.munro@gmail.com> wrote:

Thanks! I'm going to go ahead and commit this.

(Sorry for the delay, I got distracted by pgconf.eu.)

Today I set out to commit this patch, and wrote a proper commit
message to explain the code provenance, circumstances that led to it,
and the future conditions that will allow us to delete it in a few
years. Please see attached. In the process I struck a potential
snag:

https://llvm.org/LICENSE.txt
https://en.wikipedia.org/wiki/Apache_License

There are a couple of cases of dual-licensed code in our tree where we
explicitly used the Boost alternative instead of Apache 2. I plead
complete ignorance of this topic and defer to those who know about
such things: can we actually do this? I guess at a minimum a copy of
the licence would need to appear somewhere -- perhaps under
src/backend/jit/llvm? 4d says that if you modified the code you have
to say so prominently, but I did that at the top (and the changes are
completely trivial, just some #ifdef swizzling to massage some
function prototypes to suit older LLVMs). Otherwise I understand it
to be generally "BSD-like" (sans advert clause) but there is also some
stuff about patents, which surely aren't relevant to this in
practice... but I know that some projects object to it on principle
and because it smells like contract law, or something.... not an area
I am well informed about. Who should I be asking? (Naively, I
wondered: could there be some kind of fair use concept for
back-patching fixes to broken libraries that you're merely a user of
where you can be excused from the burdens of a distributor? Yeah
wishful thinking I'm sure.)

#19

Anthonin Bonnefoy

anthonin.bonnefoy@datadoghq.com

over 1 year ago

In reply to: Thomas Munro (#18)

Re: Segfault in jit tuple deforming on arm64 due to LLVM issue

On Thu, Oct 31, 2024 at 6:49 AM Thomas Munro <thomas.munro@gmail.com> wrote:

There are a couple of cases of dual-licensed code in our tree where we
explicitly used the Boost alternative instead of Apache 2. I plead
complete ignorance of this topic and defer to those who know about
such things: can we actually do this? I guess at a minimum a copy of
the licence would need to appear somewhere -- perhaps under
src/backend/jit/llvm?

I'm also not super knowledgeable about the licensing intricacies but I
read it the same way - a license file has to be provided due to the 4a
clause. llvmlite did this when they added the patched memory
manager[1]https://github.com/numba/llvmlite/pull/1009/files#diff-80b149f35cebd583e21dfc49c0007a7fab89c3c6d07c028e4a87de0848aa2ed8

4d says that if you modified the code you have
to say so prominently, but I did that at the top (and the changes are
completely trivial, just some #ifdef swizzling to massage some
function prototypes to suit older LLVMs). Otherwise I understand it
to be generally "BSD-like" (sans advert clause) but there is also some
stuff about patents, which surely aren't relevant to this in
practice... but I know that some projects object to it on principle
and because it smells like contract law, or something.... not an area
I am well informed about. Who should I be asking? (Naively, I
wondered: could there be some kind of fair use concept for
back-patching fixes to broken libraries that you're merely a user of
where you can be excused from the burdens of a distributor? Yeah
wishful thinking I'm sure.)

You mean 4b, right? LLVM doesn't seem to have any NOTICE files so the
4d clause shouldn't apply. The top comment looks fine to notify the
source of the modified file and how it was changed. But again, I don't
have much experience in this so I can't be sure.

[1]: https://github.com/numba/llvmlite/pull/1009/files#diff-80b149f35cebd583e21dfc49c0007a7fab89c3c6d07c028e4a87de0848aa2ed8

#20

Daniel Gustafsson

daniel@yesql.se

over 1 year ago

In reply to: Thomas Munro (#18)

Re: Segfault in jit tuple deforming on arm64 due to LLVM issue

On 31 Oct 2024, at 06:48, Thomas Munro <thomas.munro@gmail.com> wrote:

I guess at a minimum a copy of the licence would need to appear somewhere

That's my interpretation of it as well.

perhaps under src/backend/jit/llvm?

Since SectionMemoryManager.h is in src/backend/jit I wonder if it should be a
placed in a section in src/backend/jit/README with an overview of the what and
why (or maybe a new src/backend/jit/llvm/README would be even better). The
license doesn't have to be in a separate file AFAICT and including a (version
of) your excellent summary in the commit message along with it would probably
help readers.

--
Daniel Gustafsson

#21

Thomas Munro

thomas.munro@gmail.com

over 1 year ago

In reply to: Daniel Gustafsson (#20)

#22

Anthonin Bonnefoy

anthonin.bonnefoy@datadoghq.com

over 1 year ago

In reply to: Thomas Munro (#21)

Segfault in jit tuple deforming on arm64 due to LLVM issue

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments: