LLVM JIT: any JIT-compiled query crashes (SIGILL) on a libLLVM 19 + ASAN build

Started by Henson Choi9 days ago7 messageshackers

assam258@gmail.com

9 days ago

Hi hackers,

While looking into Andres Freund's note that cfbot is failing with crashes
inside the JIT on the Row Pattern Recognition patch [1]/messages/by-id/p7r5bekdbl2zcazid7agvfo2nfnq5bim2a5jkckqygld32n325@fctfp6ou6qnb, I found that the
crash is not specific to that patch at all: on the CI's AddressSanitizer
build with LLVM 19, any query that is pushed through the LLVM JIT code
generator crashes the backend with SIGILL. It reproduces on plain master
with a trivial aggregate, so I am reporting it as its own issue, separate
from that feature.

Minimal reproduction
--------------------

SET jit = on;
SET jit_above_cost = 0;
SET jit_optimize_above_cost = 0;
SET jit_inline_above_cost = 0;

SELECT count(*)
FROM (SELECT i, i * 2 + 1 AS x
FROM generate_series(1, 100000) i
WHERE i % 3 = 0) t;

Result:

server closed the connection unexpectedly
...
LOG: client backend (PID NNNNN) was terminated by signal 4: Illegal
instruction

A postmaster (forked backend) is required to reproduce reliably; single-user
mode does not trip it. With jit = off the same query runs fine.

Environment
-----------

This is the cfbot Linux task environment:

- Debian Trixie, libLLVM 19.1
- CFLAGS = -O2 -ggdb -fno-sanitize-recover=all -fsanitize=address
- LDFLAGS = -fsanitize=address
- meson: -Dcassert=true -Dinjection_points=true --buildtype=debug
-Dllvm=enabled (auto_features=disabled)

I reproduced this in a container that mirrors the CI configuration, and also
on a from-scratch build of plain upstream master
(89eafad297a9b01ad77cfc1ab93a433e0af894b0, "Fix tuple deforming with virtual
generated columns"), which contains no in-flight feature patches.

Backtrace
---------

The stack is corrupted at the crash, but with libLLVM debug info the top
frames resolve consistently to:

Program terminated with signal SIGILL, Illegal instruction.
#0 getUnsignedFromPrefixEncoding ()
at llvm/include/llvm/Support/Discriminator.h:34
#1 decodeDiscriminator ()
at llvm/lib/IR/DebugInfoMetadata.cpp:283

The crashing rip lands in the middle of a valid instruction
(decodeDiscriminator+48, the immediate byte of "and $0x1f,%r10d"), i.e. the
libLLVM code itself is intact and control flow was transferred into it at a
bad offset. The crash always lands at the same place, for every JIT-compiled
query, which suggests it is systematic rather than random corruption. It
surfaces in libLLVM's debug-info (discriminator) handling, and persists with
JIT inlining and optimization both disabled.

Reproducer patch
----------------

The attached patch adds a small "jit_crash" regression test that forces the
JIT compiler (jit on, all jit_*_above_cost set to 0) using a plain aggregate
over generate_series(). On a working installation it passes; on the broken
LLVM 19 + ASAN environment it crashes as above. I have also registered it in
the commitfest so cfbot exercises it directly.

References
----------

[1]: /messages/by-id/p7r5bekdbl2zcazid7agvfo2nfnq5bim2a5jkckqygld32n325@fctfp6ou6qnb
/messages/by-id/p7r5bekdbl2zcazid7agvfo2nfnq5bim2a5jkckqygld32n325@fctfp6ou6qnb

Thanks,
Henson Choi

Henson Choi

assam258@gmail.com

9 days ago

In reply to: Henson Choi (#1)

Re: LLVM JIT: any JIT-compiled query crashes (SIGILL) on a libLLVM 19 + ASAN build

Hi Andres,

Just to let you know — the CI run for this commitfest entry shows the
same crash independently on master as well, so this may not be an RPR
(Reported-Problem Reproduction) issue specific to the patch.

The identical crash occurs on a standalone test against master.

Thanks,
Henson

2026년 6월 10일 (수) 오전 11:09, Henson Choi <assam258@gmail.com>님이 작성:

Show quoted text

Hi hackers,

While looking into Andres Freund's note that cfbot is failing with crashes
inside the JIT on the Row Pattern Recognition patch [1], I found that the
crash is not specific to that patch at all: on the CI's AddressSanitizer
build with LLVM 19, any query that is pushed through the LLVM JIT code
generator crashes the backend with SIGILL. It reproduces on plain master
with a trivial aggregate, so I am reporting it as its own issue, separate
from that feature.

Minimal reproduction
--------------------

SET jit = on;
SET jit_above_cost = 0;
SET jit_optimize_above_cost = 0;
SET jit_inline_above_cost = 0;

SELECT count(*)
FROM (SELECT i, i * 2 + 1 AS x
FROM generate_series(1, 100000) i
WHERE i % 3 = 0) t;

Result:

server closed the connection unexpectedly
...
LOG: client backend (PID NNNNN) was terminated by signal 4: Illegal
instruction

A postmaster (forked backend) is required to reproduce reliably;
single-user
mode does not trip it. With jit = off the same query runs fine.

Environment
-----------

This is the cfbot Linux task environment:

- Debian Trixie, libLLVM 19.1
- CFLAGS = -O2 -ggdb -fno-sanitize-recover=all -fsanitize=address
- LDFLAGS = -fsanitize=address
- meson: -Dcassert=true -Dinjection_points=true --buildtype=debug
-Dllvm=enabled (auto_features=disabled)

I reproduced this in a container that mirrors the CI configuration, and
also
on a from-scratch build of plain upstream master
(89eafad297a9b01ad77cfc1ab93a433e0af894b0, "Fix tuple deforming with
virtual
generated columns"), which contains no in-flight feature patches.

Backtrace
---------

The stack is corrupted at the crash, but with libLLVM debug info the top
frames resolve consistently to:

Program terminated with signal SIGILL, Illegal instruction.
#0 getUnsignedFromPrefixEncoding ()
at llvm/include/llvm/Support/Discriminator.h:34
#1 decodeDiscriminator ()
at llvm/lib/IR/DebugInfoMetadata.cpp:283

The crashing rip lands in the middle of a valid instruction
(decodeDiscriminator+48, the immediate byte of "and $0x1f,%r10d"), i.e. the
libLLVM code itself is intact and control flow was transferred into it at a
bad offset. The crash always lands at the same place, for every
JIT-compiled
query, which suggests it is systematic rather than random corruption. It
surfaces in libLLVM's debug-info (discriminator) handling, and persists
with
JIT inlining and optimization both disabled.

Reproducer patch
----------------

The attached patch adds a small "jit_crash" regression test that forces the
JIT compiler (jit on, all jit_*_above_cost set to 0) using a plain
aggregate
over generate_series(). On a working installation it passes; on the broken
LLVM 19 + ASAN environment it crashes as above. I have also registered it
in
the commitfest so cfbot exercises it directly.

References
----------

[1]
/messages/by-id/p7r5bekdbl2zcazid7agvfo2nfnq5bim2a5jkckqygld32n325@fctfp6ou6qnb

Thanks,
Henson Choi

Henson Choi

assam258@gmail.com

9 days ago

In reply to: Henson Choi (#2)

Re: LLVM JIT: any JIT-compiled query crashes (SIGILL) on a libLLVM 19 + ASAN build

(Reported-Problem Reproduction) -> (Row Pattern Recognition)

2026년 6월 10일 (수) 오전 11:42, Henson Choi <assam258@gmail.com>님이 작성:

Show quoted text

Hi Andres,

Just to let you know — the CI run for this commitfest entry shows the
same crash independently on master as well, so this may not be an RPR
(Reported-Problem Reproduction) issue specific to the patch.

The identical crash occurs on a standalone test against master.

Thanks,
Henson

2026년 6월 10일 (수) 오전 11:09, Henson Choi <assam258@gmail.com>님이 작성:

Hi hackers,

While looking into Andres Freund's note that cfbot is failing with crashes
inside the JIT on the Row Pattern Recognition patch [1], I found that the
crash is not specific to that patch at all: on the CI's AddressSanitizer
build with LLVM 19, any query that is pushed through the LLVM JIT code
generator crashes the backend with SIGILL. It reproduces on plain master
with a trivial aggregate, so I am reporting it as its own issue, separate
from that feature.

Minimal reproduction
--------------------

SET jit = on;
SET jit_above_cost = 0;
SET jit_optimize_above_cost = 0;
SET jit_inline_above_cost = 0;

SELECT count(*)
FROM (SELECT i, i * 2 + 1 AS x
FROM generate_series(1, 100000) i
WHERE i % 3 = 0) t;

Result:

server closed the connection unexpectedly
...
LOG: client backend (PID NNNNN) was terminated by signal 4: Illegal
instruction

A postmaster (forked backend) is required to reproduce reliably;
single-user
mode does not trip it. With jit = off the same query runs fine.

Environment
-----------

This is the cfbot Linux task environment:

- Debian Trixie, libLLVM 19.1
- CFLAGS = -O2 -ggdb -fno-sanitize-recover=all -fsanitize=address
- LDFLAGS = -fsanitize=address
- meson: -Dcassert=true -Dinjection_points=true --buildtype=debug
-Dllvm=enabled (auto_features=disabled)

I reproduced this in a container that mirrors the CI configuration, and
also
on a from-scratch build of plain upstream master
(89eafad297a9b01ad77cfc1ab93a433e0af894b0, "Fix tuple deforming with
virtual
generated columns"), which contains no in-flight feature patches.

Backtrace
---------

The stack is corrupted at the crash, but with libLLVM debug info the top
frames resolve consistently to:

Program terminated with signal SIGILL, Illegal instruction.
#0 getUnsignedFromPrefixEncoding ()
at llvm/include/llvm/Support/Discriminator.h:34
#1 decodeDiscriminator ()
at llvm/lib/IR/DebugInfoMetadata.cpp:283

The crashing rip lands in the middle of a valid instruction
(decodeDiscriminator+48, the immediate byte of "and $0x1f,%r10d"), i.e.
the
libLLVM code itself is intact and control flow was transferred into it at
a
bad offset. The crash always lands at the same place, for every
JIT-compiled
query, which suggests it is systematic rather than random corruption. It
surfaces in libLLVM's debug-info (discriminator) handling, and persists
with
JIT inlining and optimization both disabled.

Reproducer patch
----------------

The attached patch adds a small "jit_crash" regression test that forces
the
JIT compiler (jit on, all jit_*_above_cost set to 0) using a plain
aggregate
over generate_series(). On a working installation it passes; on the broken
LLVM 19 + ASAN environment it crashes as above. I have also registered it
in
the commitfest so cfbot exercises it directly.

References
----------

[1]
/messages/by-id/p7r5bekdbl2zcazid7agvfo2nfnq5bim2a5jkckqygld32n325@fctfp6ou6qnb

Thanks,
Henson Choi

Matheus Alcantara

matheusssilv97@gmail.com

8 days ago

In reply to: Henson Choi (#1)

Re: LLVM JIT: any JIT-compiled query crashes (SIGILL) on a libLLVM 19 + ASAN build

Hi,

On Tue Jun 9, 2026 at 11:09 PM -03, Henson Choi wrote:

While looking into Andres Freund's note that cfbot is failing with crashes
inside the JIT on the Row Pattern Recognition patch [1], I found that the
crash is not specific to that patch at all: on the CI's AddressSanitizer
build with LLVM 19, any query that is pushed through the LLVM JIT code
generator crashes the backend with SIGILL. It reproduces on plain master
with a trivial aggregate, so I am reporting it as its own issue, separate
from that feature.

[ ... ]

I investigated this issue a bit and I may found the root cause.

The meson build passes get_option('c_args') directly to the clang
command that generates llvmjit_types.bc and other bitcode files used for
inlining. When building with -fsanitize=address, the sanitizer
instrumentation may change struct layouts in the generated LLVM IR.

This causes a mismatch between the field indices the JIT code expects
(defined via FIELDNO_* macros, e.g., FIELDNO_EXPRSTATE_PARENT = 11) and
the actual struct layout in the bitcode. The result is an assertion
failure when accessing struct fields:

Assertion failed: (indexValid(N)), function getElementType
LLVMStructGetTypeAtIndex(StructTy=..., i=11)

IIUC the SIGILL crash in decodeDiscriminator() reported initially was
likely a secondary effect of this struct layout corruption.

I think that the fix is to filter out sanitizer flags when generating
bitcode for the JIT code, see the attached diff.

With this fix, JIT works correctly under ASAN + LLVM 19 on my machine.
Can you please also test it on your side?

I'm also wondering if this happens only with LLVM 19 or other versions
too.

--
Matheus Alcantara
EDB: https://www.enterprisedb.com

Henson Choi

assam258@gmail.com

7 days ago

In reply to: Matheus Alcantara (#4)

Re: LLVM JIT: any JIT-compiled query crashes (SIGILL) on a libLLVM 19 + ASAN build

Hi Matheus,

Thanks for digging into this, and for the patch!

I think that the fix is to filter out sanitizer flags when generating
bitcode for the JIT code [...]
With this fix, JIT works correctly under ASAN + LLVM 19 on my machine.

Confirmed here too: with your filter applied the crash is gone and the JIT
runs normally under ASAN. Filtering the sanitizer flags out of the
bitcode is the right fix.

the sanitizer instrumentation may change struct layouts in the generated
LLVM IR [...] FIELDNO_EXPRSTATE_PARENT = 11 [...]

One nit: on libLLVM 20.1.8 the bitcode struct layout is identical with and
without -fsanitize=address (e.g. %struct.ExprState, index 11 stays a
pointer), so it isn't a FIELDNO/layout mismatch here. In short, the crash
needs debug info (-ggdb) and sanitizer instrumentation to both land in the
JIT bitcode: the SIGILL is in decodeDiscriminator(), i.e. the instrumented
IR going through the debug-info path. Your fix keeps the debug info but
drops the instrumentation, and that alone stops it -- so the
instrumentation is the trigger. The LLVM 19 assertion is likely the same
cause surfacing differently.

I'm also wondering if this happens only with LLVM 19 or other versions
too.

Not LLVM 19 only -- I reproduced the same SIGILL on libLLVM 20.1.8.

v2 series attached, folding in your fix:

0001 Add a "jit" regression test (renamed/minimized from "jit_crash").
jit is off by default now, so this turns it on to push a trivial
query through the JIT provider.

0002 Your meson fix, with an added warning() so a sanitizer build knows
its JIT code won't be instrumented. (Author: Matheus Alcantara.)

0003 Same for autoconf: filter sanitizer flags from BITCODE_CFLAGS/
CXXFLAGS with a configure warning, plus -g under --enable-debug so
the bitcode keeps debug info. The -g part is a judgment call --
autoconf just rebuilds BITCODE_CFLAGS from a whitelist that
doesn't include -g -- so feel free to keep or drop it.

Tested on both build systems with an ASAN backend: the jit test crashes
before and passes after, JIT stays functional (pg_jit_available() = t,
EXPLAIN ANALYZE shows functions compiled), and the warning fires.

Thanks,
Henson

Matheus Alcantara

matheusssilv97@gmail.com

4 days ago

In reply to: Henson Choi (#5)

Re: LLVM JIT: any JIT-compiled query crashes (SIGILL) on a libLLVM 19 + ASAN build

On Thu Jun 11, 2026 at 10:48 PM -03, Henson Choi wrote:

I think that the fix is to filter out sanitizer flags when generating
bitcode for the JIT code [...]
With this fix, JIT works correctly under ASAN + LLVM 19 on my machine.

Confirmed here too: with your filter applied the crash is gone and the JIT
runs normally under ASAN. Filtering the sanitizer flags out of the
bitcode is the right fix.

Thanks for confirming.

the sanitizer instrumentation may change struct layouts in the generated
LLVM IR [...] FIELDNO_EXPRSTATE_PARENT = 11 [...]

One nit: on libLLVM 20.1.8 the bitcode struct layout is identical with and
without -fsanitize=address (e.g. %struct.ExprState, index 11 stays a
pointer), so it isn't a FIELDNO/layout mismatch here. In short, the crash
needs debug info (-ggdb) and sanitizer instrumentation to both land in the
JIT bitcode: the SIGILL is in decodeDiscriminator(), i.e. the instrumented
IR going through the debug-info path. Your fix keeps the debug info but
drops the instrumentation, and that alone stops it -- so the
instrumentation is the trigger. The LLVM 19 assertion is likely the same
cause surfacing differently.

Ok, it make sense.

I'm also wondering if this happens only with LLVM 19 or other versions
too.

Not LLVM 19 only -- I reproduced the same SIGILL on libLLVM 20.1.8.

v2 series attached, folding in your fix:

0001 Add a "jit" regression test (renamed/minimized from "jit_crash").
jit is off by default now, so this turns it on to push a trivial
query through the JIT provider.

0002 Your meson fix, with an added warning() so a sanitizer build knows
its JIT code won't be instrumented. (Author: Matheus Alcantara.)

0003 Same for autoconf: filter sanitizer flags from BITCODE_CFLAGS/
CXXFLAGS with a configure warning, plus -g under --enable-debug so
the bitcode keeps debug info. The -g part is a judgment call --
autoconf just rebuilds BITCODE_CFLAGS from a whitelist that
doesn't include -g -- so feel free to keep or drop it.

Tested on both build systems with an ASAN backend: the jit test crashes
before and passes after, JIT stays functional (pg_jit_available() = t,
EXPLAIN ANALYZE shows functions compiled), and the warning fires.

Thanks. I think that the patches looks good and I also think that it's
good to have a JIT test case since it's off by default. I'm just
wondering if the the test patch should be 0003 instead 0001 since it
will break CI if committed before the meson and autoconf changes.

--
Matheus Alcantara
EDB: https://www.enterprisedb.com

Henson Choi

assam258@gmail.com

4 days ago

In reply to: Matheus Alcantara (#6)

Re: LLVM JIT: any JIT-compiled query crashes (SIGILL) on a libLLVM 19 + ASAN build

Hi Matheus,

I think that the patches looks good and I also think that it's
good to have a JIT test case since it's off by default. I'm just
wondering if the the test patch should be 0003 instead 0001 since it
will break CI if committed before the meson and autoconf changes.

Agreed on both counts. The series started out to show that the JIT
crash seen under Row Pattern Recognition was not RPR's fault but a
generic ASAN + JIT bitcode + debug-info problem, but since JIT is off
by default a dedicated test is worth having on its own, and I think it
makes a good seed to grow further coverage on.

And you're right about the ordering: if the test goes in first it turns
JIT on while the bitcode is still instrumented, so any commit between the
test and the build-system fixes would crash CI. I've reordered the
series so the fixes land first:

0001 meson: exclude sanitizer flags from the JIT bitcode.
(Author: Matheus Alcantara.)

0002 autoconf: same filtering for BITCODE_CFLAGS/CXXFLAGS, plus -g
under --enable-debug so the bitcode keeps its debug info.

0003 Add the "jit" regression test, which turns JIT on to push a
trivial query through the provider. This is the minimal case
that crashes the backend with ASAN enabled, and it passes again
once 0001/0002 are applied.

That way the build-system fixes precede the test at every point in the
history, so CI stays green wherever it is bisected.

Thanks!
Henson

LLVM JIT: any JIT-compiled query crashes (SIGILL) on a libLLVM 19 + ASAN build

Attachments:

Attachments:

Attachments:

Attachments: