PostgreSQL 17 Segmentation Fault

Started by Cameron Vogtover 1 year ago9 messagesbugs
Jump to latest
#1Cameron Vogt
cvogt@automaticcontrols.net

I recently upgraded a database from PostgreSQL 11.22 to 17.0. One of my queries which ran fine in PostgreSQL 11 is now causing a segmentation fault in 17. The log file contains this message: "server process (PID 7635) was terminated by signal 11: Segmentation fault","Failed process was running: SELECT DISTINCT ON (""tasks"".""gid""). I tried to delete pieces of the query and rerun it to see if I could determine which part is causing the fault. I attached the minimal query I came up with. If I try to delete any random piece out of the minimal query, it suddenly works again (even just a single line like line #2). I have no idea what is going on with it, but I think it is a bug.

Cameron Vogt | Software Developer

Direct:314-756-2302<tel:314-756-2302> | Cell: 636-388-2050<tel:636-388-2050>
cvogt@automaticcontrols.net

1585 Fencorp Drive<https://www.google.com/maps/dir/38.5384448,-90.43968/Automatic+Controls+Equipment+Systems,+Inc.+Fenton,+MO+63026/@38.5371308,-90.448053,16z/data=!3m1!4b1!4m9!4m8!1m1!4e1!1m5!1m1!1s0x87d8cfa7d0262fa3:0x8de9b691c07a1768!2m2!1d-90.4476712!2d38.5358207&gt;
<https://www.google.com/maps/dir/38.5384448,-90.43968/Automatic+Controls+Equipment+Systems,+Inc.+Fenton,+MO+63026/@38.5371308,-90.448053,16z/data=!3m1!4b1!4m9!4m8!1m1!4e1!1m5!1m1!1s0x87d8cfa7d0262fa3:0x8de9b691c07a1768!2m2!1d-90.4476712!2d38.5358207&gt;Fenton, Missouri 63026<https://www.google.com/maps/dir/38.5384448,-90.43968/Automatic+Controls+Equipment+Systems,+Inc.+Fenton,+MO+63026/@38.5371308,-90.448053,16z/data=!3m1!4b1!4m9!4m8!1m1!4e1!1m5!1m1!1s0x87d8cfa7d0262fa3:0x8de9b691c07a1768!2m2!1d-90.4476712!2d38.5358207&gt;
[cid:aces_66e9dd1e-ddf8-4d47-aa75-b72eea6ea7eb.png]
[cid:supportlink_28817f99-b936-4867-8557-910a1c601d21.png]<https://automaticcontrols.net/request-support/&gt;

Attachments:

aces_66e9dd1e-ddf8-4d47-aa75-b72eea6ea7eb.pngimage/png; name=aces_66e9dd1e-ddf8-4d47-aa75-b72eea6ea7eb.pngDownload+0-1
supportlink_28817f99-b936-4867-8557-910a1c601d21.pngimage/png; name=supportlink_28817f99-b936-4867-8557-910a1c601d21.pngDownload
seg_fault.sqlapplication/octet-stream; name=seg_fault.sqlDownload
#2Sean Massey
sean.f.massey@gmail.com
In reply to: Cameron Vogt (#1)
Re: PostgreSQL 17 Segmentation Fault

Can you reproduce this with a small subset of schema and data on another
installation of 17? Could you share the DDL to reproduce this failure case?

Can you capture the core dump?
https://wiki.postgresql.org/wiki/Generating_a_stack_trace_of_a_PostgreSQL_backend

On Thu, 3 Oct 2024 at 04:16, Cameron Vogt <cvogt@automaticcontrols.net>
wrote:

Show quoted text

I recently upgraded a database from PostgreSQL 11.22 to 17.0. One of my
queries which ran fine in PostgreSQL 11 is now causing a segmentation fault
in 17. The log file contains this message: *"server process (PID 7635)
was terminated by signal 11: Segmentation fault","Failed process was
running: SELECT DISTINCT ON (""tasks"".""gid"")*. I tried to delete
pieces of the query and rerun it to see if I could determine which part is
causing the fault. I attached the minimal query I came up with. If I try to
delete any random piece out of the minimal query, it suddenly works again
(even just a single line like line #2). I have no idea what is going on
with it, but I think it is a bug.

Cameron Vogt | Software Developer

Direct:314-756-2302 | Cell: 636-388-2050
cvogt@automaticcontrols.net

1585 Fencorp Drive
<https://www.google.com/maps/dir/38.5384448,-90.43968/Automatic+Controls+Equipment+Systems,+Inc.+Fenton,+MO+63026/@38.5371308,-90.448053,16z/data=!3m1!4b1!4m9!4m8!1m1!4e1!1m5!1m1!1s0x87d8cfa7d0262fa3:0x8de9b691c07a1768!2m2!1d-90.4476712!2d38.5358207&gt;

<https://www.google.com/maps/dir/38.5384448,-90.43968/Automatic+Controls+Equipment+Systems,+Inc.+Fenton,+MO+63026/@38.5371308,-90.448053,16z/data=!3m1!4b1!4m9!4m8!1m1!4e1!1m5!1m1!1s0x87d8cfa7d0262fa3:0x8de9b691c07a1768!2m2!1d-90.4476712!2d38.5358207&gt;Fenton,
Missouri 63026
<https://www.google.com/maps/dir/38.5384448,-90.43968/Automatic+Controls+Equipment+Systems,+Inc.+Fenton,+MO+63026/@38.5371308,-90.448053,16z/data=!3m1!4b1!4m9!4m8!1m1!4e1!1m5!1m1!1s0x87d8cfa7d0262fa3:0x8de9b691c07a1768!2m2!1d-90.4476712!2d38.5358207&gt;
[image: ACES.png]
[image: SupportLink.png] <https://automaticcontrols.net/request-support/&gt;

Attachments:

aces_66e9dd1e-ddf8-4d47-aa75-b72eea6ea7eb.pngimage/png; name=aces_66e9dd1e-ddf8-4d47-aa75-b72eea6ea7eb.pngDownload+0-1
supportlink_28817f99-b936-4867-8557-910a1c601d21.pngimage/png; name=supportlink_28817f99-b936-4867-8557-910a1c601d21.pngDownload
#3Michael Paquier
michael@paquier.xyz
In reply to: Sean Massey (#2)
Re: PostgreSQL 17 Segmentation Fault

On Thu, Oct 03, 2024 at 06:08:14AM +1000, Sean Massey wrote:

Can you reproduce this with a small subset of schema and data on another
installation of 17? Could you share the DDL to reproduce this failure case?

Can you capture the core dump?
https://wiki.postgresql.org/wiki/Generating_a_stack_trace_of_a_PostgreSQL_backend

That may be some of the recent planner changes, but hard to say. An
EXPLAIN output may also offer some hints on top of what Sean is
mentioning.
--
Michael

#4Cameron Vogt
cvogt@automaticcontrols.net
In reply to: Michael Paquier (#3)
Re: PostgreSQL 17 Segmentation Fault

I managed to get a GDB backtrace of the crash by following instructions in the link you sent me. This backtrace is in the attached zip file along with the output of using EXPLAIN on the query and some other additional information. I figured out that the error goes away when I decrease shared_buffers in postgresql.conf from 4GB to 3GB. I don't have access to another Linux server with sufficient RAM to try increasing shared_buffers like this on a fresh installation. Instead, I dumped my data, deleted the PostgreSQL cluster, created a new cluster, and restored my data. After restoring to a fresh cluster, I still get the same error. As a temporary solution, I decreased shared_buffers. Since the data relevant to the crashing query is from my company, I would rather not share it here. If necessary, I can try sending a subset of the data directly to someone. Let me know if there's anything else I can do to help debug the issue.

Thank you,

Cameron Vogt | Software Developer
Direct: 314-756-2302 | Cell: 636-388-2050
1585 Fencorp Drive | Fenton, MO 63026
Automatic Controls Equipment Systems, Inc.

Attachments:

bug.zipapplication/x-zip-compressed; name=bug.zipDownload
#5Cameron Vogt
cvogt@automaticcontrols.net
In reply to: Cameron Vogt (#4)
Re: PostgreSQL 17 Segmentation Fault

Attachments:

aces_66e9dd1e-ddf8-4d47-aa75-b72eea6ea7eb.pngimage/png; name=aces_66e9dd1e-ddf8-4d47-aa75-b72eea6ea7eb.pngDownload+0-1
supportlink_28817f99-b936-4867-8557-910a1c601d21.pngimage/png; name=supportlink_28817f99-b936-4867-8557-910a1c601d21.pngDownload
#6Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Cameron Vogt (#4)
Re: PostgreSQL 17 Segmentation Fault

Hi,

Thanks for the provided information. Per the backtrace, the failure
happens in the LLVM JIT code in nestloop/seqscan, so it has to be in
this part of the plan:

-> Nested Loop (cost=0.42..6074.84 rows=117 width=641)
-> Parallel Seq Scan on tasks__projects (cost=0.00..2201.62
rows=745 width=16)
Filter: (gid = '1138791545416725'::text)
-> Index Scan using tasks_pkey on tasks tasks_1
(cost=0.42..5.20 rows=1 width=102)
Index Cond: (gid = tasks__projects._sdc_source_key_gid)
Filter: ((NOT completed) AND (name <> ''::text))

But it's not clear why this should consume a lot of memory, though. It's
possible the memory is consumed elsewhere, and this is simply the straw
that breaks the camel's back ...

Presumably it takes a while for the query to consume a lot of memory and
crash - can you attach a debugger to it after after it allocates a lot
of memory (but before the crash), and do this:

call MemoryContextStats(TopMemoryContext)

That should write memory context stats to the server log. Perhaps that
will tell us which part of the query allocates memory.

Next, try running the query with jit=off. If that resolves the problem,
maybe it's another JIT issue. But if it completes with lower shared
buffers, that doesn't seem likely.

The plan has a bunch of hash joins. I wonder if that might be causing
issues, because the hash tables may be kept until the end of the query,
and each may be up to 64MB (you have work_mem=32, but there's also 2x
multiplier since PG13). The row estimates are pretty low, but could it
be that the real row counts are much higher? Did you run analyze after
the upgrade? Maybe try with lower work_mem?

One last thing you should check is memory overcommit. Chances are it's
set just low enough for the query to hit it with SB=4GB, but not with
SB=3GB. In that case you may need to tune this a bit. See /proc/meminfo
and /proc/sys/vm/overcommit_*).

regards

--
Tomas Vondra

#7Cameron Vogt
cvogt@automaticcontrols.net
In reply to: Tomas Vondra (#6)
Re: PostgreSQL 17 Segmentation Fault

The query crashes less than a second after running it, so there isn't much time to consume memory or to try attaching GDB mid-query. I tried decreasing work_mem from 32MB to 128kB, but I still get the error. I've also ran vacuum and analyze to no avail. When the query is successful, it only yields 68 rows, so I don't think the row estimates are too far off. I checked the files you mentioned for memory overcommit:

/proc/sys/vm/overcommit_memory = 0
/proc/sys/vm/overcommit_kbytes = 0
/proc/sys/vm/overcommit_ratio = 50

The free RAM on the system starts at and hangs around 8GB while executing the crashing query.

The only two things that have fixed the issue so far: Turning JIT off or decreasing shared_buffers. I suppose then that it might be a JIT issue?

Cameron Vogt | Software Developer
Direct: 314-756-2302 | Cell: 636-388-2050
1585 Fencorp Drive | Fenton, MO 63026
Automatic Controls Equipment Systems, Inc.

#8Thomas Munro
thomas.munro@gmail.com
In reply to: Cameron Vogt (#7)
Re: PostgreSQL 17 Segmentation Fault

On Sat, Oct 5, 2024 at 11:30 AM Cameron Vogt
<cvogt@automaticcontrols.net> wrote:

I suppose then that it might be a JIT issue?

I see from your info.txt file that this is aarch64. Could it be an
instance of LLVM's ARM relocation bug[1]/messages/by-id/CAO6_Xqr63qj=Sx7HY6ZiiQ6R_JbX+-p6sTPwDYwTWZjUmjsYBg@mail.gmail.com? I'm planning to push that
fix taken from the LLVM project soon, I have just been waiting to see
if a more polished version would land in LLVM's main branch first, but
I'm about to give up waiting for that so we get some testing time
in-tree before our next minor release.

[1]: /messages/by-id/CAO6_Xqr63qj=Sx7HY6ZiiQ6R_JbX+-p6sTPwDYwTWZjUmjsYBg@mail.gmail.com

#9Cameron Vogt
cvogt@automaticcontrols.net
In reply to: Thomas Munro (#8)
Re: PostgreSQL 17 Segmentation Fault

On 2024-10-05 01:40:17 Thomas Munro
<thomas(dot)munro(at)gmail(dot)com> wrote:

Could it be an instance of LLVM's ARM relocation bug?

After reading about the bug, I believe you are likely correct. That would explain the behavior I'm seeing with JIT and shared_buffers. When I migrated PostgreSQL versions, I also moved to a new aarch64 machine. The old machine was not aarch64, so that may explain the timing of the issue as well.

Cameron Vogt | Software Developer
Direct: 314-756-2302 | Cell: 636-388-2050
1585 Fencorp Drive | Fenton, MO 63026
Automatic Controls Equipment Systems, Inc.