BUG #19078: Segfaults in tts_minimal_store_tuple() following pg_upgrade
The following bug has been logged on the website:
Bug reference: 19078
Logged by: Yuri Zamyatin
Email address: yuri@yrz.am
PostgreSQL version: 18.0
Operating system: Debian 13.1
Description:
Hello. We are encountering segfaults from tts_minimal_store_tuple() after
upgrade. You may find the stack trace at the end of this message.
Postgresql: PostgreSQL 18.0 (Debian 18.0-1.pgdg13+3) on
x86_64-pc-linux-gnu, compiled by gcc (Debian 14.2.0-19) 14.2.0, 64-bit
Kernel: Linux 6.12.48+deb13-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.12.48-1
(2025-09-20) x86_64 GNU/Linux
OS: Debian 13.1 from deb.debian.org trixie, trixie-updates, trixie-security
(latest)
PostgreSQL client backend crashes with segfault (signal 11) intermittently
when executing SELECT or UPDATE query with the following circumstances:
- There is a set of queries that run into segfault. Noticeably they all do a
lookup on partitioned table with pruning (100+ partitions).
- Occurs across multiple machines (same OS and Postgres version) that handle
many connections and went through pg_upgrade.
- Interval between segfaults varies from dozens of minutes to days depending
on the size/load/configuration of the cluster.
- Happens randomly, most of the times these queries finish successfully, so
we're unable to reproduce the error in a consistent manner.
- Some of the problematic queries run based on the fixed schedule, which
means each run is more likely to fail in larger clusters.
The issue appeared after migration from pg17 (latest in pgdg) to pg18 pgdg
via pg_upgradecluster --method link.
Shortly before that, OS was upgraded from Debian 12 to Debian 13 with the
corresponding change of pgdg apt sources.
Postgresql 17 cluster was shut down during this time.
Right after cluster upgrade we updated all extensions, ran vacuumdb
--analyze-in-stages and reindexed all text-based indexes as expected.
Segmentation faults appeared after 1-5 days.
Trying to find a workaround, we did the following:
- Disabled huge pages
- Reduced checkpoint_timeout from 60min to 5min, reduced wal_max_size
- Disabled jit
- Set io_method to sync (io_uring was much slower under our workload)
- Ran REINDEX SYSTEM in each database
- Reindexed all databases
- Ran pg_repack on tables (with their children) mentioned in the problematic
queries
- Ran pg_amcheck on each database with default parameters, no corruption was
found
- Disabled enable_hashagg for some queries (just now)
Segmentation faults still happen on the same tables but less frequently.
For the cluster with 100+ concurrent connections, 225G shared_buffers, 2000
max_connections, 256cpu,
number of crashes decreased from 30 to 8 times a day.
Interval between segfaults may be related to checkpoint_timeout.
Previously, that server used to crash every 60 minutes, and now there are
series of crashes with 5-10 min gap in them.
We could not reproduce the crash invoking CHECKPOINT manually.
Below is the query with the simplest plan out of those which crash database
(if i relaunch that query).
Although segfaults from it happen rarely and we don't have a coredump for
that yet.
Update on tcv_scenes cs (cost=1760.81..404518.63 rows=343 width=36)
(actual time=2358.546..2386.729 rows=1.00 loops=1)
Buffers: shared hit=19823 read=138644 dirtied=20
-> Nested Loop (cost=1760.81..404518.63 rows=343 width=36) (actual
time=2344.746..2372.927 rows=1.00 loops=1)
Buffers: shared hit=12241 read=138641 dirtied=1
-> Bitmap Heap Scan on tcv_scenes cs (cost=1760.39..209679.79
rows=346 width=38) (actual time=2344.280..2372.423 rows=1.00 loops=1)
Recheck Cond: ((state_id = 7) OR (state_id = 3))
Filter: (((state_id = 7) AND (date_cr < (now() -
'24:00:00'::interval)) AND (date_state_mo > (now() - '00:15:00'::interval)))
OR ((state_id = 3) AND (date_state_mo < (now() - '00:05:00'::interval))))
Rows Removed by Filter: 221134
Heap Blocks: exact=150638
Buffers: shared hit=12237 read=138641 dirtied=1
-> BitmapOr (cost=1760.39..1760.39 rows=210544 width=0)
(actual time=43.601..43.603 rows=0.00 loops=1)
Buffers: shared hit=218 read=22
-> Bitmap Index Scan on icv_scenes__state
(cost=0.00..1755.70 rows=210151 width=0) (actual time=34.418..34.419
rows=221112.00 loops=1)
Index Cond: (state_id = 7)
Index Searches: 1
Buffers: shared hit=194
-> Bitmap Index Scan on icv_scenes__state
(cost=0.00..4.51 rows=393 width=0) (actual time=9.181..9.181 rows=30759.00
loops=1)
Index Cond: (state_id = 3)
Index Searches: 1
Buffers: shared hit=24 read=22
-> Append (cost=0.42..560.73 rows=239 width=658) (actual
time=0.094..0.128 rows=1.00 loops=1)
Buffers: shared hit=4
-> Index Scan using tcv_scene_datas_0_pkey on
tcv_scene_datas_0 cd_1 (cost=0.42..2.32 rows=1 width=50) (never executed)
Index Cond: (cv_scene_id = cs.id)
Filter: (((cs.state_id = 7) AND (cs.date_cr < (now()
- '24:00:00'::interval)) AND (cs.date_state_mo > (now() -
'00:15:00'::interval)) AND ((stitcher_result)::text ~~ '%download%'::text))
OR ((cs.state_id = 3) AND (cs.date_state_mo < (now() -
'00:05:00'::interval))))
Index Searches: 0
...<100+ partitions>...
-> Index Scan using tcv_scene_datas_118500000_pkey on
tcv_scene_datas_118500000 cd_238 (cost=0.42..2.33 rows=1 width=1072)
(actual time=0.079..0.080 rows=1.00 loops=1)
Index Cond: (cv_scene_id = cs.id)
Filter: (((cs.state_id = 7) AND (cs.date_cr < (now()
- '24:00:00'::interval)) AND (cs.date_state_mo > (now() -
'00:15:00'::interval)) AND ((stitcher_result)::text ~~ '%download%'::text))
OR ((cs.state_id = 3) AND (cs.date_state_mo < (now() -
'00:05:00'::interval))))
Index Searches: 1
Buffers: shared hit=4
-> Seq Scan on tcv_scene_datas_119000000 cd_239
(cost=0.00..0.00 rows=1 width=50) (never executed)
Filter: ((cv_scene_id = cs.id) AND (((cs.state_id =
7) AND (cs.date_cr < (now() - '24:00:00'::interval)) AND (cs.date_state_mo >
(now() - '00:15:00'::interval)) AND ((stitcher_result)::text ~~
'%download%'::text)) OR ((cs.state_id = 3) AND (cs.date_state_mo < (now() -
'00:05:00'::interval)))))
Planning:
Buffers: shared hit=15775
Planning Time: 57.800 ms
Trigger for constraint tcv_scenes_new_state_id_fkey: time=0.965 calls=1
Execution Time: 2395.941 ms
(982 rows)
More frequently segfaults occur from queries with complex plans (many levels
of aggregation, subqueries and window functions).
Below is an example. We could not find simple reproduction for that.
Overriden in postgresql.conf for that cluster:
postgresql_effective_cache_size = 560GB
postgresql_shared_buffers = 225GB
temp_buffers = 128MB
work_mem = 2GB
maintenance_work_mem = 512MB
vacuum_buffer_usage_limit = 128MB
max_connections = 2000
max_parallel_workers_per_gather = 8
max_parallel_workers = 16
max_parallel_maintenance_workers = 8
max_locks_per_transaction = 128
huge_pages = off
io_method = sync
file_copy_method = clone
effective_io_concurrency = 512
random_page_cost = 1.0
temp_file_limit = 100GB
wal_level = minimal
max_wal_senders = 0
wal_buffers = 128MB
default_statistics_target = 1000
checkpoint_timeout = 5min
min_wal_size = 3GB"
max_wal_size = 3GB"
Server log:
2025-10-08 10:36:24 UTC LOG: 00000: client backend (PID 2380761) was
terminated by signal 11: Segmentation fault
2025-10-08 10:36:24 UTC DETAIL: Failed process was running: <query>
2025-10-08 10:36:24 UTC LOCATION: LogChildExit, postmaster.c:2853
2025-10-08 10:36:24 UTC LOG: 00000: terminating any other active
server processes
Dmesg:
[126364.743906] postgres[2380761]: segfault at 1b ip 0000555fe855f1c1 sp
00007ffe304155a0 error 4 in postgres[3531c1,555fe82f0000+5f3000] likely on
CPU 122 (core 58, socket 1)
[126364.743931] Code: c9 31 d2 4c 89 63 48 66 89 4b 34 89 c1 49 83 ec 08
66 89 43 04 83 c9 04 66 89 53 06 c7 43 30 ff ff ff ff c7 43 68 00 00 00 00
<41> 8b 74 24 08 45 84 ed 0f 45 c1 4c 89 63 60 8d 56 08 66 89 43 04
Core:
#0 tts_minimal_store_tuple (slot=0x55601c765bb0, mtup=0x1b,
shouldFree=false) at ./build/../src/backend/executor/execTuples.c:697
mslot = 0x55601c765bb0
mslot = <optimized out>
#1 ExecStoreMinimalTuple (mtup=0x1b, slot=slot@entry=0x55601c765bb0,
shouldFree=shouldFree@entry=false) at
./build/../src/backend/executor/execTuples.c:1648
__func__ = "ExecStoreMinimalTuple"
__errno_location = <optimized out>
#2 0x0000555fe8566ec2 in agg_retrieve_hash_table_in_memory
(aggstate=aggstate@entry=0x55601c7567d0) at
./build/../src/include/executor/executor.h:176
hashslot = 0x55601c765bb0
hashtable = 0x55601c182ac8
i = <optimized out>
econtext = 0x55601c756f00
peragg = 0x55601c765198
pergroup = <optimized out>
entry = 0x55601c182e48
firstSlot = 0x55601c763e48
result = <optimized out>
perhash = 0x55601c764e50
#3 0x0000555fe8567ac8 in agg_retrieve_hash_table (aggstate=<optimized
out>) at ./build/../src/backend/executor/nodeAgg.c:2841
result = 0x0
result = <optimized out>
#4 ExecAgg (pstate=0x55601c7567d0) at
./build/../src/backend/executor/nodeAgg.c:2261
node = 0x55601c7567d0
result = 0x0
#5 0x0000555fe858c959 in ExecProcNode (node=0x55601c7567d0) at
./build/../src/include/executor/executor.h:315
No locals.
#6 spool_tuples (winstate=winstate@entry=0x55601c7561b8,
pos=pos@entry=57) at ./build/../src/backend/executor/nodeWindowAgg.c:1326
node = 0x55601eb9add8
outerPlan = 0x55601c7567d0
outerslot = <optimized out>
oldcontext = 0x55601b1a5fb0
#7 0x0000555fe858cb20 in window_gettupleslot
(winobj=winobj@entry=0x55601c76b028, pos=57, slot=slot@entry=0x55601c766a20)
at ./build/../src/backend/executor/nodeWindowAgg.c:3145
winstate = 0x55601c7561b8
oldcontext = <optimized out>
__func__ = "window_gettupleslot"
#8 0x0000555fe858ec94 in eval_windowaggregates (winstate=0x55601c7561b8)
at ./build/../src/backend/executor/nodeWindowAgg.c:936
ret = <optimized out>
aggregatedupto_nonrestarted = 0
econtext = 0x55601c7566c8
agg_row_slot = <optimized out>
peraggstate = <optimized out>
numaggs = <optimized out>
wfuncno = <optimized out>
numaggs_restart = <optimized out>
i = <optimized out>
oldContext = <optimized out>
agg_winobj = 0x55601c76b028
temp_slot = 0x55601c766b28
peraggstate = <optimized out>
wfuncno = <optimized out>
numaggs = <optimized out>
numaggs_restart = <optimized out>
i = <optimized out>
aggregatedupto_nonrestarted = <optimized out>
oldContext = <optimized out>
econtext = <optimized out>
agg_winobj = <optimized out>
agg_row_slot = <optimized out>
temp_slot = <optimized out>
__func__ = "eval_windowaggregates"
next_tuple = <optimized out>
__errno_location = <optimized out>
__errno_location = <optimized out>
ok = <optimized out>
ret = <optimized out>
result = <optimized out>
isnull = <optimized out>
#9 ExecWindowAgg (pstate=0x55601c7561b8) at
./build/../src/backend/executor/nodeWindowAgg.c:2300
winstate = 0x55601c7561b8
slot = <optimized out>
econtext = <optimized out>
i = <optimized out>
numfuncs = <optimized out>
__func__ = "ExecWindowAgg"
#10 0x0000555fe856476c in ExecProcNode (node=0x55601c7561b8) at
./build/../src/include/executor/executor.h:315
No locals.
#11 fetch_input_tuple (aggstate=aggstate@entry=0x55601c755a88) at
./build/../src/backend/executor/nodeAgg.c:563
slot = <optimized out>
#12 0x0000555fe8567ca9 in agg_retrieve_direct (aggstate=0x55601c755a88) at
./build/../src/backend/executor/nodeAgg.c:2450
econtext = 0x55601c7560b0
firstSlot = 0x55601c76b070
numGroupingSets = 1
node = 0x55601eb98630
tmpcontext = <optimized out>
peragg = 0x55601c76c218
outerslot = <optimized out>
nextSetSize = <optimized out>
pergroups = 0x55601c76d628
result = <optimized out>
hasGroupingSets = false
currentSet = <optimized out>
numReset = 1
i = <optimized out>
node = <optimized out>
econtext = <optimized out>
tmpcontext = <optimized out>
peragg = <optimized out>
pergroups = <optimized out>
outerslot = <optimized out>
firstSlot = <optimized out>
result = <optimized out>
hasGroupingSets = <optimized out>
numGroupingSets = <optimized out>
currentSet = <optimized out>
nextSetSize = <optimized out>
numReset = <optimized out>
i = <optimized out>
#13 ExecAgg (pstate=0x55601c755a88) at
./build/../src/backend/executor/nodeAgg.c:2265
node = 0x55601c755a88
result = 0x0
#14 0x0000555fe85877aa in ExecProcNode (node=<optimized out>) at
./build/../src/include/executor/executor.h:315
No locals.
#15 ExecScanSubPlan (node=0x55601f0e9610, econtext=0x55601ef91550,
isNull=0x55601ef8f395) at ./build/../src/backend/executor/nodeSubplan.c:275
subplan = <optimized out>
oldcontext = 0x55601f0e9610
slot = <optimized out>
astate = 0x0
planstate = <optimized out>
subLinkType = EXPR_SUBLINK
result = 0
found = false
l = <optimized out>
subplan = <optimized out>
planstate = <optimized out>
subLinkType = <optimized out>
oldcontext = <optimized out>
slot = <optimized out>
result = <optimized out>
found = <optimized out>
l = <optimized out>
astate = <optimized out>
__func__ = "ExecScanSubPlan"
l__state = <optimized out>
paramid = <optimized out>
tdesc = <error reading variable tdesc (Cannot access memory at
address 0x0)>
rowresult = <optimized out>
rownull = <optimized out>
col = <optimized out>
plst = <optimized out>
__errno_location = <optimized out>
__errno_location = <optimized out>
plst__state = <optimized out>
paramid = <optimized out>
prmdata = <optimized out>
dvalue = <optimized out>
disnull = <optimized out>
__errno_location = <optimized out>
plst__state = <optimized out>
paramid = <optimized out>
prmdata = <optimized out>
l__state = <optimized out>
paramid = <optimized out>
prmdata = <optimized out>
#16 ExecSubPlan (node=node@entry=0x55601ef91550,
econtext=econtext@entry=0x55601ef58798, isNull=0x55601ef8f395) at
./build/../src/backend/executor/nodeSubplan.c:89
subplan = <optimized out>
estate = 0x55601b1a60a8
dir = ForwardScanDirection
retval = <optimized out>
__func__ = "ExecSubPlan"
#17 0x0000555fe854d169 in ExecEvalSubPlan (state=<optimized out>,
op=<optimized out>, econtext=0x55601ef58798) at
./build/../src/backend/executor/execExprInterp.c:5316
sstate = 0x55601ef91550
sstate = <optimized out>
#18 ExecInterpExpr (state=0x55601ef8f390, econtext=0x55601ef58798,
isnull=<optimized out>) at
./build/../src/backend/executor/execExprInterp.c:2001
op = <optimized out>
resultslot = 0x55601ef8f180
innerslot = <optimized out>
outerslot = <optimized out>
scanslot = <optimized out>
oldslot = <optimized out>
newslot = <optimized out>
dispatch_table = {0x555fe854d9ce <ExecInterpExpr+4366>,
0x555fe854d9a3 <ExecInterpExpr+4323>, 0x555fe854d986 <ExecInterpExpr+4294>,
0x555fe854d969 <ExecInterpExpr+4265>, 0x555fe854d94c
<ExecInterpExpr+4236>, 0x555fe854d92f <ExecInterpExpr+4207>, 0x555fe854d90f
<ExecInterpExpr+4175>,
0x555fe854d8e0 <ExecInterpExpr+4128>, 0x555fe854d8b1
<ExecInterpExpr+4081>, 0x555fe854d882 <ExecInterpExpr+4034>, 0x555fe854d853
<ExecInterpExpr+3987>,
0x555fe854d821 <ExecInterpExpr+3937>, 0x555fe854d805
<ExecInterpExpr+3909>, 0x555fe854d7e9 <ExecInterpExpr+3881>, 0x555fe854d7cd
<ExecInterpExpr+3853>,
0x555fe854db2b <ExecInterpExpr+4715>, 0x555fe854db0c
<ExecInterpExpr+4684>, 0x555fe854daf4 <ExecInterpExpr+4660>, 0x555fe854dabf
<ExecInterpExpr+4607>,
0x555fe854da8a <ExecInterpExpr+4554>, 0x555fe854da55
<ExecInterpExpr+4501>, 0x555fe854da20 <ExecInterpExpr+4448>, 0x555fe854d9e8
<ExecInterpExpr+4392>,
0x555fe854dbca <ExecInterpExpr+4874>, 0x555fe854db91
<ExecInterpExpr+4817>, 0x555fe854db72 <ExecInterpExpr+4786>, 0x555fe854db47
<ExecInterpExpr+4743>,
0x555fe854d749 <ExecInterpExpr+3721>, 0x555fe854d729
<ExecInterpExpr+3689>, 0x555fe854d702 <ExecInterpExpr+3650>, 0x555fe854d6b0
<ExecInterpExpr+3568>,
0x555fe854de15 <ExecInterpExpr+5461>, 0x555fe854c985
<ExecInterpExpr+197>, 0x555fe854c990 <ExecInterpExpr+208>, 0x555fe854dddb
<ExecInterpExpr+5403>,
0x555fe854c94c <ExecInterpExpr+140>, 0x555fe854c957
<ExecInterpExpr+151>, 0x555fe854ddac <ExecInterpExpr+5356>, 0x555fe854dd92
<ExecInterpExpr+5330>,
0x555fe854dd5a <ExecInterpExpr+5274>, 0x555fe854dd47
<ExecInterpExpr+5255>, 0x555fe854de96 <ExecInterpExpr+5590>, 0x555fe854de76
<ExecInterpExpr+5558>,
0x555fe854de4c <ExecInterpExpr+5516>, 0x555fe854de2d
<ExecInterpExpr+5485>, 0x555fe854decd <ExecInterpExpr+5645>, 0x555fe854deb6
<ExecInterpExpr+5622>,
0x555fe854d7b9 <ExecInterpExpr+3833>, 0x555fe854d790
<ExecInterpExpr+3792>, 0x555fe854dcff <ExecInterpExpr+5183>, 0x555fe854dcd6
<ExecInterpExpr+5142>,
0x555fe854dcad <ExecInterpExpr+5101>, 0x555fe854dc71
<ExecInterpExpr+5041>, 0x555fe854dc59 <ExecInterpExpr+5017>, 0x555fe854dc43
<ExecInterpExpr+4995>,
0x555fe854dc17 <ExecInterpExpr+4951>, 0x555fe854dbf2
<ExecInterpExpr+4914>, 0x555fe854df3b <ExecInterpExpr+5755>, 0x555fe854dd28
<ExecInterpExpr+5224>,
0x555fe854def2 <ExecInterpExpr+5682>, 0x555fe854d69b
<ExecInterpExpr+3547>, 0x555fe854d663 <ExecInterpExpr+3491>, 0x555fe854d62b
<ExecInterpExpr+3435>,
0x555fe854d5a0 <ExecInterpExpr+3296>, 0x555fe854d58b
<ExecInterpExpr+3275>, 0x555fe833c321 <ExecInterpExpr.cold>, 0x555fe854d47f
<ExecInterpExpr+3007>,
0x555fe854d44b <ExecInterpExpr+2955>, 0x555fe854d436
<ExecInterpExpr+2934>, 0x555fe854d494 <ExecInterpExpr+3028>, 0x555fe854d404
<ExecInterpExpr+2884>,
0x555fe854d3cd <ExecInterpExpr+2829>, 0x555fe854d382
<ExecInterpExpr+2754>, 0x555fe854d355 <ExecInterpExpr+2709>, 0x555fe854d33d
<ExecInterpExpr+2685>,
0x555fe854d36a <ExecInterpExpr+2730>, 0x555fe854d325
<ExecInterpExpr+2661>, 0x555fe854d307 <ExecInterpExpr+2631>, 0x555fe854d2fe
<ExecInterpExpr+2622>,
0x555fe854c932 <ExecInterpExpr+114>, 0x555fe854c936
<ExecInterpExpr+118>, 0x555fe854d4f1 <ExecInterpExpr+3121>, 0x555fe854d4d1
<ExecInterpExpr+3089>,
0x555fe854d558 <ExecInterpExpr+3224>, 0x555fe854d543
<ExecInterpExpr+3203>, 0x555fe854d56f <ExecInterpExpr+3247>, 0x555fe854d2bf
<ExecInterpExpr+2559>,
0x555fe854d28c <ExecInterpExpr+2508>, 0x555fe854d257
<ExecInterpExpr+2455>, 0x555fe854d224 <ExecInterpExpr+2404>, 0x555fe854d2e6
<ExecInterpExpr+2598>,
0x555fe854d52e <ExecInterpExpr+3182>, 0x555fe854d516
<ExecInterpExpr+3158>, 0x555fe854d20f <ExecInterpExpr+2383>, 0x555fe854d1f7
<ExecInterpExpr+2359>,
0x555fe854d1e2 <ExecInterpExpr+2338>, 0x555fe854d1bf
<ExecInterpExpr+2303>, 0x555fe854d1a7 <ExecInterpExpr+2279>, 0x555fe854d125
<ExecInterpExpr+2149>,
0x555fe854d0fa <ExecInterpExpr+2106>, 0x555fe854d0e5
<ExecInterpExpr+2085>, 0x555fe854d0b2 <ExecInterpExpr+2034>, 0x555fe854d16e
<ExecInterpExpr+2222>,
0x555fe854d13a <ExecInterpExpr+2170>, 0x555fe854d186
<ExecInterpExpr+2246>, 0x555fe854c9c0 <ExecInterpExpr+256>, 0x555fe854d072
<ExecInterpExpr+1970>,
0x555fe854d051 <ExecInterpExpr+1937>, 0x555fe854d013
<ExecInterpExpr+1875>, 0x555fe854cfee <ExecInterpExpr+1838>, 0x555fe854cf2a
<ExecInterpExpr+1642>,
0x555fe854ce6f <ExecInterpExpr+1455>, 0x555fe854cdbe
<ExecInterpExpr+1278>, 0x555fe854ccb9 <ExecInterpExpr+1017>, 0x555fe854cbbf
<ExecInterpExpr+767>,
0x555fe854cab8 <ExecInterpExpr+504>, 0x555fe854ca98
<ExecInterpExpr+472>, 0x555fe854ca78 <ExecInterpExpr+440>, 0x555fe854ca48
<ExecInterpExpr+392>,
0x555fe854cba7 <ExecInterpExpr+743>, 0x555fe833c330
<ExecInterpExpr-2164112>}
#19 0x0000555fe85664cf in ExecEvalExprNoReturn (state=0x55601ef8f390,
econtext=0x55601ef58798) at ./build/../src/include/executor/executor.h:419
retDatum = <optimized out>
retDatum = <optimized out>
#20 ExecEvalExprNoReturnSwitchContext (state=0x55601ef8f390,
econtext=0x55601ef58798) at ./build/../src/include/executor/executor.h:460
oldContext = 0x55601b1a5fb0
oldContext = <optimized out>
#21 ExecProject (projInfo=0x55601ef8f388) at
./build/../src/include/executor/executor.h:492
econtext = 0x55601ef58798
state = 0x55601ef8f390
slot = 0x55601ef8f180
#22 project_aggregates (aggstate=<optimized out>) at
./build/../src/backend/executor/nodeAgg.c:1383
econtext = <optimized out>
#23 project_aggregates (aggstate=<optimized out>) at
./build/../src/backend/executor/nodeAgg.c:1370
econtext = <optimized out>
#24 0x0000555fe8567a79 in agg_retrieve_direct (aggstate=0x55601ef556c8) at
./build/../src/backend/executor/nodeAgg.c:2613
econtext = 0x55601ef58798
firstSlot = 0x55601ef8ef78
numGroupingSets = 1
node = <optimized out>
tmpcontext = <optimized out>
peragg = 0x55601ef8f8e0
outerslot = <optimized out>
nextSetSize = <optimized out>
pergroups = 0x55601ef8b9a0
result = <optimized out>
hasGroupingSets = false
currentSet = <optimized out>
numReset = <optimized out>
i = <optimized out>
node = <optimized out>
econtext = <optimized out>
tmpcontext = <optimized out>
peragg = <optimized out>
pergroups = <optimized out>
outerslot = <optimized out>
firstSlot = <optimized out>
result = <optimized out>
hasGroupingSets = <optimized out>
numGroupingSets = <optimized out>
currentSet = <optimized out>
nextSetSize = <optimized out>
numReset = <optimized out>
i = <optimized out>
#25 ExecAgg (pstate=0x55601ef556c8) at
./build/../src/backend/executor/nodeAgg.c:2265
node = 0x55601ef556c8
result = 0x0
#26 0x0000555fe855c23d in ExecScanFetch (node=<optimized out>,
epqstate=<optimized out>, accessMtd=<optimized out>, recheckMtd=<optimized
out>)
at ./build/../src/include/executor/execScan.h:126
No locals.
#27 ExecScanExtended (node=<optimized out>, accessMtd=0x555fe8588d50
<SubqueryNext>, recheckMtd=0x555fe8588d20 <SubqueryRecheck>, epqstate=0x0,
qual=0x0,
projInfo=0x55601ef9d680) at
./build/../src/include/executor/execScan.h:187
slot = <optimized out>
econtext = 0x55601ef58470
econtext = <optimized out>
slot = <optimized out>
#28 ExecScan (node=0x55601ef58368, accessMtd=0x555fe8588d50
<SubqueryNext>, recheckMtd=0x555fe8588d20 <SubqueryRecheck>)
at ./build/../src/backend/executor/execScan.c:59
epqstate = 0x0
qual = 0x0
projInfo = 0x55601ef9d680
#29 0x0000555fe8583f0e in ExecProcNode (node=0x55601ef58368) at
./build/../src/include/executor/executor.h:315
No locals.
#30 ExecNestLoop (pstate=<optimized out>) at
./build/../src/backend/executor/nodeNestloop.c:159
node = <optimized out>
nl = 0x55601b1224c8
innerPlan = 0x55601ef58368
outerPlan = <optimized out>
outerTupleSlot = <optimized out>
innerTupleSlot = <optimized out>
joinqual = <optimized out>
otherqual = <optimized out>
econtext = 0x55601eda5b60
lc = <optimized out>
#31 0x0000555fe8586ce6 in ExecProcNode (node=0x55601eda5a58) at
./build/../src/include/executor/executor.h:315
No locals.
#32 ExecSort (pstate=0x55601eda5850) at
./build/../src/backend/executor/nodeSort.c:149
plannode = <optimized out>
outerNode = 0x55601eda5a58
tupDesc = <optimized out>
tuplesortopts = <optimized out>
node = 0x55601eda5850
estate = 0x55601b1a60a8
dir = ForwardScanDirection
tuplesortstate = 0x55601b15dfa8
slot = <optimized out>
#33 0x0000555fe856476c in ExecProcNode (node=0x55601eda5850) at
./build/../src/include/executor/executor.h:315
No locals.
#34 fetch_input_tuple (aggstate=aggstate@entry=0x55601eda5130) at
./build/../src/backend/executor/nodeAgg.c:563
slot = <optimized out>
#35 0x0000555fe8567ca9 in agg_retrieve_direct (aggstate=0x55601eda5130) at
./build/../src/backend/executor/nodeAgg.c:2450
econtext = 0x55601eda5748
firstSlot = 0x55601efa0970
numGroupingSets = 1
node = 0x55601b458fb8
tmpcontext = <optimized out>
peragg = 0x55601efa1f40
outerslot = <optimized out>
nextSetSize = <optimized out>
pergroups = 0x55601efa2148
result = <optimized out>
hasGroupingSets = false
currentSet = <optimized out>
numReset = 1
i = <optimized out>
node = <optimized out>
econtext = <optimized out>
tmpcontext = <optimized out>
peragg = <optimized out>
pergroups = <optimized out>
outerslot = <optimized out>
firstSlot = <optimized out>
result = <optimized out>
hasGroupingSets = <optimized out>
numGroupingSets = <optimized out>
currentSet = <optimized out>
nextSetSize = <optimized out>
numReset = <optimized out>
i = <optimized out>
#36 ExecAgg (pstate=0x55601eda5130) at
./build/../src/backend/executor/nodeAgg.c:2265
node = 0x55601eda5130
result = 0x0
#37 0x0000555fe8579bc9 in ExecProcNode (node=0x55601eda5130) at
./build/../src/include/executor/executor.h:315
No locals.
#38 ExecLimit (pstate=0x55601eda4e20) at
./build/../src/backend/executor/nodeLimit.c:95
node = 0x55601eda4e20
econtext = 0x55601eda5028
direction = <optimized out>
slot = <optimized out>
outerPlan = 0x55601eda5130
__func__ = "ExecLimit"
#39 0x0000555fe855191b in ExecProcNode (node=0x55601eda4e20) at
./build/../src/include/executor/executor.h:315
No locals.
#40 ExecutePlan (queryDesc=0x55601b1a9f18, operation=CMD_SELECT,
sendTuples=true, numberTuples=0, direction=<optimized out>,
dest=0x55601aed55a0)
at ./build/../src/backend/executor/execMain.c:1697
estate = 0x55601b1a60a8
use_parallel_mode = <optimized out>
slot = <optimized out>
planstate = 0x55601eda4e20
current_tuple_count = 0
estate = <optimized out>
planstate = <optimized out>
use_parallel_mode = <optimized out>
slot = <optimized out>
current_tuple_count = <optimized out>
#41 standard_ExecutorRun (queryDesc=0x55601b1a9f18, direction=<optimized
out>, count=0) at ./build/../src/backend/executor/execMain.c:366
estate = 0x55601b1a60a8
operation = CMD_SELECT
dest = 0x55601aed55a0
sendTuples = <optimized out>
oldcontext = 0x55601b0f7980
#42 0x0000555fe872c2a7 in PortalRunSelect
(portal=portal@entry=0x55601afc2718, forward=forward@entry=true, count=0,
count@entry=9223372036854775807,
dest=dest@entry=0x55601aed55a0) at
./build/../src/backend/tcop/pquery.c:921
queryDesc = 0x55601b1a9f18
direction = <optimized out>
nprocessed = <optimized out>
__func__ = "PortalRunSelect"
#43 0x0000555fe872d8a0 in PortalRun (portal=portal@entry=0x55601afc2718,
count=9223372036854775807, isTopLevel=isTopLevel@entry=true,
dest=dest@entry=0x55601aed55a0,
altdest=altdest@entry=0x55601aed55a0, qc=qc@entry=0x7ffe304161c0) at
./build/../src/backend/tcop/pquery.c:765
_save_exception_stack = 0x7ffe304162a0
_save_context_stack = 0x7ffe30416280
_local_sigjmp_buf = {{__jmpbuf = {3, -100083351355759034,
93871257954072, 93871256982944, 0, 0, -100083352486123962,
-6061881521657190842},
__mask_was_saved = 0, __saved_mask = {__val = {93870411175012,
1759917036, 832786, 140729708011496, 5232754935419077376, 140729708011568,
93870410174331,
0, 93870411648949, 93871262790800, 52352, 93871256981664,
93870411676182, 140729708011568, 3, 140729708011568}}}}
_do_rethrow = <optimized out>
result = <optimized out>
nprocessed = <optimized out>
saveTopTransactionResourceOwner = 0x55601af25b28
saveTopTransactionContext = 0x55601afd83e0
saveActivePortal = 0x0
saveResourceOwner = 0x55601af25b28
savePortalContext = 0x0
saveMemoryContext = 0x55601aed50a0
__func__ = "PortalRun"
#44 0x0000555fe872a65b in exec_execute_message (portal_name=0x55601aed5198
"", max_rows=<optimized out>) at ./build/../src/backend/tcop/postgres.c:2272
portal = 0x55601afc2718
sourceText = 0x55601b6f9160 "-- NO KILL
\nselect\n\n\tt.*\n\t\n\nfrom\n\n\t(select\t\n\t\treport_id,\n\t\tshop_id,\t\t\t\t\n\t\t\n\t\tmax(uncalc_cnt)
as uncalc_cnt,\n\t\tmax(bad_cnt) as bad_cnt,\n\t\tjsonb_agg(row_to_json(t.*)
order by ordering_path) as kpis,\n\t\tm"...
prepStmtName = 0x555fe88f7d3f "<unnamed>"
was_logged = false
cmdtaglen = 6
dest = DestRemoteExecute
completed = <optimized out>
qc = {commandTag = CMDTAG_UNKNOWN, nprocessed = 0}
portalParams = 0x55601b0f7a78
save_log_statement_stats = false
is_xact_command = false
msec_str =
"0\371\003\000\000\000\000\000\360cA0\376\177\000\000\000\000\000\000\000\000\000\000\265\217\212\350_U\000"
params_data = {portalName = 0x55601afc6100 "", params =
0x55601b0f7a78}
params_errcxt = {previous = 0x0, callback = 0x555fe85e7c30
<ParamsErrorCallback>, arg = 0x7ffe304161d0}
receiver = 0x55601aed55a0
execute_is_fetch = false
cmdtagname = <optimized out>
lc = <optimized out>
dest = <optimized out>
receiver = <optimized out>
portal = <optimized out>
completed = <optimized out>
qc = <optimized out>
sourceText = <optimized out>
prepStmtName = <optimized out>
portalParams = <error reading variable portalParams (Cannot access
memory at address 0x0)>
save_log_statement_stats = <optimized out>
is_xact_command = <optimized out>
execute_is_fetch = <optimized out>
was_logged = <optimized out>
msec_str = <optimized out>
params_data = <optimized out>
params_errcxt = <optimized out>
cmdtagname = <optimized out>
cmdtaglen = <optimized out>
lc = <optimized out>
__func__ = "exec_execute_message"
__errno_location = <optimized out>
lc__state = <optimized out>
stmt = <optimized out>
lc__state = <optimized out>
stmt = <optimized out>
__errno_location = <optimized out>
__errno_location = <optimized out>
__errno_location = <optimized out>
__errno_location = <optimized out>
__
Best wishes,
Yuri Zamyatin
On Thu, 2025-10-09 at 08:35 +0000, PG Bug reporting form wrote:
- Disabled enable_hashagg for some queries (just now)
Hi,
How sure are you that the crashes are happening without HashAgg? I made
some changes in that area in v18, and a lot of the evidence points in
that direction. Please let me know if you are able to confirm that the
simpler (non-hashagg) plan you showed is actually crashing.
Regards,
Jeff Davis
Hi. I was able to reproduce the crash with the simpler (non hash-agg) plan from the previous message.
Basically I launched it in multiple infinite loops that do BEGIN - UPDATE - ROLLBACK. Other clients could also modify the tables during this time.
We've seen this query crash on multiple physical hosts.
Original query:
update Tcv_scenes cs
set
state_id=2,
stitching_server_id=null,
stitching_server_pid=null
from
tcv_scene_datas cd -- Partition key: RANGE (cv_scene_id)
where
cd.cv_scene_id=cs.id and
(
(cs.state_id=7 and cs.date_cr<now()-interval '24 hours' and cs.date_state_mo>now()-interval '15 minutes' and cd.stitcher_result::text like '%download%') or
(cs.state_id=3 and cs.date_state_mo<now()-interval '5 minutes')
)
returning cs.id
GDB Stack Trace:
Show quoted text
#0 0x0000555fe8678300 in PartitionDirectoryLookup (pdir=0x0, rel=0x7f14d172b288) at ./build/../src/backend/partitioning/partdesc.c:462
pde = <optimized out>
relid = 21856
found = 27
#1 0x0000555fe8558b51 in InitExecPartitionPruneContexts (prunestate=<optimized out>, parent_plan=0x55601c213448, initially_valid_subplans=<optimized out>,
n_total_subplans=<optimized out>) at ./build/../src/backend/executor/execPartition.c:2413
partkey = 0x55601b0244c0
partdesc = <optimized out>
pprune = <optimized out>
nparts = 239
k = <optimized out>
prunedata = 0x55601b7ea748
j = <optimized out>
estate = <optimized out>
new_subplan_indexes = <optimized out>
new_other_subplans = <optimized out>
i = 0
newidx = <optimized out>
fix_subplan_map = <optimized out>
estate = <optimized out>
new_subplan_indexes = <optimized out>
new_other_subplans = <error reading variable new_other_subplans (Cannot access memory at address 0x0)>
i = <optimized out>
newidx = <optimized out>
fix_subplan_map = <optimized out>
prunedata = <error reading variable prunedata (Cannot access memory at address 0x0)>
j = <optimized out>
pprune = <optimized out>
nparts = <optimized out>
k = <optimized out>
partkey = <optimized out>
partdesc = <optimized out>
oldidx = <optimized out>
subidx = <optimized out>
subprune = <optimized out>
#2 ExecInitPartitionExecPruning (planstate=planstate@entry=0x55601c213448, n_total_subplans=<optimized out>, part_prune_index=<optimized out>, relids=<optimized out>,
initially_valid_subplans=initially_valid_subplans@entry=0x7ffe30415500) at ./build/../src/backend/executor/execPartition.c:1934
prunestate = <optimized out>
estate = <optimized out>
pruneinfo = <optimized out>
__func__ = "ExecInitPartitionExecPruning"
#3 0x0000555fe856b030 in ExecInitAppend (node=node@entry=0x55601b8531f8, estate=estate@entry=0x55601c1fb0d8, eflags=eflags@entry=0)
at ./build/../src/backend/executor/nodeAppend.c:147
prunestate = <optimized out>
appendstate = 0x55601c213448
appendplanstates = <optimized out>
appendops = <optimized out>
validsubplans = 0x55601c213650
asyncplans = <optimized out>
nplans = <optimized out>
nasyncplans = <optimized out>
firstvalid = <optimized out>
i = <optimized out>
j = <optimized out>
#4 0x0000555fe8559ad5 in ExecInitNode (node=0x55601b8531f8, estate=estate@entry=0x55601c1fb0d8, eflags=0) at ./build/../src/backend/executor/execProcnode.c:182
result = <optimized out>
subps = <optimized out>
l = <optimized out>
__func__ = "ExecInitNode"
#5 0x0000555fe8584383 in ExecInitNestLoop (node=node@entry=0x55601b725a68, estate=estate@entry=0x55601c1fb0d8, eflags=<optimized out>, eflags@entry=0)
at ./build/../src/backend/executor/nodeNestloop.c:301
nlstate = 0x55601c1fbd80
__func__ = "ExecInitNestLoop"
#6 0x0000555fe85598f1 in ExecInitNode (node=node@entry=0x55601b725a68, estate=estate@entry=0x55601c1fb0d8, eflags=eflags@entry=0)
at ./build/../src/backend/executor/execProcnode.c:298
result = <optimized out>
subps = <optimized out>
l = <optimized out>
__func__ = "ExecInitNode"
#7 0x0000555fe855480f in EvalPlanQualStart (epqstate=0x55601b745d68, planTree=0x55601b725a68) at ./build/../src/backend/executor/execMain.c:3152
parentestate = <optimized out>
oldcontext = 0x55601b7e99b0
rtsize = <optimized out>
rcestate = 0x55601c1fb0d8
l = <optimized out>
parentestate = <optimized out>
rtsize = <optimized out>
rcestate = <optimized out>
oldcontext = <optimized out>
l = <optimized out>
i = <optimized out>
l__state = <optimized out>
subplan = <optimized out>
subplanstate = <optimized out>
l__state = <optimized out>
earm = <optimized out>
l__state = <optimized out>
rtindex = <optimized out>
#8 EvalPlanQualBegin (epqstate=epqstate@entry=0x55601b745d68) at ./build/../src/backend/executor/execMain.c:2930
parentestate = <optimized out>
recheckestate = <optimized out>
#9 0x0000555fe85549ab in EvalPlanQual (epqstate=0x55601b745d68, relation=relation@entry=0x7f14d1722d68, rti=1, inputslot=inputslot@entry=0x55601be51480)
at ./build/../src/backend/executor/execMain.c:2650
slot = <optimized out>
testslot = <optimized out>
#10 0x0000555fe858001d in ExecUpdate (context=context@entry=0x7ffe304157d0, resultRelInfo=resultRelInfo@entry=0x55601b745e88, tupleid=tupleid@entry=0x7ffe304157aa,
oldtuple=oldtuple@entry=0x0, oldSlot=<optimized out>, oldSlot@entry=0x55601be50c70, slot=slot@entry=0x55601be51078, canSetTag=true)
at ./build/../src/backend/executor/nodeModifyTable.c:2606
inputslot = 0x55601be51480
epqslot = <optimized out>
lockedtid = {ip_blkid = {bi_hi = 30, bi_lo = 53843}, ip_posid = 40}
estate = 0x55601b7e9aa8
resultRelationDesc = <optimized out>
updateCxt = {crossPartUpdate = false, updateIndexes = TU_None, lockmode = LockTupleNoKeyExclusive}
result = <optimized out>
__func__ = "ExecUpdate"
#11 0x0000555fe8581fff in ExecModifyTable (pstate=0x55601b745c80) at ./build/../src/backend/executor/nodeModifyTable.c:4510
node = 0x55601b745c80
context = {mtstate = 0x55601b745c80, epqstate = 0x55601b745d68, estate = 0x55601b7e9aa8, planSlot = 0x55601be4bef0, tmfd = {ctid = {ip_blkid = {bi_hi = 30,
bi_lo = 53844}, ip_posid = 13}, xmax = 2949858589, cmax = 4294967295, traversed = true}, cpDeletedSlot = 0x0, cpUpdateReturningSlot = 0x0}
estate = 0x55601b7e9aa8
operation = CMD_UPDATE
resultRelInfo = 0x55601b745e88
subplanstate = <optimized out>
slot = 0x55601be51078
oldSlot = 0x55601be50c70
tuple_ctid = {ip_blkid = {bi_hi = 30, bi_lo = 53844}, ip_posid = 13}
oldtupdata = {t_len = 2675325712, t_self = {ip_blkid = {bi_hi = 32475, bi_lo = 0}, ip_posid = 32265}, t_tableOid = 0, t_data = 0xf0}
oldtuple = 0x0
tupleid = <optimized out>
tuplock = false
__func__ = "ExecModifyTable"
#12 0x0000555fe855954d in ExecProcNodeInstr (node=0x55601b745c80) at ./build/../src/backend/executor/execProcnode.c:485
result = <optimized out>
#13 0x0000555fe855191b in ExecProcNode (node=0x55601b745c80) at ./build/../src/include/executor/executor.h:315
No locals.
#14 ExecutePlan (queryDesc=0x55601b737af0, operation=CMD_UPDATE, sendTuples=true, numberTuples=0, direction=<optimized out>, dest=0x555fe8bd2ec0 <donothingDR>)
at ./build/../src/backend/executor/execMain.c:1697
estate = 0x55601b7e9aa8
use_parallel_mode = <optimized out>
slot = <optimized out>
planstate = 0x55601b745c80
current_tuple_count = 0
estate = <optimized out>
planstate = <optimized out>
use_parallel_mode = <optimized out>
slot = <optimized out>
current_tuple_count = <optimized out>
#15 standard_ExecutorRun (queryDesc=0x55601b737af0, direction=<optimized out>, count=0) at ./build/../src/backend/executor/execMain.c:366
estate = 0x55601b7e9aa8
operation = CMD_UPDATE
dest = 0x555fe8bd2ec0 <donothingDR>
sendTuples = <optimized out>
oldcontext = 0x55601b01f490
#16 0x0000555fe84e2e1c in ExplainOnePlan (plannedstmt=plannedstmt@entry=0x55601b73b2d0, into=into@entry=0x0, es=es@entry=0x55601b0218e0,
queryString=queryString@entry=0x55601aed5198 "explain(buffers,verbose,analyze)update Tcv_scenes cs\nset\n\tstate_id=2,\n\tstitching_server_id=null,\n\tstitching_server_pid=null\nfrom\n\ttcv_scene_datas cd\nwhere\n\tcd.cv_scene_id=cs.id and\n\t(\n\t\t(state_id=7 an"..., params=params@entry=0x0, queryEnv=queryEnv@entry=0x0,
planduration=0x7ffe30415aa8, bufusage=0x7ffe30415b50, mem_counters=0x0) at ./build/../src/backend/commands/explain.c:579
dir = <optimized out>
dest = 0x555fe8bd2ec0 <donothingDR>
queryDesc = 0x55601b737af0
starttime = <optimized out>
totaltime = 0
eflags = <optimized out>
instrument_option = <optimized out>
serializeMetrics = {bytesSent = 0, timeSpent = {ticks = 0}, bufferUsage = {shared_blks_hit = <optimized out>, shared_blks_read = <optimized out>,
shared_blks_dirtied = <optimized out>, shared_blks_written = <optimized out>, local_blks_hit = <optimized out>, local_blks_read = <optimized out>,
local_blks_dirtied = <optimized out>, local_blks_written = <optimized out>, temp_blks_read = <optimized out>, temp_blks_written = <optimized out>,
shared_blk_read_time = {ticks = <optimized out>}, shared_blk_write_time = {ticks = <optimized out>}, local_blk_read_time = {ticks = <optimized out>},
local_blk_write_time = {ticks = <optimized out>}, temp_blk_read_time = {ticks = <optimized out>}, temp_blk_write_time = {ticks = <optimized out>}}}
#17 0x0000555fe84e34c4 in standard_ExplainOneQuery (query=<optimized out>, cursorOptions=<optimized out>, into=0x0, es=0x55601b0218e0,
queryString=0x55601aed5198 "explain(buffers,verbose,analyze)update Tcv_scenes cs\nset\n\tstate_id=2,\n\tstitching_server_id=null,\n\tstitching_server_pid=null\nfrom\n\ttcv_scene_datas cd\nwhere\n\tcd.cv_scene_id=cs.id and\n\t(\n\t\t(state_id=7 an"..., params=0x0, queryEnv=0x0) at ./build/../src/backend/commands/explain.c:372
plan = 0x55601b73b2d0
planstart = <optimized out>
planduration = {ticks = 4659506}
bufusage_start = {shared_blks_hit = 18745, shared_blks_read = 0, shared_blks_dirtied = 0, shared_blks_written = 0, local_blks_hit = 0, local_blks_read = 0,
local_blks_dirtied = 0, local_blks_written = 0, temp_blks_read = 0, temp_blks_written = 0, shared_blk_read_time = {ticks = 0}, shared_blk_write_time = {
ticks = 0}, local_blk_read_time = {ticks = 0}, local_blk_write_time = {ticks = 0}, temp_blk_read_time = {ticks = 0}, temp_blk_write_time = {ticks = 0}}
bufusage = {shared_blks_hit = 16, shared_blks_read = 0, shared_blks_dirtied = 0, shared_blks_written = 0, local_blks_hit = 0, local_blks_read = 0,
local_blks_dirtied = 0, local_blks_written = 0, temp_blks_read = 0, temp_blks_written = 0, shared_blk_read_time = {ticks = 0}, shared_blk_write_time = {
ticks = 0}, local_blk_read_time = {ticks = 0}, local_blk_write_time = {ticks = 0}, temp_blk_read_time = {ticks = 0}, temp_blk_write_time = {ticks = 0}}
mem_counters = {nblocks = 93871265236088, freechunks = 93871265236088, totalspace = 139727390028424, freespace = 230455663}
planner_ctx = 0x0
saved_ctx = 0x0
#18 0x0000555fe84e3641 in ExplainOneQuery (query=<optimized out>, cursorOptions=<optimized out>, into=<optimized out>, es=<optimized out>, pstate=<optimized out>,
params=<optimized out>) at ./build/../src/backend/commands/explain.c:309
No locals.
#19 0x0000555fe84e3733 in ExplainQuery (pstate=0x55601b01f728, stmt=0x55601b6b42d8, params=0x0, dest=0x55601b01f6a0) at ./build/../src/backend/commands/explain.c:223
l__state = {l = <optimized out>, i = 0}
l = 0x55601b197008
es = 0x55601b0218e0
tstate = <optimized out>
jstate = <optimized out>
query = <optimized out>
rewritten = 0x55601b196ff0
#20 0x0000555fe872f083 in standard_ProcessUtility (pstmt=0x55601b6b4370,
queryString=0x55601aed5198 "explain(buffers,verbose,analyze)update Tcv_scenes cs\nset\n\tstate_id=2,\n\tstitching_server_id=null,\n\tstitching_server_pid=null\nfrom\n\ttcv_scene_datas cd\nwhere\n\tcd.cv_scene_id=cs.id and\n\t(\n\t\t(state_id=7 an"..., readOnlyTree=<optimized out>, context=PROCESS_UTILITY_TOPLEVEL, params=0x0,
queryEnv=0x0, dest=0x55601b01f6a0, qc=0x7ffe30415db0) at ./build/../src/backend/tcop/utility.c:866
parsetree = 0x55601b6b42d8
isTopLevel = <optimized out>
isAtomicContext = true
pstate = 0x55601b01f728
readonly_flags = <optimized out>
__func__ = "standard_ProcessUtility"
#21 0x0000555fe872d231 in PortalRunUtility (portal=portal@entry=0x55601afc2718, pstmt=0x55601b6b4370, isTopLevel=isTopLevel@entry=true,
setHoldSnapshot=setHoldSnapshot@entry=true, dest=dest@entry=0x55601b01f6a0, qc=qc@entry=0x7ffe30415db0) at ./build/../src/backend/tcop/pquery.c:1153
No locals.
#22 0x0000555fe872d5ef in FillPortalStore (portal=portal@entry=0x55601afc2718, isTopLevel=isTopLevel@entry=true) at ./build/../src/backend/tcop/pquery.c:1026
treceiver = 0x55601b01f6a0
qc = {commandTag = CMDTAG_UNKNOWN, nprocessed = 0}
__func__ = "FillPortalStore"
#23 0x0000555fe872d96f in PortalRun (portal=portal@entry=0x55601afc2718, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true,
dest=dest@entry=0x55601b0ad0b0, altdest=altdest@entry=0x55601b0ad0b0, qc=qc@entry=0x7ffe30415fc0) at ./build/../src/backend/tcop/pquery.c:760
_save_exception_stack = 0x7ffe304162a0
_save_context_stack = 0x0
_local_sigjmp_buf = {{__jmpbuf = {93871257954072, -100083352534358458, 93871265235712, 140729708011456, 93871258914992, 93871265235752, -100083352490318266,
-6061881521657190842}, __mask_was_saved = 0, __saved_mask = {__val = {0, 140728898420737, 93869327402605, 93871257965608, 93870412007286,
140729708011280, 93871257954072, 93870412007286, 1, 93871258914920, 93871265235752, 140729708011344, 93870411676182, 140729708011344, 2,
140729708011344}}}}
_do_rethrow = <optimized out>
result = <optimized out>
nprocessed = <optimized out>
saveTopTransactionResourceOwner = 0x55601af27370
saveTopTransactionContext = 0x55601afd83e0
saveActivePortal = 0x0
saveResourceOwner = 0x55601af27370
savePortalContext = 0x0
saveMemoryContext = 0x55601afd83e0
__func__ = "PortalRun"
#24 0x0000555fe8729668 in exec_simple_query (
query_string=0x55601aed5198 "explain(buffers,verbose,analyze)update Tcv_scenes cs\nset\n\tstate_id=2,\n\tstitching_server_id=null,\n\tstitching_server_pid=null\nfrom\n\ttcv_scene_datas cd\nwhere\n\tcd.cv_scene_id=cs.id and\n\t(\n\t\t(state_id=7 an"...) at ./build/../src/backend/tcop/postgres.c:1273
cmdtaglen = 7
snapshot_set = <optimized out>
per_parsetree_context = 0x0
plantree_list = <optimized out>
parsetree = 0x55601b6b4300
commandTag = <optimized out>
qc = {commandTag = CMDTAG_UNKNOWN, nprocessed = 0}
querytree_list = <optimized out>
portal = 0x55601afc2718
receiver = 0x55601b0ad0b0
format = 0
cmdtagname = <optimized out>
parsetree_item__state = {l = 0x55601b6b4328, i = 0}
dest = DestRemote
oldcontext = 0x55601afd83e0
parsetree_list = 0x55601b6b4328
parsetree_item = 0x55601b6b4340
save_log_statement_stats = false
was_logged = false
use_implicit_block = false
msec_str = "\340\031\301\350_U\000\000Q\000\000\000\000\000\000\000\000bA0\376\177\000\000\004\000\000\000\000\000\000"
__func__ = "exec_simple_query"
#25 0x0000555fe872b56d in PostgresMain (dbname=<optimized out>, username=<optimized out>) at ./build/../src/backend/tcop/postgres.c:4766
query_string = 0x55601aed5198 "explain(buffers,verbose,analyze)update Tcv_scenes cs\nset\n\tstate_id=2,\n\tstitching_server_id=null,\n\tstitching_server_pid=null\nfrom\n\ttcv_scene_datas cd\nwhere\n\tcd.cv_scene_id=cs.id and\n\t(\n\t\t(state_id=7 an"...
firstchar = <optimized out>
input_message = {
data = 0x55601aed5198 "explain(buffers,verbose,analyze)update Tcv_scenes cs\nset\n\tstate_id=2,\n\tstitching_server_id=null,\n\tstitching_server_pid=null\nfrom\n\ttcv_scene_datas cd\nwhere\n\tcd.cv_scene_id=cs.id and\n\t(\n\t\t(state_id=7 an"..., len = 431, maxlen = 1024, cursor = 431}
local_sigjmp_buf = {{__jmpbuf = {140729708012000, -100083351217347002, 2753760000, 4, 0, 1, -100083351357856186, -6061881523619469754}, __mask_was_saved = 1,
__saved_mask = {__val = {4194304, 135168, 5232754935419077376, 16, 260416, 18446744073709551312, 260400, 0, 16274, 139727397133088, 139727395799228,
93870411867664, 139727390638096, 0, 18446744073709551312, 93871256618032}}}}
send_ready_for_query = false
idle_in_transaction_timeout_enabled = false
idle_session_timeout_enabled = false
__func__ = "PostgresMain"
#26 0x0000555fe8725a33 in BackendMain (startup_data=<optimized out>, startup_data_len=<optimized out>) at ./build/../src/backend/tcop/backend_startup.c:124
bsdata = <optimized out>
#27 0x0000555fe8683cfd in postmaster_child_launch (child_type=B_BACKEND, child_slot=316, startup_data=startup_data@entry=0x7ffe304164c0,
startup_data_len=startup_data_len@entry=24, client_sock=client_sock@entry=0x7ffe304164e0) at ./build/../src/backend/postmaster/launch_backend.c:290
pid = <optimized out>
#28 0x0000555fe8687802 in BackendStartup (client_sock=0x7ffe304164e0) at ./build/../src/backend/postmaster/postmaster.c:3587
bn = 0x7f14d1b05b50
pid = <optimized out>
startup_data = {canAcceptConnections = CAC_OK, socket_created = 813354333582584, fork_started = 813354333582603}
cac = <optimized out>
bn = <optimized out>
pid = <optimized out>
startup_data = <optimized out>
cac = <optimized out>
__func__ = "BackendStartup"
__errno_location = <optimized out>
save_errno = <optimized out>
__errno_location = <optimized out>
__errno_location = <optimized out>
#29 ServerLoop () at ./build/../src/backend/postmaster/postmaster.c:1702
s = {sock = 10, raddr = {addr = {ss_family = 2,
__ss_padding = "\305\370\274|$\247\000\000\000\000\000\000\000\000K\323\352\032`U\000\000\000\000\000\000\000\000\000\000PeA0\376\177\000\000 eA0\376\177\000\000\000\004\000\000\000\000\000\000@\323\352\032`U\000\000\213y\210\350_U", '\000' <repeats 18 times>, "peA0\376\177\000\000x\344\217\350_U\000\000\000\000\000\000\000\000\000\000\255\226\311\321\024\177\000", __ss_align = 1}, salen = 16}}
i = 0
now = <optimized out>
last_lockfile_recheck_time = 1760039078
last_touch_time = 1760036170
events = {{pos = 1, events = 2, fd = 6, user_data = 0x0}, {pos = 0, events = 0, fd = 6, user_data = 0x0}, {pos = 0, events = 0, fd = 8, user_data = 0x0}, {
pos = 658, events = 21855, fd = 451405112, user_data = 0x400000000aa}, {pos = 0, events = 21856, fd = 451597131, user_data = 0x0}, {pos = -1303149824,
events = 1218345699, fd = 451413120, user_data = 0x555fe8c28f60 <errordata>}, {pos = 809592352, events = 32766, fd = -393725350, user_data = 0xf}, {
pos = 0, events = 0, fd = 809592432, user_data = 0x0}, {pos = 809592432, events = 32766, fd = 451930800, user_data = 0x555fe88eca37}, {pos = -389995904,
events = 21855, fd = 0, user_data = 0x555fe88ce239 <pg_freeaddrinfo_all+73>}, {pos = 8, events = 0, fd = 809592672, user_data = 0x7ffe30416fa0}, {
pos = -396723022, events = 21855, fd = 451936489, user_data = 0x15381af000f2}, {pos = 451767480, events = 21856, fd = 809592672,
user_data = 0x7ffe304166bc}, {pos = 1, events = 1, fd = 451936565, user_data = 0x1e8afb0b4}, {pos = 451930800, events = 21856, fd = -393223028,
user_data = 0x100000001}, {pos = 1, events = 0, fd = 0, user_data = 0x0}, {pos = 0, events = 0, fd = 0, user_data = 0x7f0032333435}, {pos = -393347968,
events = 21855, fd = 451936712, user_data = 0x55601af001d2}, {pos = 451936739, events = 21856, fd = 15729133, user_data = 0x556000000000}, {pos = 0,
events = 0, fd = 0, user_data = 0x556000000000}, {pos = 0, events = 21760, fd = -771537064, user_data = 0x6e75722f7261762f}, {pos = 1936683055,
events = 1701996404, fd = 795636083, user_data = 0x3334352e4c515347}, {pos = -393281486, events = 21855, fd = -393262758, user_data = 0x7ffe30416dd0}, {
pos = -393347879, events = 21855, fd = 0, user_data = 0x0}, {pos = 809594368, events = 32766, fd = -393348049, user_data = 0x7ffe30416e10}, {pos = 9305135,
events = 0, fd = 0, user_data = 0x0}, {pos = 0, events = 0, fd = -771537064, user_data = 0x0}, {pos = -775242182, events = 32532, fd = 0,
user_data = 0xff0}, {pos = 0, events = 538976256, fd = -771537064, user_data = 0x5420706100000000}, {pos = -773907776, events = 32532, fd = 255,
user_data = 0xfffffffffffffed0}, {pos = 0, events = 0, fd = 399, user_data = 0x55601ae7b2b0}, {pos = -775242182, events = 32532, fd = 665957,
user_data = 0xdf20}, {pos = 0, events = 0, fd = 10, user_data = 0x0}, {pos = -773907776, events = 32532, fd = 255, user_data = 0xfffffffffffffed0}, {
pos = -773914672, events = 32532, fd = 8, user_data = 0x7f14d1deffd0 <_IO_file_jumps>}, {pos = -775238318, events = 32532, fd = 2996,
user_data = 0x55601ae7b2b0}, {pos = 4096, events = 0, fd = 809593520, user_data = 0x7f14d1deffd0 <_IO_file_jumps>}, {pos = -775389748, events = 32532,
fd = 26, user_data = 0x1397}, {pos = 1, events = 0, fd = 33152, user_data = 0x70}, {pos = 0, events = 0, fd = 1, user_data = 0x100000000}, {pos = 2,
events = 17, fd = 0, user_data = 0x3}, {pos = 0, events = 1, fd = 0, user_data = 0x0}, {pos = 0, events = 0, fd = 0, user_data = 0x0}, {pos = 0,
events = 0, fd = 0, user_data = 0x0}, {pos = 0, events = 0, fd = 0, user_data = 0x0}, {pos = 0, events = 0, fd = 0, user_data = 0x55601ae7b2b0}, {pos = 8,
events = 0, fd = -773907776, user_data = 0x802}, {pos = -304, events = 4294967295, fd = 5, user_data = 0x555fe88e5dd6}, {pos = -393256156, events = 21855,
fd = -775238318, user_data = 0x7ffe30416a50}, {pos = -393342935, events = 21855, fd = 32768, user_data = 0x9}, {pos = 809593680, events = 32766,
fd = -775001723, user_data = 0x7f0000000000}, {pos = 9, events = 0, fd = 809593648, user_data = 0x7f14d1ca90cd}, {pos = 2429, events = 0, fd = 32832,
user_data = 0x55601af00ad0}, {pos = 32832, events = 0, fd = 451971856, user_data = 0x7f14d1caa4f8}, {pos = 9, events = 0, fd = 451939024,
user_data = 0xfffffffffffffe98}, {pos = 0, events = 0, fd = 2050, user_data = 0x7f14d1cad3c0 <free+384>}, {pos = 544854009, events = 0, fd = 1759619142,
user_data = 0x2079cff9}, {pos = 0, events = 0, fd = 0, user_data = 0x9}, {pos = 809593648, events = 32766, fd = 809593680, user_data = 0x55601af00ae0}, {
pos = -393323050, events = 21855, fd = -393256156, user_data = 0x7f14d1ce678d <closedir+13>}, {pos = 451459056, events = 21856, fd = -395355527,
user_data = 0x55601af00ae0}, {pos = -393322945, events = 21855, fd = 809594800, user_data = 0x555fe86f90c8 <RemovePgTempFiles+312>}, {pos = 451541152,
events = 21856, fd = 0, user_data = 0x7367702f65736162}, {pos = 1952410737, events = 1207988333, fd = 771766842, user_data = 0x7f14d1cabe3a}}
nevents = <optimized out>
__func__ = "ServerLoop"
#30 0x0000555fe8689110 in PostmasterMain (argc=argc@entry=5, argv=argv@entry=0x55601ae7c310) at ./build/../src/backend/postmaster/postmaster.c:1400
opt = <optimized out>
status = <optimized out>
userDoption = <optimized out>
listen_addr_saved = true
output_config_variable = <optimized out>
__func__ = "PostmasterMain"
#31 0x0000555fe837f880 in main (argc=5, argv=0x55601ae7c310) at ./build/../src/backend/main/main.c:227
do_check_root = <optimized out>
dispatch_option = <optimized out>
On Fri, 10 Oct 2025 at 14:34, Yuri Zamyatin <yuri@yrz.am> wrote:
Hi. I was able to reproduce the crash with the simpler (non hash-agg) plan from the previous message.
Basically I launched it in multiple infinite loops that do BEGIN - UPDATE - ROLLBACK. Other clients could also modify the tables during this time.
We've seen this query crash on multiple physical hosts.
#0 0x0000555fe8678300 in PartitionDirectoryLookup (pdir=0x0, rel=0x7f14d172b288) at ./build/../src/backend/partitioning/partdesc.c:462
pde = <optimized out>
relid = 21856
found = 27
Looks like that's crashing in a different place from the last
backtrace you showed.
Are you able to test this without any extensions loaded to see if you
still get a crash?
At a wild guess, perhaps an extension has gone rogue and spawned
another thread resulting in something like concurrent palloc requests
getting confused and causing something strange to happen when
accessing certain palloc'd chunks. Running without extensions may help
narrow things down.
postgresql_effective_cache_size = 560GB
postgresql_shared_buffers = 225GB
Which extension are these GUCs from?
David
Hi.
Hash aggregation.
enable_hashagg=off
Fixed all the crashes in a subset of queries that used hash aggregation (tts_minimal_store_tuple).
I'm sure about that since I did some a/b testing within a timeframe of a few days.
Configuration.
postgresql_effective_cache_size = 560GB
postgresql_shared_buffers = 225GBWhich extension are these GUCs from?
It was a typo in my message, sorry about that.
In postgresql.conf they're stated correctly as effective_cache_size and shared_buffers.
Partition lookup.
Are you able to test this without any extensions loaded to see if you still get a crash?
Yes, reproduced it on a machine with a clean Debian 13 & Postgresql 18 install
from the sources mentioned earlier. The only extension loaded is plpgsql. Changes to
postgresql.conf: max_connections=1000, work_mem=2000MB, shared_buffers=10GB, max_wal_size=10GB.
I pg_restore'd two tables (+partitions) from production into a clean cluster and created
indexes manually. The partitioned table is 2.2TB in size. I hope to narrow things down
and provide better reproduction steps tomorrow.
To cause the segfault, these queries were launched simultaneously.
-- in 2 parallel infinite loops
with ids as (select (118998526-random()*100000)::int id from generate_series(1,10000))
update tcv_scene_datas set id=id where cv_scene_id in(select id from ids);
with ids as (select (118998526-random()*100000)::int id from generate_series(1,10000))
update tcv_scenes set id=id where id in(select id from ids);
-- in 10 parallel infinite loops
set jit = off;
begin;
-- <EXPLAIN to save query plan>
update Tcv_scenes cs -- CRASHES
set
state_id=2,
stitching_server_id=null,
stitching_server_pid=null
from
tcv_scene_datas cd -- partitioned
where
cd.cv_scene_id=cs.id and
(
(state_id=7 and date_cr<now()-interval '24 hours' and date_state_mo>now()-interval '15 minutes' and cd.stitcher_result::text like '%download%') or
(state_id=3 and date_state_mo<now()-interval '5 minutes')
)
returning cs.id;
rollback;
Plan stayed the same. Stacktrace also looks the same (now without analyze part):
#0 0x000055987ffde300 in PartitionDirectoryLookup (pdir=0x0, rel=0x7f3069be8c98) at ./build/../src/backend/partitioning/partdesc.c:462
pde = <optimized out>
relid = 0
found = false
#1 0x000055987febeb51 in InitExecPartitionPruneContexts (prunestate=<optimized out>, parent_plan=0x55988a30b7d0, initially_valid_subplans=<optimized out>,
n_total_subplans=<optimized out>) at ./build/../src/backend/executor/execPartition.c:2413
partkey = 0x5598892dd5a0
partdesc = <optimized out>
pprune = <optimized out>
nparts = 239
k = <optimized out>
prunedata = 0x559889a8f7b8
j = <optimized out>
estate = <optimized out>
new_subplan_indexes = <optimized out>
new_other_subplans = <optimized out>
i = 0
newidx = <optimized out>
fix_subplan_map = <optimized out>
estate = <optimized out>
new_subplan_indexes = <optimized out>
new_other_subplans = <error reading variable new_other_subplans (Cannot access memory at address 0x0)>
i = <optimized out>
newidx = <optimized out>
fix_subplan_map = <optimized out>
prunedata = <error reading variable prunedata (Cannot access memory at address 0x0)>
j = <optimized out>
pprune = <optimized out>
nparts = <optimized out>
k = <optimized out>
partkey = <optimized out>
partdesc = <optimized out>
oldidx = <optimized out>
subidx = <optimized out>
subprune = <optimized out>
#2 ExecInitPartitionExecPruning (planstate=planstate@entry=0x55988a30b7d0, n_total_subplans=<optimized out>, part_prune_index=<optimized out>, relids=<optimized out>,
initially_valid_subplans=initially_valid_subplans@entry=0x7ffefb55a2e0) at ./build/../src/backend/executor/execPartition.c:1934
prunestate = <optimized out>
estate = <optimized out>
pruneinfo = <optimized out>
__func__ = "ExecInitPartitionExecPruning"
#3 0x000055987fed1030 in ExecInitAppend (node=node@entry=0x7f2dcff14420, estate=estate@entry=0x55988a309b08, eflags=eflags@entry=0)
at ./build/../src/backend/executor/nodeAppend.c:147
prunestate = <optimized out>
appendstate = 0x55988a30b7d0
appendplanstates = <optimized out>
appendops = <optimized out>
validsubplans = 0x55988a325470
asyncplans = <optimized out>
nplans = <optimized out>
nasyncplans = <optimized out>
firstvalid = <optimized out>
i = <optimized out>
j = <optimized out>
#4 0x000055987febfad5 in ExecInitNode (node=0x7f2dcff14420, estate=estate@entry=0x55988a309b08, eflags=0) at ./build/../src/backend/executor/execProcnode.c:182
result = <optimized out>
subps = <optimized out>
l = <optimized out>
__func__ = "ExecInitNode"
#5 0x000055987feea383 in ExecInitNestLoop (node=node@entry=0x7f2dd0006f28, estate=estate@entry=0x55988a309b08, eflags=<optimized out>, eflags@entry=0)
at ./build/../src/backend/executor/nodeNestloop.c:301
nlstate = 0x55988a30a7b0
__func__ = "ExecInitNestLoop"
#6 0x000055987febf8f1 in ExecInitNode (node=node@entry=0x7f2dd0006f28, estate=estate@entry=0x55988a309b08, eflags=eflags@entry=0)
at ./build/../src/backend/executor/execProcnode.c:298
result = <optimized out>
subps = <optimized out>
l = <optimized out>
__func__ = "ExecInitNode"
#7 0x000055987feba80f in EvalPlanQualStart (epqstate=0x559889383d08, planTree=0x7f2dd0006f28) at ./build/../src/backend/executor/execMain.c:3152
parentestate = <optimized out>
oldcontext = 0x559889a8ea20
rtsize = <optimized out>
rcestate = 0x55988a309b08
l = <optimized out>
parentestate = <optimized out>
rtsize = <optimized out>
rcestate = <optimized out>
oldcontext = <optimized out>
l = <optimized out>
i = <optimized out>
l__state = <optimized out>
subplan = <optimized out>
subplanstate = <optimized out>
l__state = <optimized out>
earm = <optimized out>
l__state = <optimized out>
rtindex = <optimized out>
#8 EvalPlanQualBegin (epqstate=epqstate@entry=0x559889383d08) at ./build/../src/backend/executor/execMain.c:2930
parentestate = <optimized out>
recheckestate = <optimized out>
#9 0x000055987feba9ab in EvalPlanQual (epqstate=0x559889383d08, relation=relation@entry=0x7f3069be4710, rti=1, inputslot=inputslot@entry=0x559889fc01f0)
at ./build/../src/backend/executor/execMain.c:2650
slot = <optimized out>
testslot = <optimized out>
#10 0x000055987fee601d in ExecUpdate (context=context@entry=0x7ffefb55a5b0, resultRelInfo=resultRelInfo@entry=0x559889383e28, tupleid=tupleid@entry=0x7ffefb55a58a,
oldtuple=oldtuple@entry=0x0, oldSlot=<optimized out>, oldSlot@entry=0x559889fb3178, slot=slot@entry=0x559889fb3580, canSetTag=true)
at ./build/../src/backend/executor/nodeModifyTable.c:2606
inputslot = 0x559889fc01f0
epqslot = <optimized out>
lockedtid = {ip_blkid = {bi_hi = 31, bi_lo = 16528}, ip_posid = 88}
estate = 0x559889a8eb18
resultRelationDesc = <optimized out>
updateCxt = {crossPartUpdate = false, updateIndexes = TU_None, lockmode = LockTupleNoKeyExclusive}
result = <optimized out>
__func__ = "ExecUpdate"
#11 0x000055987fee7fff in ExecModifyTable (pstate=0x559889383c20) at ./build/../src/backend/executor/nodeModifyTable.c:4510
node = 0x559889383c20
context = {mtstate = 0x559889383c20, epqstate = 0x559889383d08, estate = 0x559889a8eb18, planSlot = 0x559889fac818, tmfd = {ctid = {ip_blkid = {bi_hi = 31,
bi_lo = 16528}, ip_posid = 62}, xmax = 17203, cmax = 4294967295, traversed = true}, cpDeletedSlot = 0x0, cpUpdateReturningSlot = 0x7ffefb55a620}
estate = 0x559889a8eb18
operation = CMD_UPDATE
resultRelInfo = 0x559889383e28
subplanstate = <optimized out>
slot = 0x559889fb3580
oldSlot = 0x559889fb3178
tuple_ctid = {ip_blkid = {bi_hi = 31, bi_lo = 16528}, ip_posid = 62}
oldtupdata = {t_len = 240, t_self = {ip_blkid = {bi_hi = 0, bi_lo = 0}, ip_posid = 50744}, t_tableOid = 402685741, t_data = 0x559889facf00}
oldtuple = 0x0
tupleid = <optimized out>
tuplock = false
__func__ = "ExecModifyTable"
#12 0x000055987feb791b in ExecProcNode (node=0x559889383c20) at ./build/../src/include/executor/executor.h:315
No locals.
#13 ExecutePlan (queryDesc=0x5598892beb88, operation=CMD_UPDATE, sendTuples=true, numberTuples=0, direction=<optimized out>, dest=0x5598892beb00)
at ./build/../src/backend/executor/execMain.c:1697
estate = 0x559889a8eb18
use_parallel_mode = <optimized out>
slot = <optimized out>
planstate = 0x559889383c20
current_tuple_count = 536
estate = <optimized out>
planstate = <optimized out>
use_parallel_mode = <optimized out>
slot = <optimized out>
current_tuple_count = <optimized out>
#14 standard_ExecutorRun (queryDesc=0x5598892beb88, direction=<optimized out>, count=0) at ./build/../src/backend/executor/execMain.c:366
estate = 0x559889a8eb18
operation = CMD_UPDATE
dest = 0x5598892beb00
sendTuples = <optimized out>
oldcontext = 0x5598892be8f0
#15 0x0000559880092774 in ProcessQuery (plan=0x7f2dd001cb00,
sourceText=0x559889170158 "update Tcv_scenes cs\nset\n\tstate_id=2,\n\tstitching_server_id=null,\n\tstitching_server_pid=null\nfrom\n\ttcv_scene_datas cd\nwhere\n\tcd.cv_scene_id=cs.id and\n\t(\n\t\t(state_id=7 and date_cr<now()-interval '24 hou"..., params=0x0, queryEnv=0x0, dest=0x5598892beb00, qc=0x7ffefb55a780)
at ./build/../src/backend/tcop/pquery.c:161
queryDesc = 0x5598892beb88
#16 0x0000559880093421 in PortalRunMulti (portal=portal@entry=0x559889267908, isTopLevel=isTopLevel@entry=true, setHoldSnapshot=setHoldSnapshot@entry=true,
dest=dest@entry=0x5598892beb00, altdest=0x559880538ec0 <donothingDR>, qc=qc@entry=0x7ffefb55a780) at ./build/../src/backend/tcop/pquery.c:1272
pstmt = 0x7f2dd001cb00
stmtlist_item__state = {l = 0x7f2dd00225c0, i = 0}
active_snapshot_set = true
stmtlist_item = 0x7f2dd00225d8
#17 0x000055988009358f in FillPortalStore (portal=portal@entry=0x559889267908, isTopLevel=isTopLevel@entry=true) at ./build/../src/backend/tcop/pquery.c:1021
treceiver = 0x5598892beb00
qc = {commandTag = CMDTAG_UNKNOWN, nprocessed = 0}
__func__ = "FillPortalStore"
#18 0x000055988009396f in PortalRun (portal=portal@entry=0x559889267908, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true,
dest=dest@entry=0x7f2dd000be20, altdest=altdest@entry=0x7f2dd000be20, qc=qc@entry=0x7ffefb55a990) at ./build/../src/backend/tcop/pquery.c:760
_save_exception_stack = 0x7ffefb55ac70
_save_context_stack = 0x0
_local_sigjmp_buf = {{__jmpbuf = {94113624389896, 9025151889858365403, 94113630543472, 140733115115920, 139834739965472, 94113630543512, 9025151889659135963,
3022814109072099291}, __mask_was_saved = 0, __saved_mask = {__val = {0, 140728898420737, 94114140538477, 94113624401432, 94113473390454, 140733115115744,
94113624389896, 94113473390454, 1, 139834740057536, 94113630543512, 140733115115808, 94113473059350, 140733115115808, 2, 140733115115808}}}}
_do_rethrow = <optimized out>
result = <optimized out>
nprocessed = <optimized out>
saveTopTransactionResourceOwner = 0x5598892157f8
saveTopTransactionContext = 0x55988927c5c0
saveActivePortal = 0x0
saveResourceOwner = 0x5598892157f8
savePortalContext = 0x0
saveMemoryContext = 0x55988927c5c0
__func__ = "PortalRun"
#19 0x000055988008f668 in exec_simple_query (
query_string=0x559889170158 "update Tcv_scenes cs\nset\n\tstate_id=2,\n\tstitching_server_id=null,\n\tstitching_server_pid=null\nfrom\n\ttcv_scene_datas cd\nwhere\n\tcd.cv_scene_id=cs.id and\n\t(\n\t\t(state_id=7 and date_cr<now()-interval '24 hou"...) at ./build/../src/backend/tcop/postgres.c:1273
cmdtaglen = 6
snapshot_set = <optimized out>
per_parsetree_context = 0x0
plantree_list = <optimized out>
parsetree = 0x559889845e70
commandTag = <optimized out>
qc = {commandTag = CMDTAG_UNKNOWN, nprocessed = 0}
querytree_list = <optimized out>
portal = 0x559889267908
receiver = 0x7f2dd000be20
format = 0
cmdtagname = <optimized out>
parsetree_item__state = {l = 0x559889845e98, i = 0}
dest = DestRemote
oldcontext = 0x55988927c5c0
parsetree_list = 0x559889845e98
parsetree_item = 0x559889845eb0
save_log_statement_stats = false
was_logged = false
use_implicit_block = false
msec_str = "\340yW\200\230U\000\000Q\000\000\000\000\000\000\000ЫU\373\376\177\000\000\004\000\000\000\000\000\000"
__func__ = "exec_simple_query"
#20 0x000055988009156d in PostgresMain (dbname=<optimized out>, username=<optimized out>) at ./build/../src/backend/tcop/postgres.c:4766
query_string = 0x559889170158 "update Tcv_scenes cs\nset\n\tstate_id=2,\n\tstitching_server_id=null,\n\tstitching_server_pid=null\nfrom\n\ttcv_scene_datas cd\nwhere\n\tcd.cv_scene_id=cs.id and\n\t(\n\t\t(state_id=7 and date_cr<now()-interval '24 hou"...
firstchar = <optimized out>
input_message = {
data = 0x559889170158 "update Tcv_scenes cs\nset\n\tstate_id=2,\n\tstitching_server_id=null,\n\tstitching_server_pid=null\nfrom\n\ttcv_scene_datas cd\nwhere\n\tcd.cv_scene_id=cs.id and\n\t(\n\t\t(state_id=7 and date_cr<now()-interval '24 hou"..., len = 381, maxlen = 1024, cursor = 381}
local_sigjmp_buf = {{__jmpbuf = {140733115116464, 9025151890000971739, 293935616, 4, 0, 1, 9025151889751410651, 3022814102951954395}, __mask_was_saved = 1,
__saved_mask = {__val = {4194304, 143360, 12259252146692762112, 16, 132672, 18446744073709551312, 132656, 0, 8290, 139845915581216, 139845914247356,
94113473250832, 139845913448464, 2047, 18446744073709551312, 94113623012352}}}}
send_ready_for_query = false
idle_in_transaction_timeout_enabled = false
idle_session_timeout_enabled = false
__func__ = "PostgresMain"
#21 0x000055988008ba33 in BackendMain (startup_data=<optimized out>, startup_data_len=<optimized out>) at ./build/../src/backend/tcop/backend_startup.c:124
bsdata = <optimized out>
#22 0x000055987ffe9cfd in postmaster_child_launch (child_type=B_BACKEND, child_slot=6, startup_data=startup_data@entry=0x7ffefb55ae90,
startup_data_len=startup_data_len@entry=24, client_sock=client_sock@entry=0x7ffefb55aeb0) at ./build/../src/backend/postmaster/launch_backend.c:290
pid = <optimized out>
#23 0x000055987ffed802 in BackendStartup (client_sock=0x7ffefb55aeb0) at ./build/../src/backend/postmaster/postmaster.c:3587
bn = 0x5598891e6700
pid = <optimized out>
startup_data = {canAcceptConnections = CAC_OK, socket_created = 813767320273791, fork_started = 813767320273794}
cac = <optimized out>
bn = <optimized out>
pid = <optimized out>
startup_data = <optimized out>
cac = <optimized out>
__func__ = "BackendStartup"
__errno_location = <optimized out>
save_errno = <optimized out>
__errno_location = <optimized out>
__errno_location = <optimized out>
#24 ServerLoop () at ./build/../src/backend/postmaster/postmaster.c:1702
s = {sock = 10, raddr = {addr = {ss_family = 1,
__ss_padding = "\2230\"\026\237\276\000\000\000\000\000\000\000\000\v\203\024\211\230U\000\000\000\000\000\000\000\000\000\000 \257U\373\376\177\000\000\360\256U\373\376\177\000\000\000\004\000\000\000\000\000\000\000\203\024\211\230U\000\000\213\331\036\200\230U", '\000' <repeats 18 times>, "@\257U\373\376\177\000\000\236(&\200\230U\000\000\000\000\000\000\000\000\000\000\255\226\tj0\177\000", __ss_align = 1}, salen = 2}}
i = 0
now = <optimized out>
last_lockfile_recheck_time = 1760452076
last_touch_time = 1760450375
events = {{pos = 3, events = 2, fd = 8, user_data = 0x0}, {pos = 0, events = 0, fd = 8, user_data = 0x0}, {pos = -1995336688, events = 21912, fd = 0,
user_data = 0x559880252a37}, {pos = 0, events = 0, fd = -1995336440, user_data = 0x40000000069}, {pos = 0, events = 21912, fd = -1995144437,
user_data = 0x0}, {pos = 126953984, events = 2854329568, fd = -1995328432, user_data = 0x55988058ef60 <errordata>}, {pos = -78270480, events = 32766,
fd = -2145478054, user_data = 0xf}, {pos = 0, events = 0, fd = -78270400, user_data = 0x0}, {pos = -78270400, events = 32766, fd = -1994499072,
user_data = 0x559880252a37}, {pos = -2141748608, events = 21912, fd = 0, user_data = 0x559880234239 <pg_freeaddrinfo_all+73>}, {pos = 8, events = 0,
fd = -78270160, user_data = 0x7ffefb55b970}, {pos = 2146491570, events = 21912, fd = 0, user_data = 0x153800000000}, {pos = -1994974088, events = 21912,
fd = -78270160, user_data = 0x7ffefb55b08c}, {pos = 1, events = 1, fd = -1994516715, user_data = 0x1891e171c}, {pos = -1994499072, events = 21912,
fd = -1994516682, user_data = 0x100000001}, {pos = 1, events = 0, fd = 0, user_data = 0x0}, {pos = 0, events = 0, fd = 0, user_data = 0x7f0032333435}, {
pos = -2145100672, events = 21912, fd = -1994516568, user_data = 0x5598891e17b2}, {pos = -1994516541, events = 21912, fd = -1994516531,
user_data = 0x5598891e17de}, {pos = -1994516506, events = 21912, fd = -1994516443, user_data = 0x5598891e182b}, {pos = -1994516425, events = 21760,
fd = 1782978392, user_data = 0x6e75722f7261762f}, {pos = 1936683055, events = 1701996404, fd = 795636083, user_data = 0x3334352e4c515347}, {
pos = -2145058766, events = 21912, fd = -2145015462, user_data = 0x7ffefb55b7a0}, {pos = -2145100583, events = 21912, fd = 0, user_data = 0x0}, {
pos = -78268464, events = 32766, fd = -2145100753, user_data = 0x7ffefb55b7e0}, {pos = 2382895, events = 0, fd = 0, user_data = 0x0}, {pos = 0, events = 0,
fd = 1782978392, user_data = 0x0}, {pos = 1779089596, events = 32560, fd = 0, user_data = 0x5598891e38b0}, {pos = 0, events = 0, fd = 1782978392,
user_data = 0x8}, {pos = 1780423360, events = 32560, fd = 255, user_data = 0xfffffffffffffed0}, {pos = 0, events = 0, fd = 399,
user_data = 0x5598891e5420}, {pos = 1779088954, events = 32560, fd = -1994499536, user_data = 0x570}, {pos = 0, events = 0, fd = 10, user_data = 0x0}, {
pos = 1780423360, events = 32560, fd = 255, user_data = 0xfffffffffffffed0}, {pos = 1780416464, events = 32560, fd = 8,
user_data = 0x7f306a1effd0 <_IO_file_jumps>}, {pos = 1779092818, events = 32560, fd = 2996, user_data = 0x5598891e5420}, {pos = 4096, events = 0,
fd = -78269312, user_data = 0x7f306a1effd0 <_IO_file_jumps>}, {pos = 1778941388, events = 32560, fd = 25, user_data = 0x21fa}, {pos = 1, events = 0,
fd = 33152, user_data = 0x68}, {pos = 0, events = 0, fd = 1, user_data = 0x100000000}, {pos = 2, events = 17, fd = 0, user_data = 0x3}, {pos = 0,
events = 1, fd = 0, user_data = 0x0}, {pos = 0, events = 0, fd = 0, user_data = 0x0}, {pos = 0, events = 0, fd = 0, user_data = 0x0}, {pos = 0, events = 0,
fd = 0, user_data = 0x0}, {pos = 0, events = 0, fd = -1994501088, user_data = 0x5598891e5420}, {pos = 8, events = 0, fd = 1780423360, user_data = 0x802}, {
pos = -304, events = 4294967295, fd = 5, user_data = 0x55988024bdd6}, {pos = -2145008860, events = 21912, fd = 1779092818, user_data = 0x7ffefb55b420}, {
pos = -2145095639, events = 21912, fd = 32768, user_data = 0x9}, {pos = -78269152, events = 32766, fd = 1779329413, user_data = 0x7f0000000000}, {pos = 9,
events = 0, fd = -78269184, user_data = 0x7f306a0e69ff}, {pos = 2346, events = 0, fd = 16318532, user_data = 0x5598891fed30}, {pos = 126953984,
events = 2854329568, fd = -1994363536, user_data = 0x9}, {pos = -1994396368, events = 21912, fd = -360, user_data = 0x9}, {pos = -1994396368,
events = 21912, fd = -360, user_data = 0x7f306a0ad3c0 <free+384>}, {pos = 926193527, events = 0, fd = 1760102962, user_data = 0x37349777}, {pos = 0,
events = 0, fd = 0, user_data = 0x9}, {pos = -78269184, events = 32766, fd = -78269152, user_data = 0x5598891fed40}, {pos = -2145075754, events = 21912,
fd = -2145008860, user_data = 0x7f306a0e678d <closedir+13>}, {pos = -1995282496, events = 21912, fd = -2147108231, user_data = 0x5598891fed40}, {
pos = -2145075649, events = 21912, fd = -78268032, user_data = 0x55988005f0c8 <RemovePgTempFiles+312>}, {pos = -2141655518, events = 21912, fd = -78268820,
user_data = 0x7367702f65736162}, {pos = 1952410737, events = 28781, fd = 771766842, user_data = 0x7f306a0abe3a}}
nevents = <optimized out>
__func__ = "ServerLoop"
#25 0x000055987ffef110 in PostmasterMain (argc=argc@entry=5, argv=argv@entry=0x5598891172e0) at ./build/../src/backend/postmaster/postmaster.c:1400
opt = <optimized out>
status = <optimized out>
userDoption = <optimized out>
listen_addr_saved = true
output_config_variable = <optimized out>
__func__ = "PostmasterMain"
#26 0x000055987fce5880 in main (argc=5, argv=0x5598891172e0) at ./build/../src/backend/main/main.c:227
do_check_root = <optimized out>
dispatch_option = <optimized out>
__
Yuri Zamyatin
On Wed, 15 Oct 2025 at 04:51, Yuri Zamyatin <yuri@yrz.am> wrote:
To cause the segfault, these queries were launched simultaneously.
-- in 2 parallel infinite loops
with ids as (select (118998526-random()*100000)::int id from generate_series(1,10000))
update tcv_scene_datas set id=id where cv_scene_id in(select id from ids);
with ids as (select (118998526-random()*100000)::int id from generate_series(1,10000))
update tcv_scenes set id=id where id in(select id from ids);
Are you able to mock this up using the schema and some test data then
share the script to populate the database?
David
I found a much easier way.
create.sql:
drop database if exists segtest;
create database segtest;
\c segtest;
create table tcv_scene_datas(cv_scene_id bigint primary key) partition by range (cv_scene_id);
do $$
declare
i int;
range_start bigint;
range_end bigint;
partition_name text;
begin
for i in 0..100 loop
range_start := 1 + (i * 10000);
range_end := range_start + 10000;
partition_name := 'tcv_scene_datas_' || LPAD(i::TEXT, 3, '0');
execute format(
'create table %I partition of tcv_scene_datas for values from (%s) to (%s)',
partition_name,
range_start,
range_end
);
end loop;
end $$;
insert into tcv_scene_datas(cv_scene_id) select id from generate_series(1,1_000_000) id;
crash.sql:
\c segtest
with ids as (select (random()*1_000_000)::int id from generate_series(1,1000))
update tcv_scene_datas set cv_scene_id=cv_scene_id where cv_scene_id in(select id from ids);
Launch crash.sql in 16 threads of infinite loops:
seq 16 | xargs -P 16 -I {} sh -c 'while true; do psql -f crash.sql; done'
In 1-2 minutes, 5 processes died with segfault.
Also I expected deadlocks with such query, strangely database did not report them.
Let me know if you need more data.
__
Best wishes, Yuri
On Thu, 16 Oct 2025 at 03:21, Yuri Zamyatin <yuri@yrz.am> wrote:
create.sql:
drop database if exists segtest;
create database segtest;
\c segtest;
create table tcv_scene_datas(cv_scene_id bigint primary key) partition by range (cv_scene_id);
do $$
declare
i int;
range_start bigint;
range_end bigint;
partition_name text;
begin
for i in 0..100 loop
range_start := 1 + (i * 10000);
range_end := range_start + 10000;
partition_name := 'tcv_scene_datas_' || LPAD(i::TEXT, 3, '0');
execute format(
'create table %I partition of tcv_scene_datas for values from (%s) to (%s)',
partition_name,
range_start,
range_end
);
end loop;
end $$;
insert into tcv_scene_datas(cv_scene_id) select id from generate_series(1,1_000_000) id;crash.sql:
\c segtest
with ids as (select (random()*1_000_000)::int id from generate_series(1,1000))
update tcv_scene_datas set cv_scene_id=cv_scene_id where cv_scene_id in(select id from ids);Launch crash.sql in 16 threads of infinite loops:
seq 16 | xargs -P 16 -I {} sh -c 'while true; do psql -f crash.sql; done'
In 1-2 minutes, 5 processes died with segfault.
Perfect. Thank you.
It seems to be some more forgotten EPQ stuff from d47cbf474. Amit got
some of these in 8741e48e5, but evidently the test case didn't do
pruning during execution, (only init plan pruning) so the partition
directory wasn't needed.
The attached seems to fix it for me.
David
Attachments:
set_EPQs_es_partition_directory.patchapplication/octet-stream; name=set_EPQs_es_partition_directory.patchDownload+10-0
On Thu, 16 Oct 2025 at 11:45, David Rowley <dgrowleyml@gmail.com> wrote:
On Thu, 16 Oct 2025 at 03:21, Yuri Zamyatin <yuri@yrz.am> wrote:
In 1-2 minutes, 5 processes died with segfault.
Perfect. Thank you.
It seems to be some more forgotten EPQ stuff from d47cbf474. Amit got
some of these in 8741e48e5, but evidently the test case didn't do
pruning during execution, (only init plan pruning) so the partition
directory wasn't needed.
I forgot to mention, this isn't the same thing as the
tts_minimal_store_tuple() issue you first reported, so if there is a
problem there, this one has nothing to do with it.
Any chance of a self-contained test case for the enable_hashagg=on crash?
David
Thank you very much.
I just tested pruning on the original case with the patch
you sent and confirm segfaults went away.
Regarding hash aggregation, I'll try to find a test case and
follow up within a day or so (cloning a huge db right now).
On Thu, Oct 16, 2025 at 7:51 AM David Rowley <dgrowleyml@gmail.com> wrote:
On Thu, 16 Oct 2025 at 11:45, David Rowley <dgrowleyml@gmail.com> wrote:
On Thu, 16 Oct 2025 at 03:21, Yuri Zamyatin <yuri@yrz.am> wrote:
In 1-2 minutes, 5 processes died with segfault.
Thanks Yuri for the report and the test case.
Perfect. Thank you.
It seems to be some more forgotten EPQ stuff from d47cbf474. Amit got
some of these in 8741e48e5, but evidently the test case didn't do
pruning during execution, (only init plan pruning) so the partition
directory wasn't needed.I forgot to mention, this isn't the same thing as the
tts_minimal_store_tuple() issue you first reported, so if there is a
problem there, this one has nothing to do with it.
Thanks again, David.
I've attached an updated patch with a test case.
--
Thanks, Amit Langote
Attachments:
v2-0001-Fix-EPQ-crash-from-missing-partition-directory-in.patchapplication/octet-stream; name=v2-0001-Fix-EPQ-crash-from-missing-partition-directory-in.patchDownload+19-1
On Thu, 16 Oct 2025 at 16:29, Amit Langote <amitlangote09@gmail.com> wrote:
I've attached an updated patch with a test case.
Looks good to me. Nice simple test.
David
On Thu, Oct 16, 2025 at 1:04 PM David Rowley <dgrowleyml@gmail.com> wrote:
On Thu, 16 Oct 2025 at 16:29, Amit Langote <amitlangote09@gmail.com> wrote:
I've attached an updated patch with a test case.
Looks good to me. Nice simple test.
Thanks for checking. Pushed.
--
Thanks, Amit Langote
On Thu, 2025-10-16 at 11:51 +1300, David Rowley wrote:
I forgot to mention, this isn't the same thing as the
tts_minimal_store_tuple() issue you first reported, so if there is a
problem there, this one has nothing to do with it.
I investigated, but came up empty so far. Any additional info on the
hashagg crash would be appreciated.
I appended my raw notes below in case someone notices a mistake.
Regards,
Jeff Davis
Raw notes:
* Somehow entry->firstTuple==0x1b, which is obviously wrong.
* The entry structure lives in the bucket array, allocated in the
metacxt using MCXT_ALLOC_ZERO, so there's no uninitialized memory
floating around in the bucket array.
* The metacxt (aggstate->hash_metacxt) is an AllocSet, and it's never
reset. It contains the bucket array as well as some ExprStates and an
ExprContext for evaluating hash functions.
* Hash entries are never deleted, but between batches the entire hash
table is reset (which memsets the entire bucket array to zero).
* The entry->firstTuple is assigned only in one place, from
ExecCopySlotMinimalTupleExtra(). The 'extra' argument is a multiple of
16.
* ExecCopySlotMinimalTupleExtra() does some interesting pointer math,
but I didn't find any path that could plausibly return something like
0x1b. The memory is allocated with palloc/palloc0, which cannot return
zero, and 0x1b is not a multiple of 16 so seems unrelated to the extra
argument.
* JIT does not seem to be involved, because it's going through
ExecInterpExpr().
* When the hash table grows, it invalidates previously-returned entry
pointers. But, given the site of the crash, I don't see that as a
problem in this case.
Haven't managed to reproduce it consistently so far.
Perhaps we could use the existing dump to investigate.
I'm unfamiliar with Postgres internals. If it would be useful,
could you walk me through the structures I need to capture?
(gdb) frame 2
#2 0x0000555fe8566ec2 in agg_retrieve_hash_table_in_memory (aggstate=aggstate@entry=0x55601c7567d0) at ./build/../src/include/executor/executor.h:176
176 in ./build/../src/include/executor/executor.h
(gdb) print *perhash
$83 = {hashtable = 0x55601c182ac8, hashiter = {cur = 10, end = 43, done = false}, hashslot = 0x55601c765bb0, hashfunctions = 0x55601c765b28,
eqfuncoids = 0x55601c765b18, numCols = 2, numhashGrpCols = 2, largestGrpColIdx = 4, hashGrpColIdxInput = 0x55601c7659f0, hashGrpColIdxHash = 0x55601c765a00,
aggnode = 0x55601eba19f8}
(gdb) print *hashtable->hashtab
$84 = {size = 16, members = 4, sizemask = 15, grow_threshold = 14, data = 0x55601c182b98, ctx = 0x55601c1819b0, private_data = 0x55601c182ac8}
(gdb) print *entry
$86 = {firstTuple = 0x1b, status = 1, hash = 21856}
Does it look suspicious?
perhash->hashiter->end=43, hashtable->hashtab->size=16, 43-16=0x1b
Some more details from that step:
Show quoted text
(gdb) info locals
hashslot = 0x55601c765bb0
hashtable = 0x55601c182ac8
i = <optimized out>
econtext = 0x55601c756f00
peragg = 0x55601c765198
pergroup = <optimized out>
entry = 0x55601c182e48
firstSlot = 0x55601c763e48
result = <optimized out>
perhash = 0x55601c764e50
(gdb) print *aggstate
$87 = {ss = {ps = {type = T_AggState, plan = 0x55601eb9bb18, state = 0x55601b1a60a8, ExecProcNode = 0x555fe8567890 <ExecAgg>,
ExecProcNodeReal = 0x555fe8567890 <ExecAgg>, instrument = 0x0, worker_instrument = 0x0, worker_jit_instrument = 0x0, qual = 0x0, lefttree = 0x55601c757008,
righttree = 0x0, initPlan = 0x0, subPlan = 0x0, chgParam = 0x0, ps_ResultTupleDesc = 0x55601c763f50, ps_ResultTupleSlot = 0x55601c764758,
ps_ExprContext = 0x55601c756f00, ps_ProjInfo = 0x55601c764860, async_capable = false, scandesc = 0x55601c762fa0, scanops = 0x555fe8bd0f20 <TTSOpsVirtual>,
outerops = 0x555fe8bd0f20 <TTSOpsVirtual>, innerops = 0x0, resultops = 0x555fe8bd0f20 <TTSOpsVirtual>, scanopsfixed = true, outeropsfixed = true,
inneropsfixed = false, resultopsfixed = true, scanopsset = true, outeropsset = true, inneropsset = false, resultopsset = true}, ss_currentRelation = 0x0,
ss_currentScanDesc = 0x0, ss_ScanTupleSlot = 0x55601c763e48}, aggs = 0x55601c7628a8, numaggs = 1, numtrans = 1, aggstrategy = AGG_HASHED,
aggsplit = AGGSPLIT_SIMPLE, phase = 0x55601c764d70, numphases = 1, current_phase = 0, peragg = 0x55601c765198, pertrans = 0x55601c765220,
hashcontext = 0x55601c756df8, aggcontexts = 0x55601c756bd8, tmpcontext = 0x55601c756be8, curaggcontext = 0x55601c756df8, curperagg = 0x0,
curpertrans = 0x55601c765220, input_done = false, agg_done = false, projected_set = -1, current_set = 1, grouped_cols = 0x55601c765040,
all_grouped_cols = 0x55601c7650b8, colnos_needed = 0x55601c7656b0, max_colno_needed = 9, all_cols_needed = true, maxsets = 1, phases = 0x55601c764d70, sort_in = 0x0,
sort_out = 0x0, sort_slot = 0x0, pergroups = 0x0, grp_firstTuple = 0x0, table_filled = true, num_hashes = 4, hash_metacxt = 0x55601c1819b0,
hash_tablecxt = 0x55601c1839c0, hash_tapeset = 0x0, hash_spills = 0x0, hash_spill_rslot = 0x55601c765470, hash_spill_wslot = 0x55601c765578, hash_batches = 0x0,
hash_ever_spilled = false, hash_spill_mode = false, hash_mem_limit = 2147483648, hash_ngroups_limit = 10324440, hash_planned_partitions = 0,
hashentrysize = 138.26865671641792, hash_mem_peak = 81920, hash_ngroups_current = 67, hash_disk_used = 0, hash_batches_used = 1, perhash = 0x55601c764df8,
hash_pergroup = 0x55601c765428, all_pergroups = 0x55601c765428, shared_info = 0x0}
(gdb) print *hashslot
$88 = {type = T_TupleTableSlot, tts_flags = 24, tts_nvalid = 0, tts_ops = 0x555fe8bd0e20 <TTSOpsMinimalTuple>, tts_tupleDescriptor = 0x55601c765a10,
tts_values = 0x55601c765c20, tts_isnull = 0x55601c765c30, tts_mcxt = 0x55601b1a5fb0, tts_tid = {ip_blkid = {bi_hi = 65535, bi_lo = 65535}, ip_posid = 0},
tts_tableOid = 0}
(gdb) print *hashtable
$89 = {hashtab = 0x55601c182b50, numCols = 2, keyColIdx = 0x55601c765a00, tab_hash_expr = 0x55601c182eb0, tab_eq_func = 0x55601c1833e8,
tab_collations = 0x55601eba1b20, tablecxt = 0x55601c1839c0, tempcxt = 0x55601c17b980, additionalsize = 16, tableslot = 0x55601c182da8, inputslot = 0x55601c765bb0,
in_hash_expr = 0x55601c182eb0, cur_eq_func = 0x55601c1833e8, exprcontext = 0x55601c14b7f8}
(gdb) print *econtext
$90 = {type = T_ExprContext, ecxt_scantuple = 0x0, ecxt_innertuple = 0x0, ecxt_outertuple = 0x55601c763e48, ecxt_per_query_memory = 0x55601b1a5fb0,
ecxt_per_tuple_memory = 0x55601c1859d0, ecxt_param_exec_vals = 0x55601b42bcf0, ecxt_param_list_info = 0x55601b0f7a78, ecxt_aggvalues = 0x55601c762e88,
ecxt_aggnulls = 0x55601c765188, caseValue_datum = 0, caseValue_isNull = true, domainValue_datum = 0, domainValue_isNull = true, ecxt_oldtuple = 0x0,
ecxt_newtuple = 0x0, ecxt_estate = 0x55601b1a60a8, ecxt_callbacks = 0x0}
(gdb) print *peragg
$91 = {aggref = 0x55601eb9c170, transno = 0, finalfn_oid = 0, finalfn = {fn_addr = 0x0, fn_oid = 0, fn_nargs = 0, fn_strict = false, fn_retset = false,
fn_stats = 0 '\000', fn_extra = 0x0, fn_mcxt = 0x0, fn_expr = 0x0}, numFinalArgs = 1, aggdirectargs = 0x0, resulttypeLen = 4, resulttypeByVal = true,
shareable = false}
(gdb) print *firstSlot
$92 = {type = T_TupleTableSlot, tts_flags = 16, tts_nvalid = 9, tts_ops = 0x555fe8bd0f20 <TTSOpsVirtual>, tts_tupleDescriptor = 0x55601c762fa0,
tts_values = 0x55601c763e90, tts_isnull = 0x55601c763ed8, tts_mcxt = 0x55601b1a5fb0, tts_tid = {ip_blkid = {bi_hi = 65535, bi_lo = 65535}, ip_posid = 0},
tts_tableOid = 0}
Thanks for doing all the extra debugging.
On Sat, 18 Oct 2025 at 09:09, Yuri Zamyatin <yuri@yrz.am> wrote:
$84 = {size = 16, members = 4, sizemask = 15, grow_threshold = 14, data = 0x55601c182b98, ctx = 0x55601c1819b0, private_data = 0x55601c182ac8}
(gdb) print *entry
$86 = {firstTuple = 0x1b, status = 1, hash = 21856}Does it look suspicious?
perhash->hashiter->end=43, hashtable->hashtab->size=16, 43-16=0x1b
If that's the iterator for that hash table, then that's a big problem.
hashiter->end should never be >= hashtab->size. If that happens we'll
index over the end of the bucket array, and that might explain why the
firstTuple field is set to an invalid pointer.
Are you able to build with Asserts and try and get an Assert failure
with the attached patch?
If this fails then maybe we're using the wrong iterator somewhere in
nodeAgg.c. I can't see any other way for the iterator's 'end' field to
be bigger than the table's size.
David
Attachments:
add_asserts_to_simplehash_iterator_code.patchapplication/octet-stream; name=add_asserts_to_simplehash_iterator_code.patchDownload+4-0
On Sat, 18 Oct 2025 at 10:25, David Rowley <dgrowleyml@gmail.com> wrote:
If this fails then maybe we're using the wrong iterator somewhere in
nodeAgg.c. I can't see any other way for the iterator's 'end' field to
be bigger than the table's size.
I started looking for places that this could happen and quickly found
the following code:
/*
* Switch to next grouping set, reinitialize, and restart the
* loop.
*/
select_current_set(aggstate, nextset, true);
perhash = &aggstate->perhash[aggstate->current_set];
ResetTupleHashIterator(hashtable, &perhash->hashiter);
The hash table and the iterator for each set are meant to be in the
same AggStatePerHash, but the above code moves to the next set,
changes the "perhash" then resets the next iterator using the previous
hash table.
I think that line needs to be:
ResetTupleHashIterator(perhash->hashtable, &perhash->hashiter);
David
Thank you for the patch. Will do.
It will probably take me 1-2 days to test
due to the pattern of crashes.
__
Yuri
Nice, should I still try to reproduce the bug
with assertions and the patch you provided?
On Sat, 18 Oct 2025 at 11:18, Yuri Zamyatin <yuri@yrz.am> wrote:
Nice, should I still try to reproduce the bug
with assertions and the patch you provided?
I think I've got it. I can get the Asserts in the patch to fail with:
drop table if exists ab;
create table ab (a int, b int);
insert into ab select x%2,x%99 from generate_series(1,1001)x;
select a,b,count(*) from ab group by grouping sets(b,a,(a,b),a,b);
David