BUG #16219: EvalPlanQualFetchRowMark segfaults on Updates
The following bug has been logged on the website:
Bug reference: 16219
Logged by: Julian Schauder
Email address: julian.schauder@gmx.de
PostgreSQL version: 12.1
Operating system: Ubuntu 18.04
Description:
Howdy Hackers,
for a week now (upgrade pg11=>12) we're experiencing a bug that causes
segfaults on a daily basis.
I've yet to reproduce this exact behaviour as the querys do not fail all the
time but rather now-and-then
when run concurrently. ( ~4 Crashes so far )
Outputs(identifiers) are anonymised.
2020-01-20 01:39:24.246 CET [2112] LOG: server process (PID 31599) was
terminated by signal 11: Segmentation fault
2020-01-20 01:39:24.246 CET [2112] DETAIL: Failed process was running:
UPDATE partitionedtable ia SET A=A, B=now() FROM temptable tt WHERE
ia.id=tt.ia_id AND A::text IS DISTINCT FROM tt.A::text;
2020-01-20 01:39:24.246 CET [2112] LOG: terminating any other active
server processes
#0 EvalPlanQualFetchRowMark (epqstate=0x5576bfbccfc0, rti=2,
slot=slot@entry=0x5576c888b6c0) at
./build/../src/backend/executor/execMain.c:2590
#1 0x00005576bdf449ce in ExecScanFetch (recheckMtd=0x5576bdf65740
<SeqRecheck>, accessMtd=0x5576bdf656b0 <SeqNext>, node=0x5576c888b500) at
./build/../src/backend/executor/execScan.c:115
#2 ExecScan (node=0x5576c888b500, accessMtd=0x5576bdf656b0 <SeqNext>,
recheckMtd=0x5576bdf65740 <SeqRecheck>) at
./build/../src/backend/executor/execScan.c:200
#3 0x00005576bdf63f7d in ExecProcNode (node=0x5576c888b500) at
./build/../src/include/executor/executor.h:239
#4 ExecNestLoop (pstate=<optimized out>) at
./build/../src/backend/executor/nodeNestloop.c:109
#5 0x00005576bdf3e9a9 in ExecProcNode (node=0x5576c888b310) at
./build/../src/include/executor/executor.h:239
#6 EvalPlanQualNext (epqstate=epqstate@entry=0x5576bfbccfc0) at
./build/../src/backend/executor/execMain.c:2695
#7 0x00005576bdf3ee77 in EvalPlanQual
(epqstate=epqstate@entry=0x5576bfbccfc0,
relation=relation@entry=0x7fbfd57df168, rti=42,
inputslot=inputslot@entry=0x5576cfb66578) at ./build/../src/backend/>
executor/execMain.c:2467
#8 0x00005576bdf61f5e in ExecUpdate
(mtstate=mtstate@entry=0x5576bfbccec8, tupleid=0x7ffdd552f89a, oldtuple=0x0,
slot=<optimized out>, planSlot=planSlot@entry=0x5576c153f428,
epqstate=epqstate@entry=0x5576bfbccfc0, estate=0x5576bf5df080,
canSetTag=true) at
./build/../src/backend/executor/nodeModifyTable.c:1387
#9 0x00005576bdf62df9 in ExecModifyTable (pstate=0x5576bfbccec8) at
./build/../src/backend/executor/nodeModifyTable.c:2223
#10 0x00005576bdf3c23d in ExecProcNode (node=0x5576bfbccec8) at
./build/../src/include/executor/executor.h:239
#11 ExecutePlan (execute_once=<optimized out>, dest=0x5576be63ec00
<donothingDR>, direction=<optimized out>, numberTuples=0,
sendTuples=<optimized out>, operation=CMD_UPDATE,
use_parallel_mode=<optimized out>, planstate=0x5576bfbccec8,
estate=0x5576bf5df080) at
./build/../src/backend/executor/execMain.c:1646
#12 standard_ExecutorRun (queryDesc=0x5576bed15ae0, direction=<optimized
out>, count=0, execute_once=<optimized out>) at
./build/../src/backend/executor/execMain.c:364
#13 0x00007fc89b1a4125 in pgss_ExecutorRun (queryDesc=0x5576bed15ae0,
direction=ForwardScanDirection, count=0, execute_once=<optimized out>) at
./build/../contrib/pg_stat_statements/pg_stat_statements.c:893
#14 0x00005576be0993b5 in ProcessQuery (plan=<optimized out>,
sourceText=0x5576bed159d0 "UPDATE partitionedtable ia SET A=A, B=now()
FROM temptable tt WHERE ia.id=tt.ia_id AND A::text IS DISTINCT FROM
tt.A::text"...,
params=0x0, queryEnv=0x0, dest=0x5576be63ec00 <donothingDR>,
completionTag=0x7ffdd552ff40 "") at
./build/../src/backend/tcop/pquery.c:161
#15 0x00005576be0995f3 in PortalRunMulti
(portal=portal@entry=0x5576bea978f0, isTopLevel=isTopLevel@entry=true,
setHoldSnapshot=setHoldSnapshot@entry=false, dest=0x5576be63ec00
<donothingDR>, dest@entry=0x5576bea00890,
altdest=0x5576be63ec00 <donothingDR>, altdest@entry=0x5576bea00890,
completionTag=completionTag@entry=0x7ffdd552ff40 "") at
./build/../src/backend/tcop/pquery.c:1283
#16 0x00005576be09a1ff in PortalRun (portal=portal@entry=0x5576bea978f0,
count=count@entry=1, isTopLevel=isTopLevel@entry=true, run_once=<optimized
out>, dest=dest@entry=0x5576bea00890, altdest=altdest@entry=0x5576bea00890,
completionTag=0x7ffdd552ff40 "") at
./build/../src/backend/tcop/pquery.c:796
#17 0x00005576be097a31 in exec_execute_message (max_rows=1,
portal_name=0x5576bea00480 "") at
./build/../src/backend/tcop/postgres.c:2090
#18 PostgresMain (argc=<optimized out>, argv=argv@entry=0x5576bea2fbd0,
dbname=<optimized out>, username=<optimized out>) at
./build/../src/backend/tcop/postgres.c:4297
#19 0x00005576be01e53b in BackendRun (port=0x5576bea27dc0,
port=0x5576bea27dc0) at
./build/../src/backend/postmaster/postmaster.c:4437
#20 BackendStartup (port=0x5576bea27dc0) at
./build/../src/backend/postmaster/postmaster.c:4128
#21 ServerLoop () at ./build/../src/backend/postmaster/postmaster.c:1704
#22 0x00005576be01f513 in PostmasterMain (argc=5, argv=0x5576be9f9090) at
./build/../src/backend/postmaster/postmaster.c:1377
#23 0x00005576bdd95cf6 in main (argc=5, argv=0x5576be9f9090) at
./build/../src/backend/main/main.c:228
#0 EvalPlanQualFetchRowMark (epqstate=0x5576bfbccfc0, rti=2,
slot=slot@entry=0x5576c888b6c0) at
./build/../src/backend/executor/execMain.c:2590
earm = 0x5576c7176258
erm = 0x24
datum = <optimized out>
isNull = false
__func__ = "EvalPlanQualFetchRowMark"
#1 0x00005576bdf449ce in ExecScanFetch (recheckMtd=0x5576bdf65740
<SeqRecheck>, accessMtd=0x5576bdf656b0 <SeqNext>, node=0x5576c888b500) at
./build/../src/backend/executor/execScan.c:115
slot = 0x5576c888b6c0
epqstate = <optimized out>
scanrelid = <optimized out>
slot = <optimized out>
estate = 0x5576c888b100
estate = <optimized out>
epqstate = <optimized out>
scanrelid = <optimized out>
slot = <optimized out>
slot = <optimized out>
slot = <optimized out>
slot = <optimized out>
#2 ExecScan (node=0x5576c888b500,
Excerpt shows the pointer 0x24 of erm to be uninitialized'ish. Once i can
capture a second
stacktrace i might be able to confirm if this is indeed a random value or
always 0x24. However,
we're workarounding this by simply not running this query concurrently
anymore hence i might
lack new stacktraces soon.
UPDATE partitionedtable ia SET A=A, B=now() FROM temptable tt WHERE
ia.id=tt.ia_id AND A::text IS DISTINCT FROM tt.A::text;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------
Update on partitionedtable ia (cost=0.43..1202173.35 rows=465660
width=143)
Update on partitionedtable ia
Update on partitionedtable_0062 ia_1
Update on partitionedtable_0061 ia_2
Update on partitionedtable_0060 ia_3
[repeats for another ~400 Partitions]
[..]
-> Nested Loop (cost=0.43..3386.78 rows=1194 width=145)
-> Seq Scan on temptable tt (cost=0.00..22.00 rows=1200
width=46)
-> Index Scan using partitionedtable_pk on partitionedtable ia
(cost=0.43..2.79 rows=1 width=560)
Index Cond: (id = tt.ia_id)
Filter: ((document)::text IS DISTINCT FROM
(tt.document_json)::text)
-> Nested Loop (cost=0.42..3108.99 rows=1194 width=159)
-> Seq Scan on temptable tt (cost=0.00..22.00 rows=1200
width=46)
-> Index Scan using partitionedtable_062_pk on
partitionedtable_0062 ia_1 (cost=0.42..2.56 rows=1 width=518)
Index Cond: (id = tt.ia_id)
Filter: ((document)::text IS DISTINCT FROM
(tt.document_json)::text)
[repeats for another ~400 Partitions]
[..]JIT:
Functions: 3510
Options: Inlining true, Optimization true, Expressions true, Deforming
true
(2344 rows)
The Query basically updates a massively partitioned* Table from the
contents
of a temptable.
*Inheritance-Partitioning on ID(not modified within the query)
*Inheritance has triggers. ISTM that they are not of any concern.
System is a current Ubuntu18.04 with a freshly pg_upgrade'd postgresql-12.1
pgdg-build,
including new statistics, indices and vacuum.
I'm currently trying to get this to a minimized and reproducible form as i
do not see an
obvious error in the code yet. For the time being i fail to actually
recreate as accurate as
production does :)
regards,
julian
PG Bug reporting form <noreply@postgresql.org> writes:
for a week now (upgrade pg11=>12) we're experiencing a bug that causes
segfaults on a daily basis.
I've yet to reproduce this exact behaviour as the querys do not fail all the
time but rather now-and-then
when run concurrently. ( ~4 Crashes so far )
Not sure if you realize that the EvalPlanQual stuff is only called when
there's an uncommitted conflicting update on a target row. So the way
I'd try to reproduce something like this is
session 1 session 2
begin;
issue update query;
issue same update query;
commit;
which should take the race-condition issue out of the equation,
since session 2 will certainly be trying to update the same
rows that session 1 did but hasn't yet committed.
There may be other hard-to-reproduce factors at work, of course.
If you can test on a debug build (with --enable-cassert), that'd
help remove a few more variables.
earm = 0x5576c7176258
erm = 0x24
Excerpt shows the pointer 0x24 of erm to be uninitialized'ish.
Yeah. If you could do "p *earm" here, that might offer some clues;
I'm wondering if that entire struct has gotten trashed (likely because
somebody freed it too soon).
regards, tom lane