BUG #18986: SIGSEGV in nodeModifyTable.c during Parallel Execution

Started by PG Bug reporting form9 months ago12 messagesbugs
Jump to latest
#1PG Bug reporting form
noreply@postgresql.org

The following bug has been logged on the website:

Bug reference: 18986
Logged by: Yaroslav Syrytsia
Email address: y@lll.gd
PostgreSQL version: 17.5
Operating system: Ubuntu 22.04.5
Description:

Hello,

During testing a flow in an app, I encountered a segfault that only happens
during parallel execution:
https://gist.github.com/joy4eg/708458e204f52129a8e54a13534586b7

Looking into the coredump, it looks like a NULL pointer dereference is
happening at
https://github.com/postgres/postgres/blob/5e2f3df49d4298c6097789364a5a53be172f6e85/src/backend/executor/nodeModifyTable.c#L752

After further investigation, that part of the code looks suspicious as well:
static void
ExecInitInsertProjection(ModifyTableState *mtstate,
ResultRelInfo
*resultRelInfo)
{
...
/* Build ProjectionInfo if needed (it probably isn't). */
if (need_projection)
{
TupleDesc relDesc =
RelationGetDescr(resultRelInfo->ri_RelationDesc);

/* need an expression context to do the projection */
if (mtstate->ps.ps_ExprContext == NULL)
ExecAssignExprContext(estate, &mtstate->ps);

resultRelInfo->ri_projectNew =
ExecBuildProjectionInfo(insertTargetList,
mtstate->ps.ps_ExprContext,
resultRelInfo->ri_newTupleSlot,
&mtstate->ps,
relDesc);
}

resultRelInfo->ri_projectNewInfoValid = true; // <--- it's set to
true even if need_projection may be NULL.
}

(gdb) frame 0
#0 ExecGetUpdateNewTuple (oldSlot=0x55defb35ae20, planSlot=0x55defb41d9d8,
relinfo=0x55defb312108) at
executor/./build/../src/backend/executor/nodeModifyTable.c:752
752 econtext = newProj->pi_exprContext;
(gdb) p relinfo
$1 = (ResultRelInfo *) 0x55defb312108
(gdb) p *relinfo
$2 = {type = T_ResultRelInfo, ri_RangeTableIndex = 15, ri_RelationDesc =
0x7f0c0eb80500, ri_NumIndices = 21, ri_IndexRelationDescs = 0x55defb358248,
ri_IndexRelationInfo = 0x55defb358350, ri_RowIdAttNo = 1,
ri_extraUpdatedCols = 0x55defb357300, ri_projectNew = 0x0, ri_newTupleSlot
= 0x55defb323cf0, ri_oldTupleSlot = 0x55defb323ae8, ri_projectNewInfoValid =
true, ri_needLockTagTuple = false, ri_TrigDesc = 0x55defb312310,
ri_TrigFunctions = 0x55defb31a318, ri_TrigWhenExprs = 0x55defb2d8cf8,
ri_TrigInstrument = 0x0, ri_ReturningSlot = 0x0, ri_TrigOldSlot =
0x55defb35ae20, ri_TrigNewSlot = 0x0, ri_FdwRoutine = 0x0, ri_FdwState =
0x0,
ri_usesFdwDirectModify = false, ri_NumSlots = 0, ri_NumSlotsInitialized =
0, ri_BatchSize = 0, ri_Slots = 0x0, ri_PlanSlots = 0x0, ri_WithCheckOptions
= 0x0, ri_WithCheckOptionExprs = 0x0, ri_ConstraintExprs = 0x0,
ri_GeneratedExprsI = 0x0, ri_GeneratedExprsU = 0x55defb358140,
ri_NumGeneratedNeededI = 0, ri_NumGeneratedNeededU = 2, ri_returningList =
0x55defb39c630, ri_projectReturning = 0x55defb323590,
ri_onConflictArbiterIndexes = 0x0,
ri_onConflict = 0x0, ri_MergeActions = {0x55defb324518, 0x0,
0x55defb328e50}, ri_MergeJoinCondition = 0x0, ri_PartitionCheckExpr = 0x0,
ri_ChildToRootMap = 0x0, ri_ChildToRootMapValid = false, ri_RootToChildMap =
0x0,
ri_RootToChildMapValid = false, ri_RootResultRelInfo = 0x0,
ri_PartitionTupleSlot = 0x0, ri_CopyMultiInsertBuffer = 0x0,
ri_ancestorResultRels = 0x0}

Where:
- ri_projectNewInfoValid = true
- ri_projectNew = 0x0

Version tested:
- 17.5 (ubuntu, apt, with and without pgroonga)
- postgres:18beta1-bookworm (docker, without pgroonga)
- postgres:17.5-bookworm (docker, without pgroonga)

Yaroslav.

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: PG Bug reporting form (#1)
Re: BUG #18986: SIGSEGV in nodeModifyTable.c during Parallel Execution

PG Bug reporting form <noreply@postgresql.org> writes:

During testing a flow in an app, I encountered a segfault that only happens
during parallel execution:
https://gist.github.com/joy4eg/708458e204f52129a8e54a13534586b7

This report isn't terribly helpful, since you have not explained
how to trigger the crash. If we can't duplicate it, it's hard
to decide what is the appropriate fix.

regards, tom lane

#3Michael Paquier
michael@paquier.xyz
In reply to: Tom Lane (#2)
Re: BUG #18986: SIGSEGV in nodeModifyTable.c during Parallel Execution

On Tue, Jul 15, 2025 at 02:13:53PM -0400, Tom Lane wrote:

PG Bug reporting form <noreply@postgresql.org> writes:

During testing a flow in an app, I encountered a segfault that only happens
during parallel execution:
https://gist.github.com/joy4eg/708458e204f52129a8e54a13534586b7

This report isn't terribly helpful, since you have not explained
how to trigger the crash. If we can't duplicate it, it's hard
to decide what is the appropriate fix.

For the sake of visibility if the information on github gets lost, the
change can be summarized by this query pattern, but we don't have any
information about the end of the query:

WITH replaced AS (
DELETE FROM replaceable_events_before_update
WHERE replaced_by_id = ANY($1)
RETURNING attr_list_1, [...] ),
source_data (attr_list_2) AS (
SELECT * from replaced UNION ALL
VALUES (attr_list_3 ...

That's most likely generated a MERGE by an ORM to have these many
parameters and columns listed. Without the full pattern or an idea of
the schema as this includes triggers, what we can do is limited, as
Tom says.

And the relevant portion of the backtrace:
#7 ExecGetUpdateNewTuple (oldSlot=0x55b0842d0288,
#planSlot=0x55b084249d68, relinfo=0x55b0836a9b88) at
#executor/./build/../src/backend/executor/nodeModifyTable.c:752
#8 ExecBRUpdateTriggers (estate=0x55b08369a708,
#epqstate=0x55b0836a9a68, relinfo=0x55b0836a9b88,
#tupleid=0x7fffc373628a, fdw_trigtuple=0x0, newslot=0x55b083720eb0,
#tmresult=0x7fffc37361d0, tmfd=0x7fffc37362e0) at
#commands/./build/../src/backend/commands/trigger.c:2979
#9 0x000055b08158628e in ExecUpdatePrologue
#(context=context@entry=0x7fffc37362c0,
#resultRelInfo=resultRelInfo@entry=0x55b0836a9b88,
#tupleid=tupleid@entry=0x7fffc373628a, oldtuple=oldtuple@entry=0x0,
#slot=slot@entry=0x55b083720eb0, result=result@entry=0x7fffc37361d0)
#at executor/./build/../src/backend/executor/nodeModifyTable.c:1949
#10 0x000055b081587e75 in ExecMergeMatched (context=<optimized out>,
#resultRelInfo=<optimized out>, tupleid=0x7fffc373628a, oldtuple=0x0,
#canSetTag=false, matched=0x7fffc3736290) at
#executor/./build/../src/backend/executor/nodeModifyTable.c:3023

If you could extract the query string from the backtrace, that would
be one piece of information. With triggers and MERGE in mind, it may
be possible to reproduce something.
--
Michael

In reply to: Tom Lane (#2)
Re: BUG #18986: SIGSEGV in nodeModifyTable.c during Parallel Execution

Hello,

I made a small reproducer here - https://github.com/joy4eg/pg-merge-crash

It creates a few tables and then executes MERGE statements in parallel.

Yaroslav.

Show quoted text

On 2025-07-15 20:13 CEST Tom Lane <tgl@sss.pgh.pa.us> wrote:

PG Bug reporting form <noreply@postgresql.org> writes:

During testing a flow in an app, I encountered a segfault that only happens
during parallel execution:
https://gist.github.com/joy4eg/708458e204f52129a8e54a13534586b7

This report isn't terribly helpful, since you have not explained
how to trigger the crash. If we can't duplicate it, it's hard
to decide what is the appropriate fix.

regards, tom lane

#5Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: Michael Paquier (#3)
Re: BUG #18986: SIGSEGV in nodeModifyTable.c during Parallel Execution

On Wed, 16 Jul 2025 at 01:10, Michael Paquier <michael@paquier.xyz> wrote:

And the relevant portion of the backtrace:
#7 ExecGetUpdateNewTuple (oldSlot=0x55b0842d0288,
#planSlot=0x55b084249d68, relinfo=0x55b0836a9b88) at
#executor/./build/../src/backend/executor/nodeModifyTable.c:752
#8 ExecBRUpdateTriggers (estate=0x55b08369a708,
#epqstate=0x55b0836a9a68, relinfo=0x55b0836a9b88,
#tupleid=0x7fffc373628a, fdw_trigtuple=0x0, newslot=0x55b083720eb0,
#tmresult=0x7fffc37361d0, tmfd=0x7fffc37362e0) at
#commands/./build/../src/backend/commands/trigger.c:2979
#9 0x000055b08158628e in ExecUpdatePrologue
#(context=context@entry=0x7fffc37362c0,
#resultRelInfo=resultRelInfo@entry=0x55b0836a9b88,
#tupleid=tupleid@entry=0x7fffc373628a, oldtuple=oldtuple@entry=0x0,
#slot=slot@entry=0x55b083720eb0, result=result@entry=0x7fffc37361d0)
#at executor/./build/../src/backend/executor/nodeModifyTable.c:1949
#10 0x000055b081587e75 in ExecMergeMatched (context=<optimized out>,
#resultRelInfo=<optimized out>, tupleid=0x7fffc373628a, oldtuple=0x0,
#canSetTag=false, matched=0x7fffc3736290) at
#executor/./build/../src/backend/executor/nodeModifyTable.c:3023

Hmm, I can see the problem. This particular stack trace should never happen:

#7 ExecGetUpdateNewTuple
#8 ExecBRUpdateTriggers
#9 ExecUpdatePrologue
#10 ExecMergeMatched

ExecBRUpdateTriggers() only calls ExecGetUpdateNewTuple() if
GetTupleForTrigger() sets epqslot_candidate to a non-null value, and
GetTupleForTrigger() should only set epqslot_candidate (*epqslot) to
non-null if we're not doing a MERGE (see commit 9321c79).

The problem, however, is that GetTupleForTrigger() tests for MERGE by doing

if (estate->es_plannedstmt->commandType != CMD_MERGE)

which doesn't work if the MERGE is inside a CTE. So we need a
different way for ExecBRUpdateTriggers() / GetTupleForTrigger() to
know that it's being called from within a MERGE.

Regards,
Dean

#6Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: Dean Rasheed (#5)
Re: BUG #18986: SIGSEGV in nodeModifyTable.c during Parallel Execution

On Wed, 16 Jul 2025 at 10:55, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:

The problem, however, is that GetTupleForTrigger() tests for MERGE by doing

if (estate->es_plannedstmt->commandType != CMD_MERGE)

which doesn't work if the MERGE is inside a CTE. So we need a
different way for ExecBRUpdateTriggers() / GetTupleForTrigger() to
know that it's being called from within a MERGE.

Attached is a reproducer using the merge-match-recheck isolation test.

I believe that the bug only goes back to v17, because MERGE could not
appear inside a CTE prior to that.

I think that best thing to do is pass the commandType to
ExecBRUpdateTriggers(), except that in v17 we shouldn't change the
signature of ExecBRUpdateTriggers(), so a wrapper function will be
needed, similar to what 9321c79 did.

Question: Is it OK to change the signature of ExecBRUpdateTriggers() in v18?

Regards,
Dean

Attachments:

merge-bug.patchtext/x-patch; charset=US-ASCII; name=merge-bug.patchDownload+13-9
#7Michael Paquier
michael@paquier.xyz
In reply to: Dean Rasheed (#6)
Re: BUG #18986: SIGSEGV in nodeModifyTable.c during Parallel Execution

On Wed, Jul 16, 2025 at 11:42:33AM +0100, Dean Rasheed wrote:

Attached is a reproducer using the merge-match-recheck isolation test.

I believe that the bug only goes back to v17, because MERGE could not
appear inside a CTE prior to that.

That was fast, nice!

I think that best thing to do is pass the commandType to
ExecBRUpdateTriggers(), except that in v17 we shouldn't change the
signature of ExecBRUpdateTriggers(), so a wrapper function will be
needed, similar to what 9321c79 did.

Yes, changing ExecBRUpdateTriggers() would not be a good idea on a
stable branch.. I can see that at least timescaledb does a direct
call to it. A minor upgrade breakage would be bad for them.

Question: Is it OK to change the signature of ExecBRUpdateTriggers()
in v18?

We are still in beta, so that's not a problem for v18 and HEAD.
--
Michael

#8Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: Michael Paquier (#7)
Re: BUG #18986: SIGSEGV in nodeModifyTable.c during Parallel Execution

On Wed, 16 Jul 2025 at 11:49, Michael Paquier <michael@paquier.xyz> wrote:

Question: Is it OK to change the signature of ExecBRUpdateTriggers()
in v18?

We are still in beta, so that's not a problem for v18 and HEAD.

Cool. I thought so, but I wanted to check.

Attached is a patch for HEAD/v18, and a slightly different one for
v17, preserving the trigger ABI in the standard way.

I decided to do this by adding an extra "is_merge_update" boolean
parameter, rather than passing the commandType because that looked
slightly neater. It was also necessary to update
ExecBRDeleteTriggers(), since otherwise a concurrent MERGE DELETE
could do the wrong thing (execute the wrong action, rather than
seg-faulting). That was picked up by an existing isolation test case
added by 9321c79, so no need for more tests.

I haven't tested this against the OP's reproducer.

Regards,
Dean

Attachments:

fix-merge-bug.patchtext/x-patch; charset=US-ASCII; name=fix-merge-bug.patchDownload+89-50
fix-merge-bug-17.patchtext/x-patch; charset=US-ASCII; name=fix-merge-bug-17.patchDownload+153-62
#9Michael Paquier
michael@paquier.xyz
In reply to: Dean Rasheed (#8)
Re: BUG #18986: SIGSEGV in nodeModifyTable.c during Parallel Execution

On Wed, Jul 16, 2025 at 02:05:06PM +0100, Dean Rasheed wrote:

I decided to do this by adding an extra "is_merge_update" boolean
parameter, rather than passing the commandType because that looked
slightly neater. It was also necessary to update
ExecBRDeleteTriggers(), since otherwise a concurrent MERGE DELETE
could do the wrong thing (execute the wrong action, rather than
seg-faulting). That was picked up by an existing isolation test case
added by 9321c79, so no need for more tests.

Looks sensible here, for the HEAD and v17 flavors.

I haven't tested this against the OP's reproducer.

Except waiting for the OP to actually test the fix, I'm not sure if
there is much that we can do. Anyway, the stack that one gets with
the isolation test is very close to what the OP has reported, so I'm
ready to bet a coffee that you have been able to narrow this one down.

I have also raised an issue to timescale to make them aware of the
issue. I have no idea if this is an issue for them, just acting as a
reporter:
https://github.com/timescale/timescaledb/issues/8392
--
Michael

#10Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: Michael Paquier (#9)
Re: BUG #18986: SIGSEGV in nodeModifyTable.c during Parallel Execution

On Thu, 17 Jul 2025 at 06:45, Michael Paquier <michael@paquier.xyz> wrote:

On Wed, Jul 16, 2025 at 02:05:06PM +0100, Dean Rasheed wrote:

I decided to do this by adding an extra "is_merge_update" boolean
parameter, rather than passing the commandType because that looked
slightly neater. It was also necessary to update
ExecBRDeleteTriggers(), since otherwise a concurrent MERGE DELETE
could do the wrong thing (execute the wrong action, rather than
seg-faulting). That was picked up by an existing isolation test case
added by 9321c79, so no need for more tests.

Looks sensible here, for the HEAD and v17 flavors.

Thanks for looking.

I haven't tested this against the OP's reproducer.

Except waiting for the OP to actually test the fix, I'm not sure if
there is much that we can do.

I'll push this in a day or so in any case, since it's clearly fixing
*an* issue, even if it doesn't entirely fix the OP's issue.

Anyway, the stack that one gets with
the isolation test is very close to what the OP has reported, so I'm
ready to bet a coffee that you have been able to narrow this one down.

Yeah, that was my thinking too.

I have also raised an issue to timescale to make them aware of the
issue. I have no idea if this is an issue for them, just acting as a
reporter:
https://github.com/timescale/timescaledb/issues/8392

Cool, thanks for doing that.

Regards,
Dean

#11Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: Dean Rasheed (#10)
Re: BUG #18986: SIGSEGV in nodeModifyTable.c during Parallel Execution

On Thu, 17 Jul 2025 at 07:46, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:

I'll push this in a day or so in any case, since it's clearly fixing
*an* issue, even if it doesn't entirely fix the OP's issue.

Pushed.

Regards,
Dean

In reply to: Dean Rasheed (#11)
Re: BUG #18986: SIGSEGV in nodeModifyTable.c during Parallel Execution

Hello,

I just made a fresh build:
commit 91ad1bdef8e46983d82d2723c6e1c05e004f74c5 (HEAD -> 17_stable, origin/REL_17_STABLE)
Author: Dean Rasheed <dean.a.rasheed@gmail.com>
Date: Fri Jul 18 10:01:31 2025 +0100

Fix concurrent update trigger issues with MERGE in a CTE.
Can confirm that the issue is resolved now.
Thanks!
Yaroslav.

Show quoted text

On Jul 18 2025, at 11:19 am, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:

On Thu, 17 Jul 2025 at 07:46, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:

I'll push this in a day or so in any case, since it's clearly fixing
*an* issue, even if it doesn't entirely fix the OP's issue.

Pushed.
Regards,
Dean