BUG #16213: segfault when running a query

Started by PG Bug reporting formabout 6 years ago3 messageshackersbugs
Jump to latest
#1PG Bug reporting form
noreply@postgresql.org
hackersbugs

The following bug has been logged on the website:

Bug reference: 16213
Logged by: Matt Jibson
Email address: matt.jibson@gmail.com
PostgreSQL version: 11.5
Operating system: linux
Description:

SELECT
*
FROM
(
SELECT
tab_31924.col_41292 AS col_41294,
tab_31924.col_41293 AS col_41295,
0::OID AS col_41296,
false AS col_41297
FROM
(
VALUES
(
'A'::STRING::STRING
NOT IN (
SELECT
'E'::STRING::STRING AS col_41289
FROM
(
VALUES
(NULL),
(NULL),
(NULL),
(NULL)
)
AS tab_31923 (col_41288)
WHERE
false
),
NULL,
'B'::STRING,
3::OID
),
(false, 4::OID, 'B'::STRING, 0::OID)
)
AS tab_31924
(col_41290, col_41291, col_41292, col_41293)
WHERE
tab_31924.col_41290
)
AS tab_31925
ORDER BY
col_41294 NULLS FIRST,
col_41295 NULLS FIRST,
col_41296 NULLS FIRST,
col_41297 NULLS FIRST;

The above query produces an error in the server log:

LOG: server process (PID 108) was terminated by signal 11: Segmentation
fault

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: PG Bug reporting form (#1)
hackersbugs
Re: BUG #16213: segfault when running a query

PG Bug reporting form <noreply@postgresql.org> writes:

The above query produces an error in the server log:
LOG: server process (PID 108) was terminated by signal 11: Segmentation
fault

Yeah, duplicated here. (For anyone following along at home: you
can "create domain string as text", or just s/STRING/TEXT/g in the
given query. That type's not relevant to the problem.) The problem
is probably much more easily reproduced in a debug build, because it
boils down to a dangling-pointer bug. I duplicated it back to 9.4,
and it's probably older than that.

The direct cause of the crash is that by the time we get to ExecutorEnd,
there are dangling pointers in the es_tupleTable list. Those pointers
turn out to originate from ExecInitSubPlan's creation of TupleTableSlots
for the ProjectionInfo objects it creates when doing hashing. And the
reason they're dangling is that the subplan is inside a VALUES list,
and nodeValuesscan.c does this remarkably bletcherous thing:

* Build the expression eval state in the econtext's per-tuple memory.
* This is a tad unusual, but we want to delete the eval state again
* when we move to the next row, to avoid growth of memory
* requirements over a long values list.

It turns out that just below that, there's already some messy hacking
to deal with subplans in the VALUES list. But I guess we'd not hit
the case of a subplan using hashing within VALUES.

The attached draft patch fixes this by not letting ExecInitSubPlan hook
the slots it's making into the outer es_tupleTable list. Ordinarily
that would be bad because it exposes us to leaking resources, if the
slots aren't cleared before ending execution. But nodeSubplan.c is
already being (mostly) careful to clear those slots promptly, so it
doesn't cost us anything much to lose this backstop.

What that change fails to do is guarantee that there are no such
bugs elsewhere. In the attached I made nodeValuesscan.c assert that
nothing has added slots to the es_tupleTable list, but of course
that only catches cases where there's a live bug. Given how long
this case escaped detection, I don't feel like that's quite enough.
(Unsurprisingly, the existing regression tests don't trigger this
assert, even without the nodeSubplan.c fix.)

Another angle I've not run to ground is that the comments for the
existing subplan-related hacking in nodeValuesscan.c claim that
a problematic subplan could only occur in conjunction with LATERAL.
But there's no LATERAL in this example --- are those comments wrong
or obsolete, or just talking about a different case?

I didn't work on making a minimal test case for the regression tests,
either.

Anyway, thanks for the report!

regards, tom lane

Attachments:

fix-for-hashed-subplan-in-VALUES-1.patchtext/x-diff; charset=us-ascii; name=fix-for-hashed-subplan-in-VALUES-1.patchDownload+46-16
#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#2)
hackersbugs
Re: BUG #16213: segfault when running a query

I wrote:

PG Bug reporting form <noreply@postgresql.org> writes:

The above query produces an error in the server log:
LOG: server process (PID 108) was terminated by signal 11: Segmentation
fault

The direct cause of the crash is that by the time we get to ExecutorEnd,
there are dangling pointers in the es_tupleTable list.

After further reflection, I'm totally dissatisfied with the quick-hack
patch I posted last night. I think that what this example demonstrates
is that my fix (in commit 9b63c13f0) for bug #14924 in fact does not
work: the subPlan list is not the only way in which a SubPlan connects
up to the outer plan level. I have no faith that the es_tupleTable list
is the only other way, either, or that we won't create more in future.

I think what we have to do here is revert 9b63c13f0, going back to
the previous policy of passing down parent = NULL to the transient
subexpressions, so that there is a strong guarantee that there aren't
any unwanted connections between short-lived and longer-lived state.
And then we need some other solution for making SubPlans in VALUES
lists work. The best bet seems to be what I speculated about in that
commit message: initialize the expressions for VALUES rows that contain
SubPlans normally at executor startup, and use the trick with
short-lived expression state only for VALUES rows that don't contain any
SubPlans. I think that the case we're worried about with long VALUES
lists is not likely to involve any SubPlans, so that this shouldn't be
too awful for memory consumption.

Another benefit of doing it like this is that SubPlans in the VALUES
lists are reported normally by EXPLAIN, while the previous hack caused
them to be missing from the output.

Objections, better ideas?

regards, tom lane

Attachments:

fix-for-hashed-subplan-in-VALUES-2.patchtext/x-diff; charset=us-ascii; name=fix-for-hashed-subplan-in-VALUES-2.patchDownload+118-43