BUG #18576: Using EXPLAIN (VERBOSE) in information_schema.element_types returns ERROR: failed to find plan for

Started by PG Bug reporting formover 1 year ago13 messagesbugs
Jump to latest
#1PG Bug reporting form
noreply@postgresql.org

The following bug has been logged on the website:

Bug reference: 18576
Logged by: Vasya B
Email address: vasiliy.boytsov@gmail.com
PostgreSQL version: 16.3
Operating system: Ubuntu 24.04
Description:

From a clean DB, one can execute:
EXPLAIN (VERBOSE) SELECT FROM information_schema.element_types WHERE
object_type = 'TABLE';
Which returns:
ERROR: failed to find plan for subquery ss
While the expected result was a working query.
W/O VERBOSE this query works.

#2Richard Guo
guofenglinux@gmail.com
In reply to: PG Bug reporting form (#1)
Re: BUG #18576: Using EXPLAIN (VERBOSE) in information_schema.element_types returns ERROR: failed to find plan for

On Thu, Aug 8, 2024 at 8:23 AM PG Bug reporting form
<noreply@postgresql.org> wrote:

From a clean DB, one can execute:
EXPLAIN (VERBOSE) SELECT FROM information_schema.element_types WHERE
object_type = 'TABLE';
Which returns:
ERROR: failed to find plan for subquery ss

Thanks for the report!

I think the problem is that when we see a Var that references a
SUBQUERY RTE when deparsing a Plan tree to get the name of a field, we
assume that we are in a SubqueryScan plan node, in which case the code
is no problem because set_deparse_plan has set dpns->inner_plan to its
child plan. However, this bug shows that this assumption does not
always hold: we might instead be in a Result node with a Var
referencing a SUBQUERY RTE. This problem can be reproduced with the
query below.

EXPLAIN (VERBOSE, COSTS OFF)
SELECT (ss.a).x, (ss.a).n FROM
(SELECT information_schema._pg_expandarray(ARRAY[1,2]) AS a) ss
WHERE FALSE;
ERROR: failed to find plan for subquery ss

In this case, due to the constant false filter, the whole plan is
reduced to a dummy Result node, with a targetlist consisting of 'a.x'
and 'a.n', where 'a' is a Var referencing the SUBQUERY RTE. We do not
generate a SubqueryScan plan node for the subquery, as the relation is
recognized as dummy. That is to say, we neither have a valid
rte->subquery nor a valid SubqueryScan plan node. So it seems to me
that there is no easy way to get the names of the fields in this case.
I'm wondering whether we can just compose a fake name with something
like below?

@@ -7903,6 +7903,14 @@ get_name_for_var_field(Var *var, int fieldno,
deparse_namespace save_dpns;
const char *result;

+               if (IsA(dpns->plan, Result))
+               {
+                   char       *fakeCol = palloc(32);
+
+                   snprintf(fakeCol, sizeof(fakeCol), "col%d", fieldno);
+                   return fakeCol;
+               }
+
                if (!dpns->inner_plan)
                    elog(ERROR, "failed to find plan for subquery %s",
                         rte->eref->aliasname);

This same problem can also happen to CTEs.

EXPLAIN (VERBOSE, COSTS OFF)
WITH ss AS MATERIALIZED
(SELECT information_schema._pg_expandarray(ARRAY[1,2]) AS a)
SELECT (ss.a).x, (ss.a).n FROM ss WHERE FALSE;
ERROR: failed to find plan for CTE ss

Thanks
Richard

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Richard Guo (#2)
Re: BUG #18576: Using EXPLAIN (VERBOSE) in information_schema.element_types returns ERROR: failed to find plan for

Richard Guo <guofenglinux@gmail.com> writes:

I think the problem is that when we see a Var that references a
SUBQUERY RTE when deparsing a Plan tree to get the name of a field, we
assume that we are in a SubqueryScan plan node, in which case the code
is no problem because set_deparse_plan has set dpns->inner_plan to its
child plan. However, this bug shows that this assumption does not
always hold: we might instead be in a Result node with a Var
referencing a SUBQUERY RTE.

Yeah. I think the comment about that in get_name_for_var_field was
accurate when written, but that was a few rounds of planner
improvements ago. I found that your simplified query works up to
9.5 but fails in >= 9.6, whereupon I bisected to find

3fc6e2d7f5b652b417fa6937c34de2438d60fa9f is the first bad commit
commit 3fc6e2d7f5b652b417fa6937c34de2438d60fa9f
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date: Mon Mar 7 15:58:22 2016 -0500

Make the upper part of the planner work by generating and comparing Paths.

Immediately before that, the non-VERBOSE description of the plan was

Result
One-Time Filter: false
-> Result

while that commit changed it to

Result
One-Time Filter: false

So even before that, we'd been emitting a Result not a SubqueryScan
plan node, but it worked anyway because the lower Result had the
same tlist as the SubqueryScan it replaced: the VERBOSE output was

Result
Output: ((information_schema._pg_expandarray('{1,2}'::integer[]))).x, ((information_schema._pg_expandarray('{1,2}'::integer[]))).n
One-Time Filter: false
-> Result
Output: information_schema._pg_expandarray('{1,2}'::integer[])

However, once we decided we didn't really need the child plan node at
all, get_name_for_var_field was broken. I'm surprised it took this
long to notice.

... That is to say, we neither have a valid
rte->subquery nor a valid SubqueryScan plan node. So it seems to me
that there is no easy way to get the names of the fields in this case.

Indeed.

I'm wondering whether we can just compose a fake name with something
like below?

This seems like a band-aid. Also, I experimented with it and found
that for your test query, the output is

Result
Output: (a).col1, (a).col2
One-Time Filter: false

This is confusing (where'd "a" come from?) and it makes me quite
nervous that there are other cases where we'd also fail. What
we basically have here is a dangling-reference Var with no valid
referent. That's wrong in itself and it seems like it risks
causing executor problems not only EXPLAIN problems. It's just
minimally safe because we know that the tlist will never actually
get evaluated ... but this could easily trip up logic that runs
during executor startup.

I wonder if we shouldn't change what the planner puts into the tlist
of a dummy Result. That is, I'm imagining reducing the tlist of
such a Result to null Consts that would serve to show the right column
datatypes and not much else:

Result
Output: NULL, NULL
One-Time Filter: false

I've not looked at how messy this might get, though.

This same problem can also happen to CTEs.

Yeah, basically the same thing in the RTE_CTE switch case.

regards, tom lane

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#3)
Re: BUG #18576: Using EXPLAIN (VERBOSE) in information_schema.element_types returns ERROR: failed to find plan for

I wrote:

I wonder if we shouldn't change what the planner puts into the tlist
of a dummy Result. That is, I'm imagining reducing the tlist of
such a Result to null Consts that would serve to show the right column
datatypes and not much else:

Result
Output: NULL, NULL
One-Time Filter: false

I experimented with this, and was interested to find that there's
basically only one place we could do it in a low-level, localized
fashion, namely in set_plan_refs' handling of Result nodes; if you try
to do it earlier then set_upper_references fails when there's an upper
plan node that should copy such a column's value. And there's
*already* an ugly hack there, which we could replace with a more
general hack. See 0001 attached.

However, after testing it I got less excited, because it caused
quite a lot of regression test changes (which I didn't bother to
include in 0001). Many of them seem like significant readability
lossage, for example

  GroupAggregate
-   Group Key: pagg_tab1.y
+   Group Key: (NULL::integer)
    ->  Sort
-         Sort Key: pagg_tab1.y
+         Sort Key: (NULL::integer)
          ->  Result
                One-Time Filter: false

and in a few cases I feel like the change actually means that the
test no longer proves what it intends to, because significant
details vanish into a sea of NULLs.

Another problem, which we could probably work around, is that this
breaks removal of trivial SubqueryScan nodes in some cases, because
that happens even later and the tlists no longer match.

So I feel like my idea is a dead end. Even without the above
problems, a significant change in EXPLAIN's behavior in back
branches would be a hard sell --- the more so if it's to fix
a case that nobody even noticed for eight years.

So I'm coming around to doing something like the quick hack you
proposed. I don't like it too much because it seems like it could
make other bugs in this area much harder to notice, but I don't have a
better idea. We do need some work on the outdated comments though.
Also, I think we should use "f%d" not "col%d" by analogy to the
default field names for RowExprs and anonymous record types.
See 0002 attached.

regards, tom lane

Attachments:

0001-bug-18576-fix-fail.patchtext/x-diff; charset=us-ascii; name=0001-bug-18576-fix-fail.patchDownload+32-17
0002-bug-18576-fix-v2.patchtext/x-diff; charset=us-ascii; name=0002-bug-18576-fix-v2.patchDownload+107-13
#5Richard Guo
guofenglinux@gmail.com
In reply to: Tom Lane (#4)
Re: BUG #18576: Using EXPLAIN (VERBOSE) in information_schema.element_types returns ERROR: failed to find plan for

On Fri, Aug 9, 2024 at 4:35 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

However, after testing it I got less excited, because it caused
quite a lot of regression test changes (which I didn't bother to
include in 0001). Many of them seem like significant readability
lossage, for example

GroupAggregate
-   Group Key: pagg_tab1.y
+   Group Key: (NULL::integer)
->  Sort
-         Sort Key: pagg_tab1.y
+         Sort Key: (NULL::integer)
->  Result
One-Time Filter: false

Yeah, it seems that the original tlist of a dummy Result is needed to
deparse upper plan nodes. With 0001, expressions that should be
matched to lower tlist might end up with all NULLs, which seems not
great.

So I'm coming around to doing something like the quick hack you
proposed. I don't like it too much because it seems like it could
make other bugs in this area much harder to notice, but I don't have a
better idea. We do need some work on the outdated comments though.
Also, I think we should use "f%d" not "col%d" by analogy to the
default field names for RowExprs and anonymous record types.

Agreed. Do you think it would be helpful to add some assertions to
verify the plan type? For example, if dpns->inner_plan is NULL, the
plan should be a Result node; otherwise, it must be a SubqueryScan
node (or in the RTE_CTE case, it must be a CteScan/WorkTableScan
node).

Thanks
Richard

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Richard Guo (#5)
Re: BUG #18576: Using EXPLAIN (VERBOSE) in information_schema.element_types returns ERROR: failed to find plan for

Richard Guo <guofenglinux@gmail.com> writes:

Agreed. Do you think it would be helpful to add some assertions to
verify the plan type?

I thought about that but didn't experiment with it. I wonder whether
it'd just make the code more fragile. Still, it might be useful to
verify that things are happening as we expect.

regards, tom lane

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#6)
Re: BUG #18576: Using EXPLAIN (VERBOSE) in information_schema.element_types returns ERROR: failed to find plan for

I wrote:

Richard Guo <guofenglinux@gmail.com> writes:

Agreed. Do you think it would be helpful to add some assertions to
verify the plan type?

I thought about that but didn't experiment with it. I wonder whether
it'd just make the code more fragile. Still, it might be useful to
verify that things are happening as we expect.

After sleeping on it, I elected to add the Asserts as suggested,
except in v12 where they'd have had to look different. That's
not strictly a matter of laziness: v12's next release will be its
last, and I've been burned often enough to become very hesitant
about pushing even slightly-questionable code into an EOL release.

Pushed with that change.

regards, tom lane

#8Andrei Lepikhov
lepihov@gmail.com
In reply to: Tom Lane (#7)
Re: BUG #18576: Using EXPLAIN (VERBOSE) in information_schema.element_types returns ERROR: failed to find plan for

On 9/8/2024 17:25, Tom Lane wrote:

I wrote:

Richard Guo <guofenglinux@gmail.com> writes:

Agreed. Do you think it would be helpful to add some assertions to
verify the plan type?

I thought about that but didn't experiment with it. I wonder whether
it'd just make the code more fragile. Still, it might be useful to
verify that things are happening as we expect.

After sleeping on it, I elected to add the Asserts as suggested,
except in v12 where they'd have had to look different. That's
not strictly a matter of laziness: v12's next release will be its
last, and I've been burned often enough to become very hesitant
about pushing even slightly-questionable code into an EOL release.

Thanks for pushing this!

But could you change this code a little bit?
I reported this issue a year ago. At that time, it was triggered by the
CustomScan node [1]/messages/by-id/3933834e-b657-4ad1-bf4e-5f3fbba7ba14@app.fastmail.com. I haven't found the solution there [2]/messages/by-id/0cbbd87e-1b57-4b7e-9825-19a6fb2f8670@postgrespro.ru. Your code
looks like a good tradeoff, and if you slightly change the code (like in
the attachment), it allows CustomScan to survive such cases.

[1]: /messages/by-id/3933834e-b657-4ad1-bf4e-5f3fbba7ba14@app.fastmail.com
/messages/by-id/3933834e-b657-4ad1-bf4e-5f3fbba7ba14@app.fastmail.com
[2]: /messages/by-id/0cbbd87e-1b57-4b7e-9825-19a6fb2f8670@postgrespro.ru
/messages/by-id/0cbbd87e-1b57-4b7e-9825-19a6fb2f8670@postgrespro.ru

--
regards, Andrei Lepikhov

Attachments:

minor_fix.difftext/plain; charset=UTF-8; name=minor_fix.diffDownload+5-2
#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrei Lepikhov (#8)
Re: BUG #18576: Using EXPLAIN (VERBOSE) in information_schema.element_types returns ERROR: failed to find plan for

Andrei Lepikhov <lepihov@gmail.com> writes:

Thanks for pushing this!

But could you change this code a little bit?
I reported this issue a year ago. At that time, it was triggered by the
CustomScan node [1]. I haven't found the solution there [2]. Your code
looks like a good tradeoff, and if you slightly change the code (like in
the attachment), it allows CustomScan to survive such cases.

This seems like it's making assumptions it shouldn't about what
CustomScan does. If there's an argument for doing this, it should
be added to the adjacent comments.

(Also, what's with the random change in contrib/Makefile?)

regards, tom lane

#10Andrei Lepikhov
lepihov@gmail.com
In reply to: Tom Lane (#9)
Re: BUG #18576: Using EXPLAIN (VERBOSE) in information_schema.element_types returns ERROR: failed to find plan for

On 19/8/2024 18:36, Tom Lane wrote:

Andrei Lepikhov <lepihov@gmail.com> writes:

Thanks for pushing this!

But could you change this code a little bit?
I reported this issue a year ago. At that time, it was triggered by the
CustomScan node [1]. I haven't found the solution there [2]. Your code
looks like a good tradeoff, and if you slightly change the code (like in
the attachment), it allows CustomScan to survive such cases.

This seems like it's making assumptions it shouldn't about what
CustomScan does. If there's an argument for doing this, it should
be added to the adjacent comments.

Hm, I got into this problem many times using CustomScan node. Do you
have some objections to not allow CustomScan node have a RECORD Var in
the target list? For example, once I got this bug designing CustomScan
which gathered lightweight statistics on-the-fly under a Sort and
GROUP-BY nodes. I didn't change any tuple and had the same target list
as the child node. Why we should analyse target list and don't use
CustomScan if it contains Var of specific type?

(Also, what's with the random change in contrib/Makefile?)

Oops, it's a waste code, pardon.

--
regards, Andrei Lepikhov

#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrei Lepikhov (#10)
Re: BUG #18576: Using EXPLAIN (VERBOSE) in information_schema.element_types returns ERROR: failed to find plan for

Andrei Lepikhov <lepihov@gmail.com> writes:

On 19/8/2024 18:36, Tom Lane wrote:

This seems like it's making assumptions it shouldn't about what
CustomScan does. If there's an argument for doing this, it should
be added to the adjacent comments.

Hm, I got into this problem many times using CustomScan node. Do you
have some objections to not allow CustomScan node have a RECORD Var in
the target list?

In the case of a childless Result, we can suppose that the reason why
we're here is that a provably-empty subquery got optimized away.
If the Var were actually evaluated at runtime, it would fail, so that
had better be the case. (I thought about extending these new Asserts
to check that the Result has a constant-false resconstantqual, but
decided that was overkill.)

It's not clear to me what the equivalent argument is for allowing
CustomScan. I don't say that there isn't one. I do say that a patch
like this should make that argument, in the same comment block that
explains why we're doing this for Result.

The main reason I'm being sticky about this is that if we need to
allow CustomScan, then it seems likely that we also need to allow
ForeignScan, and maybe some other things, and then I start to
wonder if we should have any assertion at all about the child
plan type. So I want to actually understand what is the scenario
in which this will happen.

regards, tom lane

#12Andrei Lepikhov
lepihov@gmail.com
In reply to: Tom Lane (#11)
Re: BUG #18576: Using EXPLAIN (VERBOSE) in information_schema.element_types returns ERROR: failed to find plan for

On 19/8/2024 19:26, Tom Lane wrote:

Andrei Lepikhov <lepihov@gmail.com> writes:

On 19/8/2024 18:36, Tom Lane wrote:

This seems like it's making assumptions it shouldn't about what
CustomScan does. If there's an argument for doing this, it should
be added to the adjacent comments.

The main reason I'm being sticky about this is that if we need to
allow CustomScan, then it seems likely that we also need to allow
ForeignScan, and maybe some other things, and then I start to
wonder if we should have any assertion at all about the child
plan type. So I want to actually understand what is the scenario
in which this will happen.

I understand your point — the origins of ForeignScan and CustomScan are
the same.
However, we also have a difference: CustomScan can be executed locally
and sometimes may allow volatile functions, CTE, and many other things
in the underlying subtree. That's why we can't replay some cases with
ForeignScan stuff.
But ok, it make sense. I'll try to reproduce the case by employing
ForeignScan.

--
regards, Andrei Lepikhov

#13Andrei Lepikhov
lepihov@gmail.com
In reply to: Andrei Lepikhov (#12)
Re: BUG #18576: Using EXPLAIN (VERBOSE) in information_schema.element_types returns ERROR: failed to find plan for

On 19/8/2024 21:16, Andrei Lepikhov wrote:

On 19/8/2024 19:26, Tom Lane wrote:

Andrei Lepikhov <lepihov@gmail.com> writes:

On 19/8/2024 18:36, Tom Lane wrote:

This seems like it's making assumptions it shouldn't about what
CustomScan does.  If there's an argument for doing this, it should
be added to the adjacent comments.

The main reason I'm being sticky about this is that if we need to
allow CustomScan, then it seems likely that we also need to allow
ForeignScan, and maybe some other things, and then I start to
wonder if we should have any assertion at all about the child
plan type.  So I want to actually understand what is the scenario
in which this will happen.

I understand your point — the origins of ForeignScan and CustomScan are
the same.
However, we also have a difference: CustomScan can be executed locally
and sometimes may allow volatile functions, CTE, and many other things
in the underlying subtree. That's why we can't replay some cases with
ForeignScan stuff.
But ok, it make sense. I'll try to reproduce the case by employing
ForeignScan.

Delving into the intricacies of ForeignScan, I came to the realization
that implementing a push-down of ROW() expression in FDW is no simple
feat. It requires a 'deparse' routine and other complex stuff. While
there have been the attempt [1]/messages/by-id/2253e9091b300d868d524c0943fa8796@postgrespro.ru, it appears to be in a raw state, making
it challenging to use.
Given that ForeignScan forms a flat target list for foreign scans and
assembles whole-row expressions locally, it's worth pondering: do we
really need to limit CustomScan in the same way? If yes, should we add a
'restrictions' section describing the limitations of the underlying
subtree's target list to the documentation?

[1]: /messages/by-id/2253e9091b300d868d524c0943fa8796@postgrespro.ru
/messages/by-id/2253e9091b300d868d524c0943fa8796@postgrespro.ru

--
regards, Andrei Lepikhov