BUG #18953: Planner fails to build plan for complex query with LATERAL references
The following bug has been logged on the website:
Bug reference: 18953
Logged by: Alexander Lakhin
Email address: exclusion@gmail.com
PostgreSQL version: 18beta1
Operating system: Ubuntu 24.04
Description:
The following query:
create table tbl1(a int);
create table tbl2(b int);
create table tbl3(c int);
select * from tbl1 left join
(select case when a = 0 then 0 else subq_3.cc end from tbl1,
lateral (select 1 from tbl2 t1, tbl2 t2, tbl2 t3, tbl2 t4) subq_1,
lateral (
select tbl3.c as cc from tbl3, tbl2 t1, tbl2 t2,
lateral (select c, a from tbl2 limit 1) subq_2
) as subq_3
) subq_4 on true;
ends up with:
ERROR: XX000: failed to build any 4-way joins
LOCATION: standard_join_search, allpaths.c:3527
where N in "N-way" depends on the number of tbl2 t2, tbl2 t3, ... in
subq_1.
This is a simplified version of a query generated by SQLsmith.
The first commit this error is raised on is acfcd45ca. On acfcd45ca~1, the
select fails with "ERROR: could not devise a query plan for the given
query". The last commit it executed with no error is 4200a9286.
On Wed, Jun 11, 2025 at 3:14 AM PG Bug reporting form
<noreply@postgresql.org> wrote:
The following query:
create table tbl1(a int);
create table tbl2(b int);
create table tbl3(c int);
select * from tbl1 left join
(select case when a = 0 then 0 else subq_3.cc end from tbl1,
lateral (select 1 from tbl2 t1, tbl2 t2, tbl2 t3, tbl2 t4) subq_1,
lateral (
select tbl3.c as cc from tbl3, tbl2 t1, tbl2 t2,
lateral (select c, a from tbl2 limit 1) subq_2
) as subq_3
) subq_4 on true;
ends up with:
ERROR: XX000: failed to build any 4-way joins
Thanks for the report. Here's a simplified repro.
create table t (a int);
set from_collapse_limit to 2;
select * from t t1 left join
(select coalesce(a+x) from t t2,
lateral (select t3.a as x from t t3, lateral (select t2.a, t3.a
offset 0) s))
on true;
ERROR: failed to build any 2-way joins
In this query, the join between t3 and s is placed into a separate
join sub-problem due to the from_collapse_limit. This join is deemed
not legal by join_is_legal(), as have_dangerous_phv() thinks the PHV
could pose a hazard as described in that function's comment. As a
result, no join could be built for this sub-problem.
It seems to me that there are some loose ends in the logic of
have_dangerous_phv(). In its comment, it says:
* (Note that we can still make use of A's parameterized
* path with pre-joined B+C as the outer rel. have_join_order_restriction()
* ensures that we will consider making such a join even if there are not
* other reasons to do so.)
However, if B and C end up in different sub-joinlists, it becomes
impossible to pre-join B+C.
In the query above, the PHV's minimum eval_at set includes t2 and t3.
But since t2 and t3 belong to different sub-joinlists, we have no way
to pre-join t2+t3 first and then join the result with s.
No idea how to fix this though. Any thoughts?
Thanks
Richard
On Wed, Jun 11, 2025 at 5:33 PM Richard Guo <guofenglinux@gmail.com> wrote:
In this query, the join between t3 and s is placed into a separate
join sub-problem due to the from_collapse_limit. This join is deemed
not legal by join_is_legal(), as have_dangerous_phv() thinks the PHV
could pose a hazard as described in that function's comment. As a
result, no join could be built for this sub-problem.
No idea how to fix this though. Any thoughts?
This might be a silly idea, but if we can't find a valid plan for a
join sub-problem, perhaps we could consider flattening the
sub-joinlist into the higher level to explore a solution in a broader
search space. The top-most joinlist should always be able to produce
a valid plan, otherwise something must be wrong during planning.
This may violate the from_collapse_limit restriction, but such cases
are expected to be very rare.
Thanks
Richard
Richard Guo <guofenglinux@gmail.com> writes:
Thanks for the report. Here's a simplified repro.
...
In this query, the join between t3 and s is placed into a separate
join sub-problem due to the from_collapse_limit. This join is deemed
not legal by join_is_legal(), as have_dangerous_phv() thinks the PHV
could pose a hazard as described in that function's comment. As a
result, no join could be built for this sub-problem.
Bleah.
No idea how to fix this though. Any thoughts?
My thought is that have_dangerous_phv() was never more than a
quick-n-dirty kludge, and what we really ought to do is remove it.
That means cleaning up the technical debt mentioned in 85e5e222b:
In principle we could allow such a PlaceHolderVar to be evaluated at the
lower join node using values passed down from the upper relation along with
values from the join's own outer relation. However, nodeNestloop.c only
supports simple Vars not arbitrary expressions as nestloop parameters.
createplan.c is also a few bricks shy of being able to handle such cases;
it misplaces the PlaceHolderVar parameters in the plan tree, which is why
the visible symptoms of this bug are "plan should not reference subplan's
variable" and "failed to assign all NestLoopParams to plan nodes" planner
errors.
Adding the necessary complexity to make this work doesn't seem like it
would be repaid in significantly better plans, because in cases where such
a PHV exists, there is probably a corresponding join order constraint that
would allow a good plan to be found without using the star-schema exception.
Furthermore, adding complexity to nodeNestloop.c would create a run-time
penalty even for plans where this whole consideration is irrelevant.
So let's just reject such paths instead.
I think that the argument about executor complexity might be a red
herring: if we can get the PHV to be evaluated in the tlist of the
nestloop's outer relation, then the reference to it will still just be
an outer Var in the NestLoopParam structure. I'm still poking at what
we'd have to do to the planner to get that to happen, but my initial
impression is that it might not be very complicated after all.
regards, tom lane
I wrote:
My thought is that have_dangerous_phv() was never more than a
quick-n-dirty kludge, and what we really ought to do is remove it.
Here's a WIP patch along that line. It's unfinished in that, for
testing purposes, I just lobotomized have_dangerous_phv() to return
constant false rather than taking it out entirely. But of course
we'd want to clean up all the dead code if we go this way.
I have mixed feelings about whether to back-patch or just make
this change in HEAD. While we're clearly fixing a bug here,
the bug's been there for 10 years, so the lack of field reports
suggests strongly that this is not something ordinary users write.
Two arguments against back-patching are that
(1) the odds of introducing a new bug aren't zero;
(2) removing the have_dangerous_phv() restriction will change
some plan choices, which we generally dislike doing in stable
branches.
I'm still comfortable with shoving this into v18, but maybe we
should leave the back branches alone.
regards, tom lane
Attachments:
wip-fix-bug-18953.patchtext/x-diff; charset=us-ascii; name=wip-fix-bug-18953.patchDownload+115-31
On Sat, Jun 14, 2025 at 5:58 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Here's a WIP patch along that line. It's unfinished in that, for
testing purposes, I just lobotomized have_dangerous_phv() to return
constant false rather than taking it out entirely. But of course
we'd want to clean up all the dead code if we go this way.
In identify_current_nestloop_params(), I wonder if we should use the
join path's required-outer rels rather than the left path's as the
"outerrelids". It seems that there may be cases where the join path
is parameterized by some other rel while its left path itself is not
parameterized at all.
Regarding the test case, I wonder if we should add another test query
to specifically test the changes in create_nestloop_plan(). In
particular, a case where the inner rel A has a parameter that is a
PHV, and that PHV's ph_eval_at includes the outer rel B and some third
rel C. The expected plan should show the PHV being added to B's
target list (such a case might be tricky to construct though).
I have mixed feelings about whether to back-patch or just make
this change in HEAD. While we're clearly fixing a bug here,
the bug's been there for 10 years, so the lack of field reports
suggests strongly that this is not something ordinary users write.
Two arguments against back-patching are that
(1) the odds of introducing a new bug aren't zero;
(2) removing the have_dangerous_phv() restriction will change
some plan choices, which we generally dislike doing in stable
branches.
Agreed. The changes in this patch seem too risky to backport to the
stable branches.
Thanks
Richard
Richard Guo <guofenglinux@gmail.com> writes:
In identify_current_nestloop_params(), I wonder if we should use the
join path's required-outer rels rather than the left path's as the
"outerrelids". It seems that there may be cases where the join path
is parameterized by some other rel while its left path itself is not
parameterized at all.
I went back and forth on that. In principle, we shouldn't put such
a calculation into the left path if it's not parameterized by the
rel supplying the required value. In practice, if the required value
is getting passed down to the join path then it's available to the
left path too, so it should work fine whether the left path is marked
like that or not. In the only test case I have that exercises this
logic, both paths are marked with the required_outer anyway, so it
doesn't seem to matter. If we can find a case where that's not so,
then probably we should do it the other way.
Regarding the test case, I wonder if we should add another test query
to specifically test the changes in create_nestloop_plan().
We already have one; there's a query in join.sql (originally added to
test the have_dangerous_phv logic) that fails without the additions
to push a PHV into the left path. Admittedly, that's behind the
scenes so maybe we ought to add an EXPLAIN to make it more apparent.
regards, tom lane
On Tue, Jun 17, 2025 at 12:24 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Richard Guo <guofenglinux@gmail.com> writes:
In identify_current_nestloop_params(), I wonder if we should use the
join path's required-outer rels rather than the left path's as the
"outerrelids". It seems that there may be cases where the join path
is parameterized by some other rel while its left path itself is not
parameterized at all.
I went back and forth on that. In principle, we shouldn't put such
a calculation into the left path if it's not parameterized by the
rel supplying the required value. In practice, if the required value
is getting passed down to the join path then it's available to the
left path too, so it should work fine whether the left path is marked
like that or not. In the only test case I have that exercises this
logic, both paths are marked with the required_outer anyway, so it
doesn't seem to matter. If we can find a case where that's not so,
then probably we should do it the other way.
I tried to find such a case but didn't succeed; though I suspect
that's simply because I haven't tried hard enough. Conceptually, I'm
thinking of a query where 1) A laterally references both B and C while
B does not reference any other relation, and 2) A has a PHV parameter
whose ph_eval_at includes both B and C. In such a query, the B/A join
path is parameterized by C, while its left path B is not parameterized
at all. In order to add the PHV to B's tlist, we'll need to consider
the join path's required-outer rels not just those of the left path.
Thanks
Richard
Richard Guo <guofenglinux@gmail.com> writes:
On Tue, Jun 17, 2025 at 12:24 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
In the only test case I have that exercises this
logic, both paths are marked with the required_outer anyway, so it
doesn't seem to matter. If we can find a case where that's not so,
then probably we should do it the other way.
I tried to find such a case but didn't succeed; though I suspect
that's simply because I haven't tried hard enough. Conceptually, I'm
thinking of a query where 1) A laterally references both B and C while
B does not reference any other relation, and 2) A has a PHV parameter
whose ph_eval_at includes both B and C. In such a query, the B/A join
path is parameterized by C, while its left path B is not parameterized
at all. In order to add the PHV to B's tlist, we'll need to consider
the join path's required-outer rels not just those of the left path.
After thinking about this for awhile, I'm not seeing how that could
happen. A PHV with ph_eval_at exceeding its syntactic scope could
only be created via something like
B left join lateral (select coalesce(B.x, ...) as Q from D) C
that is we need a non-strict targetlist expression that references
something outside the sub-select proper. In this case all paths
created for C will have required_outer mentioning B, because there's
no way to compute C's reltarget without an outer reference to B.
This might be embedded in
... join A on A.y = C.Q
which gives rise to the situation you describe --- but there's
no possibility of the planner trying to do this in the order
C join (B join A)
because it will think that all paths for C require B on the
outside. It will only consider
B join (C join A)
and the path for C will show B as required_outer.
So I'm inclined to leave that code as I had it. It's notationally
a bit simpler and it doesn't require assuming that we can ignore
the path's required_outer marking at this stage. If I'm wrong,
someone will eventually find a counterexample and we can fix it
then; the changes won't be large.
regards, tom lane
Hello Tom,
17.06.2025 19:29, Tom Lane wrote:
So I'm inclined to leave that code as I had it. It's notationally
a bit simpler and it doesn't require assuming that we can ignore
the path's required_outer marking at this stage. If I'm wrong,
someone will eventually find a counterexample and we can fix it
then; the changes won't be large.
Please look at the following (simplified version of a query generated by
SQLsmith), which produces errors after a16ef313f2:
CREATE TABLE t(b bool); ANALYZE t;
SELECT 1 FROM t
INNER JOIN (SELECT (SELECT true) AS c FROM t, LATERAL (SELECT b LIMIT 1)) ON true
RIGHT JOIN (SELECT true) ON true,
LATERAL (SELECT b WHERE c LIMIT 1)
WHERE t.b;
ERROR: XX000: unrecognized node type: 22
LOCATION: ExecInitExprRec, execExpr.c:2665
Or a slight variation:
SELECT 1 FROM t
INNER JOIN (SELECT (SELECT true) AS c FROM t, LATERAL (SELECT 1 LIMIT 1)) ON true
RIGHT JOIN (SELECT true) ON true,
LATERAL (SELECT b WHERE c LIMIT 1)
WHERE t.b;
ERROR: XX000: failed to assign all NestLoopParams to plan nodes
LOCATION: create_plan, createplan.c:372
Best regards,
Alexander
Alexander Lakhin <exclusion@gmail.com> writes:
17.06.2025 19:29, Tom Lane wrote:
So I'm inclined to leave that code as I had it. It's notationally
a bit simpler and it doesn't require assuming that we can ignore
the path's required_outer marking at this stage. If I'm wrong,
someone will eventually find a counterexample and we can fix it
then; the changes won't be large.
Please look at the following (simplified version of a query generated by
SQLsmith), which produces errors after a16ef313f2:
Hah, that didn't take long! Your second case is indeed a
counterexample to my argument. We end up with a PHV having
phrels (b 7 8) which needs to be evaluated here:
{NESTPATH
:jpath.path.pathtype 356
:parent_relids (b 1 6 7)
:required_outer (b 8)
...
:jpath.outerjoinpath
{PATH
:pathtype 339
:parent_relids (b 7)
:required_outer (b)
Doing things the way Richard wanted to (ie using the nestloop's
required_outer) gets past the "failed to assign all NestLoopParams"
error, but then we get the same "unrecognized node type: 22" error
as in the first example.
That error seems considerably nastier to fix. I think it is a
pre-existing problem, though I don't currently have an explanation
why we've not seen it reported before. What is happening is that
we pull up a PHV for "c", containing a SubLink for (SELECT true),
and only some copies of it get put through preprocess_expression,
which is responsible for converting SubLinks to SubPlans among
other essential tasks. It seems that up to now we've always
managed to use only preprocessed copies in the finished plan.
But now a not-preprocessed copy is managing to get through to
the executor, which promptly spits up because it doesn't know
what to do with a SubLink.
We could probably hack this in a localized way by ensuring that
we push a preprocessed copy of the PHV into the left plan's tlist.
(One way to get that would be to look in the placeholder_list for the
correct phid.) However, now that I've seen this I have very little
faith that there aren't other bugs of the same ilk, that somehow
we've not seen up to now. I'm thinking about what we must do to
ensure that all PHVs are preprocessed once and only once, perhaps
at the moment they get put into the placeholder_list.
I don't have a patch for that yet, but attached is something that
gets rid of the "failed to assign all NestLoopParams" problem.
regards, tom lane
Attachments:
wip-fix-incorrect-NLP-placement.patchtext/x-diff; charset=us-ascii; name=wip-fix-incorrect-NLP-placement.patchDownload+18-10
On Mon, Jun 23, 2025 at 3:31 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Alexander Lakhin <exclusion@gmail.com> writes:
17.06.2025 19:29, Tom Lane wrote:
So I'm inclined to leave that code as I had it. It's notationally
a bit simpler and it doesn't require assuming that we can ignore
the path's required_outer marking at this stage. If I'm wrong,
someone will eventually find a counterexample and we can fix it
then; the changes won't be large.
Please look at the following (simplified version of a query generated by
SQLsmith), which produces errors after a16ef313f2:
Hah, that didn't take long! Your second case is indeed a
counterexample to my argument.
FWIW, a16ef313f also causes GEQO to encounter the "failed to assign
all NestLoopParams" problem, which can be seen in one of our
regression test queries.
set geqo_threshold to 2;
regression=# explain (verbose, costs off)
select ss2.* from
int4_tbl i41
left join int8_tbl i8
join (select i42.f1 as c1, i43.f1 as c2, 42 as c3
from int4_tbl i42, int4_tbl i43) ss1
on i8.q1 = ss1.c2
on i41.f1 = ss1.c1,
lateral (select i41.*, i8.*, ss1.* from text_tbl limit 1) ss2
where ss1.c2 = 0;
ERROR: failed to assign all NestLoopParams to plan nodes
This problem vanishes with the proposed patch.
Thanks
Richard
I wrote:
Doing things the way Richard wanted to (ie using the nestloop's
required_outer) gets past the "failed to assign all NestLoopParams"
error, but then we get the same "unrecognized node type: 22" error
as in the first example.
That error seems considerably nastier to fix. I think it is a
pre-existing problem, though I don't currently have an explanation
why we've not seen it reported before. What is happening is that
we pull up a PHV for "c", containing a SubLink for (SELECT true),
and only some copies of it get put through preprocess_expression,
which is responsible for converting SubLinks to SubPlans among
other essential tasks. It seems that up to now we've always
managed to use only preprocessed copies in the finished plan.
But now a not-preprocessed copy is managing to get through to
the executor, which promptly spits up because it doesn't know
what to do with a SubLink.
We could probably hack this in a localized way by ensuring that
we push a preprocessed copy of the PHV into the left plan's tlist.
(One way to get that would be to look in the placeholder_list for the
correct phid.) However, now that I've seen this I have very little
faith that there aren't other bugs of the same ilk, that somehow
we've not seen up to now. I'm thinking about what we must do to
ensure that all PHVs are preprocessed once and only once, perhaps
at the moment they get put into the placeholder_list.
I've traced through this in more detail, and found that it's
inaccurate to claim that the PHV's expression escaped preprocessing
altogether. (If it had, there would be more failure modes besides
"unprocessed SubLink", and we'd likely have noticed sooner.)
Rather, the problem is that SS_process_sublinks() skips replacement of
SubLinks within PlaceHolderVars that have phlevelsup > 0, expecting
that to be handled later. So the sequence of events that Alexander's
test case causes is:
1. We pull up the first sub-select "(SELECT (SELECT true) AS c FROM t,
LATERAL (SELECT b LIMIT 1))", causing the "WHERE c" in the later
LATERAL sub-select to be replaced by PHV((SELECT true)); a PHV is
needed since the pulled-up sub-select was under an outer join.
The PHV has phlevelsup = 1 since it's pushed down into a sub-select.
2. Expression pre-processing happens in the separately-planned
LATERAL sub-select, and while it does most of what's needed, it
explicitly skips replacement of the PHV's SubLink.
3. SS_replace_correlation_vars pulls the PHV out of the LATERAL
sub-select, replacing it with a Param and adding it to the
plan_params of the outer query level. Now it has phlevelsup = 0,
but there's still a SubLink inside it.
4. The code added by a16ef313f2 copies the PHV verbatim from
the plan_params list (via a subquery rel's subplan_params and
root->curOuterParams) into the outer subplan's tlist. Kaboom!
Now, if the PHV matched something that was already in the
outer subplan's tlist (recall it only needs to match by phid)
then we're safe, because that copy would have gotten fixed up
during extract_lateral_references. But in this example that
doesn't happen, probably because the PHV contains no Vars
so it doesn't get put into any baserel's reltarget.
I'm still trying to wrap my head around whether there are any
related failure modes in released branches. Steps 1-3 certainly
have been happening just like that for a long time, so that there
can be a PHV with an unexpanded SubLink floating around the
outer query level's data structures. Is the a16ef313f2 code
really the only way that that version of the PHV can make its way
into the emitted plan tree? Maybe, but I'm far from convinced.
Anyway, with one eye on the possibility that we'll need to back-patch
this in some form, I went for the localized fix of ensuring that
we use a copy of the PHV that's been through the normal processing.
I do want to look into restructuring the handling of these PHVs
to make this less messy, but that's not a job to undertake for v18
(much less if we have to back-patch).
One thing worth noting is that there's a visible shortcoming in the
generated plans for the new test cases: there are two copies of the
InitPlan that arises from the PHV's SubLink. One of them is
unreferenced and hence doesn't get used at runtime. That happens
because we are performing SS_process_sublinks on the PHV twice.
It's not fatal, but it's not great either. Perhaps a better design
could avoid that. I definitely don't want to adopt a fix that
results in still another InitPlan, so adding YA invocation of
preprocessing doesn't sound good.
regards, tom lane
Attachments:
fix-bug-18953-some-more.patchtext/x-diff; charset=us-ascii; name=fix-bug-18953-some-more.patchDownload+171-14
Hello Tom and Richard,
23.06.2025 21:24, Tom Lane wrote:
Anyway, with one eye on the possibility that we'll need to back-patch
this in some form, I went for the localized fix of ensuring that
we use a copy of the PHV that's been through the normal processing.
I do want to look into restructuring the handling of these PHVs
to make this less messy, but that's not a job to undertake for v18
(much less if we have to back-patch).
I've managed to discover one more anomaly introduced by a16ef313f, that is
not fixed by fix-bug-18953-some-more:
CREATE TABLE t(i int PRIMARY KEY);
MERGE INTO t
USING
(SELECT 1 AS j FROM generate_series(1, 1))
RIGHT JOIN (SELECT 1) ON true
LEFT JOIN (SELECT 1 FROM (SELECT 1 FROM generate_series(1, 1))) ON false
ON i = j
WHEN NOT MATCHED THEN INSERT VALUES (1);
ERROR: XX000: wrong phnullingrels (b) (expected (b 4)) for PlaceHolderVar 1
LOCATION: search_indexed_tlist_for_phv, setrefs.c:2958
Thank you for spending your time on this!
Best regards,
Alexander
Alexander Lakhin <exclusion@gmail.com> writes:
I've managed to discover one more anomaly introduced by a16ef313f, that is
not fixed by fix-bug-18953-some-more:
Just to note that I am studying this. It looks to me like the
issue is that identify_current_nestloop_params() is handing back
a PHV with too few nullingrel bits set for the place that we want
to put it, as is acknowledged to be possible in its comments.
We thought we could get away with that, but in this context setrefs.c
will complain. I'm inclined to try to make it set the bits more
accurately, rather than further weaken setrefs.c's cross-checks.
I'm wondering again whether this isn't a case that is reachable
before a16ef313f. Perhaps the have_dangerous_phv check prevented
forming plan trees that could have this issue, but it's not real
clear why.
regards, tom lane
On Tue, Jun 24, 2025 at 3:24 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Now, if the PHV matched something that was already in the
outer subplan's tlist (recall it only needs to match by phid)
then we're safe, because that copy would have gotten fixed up
during extract_lateral_references. But in this example that
doesn't happen, probably because the PHV contains no Vars
so it doesn't get put into any baserel's reltarget.
I think the reason the PHV isn't already included in the outer
subplan's tlist is that it's supposed to be evaluated at (7 8), while
the outer rel includes only (7). If the planner chooses a join order
where (7 8) are joined first, the PHV would appear in that join rel's
tlist -- but that's not the case here.
I'm still trying to wrap my head around whether there are any
related failure modes in released branches. Steps 1-3 certainly
have been happening just like that for a long time, so that there
can be a PHV with an unexpanded SubLink floating around the
outer query level's data structures. Is the a16ef313f2 code
really the only way that that version of the PHV can make its way
into the emitted plan tree? Maybe, but I'm far from convinced.
Prior to a16ef313f, have_dangerous_phv() ensures that we don't choose
a join order where the outer rel includes only (7). Instead, the
planner will always join (7 8) first, allowing the well-preprocessed
PHVs to be included in the tlist of that join rel. The unpreprocessed
PHVs in NestLoop.nestParams don't cause problems because they can
always be replaced with OUTER_VAR vars by setrefs.c. So it seems to
me that we're good in released branches, but I could be wrong.
Thanks
Richard
On Fri, Jun 27, 2025 at 3:37 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Alexander Lakhin <exclusion@gmail.com> writes:
I've managed to discover one more anomaly introduced by a16ef313f, that is
not fixed by fix-bug-18953-some-more:
Just to note that I am studying this. It looks to me like the
issue is that identify_current_nestloop_params() is handing back
a PHV with too few nullingrel bits set for the place that we want
to put it, as is acknowledged to be possible in its comments.
We thought we could get away with that, but in this context setrefs.c
will complain. I'm inclined to try to make it set the bits more
accurately, rather than further weaken setrefs.c's cross-checks.I'm wondering again whether this isn't a case that is reachable
before a16ef313f. Perhaps the have_dangerous_phv check prevented
forming plan trees that could have this issue, but it's not real
clear why.
I can reproduce this error with the following query.
select * from t t1
left join (select 1 as x, * from t t2) t2 on t1.i = t2.i
left join t t3 on false
left join t t4 on t4.i > t2.x;
ERROR: wrong phnullingrels (b) (expected (b 3)) for PlaceHolderVar 1
There are two versions of the join clause "t4.i > t2.x" due to
outer-join identity 3: one involving PHV(t2.x) with an empty
phnullingrels, and another where PHV(t2.x) is nulled by the t1/t2
join.
When creating index paths for t4, we need to identify its join clauses
that match the index. The join_clause_is_movable_to() check there
rejects is_clone versions, assuming that "generating one path from the
minimally-parameterized has_clone version is sufficient". As a
result, only the version with empty phnullingrels appears in the t4's
scan_clauses (via its ParamPathInfo.ppi_clauses). That version of PHV
eventually gets pulled into nestParams by replace_nestloop_params().
Prior to a16ef313f, this does not cause problems because we are aware
that Vars/PHVs seen in NestLoopParams may have nullingrels that are
just a subset of those in the Vars/PHVs actually available from the
outer side. Accordingly, when fixing up NestLoopParams, we only
require a NRM_SUBSET match.
Starting from a16ef313f, create_nestloop_plan() might insert PHVs from
NestLoopParams into the outer subplan's tlist if they are not already
present there. To check for presence, it uses equal(), which compares
phnullingrels as well. So in this case, the PHV with empty
phnullingrels is added to the t1/t2/t3 join's tlist, even though a
semantically equivalent one already exists there with a different
phnullingrels. This causes setrefs.c to complain when fixing up the
tlist for the t1/t2/t3 join, because it requires a NRM_SUPERSET match.
I think we should avoid adding the PHV to the outer subplan's tlist if
there is already an equivalent one there. Ideally, we should fix this
by making the PHVs in the NestLoopParam expressions have accurate
nullingrels, but that doesn't seem like a trivial task. As a band-aid
fix, I think we could match on phid only in create_nestloop_plan()
when checking whether the PHV is already available, like below.
- if (tlist_member((Expr *) phv, outer_tlist))
+ foreach(lc2, outer_tlist)
+ {
+ TargetEntry *tlentry = (TargetEntry *) lfirst(lc2);
+ PlaceHolderVar *subphv = (PlaceHolderVar *) tlentry->expr;
+
+ if (IsA(subphv, PlaceHolderVar) &&
+ phv->phid == subphv->phid)
+ break;
+ }
+ if (lc2 != NULL)
continue; /* already available */
Thanks
Richard
Richard Guo <guofenglinux@gmail.com> writes:
I think we should avoid adding the PHV to the outer subplan's tlist if
there is already an equivalent one there. Ideally, we should fix this
by making the PHVs in the NestLoopParam expressions have accurate
nullingrels, but that doesn't seem like a trivial task. As a band-aid
fix, I think we could match on phid only in create_nestloop_plan()
when checking whether the PHV is already available, like below.
I don't love this solution, because it's not clear that it'd avoid
a failure if there *isn't* an equivalent PHV already there.
Also it seems like it's piling another level of band-aid on what's
already a mess.
I experimented with making identify_current_nestloop_params compute
the correct nullingrels "from scratch", and that seems to resolve
the problem. Now this is arguably not an ideal answer, because the
point of the setrefs.c cross-checks is to verify that we placed
Vars and PHVs at safe places to calculate correctly-nulled values,
and we're basically defeating those checks. But this feels like
the least crufty answer, and it's also confining the cruft to the
right place: if we can think of a more principled answer for
computing the nestloop params' nullingrels, we just have to fix
identify_current_nestloop_params to do that instead.
BTW, having done this it seems like we could get rid of the use
of NRM_SUBSET and instead use NRM_EQUAL for processing nestloop
params in setrefs.c. I tried that, and it gets through check-world,
but I'm too chicken to risk such a change in v18 at this stage.
I propose applying the attached for now and then removing NRM_SUBSET
when v19 opens for business.
For ease of review, 0001 attached is the same patch as before and
then 0002 is the new stuff. I'd squash them to one patch for
commit, though.
regards, tom lane
... BTW, looking at these two patches again, I wonder if it'd be
better to put the hackery for SubLink conversion into
identify_current_nestloop_params? That'd remove the need to argue
that we don't have to fix both copies of a nestloop-parameter PHV.
The responsibility doesn't clearly belong to
identify_current_nestloop_params, but it doesn't clearly belong
to create_nestloop_plan either, so neither choice seems superior
from a modularity standpoint.
regards, tom lane
Hello Tom and Richard,
28.06.2025 00:37, Tom Lane wrote:
... BTW, looking at these two patches again, I wonder if it'd be
better to put the hackery for SubLink conversion into
identify_current_nestloop_params? That'd remove the need to argue
that we don't have to fix both copies of a nestloop-parameter PHV.
The responsibility doesn't clearly belong to
identify_current_nestloop_params, but it doesn't clearly belong
to create_nestloop_plan either, so neither choice seems superior
from a modularity standpoint.
I've discovered a query which fails with an error after a16ef313f with
both v2 patches applied:
create temp table t(i int primary key);
select * from
(select k from
(select i, coalesce(i, j) as k from
(select i from t union all select 1)
join (select 1 as j limit 1) on i = j)
right join (select 1) on true
join (select 1) on i is not null
),
lateral (select k offset 1);
ERROR: XX000: variable not found in subplan target list
LOCATION: fix_upper_expr_mutator, setrefs.c:3310
Thank you for working on this!
Best regards,
Alexander