Wrong query results caused by loss of join quals
I came across $subject on HEAD and here is the query I'm using.
create table t1 (a int, b int);
create table t2 (a int, b int);
create table t3 (a int, b int);
insert into t1 values (1, 1);
insert into t2 values (2, 200);
insert into t3 values (3, 3);
# select * from t1 left join t2 on true, lateral (select * from t3 where
t2.a = t2.b) ss;
a | b | a | b | a | b
---+---+---+-----+---+---
1 | 1 | 2 | 200 | 3 | 3
(1 row)
# explain (costs off) select * from t1 left join t2 on true, lateral
(select * from t3 where t2.a = t2.b) ss;
QUERY PLAN
----------------------------------
Nested Loop
-> Nested Loop Left Join
-> Seq Scan on t1
-> Materialize
-> Seq Scan on t2
-> Materialize
-> Seq Scan on t3
(7 rows)
As we can see, the join qual 't2.a = t2.b' disappears in the plan, and
that results in the wrong query results.
I did some dig and here is what happened. Firstly both sides of qual
't2.a = t2.b' could be nulled by the OJ t1/t2 and they are marked so in
their varnullingrels. Then we decide that this qual can form a EC, and
the EC's ec_relids is marked as {t2, t1/t2}. Note that t1 is not
included in this ec_relids. So when it comes to building joinrel for
t1/t2, generate_join_implied_equalities fails to generate the join qual
from that EC.
I'm not sure how to fix this problem yet. I'm considering that while
composing eclass_indexes for each base rel, when we come across an
ojrelid in ec->ec_relids, can we instead mark the base rels in the OJ's
min_lefthand/min_righthand that they are 'mentioned' in this EC?
Something like the TODO says.
i = -1;
while ((i = bms_next_member(ec->ec_relids, i)) > 0)
{
RelOptInfo *rel = root->simple_rel_array[i];
if (rel == NULL) /* must be an outer join */
{
Assert(bms_is_member(i, root->outer_join_rels));
+ /*
+ * TODO Mark the base rels in the OJ's min_xxxhand that they
+ * are 'mentioned' in this EC.
+ */
continue;
}
Assert(rel->reloptkind == RELOPT_BASEREL);
rel->eclass_indexes = bms_add_member(rel->eclass_indexes,
ec_index);
if (can_generate_joinclause)
rel->has_eclass_joins = true;
}
Or maybe we can just expand ec->ec_relids to include OJ's min_xxxhand
when we form a new EC?
Thanks
Richard
Richard Guo <guofenglinux@gmail.com> writes:
... As we can see, the join qual 't2.a = t2.b' disappears in the plan, and
that results in the wrong query results.
Ugh.
I did some dig and here is what happened. Firstly both sides of qual
't2.a = t2.b' could be nulled by the OJ t1/t2 and they are marked so in
their varnullingrels. Then we decide that this qual can form a EC, and
the EC's ec_relids is marked as {t2, t1/t2}. Note that t1 is not
included in this ec_relids. So when it comes to building joinrel for
t1/t2, generate_join_implied_equalities fails to generate the join qual
from that EC.
Hmm. My intention for this sort of case was that the nulled Vars should
look like "new_members" to generate_join_implied_equalities_normal,
since they are computable at the join node (in filter not join quals)
but not computable within either input. Then it would generate the
necessary quals to equate them to each other. The reason that that
doesn't happen is that get_common_eclass_indexes believes it can ignore
ECs that don't mention t1. The attached quick hack is enough to fix
the presented case, but:
* I suspect the other use of get_common_eclass_indexes, in
have_relevant_eclass_joinclause, is broken as well.
* This fix throws away a fair bit of the optimization intended by
3373c7155, since it will result in examining some irrelevant ECs.
I'm not sure if it's worth complicating get_common_eclass_indexes
to try to recover that by adding knowledge about outer joins.
* I'm now kind of wondering whether there are pre-existing bugs of the
same ilk. Maybe not, because before 2489d76c4 an EC constraint that was
computable at the join but not earlier would have to have mentioned both
sides of the join ... but I'm not quite sure.
BTW, while looking at this I saw that generate_join_implied_equalities'
calculation of nominal_join_relids is wrong for child rels, because it
fails to fold the join relid into that if appropriate. In cases similar
to this one, that could result in generate_join_implied_equalities_broken
doing the same sort of wrong thing, that is rejecting quals it should
have enforced at the join. I don't think the use of nominal_join_relids
added by this patch is affected, though: a Var mentioning the outer join
in varnullingrels would have to have some member of the RHS' base rels
in varno, and I think a parallel statement can be made about PHVs.
regards, tom lane
Attachments:
quick-hack-for-missing-join-quals.patchtext/x-diff; charset=us-ascii; name=quick-hack-for-missing-join-quals.patchDownload+2-4
On Mon, Feb 20, 2023 at 4:56 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
* I suspect the other use of get_common_eclass_indexes, in
have_relevant_eclass_joinclause, is broken as well.
It seems have_relevant_joinclause is broken for the presented case. It
does not get a change to call have_relevant_eclass_joinclause, because
flag 'has_eclass_joins' is not set for t1 due to t1 being not in the
EC's ec_relids. As a result, have_relevant_joinclause thinks there is
no joinclause that involves t1 and t2, which is not right.
* This fix throws away a fair bit of the optimization intended by
3373c7155, since it will result in examining some irrelevant ECs.
I'm not sure if it's worth complicating get_common_eclass_indexes
to try to recover that by adding knowledge about outer joins.
Yeah, this is also my concern that we'd lose some optimization about
finding ECs.
* I'm now kind of wondering whether there are pre-existing bugs of the
same ilk. Maybe not, because before 2489d76c4 an EC constraint that was
computable at the join but not earlier would have to have mentioned both
sides of the join ... but I'm not quite sure.
I also think there is no problem before, because if a clause was
computable at the join but not earlier and only mentioned one side of
the join, then it was a non-degenerate outer join qual or an
outerjoin_delayed qual, and cannot enter into EC.
BTW, while looking at this I saw that generate_join_implied_equalities'
calculation of nominal_join_relids is wrong for child rels, because it
fails to fold the join relid into that if appropriate.
I dug a little into this and it seems this is all right as-is. Among
all the calls of generate_join_implied_equalities, it seems only
build_joinrel_restrictlist would have outer join's ojrelid in param
'join_relids'. And build_joinrel_restrictlist does not get called for
child rels. The restrictlist of a child rel is constructed from that of
its parent rel.
Thanks
Richard
Richard Guo <guofenglinux@gmail.com> writes:
On Mon, Feb 20, 2023 at 4:56 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
* I suspect the other use of get_common_eclass_indexes, in
have_relevant_eclass_joinclause, is broken as well.
It seems have_relevant_joinclause is broken for the presented case. It
does not get a change to call have_relevant_eclass_joinclause, because
flag 'has_eclass_joins' is not set for t1 due to t1 being not in the
EC's ec_relids. As a result, have_relevant_joinclause thinks there is
no joinclause that involves t1 and t2, which is not right.
I thought about this and decided that it's not really a problem.
have_relevant_joinclause is just a heuristic, and I don't think we
need to prioritize forming a join if the only relevant clauses look
like this. We won't be able to use such clauses for merge or hash,
so we're going to end up with an unconstrained nestloop, which isn't
something to be eager to form. The join ordering rules will take
care of forcing us to make the join when necessary.
* This fix throws away a fair bit of the optimization intended by
3373c7155, since it will result in examining some irrelevant ECs.
Yeah, this is also my concern that we'd lose some optimization about
finding ECs.
The only easy improvement I can see to make here is to apply the old
rules at inner joins. Maybe it's worth complicating the data structures
to be smarter at outer joins, but I rather doubt it: we could easily
expend more overhead than we'll save here by examining irrelevant ECs.
In any case, if there is a useful optimization here, it can be pursued
later.
BTW, while looking at this I saw that generate_join_implied_equalities'
calculation of nominal_join_relids is wrong for child rels, because it
fails to fold the join relid into that if appropriate.
I dug a little into this and it seems this is all right as-is.
I changed it anyway after noting that (a) passing in the ojrelid is
needful to be able to distinguish inner and outer joins, and
(b) the existing comment about the join_relids input is now wrong.
Even if it happens to not be borked for current callers, that seems
like a mighty fragile assumption.
Less-hasty v2 patch attached.
regards, tom lane
Attachments:
fix-for-missing-join-quals-2.patchtext/x-diff; charset=us-ascii; name=fix-for-missing-join-quals-2.patchDownload+105-18
On Thu, Feb 23, 2023 at 4:50 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
I thought about this and decided that it's not really a problem.
have_relevant_joinclause is just a heuristic, and I don't think we
need to prioritize forming a join if the only relevant clauses look
like this. We won't be able to use such clauses for merge or hash,
so we're going to end up with an unconstrained nestloop, which isn't
something to be eager to form. The join ordering rules will take
care of forcing us to make the join when necessary.
Agreed. And as I tried, in lots of cases joins with such clauses would
be accepted by have_join_order_restriction(), which always appears with
have_relevant_joinclause().
The only easy improvement I can see to make here is to apply the old
rules at inner joins. Maybe it's worth complicating the data structures
to be smarter at outer joins, but I rather doubt it: we could easily
expend more overhead than we'll save here by examining irrelevant ECs.
In any case, if there is a useful optimization here, it can be pursued
later.
This makes sense.
I changed it anyway after noting that (a) passing in the ojrelid is
needful to be able to distinguish inner and outer joins, and
(b) the existing comment about the join_relids input is now wrong.
Even if it happens to not be borked for current callers, that seems
like a mighty fragile assumption.
Agreed. This is reasonable.
Less-hasty v2 patch attached.
I think the patch is in good shape now.
Thanks
Richard