Assert failure of the cross-check for nullingrels

Started by Richard Guoabout 3 years ago21 messageshackers
Jump to latest
#1Richard Guo
guofenglinux@gmail.com

While looking into issue [1]/messages/by-id/0b819232-4b50-f245-1c7d-c8c61bf41827@postgrespro.ru, I came across $subject on master. Below
is how to reproduce it.

DROP TABLE IF EXISTS t1,t2,t3,t4 CASCADE;
CREATE TABLE t1 AS SELECT true AS x FROM generate_series(0,1) x;
CREATE TABLE t2 AS SELECT true AS x FROM generate_series(0,1) x;
CREATE TABLE t3 AS SELECT true AS x FROM generate_series(0,1) x;
CREATE TABLE t4 AS SELECT true AS x FROM generate_series(0,1) x;
ANALYZE;

explain (costs off)
select * from t1 left join (t2 left join t3 on t2.x) on t2.x left join t4
on t3.x and t2.x where t1.x = coalesce(t2.x,true);

I've looked into this a little bit. For the join of t2/t3 to t4, since
it can commute with the join of t1 to t2/t3 according to identity 3, we
would generate multiple versions for its joinquals. In particular, the
qual 't3.x' would have two versions, one with varnullingrels as {t2/t3,
t1/t2}, the other one with varnullingrels as {t2/t3}. So far so good.

Assume we've determined to build the join of t2/t3 to t4 after we've
built t1/t2 and t2/t3, then we'd find that both versions of qual 't3.x'
would be accepted by clause_is_computable_at. This is not correct. We
are supposed to accept only the one marked as {t2/t3, t1/t2}. The other
one is not rejected mainly because we found that the qual 't3.x' does
not mention any nullable Vars of outer join t1/t2.

I wonder if we should consider syn_xxxhand rather than min_xxxhand in
clause_is_computable_at when we check if clause mentions any nullable
Vars. But I'm not sure about that.

[1]: /messages/by-id/0b819232-4b50-f245-1c7d-c8c61bf41827@postgrespro.ru
/messages/by-id/0b819232-4b50-f245-1c7d-c8c61bf41827@postgrespro.ru

Thanks
Richard

#2Richard Guo
guofenglinux@gmail.com
In reply to: Richard Guo (#1)
Re: Assert failure of the cross-check for nullingrels

On Fri, Mar 10, 2023 at 4:13 PM Richard Guo <guofenglinux@gmail.com> wrote:

I wonder if we should consider syn_xxxhand rather than min_xxxhand in
clause_is_computable_at when we check if clause mentions any nullable
Vars. But I'm not sure about that.

No, considering syn_xxxhand is not right. After some join order
commutation we may form the join with only its min_lefthand and
min_righthand. In this case if we check against syn_xxxhand rather than
min_xxxhand in clause_is_computable_at, we may end up with being unable
to find a proper place for some quals. I can see this problem in below
query.

select * from t1 left join ((select t2.x from t2 left join t3 on t2.x where
t3.x is null) s left join t4 on s.x) on s.x = t1.x;

Suppose we've formed join t1/t2 and go ahead to form the join of t1/t2
to t3. If we consider t1/t2 join's syn_xxxhand, then the pushed down
qual 't3.x is null' would not be computable at this level because it
mentions nullable Vars from t1/t2 join's syn_righthand and meanwhile is
not marked with t1/t2 join. This is not correct and would trigger an
Assert.

Back to the original issue, if a join has more than one quals, actually
we treat them as a whole when we check if identity 3 applies as well as
when we adjust them to be suitable for commutation according to identity
3. So when we check if a qual is computable at a given level, I think
we should also consider the join's quals as a whole. I'm thinking that
we use a 'group' notion for RestrictInfos and then use the clause_relids
of the 'group' in clause_is_computable_at. Does this make sense?

Thanks
Richard

#3Richard Guo
guofenglinux@gmail.com
In reply to: Richard Guo (#2)
Re: Assert failure of the cross-check for nullingrels

On Mon, Mar 13, 2023 at 5:03 PM Richard Guo <guofenglinux@gmail.com> wrote:

Back to the original issue, if a join has more than one quals, actually
we treat them as a whole when we check if identity 3 applies as well as
when we adjust them to be suitable for commutation according to identity
3. So when we check if a qual is computable at a given level, I think
we should also consider the join's quals as a whole. I'm thinking that
we use a 'group' notion for RestrictInfos and then use the clause_relids
of the 'group' in clause_is_computable_at. Does this make sense?

I'm imagining something like attached (no comments and test cases yet).

Thanks
Richard

Attachments:

v1-0001-Draft-group-RestrictInfos.patchapplication/octet-stream; name=v1-0001-Draft-group-RestrictInfos.patchDownload+29-2
#4Richard Guo
guofenglinux@gmail.com
In reply to: Richard Guo (#3)
Re: Assert failure of the cross-check for nullingrels

On Mon, Mar 13, 2023 at 5:44 PM Richard Guo <guofenglinux@gmail.com> wrote:

On Mon, Mar 13, 2023 at 5:03 PM Richard Guo <guofenglinux@gmail.com>
wrote:

Back to the original issue, if a join has more than one quals, actually
we treat them as a whole when we check if identity 3 applies as well as
when we adjust them to be suitable for commutation according to identity
3. So when we check if a qual is computable at a given level, I think
we should also consider the join's quals as a whole. I'm thinking that
we use a 'group' notion for RestrictInfos and then use the clause_relids
of the 'group' in clause_is_computable_at. Does this make sense?

I'm imagining something like attached (no comments and test cases yet).

Here is an updated patch with comments and test case. I also change the
code to store 'group_clause_relids' directly in RestrictInfo.

Thanks
Richard

Attachments:

v2-0001-Check-group_clause_relids-to-see-if-a-clause-is-computable.patchapplication/octet-stream; name=v2-0001-Check-group_clause_relids-to-see-if-a-clause-is-computable.patchDownload+61-2
#5Richard Guo
guofenglinux@gmail.com
In reply to: Richard Guo (#4)
Re: Assert failure of the cross-check for nullingrels

On Fri, Mar 17, 2023 at 11:05 AM Richard Guo <guofenglinux@gmail.com> wrote:

Here is an updated patch with comments and test case. I also change the
code to store 'group_clause_relids' directly in RestrictInfo.

BTW, I've added an open item for this issue.

Thanks
Richard

#6Jonathan S. Katz
jkatz@postgresql.org
In reply to: Richard Guo (#5)
Re: Assert failure of the cross-check for nullingrels

On 5/12/23 3:02 AM, Richard Guo wrote:

On Fri, Mar 17, 2023 at 11:05 AM Richard Guo <guofenglinux@gmail.com
<mailto:guofenglinux@gmail.com>> wrote:

Here is an updated patch with comments and test case.  I also change the
code to store 'group_clause_relids' directly in RestrictInfo.

BTW, I've added an open item for this issue.

[RMT hat]

Is there a specific commit targeted for v16 that introduced this issue?
Does it only affect v16 or does it affect backbranches?

Thanks,

Jonathan

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jonathan S. Katz (#6)
Re: Assert failure of the cross-check for nullingrels

"Jonathan S. Katz" <jkatz@postgresql.org> writes:

Is there a specific commit targeted for v16 that introduced this issue?
Does it only affect v16 or does it affect backbranches?

It's part of the outer-join-aware-Vars stuff, so it's my fault ...
and v16 only.

regards, tom lane

#8Jonathan S. Katz
jkatz@postgresql.org
In reply to: Tom Lane (#7)
Re: Assert failure of the cross-check for nullingrels

On 5/16/23 9:49 AM, Tom Lane wrote:

"Jonathan S. Katz" <jkatz@postgresql.org> writes:

Is there a specific commit targeted for v16 that introduced this issue?
Does it only affect v16 or does it affect backbranches?

It's part of the outer-join-aware-Vars stuff, so it's my fault ...
and v16 only.

*nods* thanks. I updated the Open Items page accordingly (doing RMT
housecleaning today in advance of Beta 1).

Jonathan

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Richard Guo (#4)
Re: Assert failure of the cross-check for nullingrels

Richard Guo <guofenglinux@gmail.com> writes:

Here is an updated patch with comments and test case. I also change the
code to store 'group_clause_relids' directly in RestrictInfo.

Hmm ... I don't like this patch terribly much. It's not obvious why
(or if) it works, and it's squirreling bits of semantic knowledge
into places they don't belong. ISTM the fundamental problem is that
clause_is_computable_at() is accepting clauses it shouldn't, and we
should try to solve it there.

After some poking at it I hit on what seems like a really simple
solution: we should be checking syn_righthand not min_righthand
to see whether a Var should be considered nullable by a given OJ.
Maybe that's still not quite right, but it seems like it might be
right given that the last fix reaffirmed our conviction that Vars
should be marked according to the syntactic structure.

If we don't want to do it like this, another way is to consider
the transitive closure of commutable outer joins, along similar
lines to your fixes to my earlier patch. But that seems like it
might just be adding complication.

regards, tom lane

Attachments:

v3-0001-fix-clause_is_computable_at.patchtext/x-diff; charset=us-ascii; name=v3-0001-fix-clause_is_computable_at.patchDownload+30-2
#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#9)
Re: Assert failure of the cross-check for nullingrels

... BTW, something I'd considered in an earlier attempt at fixing this
was to change clause_is_computable_at's API to pass the clause's
RestrictInfo not just the clause_relids, along the lines of

@@ -541,9 +547,10 @@ extract_actual_join_clauses(List *restrictinfo_list,
  */
 bool
 clause_is_computable_at(PlannerInfo *root,
-                        Relids clause_relids,
+                        RestrictInfo *rinfo,
                         Relids eval_relids)
 {
+    Relids        clause_relids = rinfo->clause_relids;
     ListCell   *lc;

/* Nothing to do if no outer joins have been performed yet. */

with corresponding simplifications at the call sites. That was with
a view to examining has_clone/is_clone inside this function. My
current proposal doesn't require that, but I'm somewhat tempted
to make this API change anyway for future-proofing purposes.
Thoughts?

regards, tom lane

#11Richard Guo
guofenglinux@gmail.com
In reply to: Tom Lane (#9)
Re: Assert failure of the cross-check for nullingrels

On Thu, May 18, 2023 at 3:34 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

After some poking at it I hit on what seems like a really simple
solution: we should be checking syn_righthand not min_righthand
to see whether a Var should be considered nullable by a given OJ.
Maybe that's still not quite right, but it seems like it might be
right given that the last fix reaffirmed our conviction that Vars
should be marked according to the syntactic structure.

I thought about this solution before but proved it was not right in
/messages/by-id/CAMbWs48fObJJ=YVb4ip8tnwxwixUNKUThfnA1eGfPzJxJRRgZQ@mail.gmail.com

I checked the query shown there and it still fails with v3 patch.

explain (costs off)
select * from t1
left join (select t2.x from t2
left join t3 on t2.x where t3.x is null) s
left join t4 on s.x
on s.x = t1.x;
server closed the connection unexpectedly

The failure happens when we are forming the join of (t1/t2) to t3.
Consider qual 't3.x is null'. It's a non-clone filter clause so
clause_is_computable_at is supposed to think it's applicable here. We
have an Assert for that. However, when checking outer join t1/t2, which
has been performed but is not listed in the qual's nullingrels,
clause_is_computable_at would think it'd null vars of the qual if we
check syn_righthand not min_righthand, and get a conclusion that the
qual is not applicable here. This is how the Assert is triggered.

Thanks
Richard

#12Richard Guo
guofenglinux@gmail.com
In reply to: Tom Lane (#10)
Re: Assert failure of the cross-check for nullingrels

On Thu, May 18, 2023 at 3:42 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

... BTW, something I'd considered in an earlier attempt at fixing this
was to change clause_is_computable_at's API to pass the clause's
RestrictInfo not just the clause_relids, along the lines of

@@ -541,9 +547,10 @@ extract_actual_join_clauses(List *restrictinfo_list,
*/
bool
clause_is_computable_at(PlannerInfo *root,
-                        Relids clause_relids,
+                        RestrictInfo *rinfo,
Relids eval_relids)
{
+    Relids        clause_relids = rinfo->clause_relids;
ListCell   *lc;

/* Nothing to do if no outer joins have been performed yet. */

with corresponding simplifications at the call sites. That was with
a view to examining has_clone/is_clone inside this function. My
current proposal doesn't require that, but I'm somewhat tempted
to make this API change anyway for future-proofing purposes.
Thoughts?

This change looks good to me.

Thanks
Richard

#13Tom Lane
tgl@sss.pgh.pa.us
In reply to: Richard Guo (#12)
Re: Assert failure of the cross-check for nullingrels

Richard Guo <guofenglinux@gmail.com> writes:

On Thu, May 18, 2023 at 3:42 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

... BTW, something I'd considered in an earlier attempt at fixing this
was to change clause_is_computable_at's API to pass the clause's
RestrictInfo not just the clause_relids, along the lines of

This change looks good to me.

Did that part.

regards, tom lane

#14Tom Lane
tgl@sss.pgh.pa.us
In reply to: Richard Guo (#11)
Re: Assert failure of the cross-check for nullingrels

Richard Guo <guofenglinux@gmail.com> writes:

On Thu, May 18, 2023 at 3:34 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

After some poking at it I hit on what seems like a really simple
solution: we should be checking syn_righthand not min_righthand
to see whether a Var should be considered nullable by a given OJ.

I thought about this solution before but proved it was not right in
/messages/by-id/CAMbWs48fObJJ=YVb4ip8tnwxwixUNKUThfnA1eGfPzJxJRRgZQ@mail.gmail.com
I checked the query shown there and it still fails with v3 patch.

Bleah. The other solution I'd been poking at involved adding an
extra check for clone clauses, as attached (note this requires
8a2523ff3). This survives your example, but I wonder if it might
reject all the clones in some cases. It seems a bit expensive
too, although as I said before, I don't think the clone cases get
traversed all that often.

Perhaps another answer could be to compare against syn_righthand
for clone clauses and min_righthand for non-clones? That seems
mighty unprincipled though.

regards, tom lane

Attachments:

v4-0001-fix-clause_is_computable_at.patchtext/x-diff; charset=us-ascii; name=v4-0001-fix-clause_is_computable_at.patchDownload+87-1
#15Richard Guo
guofenglinux@gmail.com
In reply to: Tom Lane (#14)
Re: Assert failure of the cross-check for nullingrels

On Fri, May 19, 2023 at 12:33 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Bleah. The other solution I'd been poking at involved adding an
extra check for clone clauses, as attached (note this requires
8a2523ff3). This survives your example, but I wonder if it might
reject all the clones in some cases. It seems a bit expensive
too, although as I said before, I don't think the clone cases get
traversed all that often.

I tried with v4 patch and find that, as you predicted, it might reject
all the clones in some cases. Check the query below

explain (costs off)
select * from t t1
left join t t2 on t1.a = t2.a
left join t t3 on t2.a = t3.a
left join t t4 on t3.a = t4.a and t2.b = t4.b;
QUERY PLAN
------------------------------------------
Hash Left Join
Hash Cond: (t2.b = t4.b)
-> Hash Left Join
Hash Cond: (t2.a = t3.a)
-> Hash Left Join
Hash Cond: (t1.a = t2.a)
-> Seq Scan on t t1
-> Hash
-> Seq Scan on t t2
-> Hash
-> Seq Scan on t t3
-> Hash
-> Seq Scan on t t4
(13 rows)

So the qual 't3.a = t4.a' is missing in this plan shape.

Perhaps another answer could be to compare against syn_righthand
for clone clauses and min_righthand for non-clones? That seems
mighty unprincipled though.

I also checked this solution with the same query.

explain (costs off)
select * from t t1
left join t t2 on t1.a = t2.a
left join t t3 on t2.a = t3.a
left join t t4 on t3.a = t4.a and t2.b = t4.b;
QUERY PLAN
------------------------------------------------------------------
Hash Left Join
Hash Cond: ((t3.a = t4.a) AND (t3.a = t4.a) AND (t2.b = t4.b))
-> Hash Left Join
Hash Cond: (t2.a = t3.a)
-> Hash Left Join
Hash Cond: (t1.a = t2.a)
-> Seq Scan on t t1
-> Hash
-> Seq Scan on t t2
-> Hash
-> Seq Scan on t t3
-> Hash
-> Seq Scan on t t4
(13 rows)

This time the qual 't3.a = t4.a' is back, but twice.

I keep thinking about my proposal in v2 patch. It seems more natural to
me to fix this issue, because an outer join's quals are always treated
as a whole when we check if identity 3 applies in make_outerjoininfo, as
well as when we adjust the outer join's quals for commutation in
deconstruct_distribute_oj_quals. So when it comes to check if quals are
computable at a join level, they should be still treated as a whole.
This should have the same effect regarding qual placement if the quals
of an outer join are in form of 'qual1 OR qual2 OR ...' rather than
'qual1 AND qual2 AND ...'.

Thanks
Richard

#16Tom Lane
tgl@sss.pgh.pa.us
In reply to: Richard Guo (#15)
Re: Assert failure of the cross-check for nullingrels

Richard Guo <guofenglinux@gmail.com> writes:

I keep thinking about my proposal in v2 patch. It seems more natural to
me to fix this issue, because an outer join's quals are always treated
as a whole when we check if identity 3 applies in make_outerjoininfo, as
well as when we adjust the outer join's quals for commutation in
deconstruct_distribute_oj_quals.

No, I doubt that that patch works properly. If the join condition
contains independent quals on different relations, say

select ... from t1 left join t2 on (t1.a = 1 and t2.b = 2)

then it may be that those quals need to be pushed to different levels.
I don't believe that considering the union of the rels mentioned in
any qual is a reasonable thing to do here.

regards, tom lane

#17Tom Lane
tgl@sss.pgh.pa.us
In reply to: Richard Guo (#15)
Re: Assert failure of the cross-check for nullingrels

Richard Guo <guofenglinux@gmail.com> writes:

I tried with v4 patch and find that, as you predicted, it might reject
all the clones in some cases. Check the query below
...
So the qual 't3.a = t4.a' is missing in this plan shape.

I poked into that more closely and realized that the reason that
clause_is_computable_at() misbehaves is that both clones of the
"t3.a = t4.a" qual have the same clause_relids: (4 5 6) which is
t3, the left join to t3, and t4. This is unsurprising because
the difference in these clones is whether they are expected to be
evaluated above or below outer join 3 (the left join to t2), and
t2 doesn't appear in the qual. (It does appear in "t2.b = t4.b",
which is why there's no similar misbehavior for that qual.)

If they have the same clause_relids, then clearly in its current
form clause_is_computable_at() cannot distinguish them. So what
I now think we should do is have clause_is_computable_at() examine
their required_relids instead. Those will be different, by
construction in deconstruct_distribute_oj_quals(), ensuring that
we select only one of the group of clones.

BTW, while I've not tried it, I suspect your v2 patch also fails
on this example for the same reason: it cannot distinguish the
clones of this qual.

regards, tom lane

Attachments:

v5-0001-fix-clause_is_computable_at.patchtext/x-diff; charset=us-ascii; name=v5-0001-fix-clause_is_computable_at.patchDownload+100-1
#18Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#17)
Re: Assert failure of the cross-check for nullingrels

I wrote:

If they have the same clause_relids, then clearly in its current
form clause_is_computable_at() cannot distinguish them. So what
I now think we should do is have clause_is_computable_at() examine
their required_relids instead. Those will be different, by
construction in deconstruct_distribute_oj_quals(), ensuring that
we select only one of the group of clones.

Since we're hard up against the beta1 wrap deadline, I went ahead
and pushed the v5 patch. I doubt that it's perfect yet, but it's
a small change and demonstrably fixes the cases we know about.

As I said in the commit message, the main knock I'd lay on v5
is "why not use required_relids all the time?". I tried to do
that and soon found that the answer is that we're not maintaining
required_relids very accurately. I found two causes so far:

1. equivclass.c sometimes generates placeholder constant-true
join clauses, and it's being sloppy about that. It copies
the required_relids of the original clause, but fails to copy
is_pushed_down, making the clause look like it's been assigned
to the wrong side of the join-clause-vs-filter-clause divide.
I found that we need to copy has_clone/is_clone as well. The
attached quick-hack patch avoids the bugs, but now I feel like
it was a mistake to not add has_clone/is_clone as full-fledged
arguments of make_restrictinfo. I'm inclined to change that,
but not right before beta1 when we have no evidence of a reachable
bug. (Mind you, there might *be* a reachable bug, but ...)

2. When distribute_qual_to_rels forces a qual up to a particular
syntactic level, it applies a relid set that very possibly refers
to rels the clause doesn't actually depend on. This is problematic
because if the clause gets postponed to above some outer join that
nulls those rels, then it looks like it's being evaluated in an
unsafe location. I think that when we detect commutability of two
outer joins, we need to adjust the relevant min_xxxhand sets more
thoroughly than we do now. I've not managed to write a patch
for that yet. One problem is that if we insist on removing all
unreferenced rels from required_relids, we might end up with a
set that mentions *none* of the RHS and therefore fails to keep
the clause from dropping into the LHS where it must not go.
Not sure about a nice way to handle that.

regards, tom lane

Attachments:

clean-up-const-true-clauses-wip.patchtext/x-diff; charset=us-ascii; name=clean-up-const-true-clauses-wip.patchDownload+26-14
#19Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#18)
Re: Assert failure of the cross-check for nullingrels

On 2023-May-21, Tom Lane wrote:

Since we're hard up against the beta1 wrap deadline, I went ahead
and pushed the v5 patch. I doubt that it's perfect yet, but it's
a small change and demonstrably fixes the cases we know about.

As I said in the commit message, the main knock I'd lay on v5
is "why not use required_relids all the time?".

So, is this done? I see that you made other commits fixing related code
several days after this email, but none seems to match the changes you
posted in this patch; and also it's not clear to me that there's any
test case where this patch is expected to change behavior. (So there's
also a question of whether this is a bug fix or rather some icying on
cake.)

--
Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/

#20Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#19)
Re: Assert failure of the cross-check for nullingrels

Alvaro Herrera <alvherre@alvh.no-ip.org> writes:

So, is this done? I see that you made other commits fixing related code
several days after this email, but none seems to match the changes you
posted in this patch; and also it's not clear to me that there's any
test case where this patch is expected to change behavior. (So there's
also a question of whether this is a bug fix or rather some icying on
cake.)

Well, the bugs I was aware of ahead of PGCon are all fixed, but there
are some new reports I still have to deal with. I left the existing
open issue open, but maybe it'd be better to close it and start a new
one?

regards, tom lane

#21Richard Guo
guofenglinux@gmail.com
In reply to: Tom Lane (#20)