Bugs in planner's equivalence-class processing
I looked into the problem reported in bug #7604 (Bill MacArthur was kind
enough to send me a reproducer off-list). The error can be demonstrated
by this example in the regression test database:
select f1, unique2, case when unique2 is null then f1 else 0 end
from int4_tbl a left join tenk1 b on f1 = unique2
where (case when unique2 is null then f1 else 0 end) = 0;
If you look at the output of the basic join:
regression=# select f1, unique2, case when unique2 is null then f1 else 0 end
regression-# from int4_tbl a left join tenk1 b on f1 = unique2;
f1 | unique2 | case
-------------+---------+-------------
0 | 0 | 0
123456 | | 123456
-123456 | | -123456
2147483647 | | 2147483647
-2147483647 | | -2147483647
(5 rows)
it's obvious that only the first of these rows passes the added WHERE
condition, so the query should produce just that row; but what you
actually get in 9.2 is all five rows! EXPLAIN gives a clue what's
going wrong:
Nested Loop Left Join (cost=0.00..22.51 rows=1 width=8)
-> Seq Scan on int4_tbl a (cost=0.00..1.05 rows=5 width=4)
-> Index Only Scan using tenk1_unique2 on tenk1 b (cost=0.00..4.28 rows=1 w
idth=4)
Index Cond: (unique2 = a.f1)
Filter: (CASE WHEN (unique2 IS NULL) THEN a.f1 ELSE 0 END = 0)
(5 rows)
The filter condition has been dropped down to the scan of tenk1 (relying
on a.f1 being passed in as a parameter), which is wrong since there it
is unable to filter null-extended rows produced by the left join
operator.
At first I didn't see how this could be happening: the code for placing
quals looks at RestrictInfo.nullable_relids, and this qual should have
tenk1 in its nullable_relids since it's above a left join to tenk1.
But gdb soon told me that that wasn't what join_clause_is_movable_to was
seeing. Eventually I figured out that the equivalence-class machinery
was eating the clause (forming an EquivalenceClass consisting of the
CASE expression and the constant 0) and reconstructing a new clause that
didn't have nullable_relids set.
So this is essentially an oversight in the patch that added tracking of
nullable_relids. I got confused about the difference between
outerjoin_delayed (this clause, as a whole, is not outerjoin_delayed
because its natural semantic level would be at the join anyway) and
having nonempty nullable_relids, and thought that equivalence-class
processing would never see such clauses because it doesn't accept
outerjoin_delayed clauses. So there's no code in equivclass.c to
compute nullable_relids sets for constructed clauses. At some point
it might be worth adding same, but for now I'm just going to tweak
distribute_qual_to_rels to not believe that clauses with nonempty
nullable_relids can be equivalence clauses.
It gets worse though. Tracking clauses' nullable_relids is new in 9.2,
but variants of this example can be demonstrated to fail as far back
as 7.4. Here's one:
select q1, unique2, thousand, hundred
from int8_tbl a left join tenk1 b on q1 = unique2
where coalesce(thousand,123) = q1 and q1 = coalesce(hundred,123);
If you look at just the bare join output, there are no rows that
should pass the WHERE clause, but actually what you get in 7.4-9.1
is
q1 | unique2 | thousand | hundred
-----+---------+----------+---------
123 | | |
123 | | |
(2 rows)
which is completely broken because it's not even a subset of the bare
join output :-(. 9.1 shows this EXPLAIN output:
Nested Loop Left Join (cost=0.00..42.48 rows=1 width=20)
Filter: (a.q1 = COALESCE(b.thousand, 123))
-> Seq Scan on int8_tbl a (cost=0.00..1.05 rows=5 width=8)
-> Index Scan using tenk1_unique2 on tenk1 b (cost=0.00..8.27 rows=1 width=
12)
Index Cond: (a.q1 = unique2)
Filter: (COALESCE(thousand, 123) = COALESCE(hundred, 123))
What's happening here is that because the two WHERE clauses are not
outerjoin_delayed, they're being deconstructed as equivalence clauses,
and then we synthesize an equivalence constraint on the two COALESCE
expressions that can be applied at the tenk1 scan level. That gets
rid of the row that should join to q1=123, allowing the bogus
null-extended join rows to be formed.
So this shows that, although 9.2 has a more extensive form of the
disease, the use of outerjoin_delayed to decide whether a clause can be
an equivalence clause is really quite wrong. I believe what I need to
do to fix this is back-port the 9.2 logic that computes a
nullable_relids set for each clause. We won't need to store it, just
check whether it's empty before letting the clause be treated as an
equivalence clause.
Is anybody concerned about the compatibility implications of fixing this
bug in the back branches? I'm worried about people complaining that we
broke their application in a minor release. Maybe they were depending
on incorrect behavior, but they might complain anyway. On the other
hand, the fact that this hasn't been reported from the field in nine
years suggests that not many people write queries like this.
regards, tom lane
On Tue, Oct 16, 2012 at 11:56:52AM -0400, Tom Lane wrote:
Is anybody concerned about the compatibility implications of fixing this
bug in the back branches? I'm worried about people complaining that we
broke their application in a minor release. Maybe they were depending
on incorrect behavior, but they might complain anyway. On the other
hand, the fact that this hasn't been reported from the field in nine
years suggests that not many people write queries like this.
Nice detective work. I'd personally say that it should be fixed. I
personally haven't written these kinds of queries so I'm not affected,
but I don't like the idea of known bugs being unfixed.
It's a pity we can't have a system that can somehow independantly
checks the results of the planner....
Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/
He who writes carelessly confesses thereby at the very outset that he does
not attach much importance to his own thoughts.
-- Arthur Schopenhauer
I wrote:
So this is essentially an oversight in the patch that added tracking of
nullable_relids. I got confused about the difference between
outerjoin_delayed (this clause, as a whole, is not outerjoin_delayed
because its natural semantic level would be at the join anyway) and
having nonempty nullable_relids, and thought that equivalence-class
processing would never see such clauses because it doesn't accept
outerjoin_delayed clauses. So there's no code in equivclass.c to
compute nullable_relids sets for constructed clauses. At some point
it might be worth adding same, but for now I'm just going to tweak
distribute_qual_to_rels to not believe that clauses with nonempty
nullable_relids can be equivalence clauses.
I tried the latter, with the first patch attached below, and it fixed
the incorrect-plan problem. However, further testing showed that this
method also breaks one of the optimizations implemented in equivclass.c,
which is the ability to push constants through full joins, as in
select * from tenk1 a full join tenk1 b using (unique1)
where unique1 = 42;
We don't seem to have a regression test case for that optimization,
which is probably because it predates the availability of EXPLAIN's
COSTS OFF option. (I've added one in the second patch below.)
I thought for a minute about accepting that as fallout, but I know
for certain that we'd get pushback on it, because we did last time
I broke it :-(.
So it appears that what we have to do is upgrade the equivalence-class
machinery to handle nullable_relids properly. The second patch attached
below does that. It's considerably larger than the quick-hack patch,
though not as large as I'd originally feared.
Anyway, the question now is whether to back-patch this. We do have
pretty much the same equivalence-class infrastructure in all versions
since 8.3, so it's possible to do it ... but it's a bit nervous-making.
On the other hand, letting a known bug go unfixed isn't attractive
either.
regards, tom lane
On 16 October 2012 16:56, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Is anybody concerned about the compatibility implications of fixing this
bug in the back branches? I'm worried about people complaining that we
broke their application in a minor release. Maybe they were depending
on incorrect behavior, but they might complain anyway. On the other
hand, the fact that this hasn't been reported from the field in nine
years suggests that not many people write queries like this.
Thanks for investigating this. My experience is that people seldom
check or understand the output of a query, they probably just figure
they didn't understand SQL and rewrite a different way, so its hard to
gauge the impact.
I think we need to see the cure before we can decide whether its worse
than the disease. And especially important is that we fix this just
once so I suggest fix and then backpatch deeply later.
This type of thing is handled in other products by having a
compatibility level, so you can decide whether you want it or not. Not
suggesting that here, yet, but its one way of mitigating the change.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services