BUG #18643: EXPLAIN estimated rows mismatch

Started by PG Bug reporting formover 1 year ago9 messagesbugs

noreply@postgresql.org

over 1 year ago

The following bug has been logged on the website:

Bug reference: 18643
Logged by: ming wei tan
Email address: mingwei.tc@gmail.com
PostgreSQL version: 12.20
Operating system: Debian 12.20-1.pgdg120+1 (Docker)
Description:

Given predicate A and B, it is expected that size (SELECT X where A) <=
size (SELECT X WHERE A or B)

However, `EXPLAIN SELECT t2.c0 FROM t2 WHERE t2.c0 IN (t2.c0)` returns
rows=2537
QUERY PLAN
------------------------------------------------------
Seq Scan on t2 (cost=0.00..35.50 rows=2537 width=4)
Filter: (c0 IS NOT NULL)

While, `EXPLAIN SELECT t2.c0 FROM t2 WHERE (t2.c0 IN (t2.c0)) OR (t2.c0 >
4)` returns rows=858
QUERY PLAN
-----------------------------------------------------
Seq Scan on t2 (cost=0.00..48.25 rows=858 width=4)
Filter: ((c0 = c0) OR (c0 > 4))

Running on docker image postgres:12

Test cases:

DROP DATABASE IF EXISTS database4;
CREATE DATABASE database4 WITH ENCODING 'UTF8' TEMPLATE template0;
\c database4;
CREATE TABLE t2(c0 int);
INSERT INTO t2(c0) VALUES(1);
INSERT INTO t2(c0) VALUES(2);
EXPLAIN SELECT t2.c0 FROM t2 WHERE t2.c0 IN (t2.c0);
EXPLAIN SELECT t2.c0 FROM t2 WHERE (t2.c0 IN (t2.c0)) OR (t2.c0 > 4);

DROP DATABASE IF EXISTS database4;
CREATE DATABASE database4 WITH ENCODING 'UTF8' TEMPLATE template0;
\c database4;
CREATE TABLE t2(c0 int);
INSERT INTO t2(c0) VALUES(1);
INSERT INTO t2(c0) VALUES(2);
EXPLAIN SELECT t2.c0 FROM t2 WHERE (t2.c0)=(t2.c0);
EXPLAIN SELECT t2.c0 FROM t2 WHERE ((t2.c0)=(t2.c0) OR (t2.c0 > 4);

Tom Lane

tgl@sss.pgh.pa.us

over 1 year ago

In reply to: PG Bug reporting form (#1)

Re: BUG #18643: EXPLAIN estimated rows mismatch

PG Bug reporting form <noreply@postgresql.org> writes:

Given predicate A and B, it is expected that size (SELECT X where A) <=
size (SELECT X WHERE A or B)
However, `EXPLAIN SELECT t2.c0 FROM t2 WHERE t2.c0 IN (t2.c0)` returns
rows=2537

I don't see any particular bug here. If you look closely at the
EXPLAIN output, you'll see that "t2.c0 IN (t2.c0)" is transformed
to "c0 IS NOT NULL" --- but only if it's at top level. So we're
estimating selectivities for two quite different conditions in
this example.

The NOT NULL bit happens because a top-level equality clause
is transformed into an "EquivalenceClass", and then when we
notice the class has only one member, we prefer to spit out
"x IS NOT NULL" rather than "x = x". That has the same effect
(at top level of WHERE, anyway) and tends to be estimated
more accurately.

In any case, in this toy example that lacks an ANALYZE step,
the selectivity estimates are mostly going to be garbage.

regards, tom lane

Andrei Lepikhov

lepihov@gmail.com

over 1 year ago

In reply to: Tom Lane (#2)

Re: BUG #18643: EXPLAIN estimated rows mismatch

On 1/10/2024 18:43, Tom Lane wrote:

PG Bug reporting form <noreply@postgresql.org> writes:

Given predicate A and B, it is expected that size (SELECT X where A) <=
size (SELECT X WHERE A or B)
However, `EXPLAIN SELECT t2.c0 FROM t2 WHERE t2.c0 IN (t2.c0)` returns
rows=2537

I don't see any particular bug here. If you look closely at the
EXPLAIN output, you'll see that "t2.c0 IN (t2.c0)" is transformed
to "c0 IS NOT NULL" --- but only if it's at top level. So we're
estimating selectivities for two quite different conditions in
this example.

The NOT NULL bit happens because a top-level equality clause
is transformed into an "EquivalenceClass", and then when we
notice the class has only one member, we prefer to spit out
"x IS NOT NULL" rather than "x = x". That has the same effect
(at top level of WHERE, anyway) and tends to be estimated
more accurately.

I think their question was about why 'x IN (x)' transforms differently
at the top and inside the OR clause. It is pretty typical question.

--
regards, Andrei Lepikhov

ming wei tan

mingwei.tc@gmail.com

over 1 year ago

In reply to: Andrei Lepikhov (#3)

Re: BUG #18643: EXPLAIN estimated rows mismatch

On 1/10/2024 18:43, Tom Lane wrote:

In any case, in this toy example that lacks an ANALYZE step,
the selectivity estimates are mostly going to be garbage.

Thanks for the replies. I'm just checking if a bug is present here
is a bug. Even with ANALYZE, the first EXPLAIN estimates more rows
compared to the second, even though the second WHERE clause is
less restrictive.

ANALYZE;

ANALYZE

EXPLAIN SELECT t2.c0 FROM t2 WHERE t2.c0 IN (t2.c0);
QUERY PLAN
--------------------------------------------------
Seq Scan on t2 (cost=0.00..1.02 rows=2 width=4)
Filter: (c0 IS NOT NULL)
(2 rows)

EXPLAIN SELECT t2.c0 FROM t2 WHERE (t2.c0 IN (t2.c0)) OR (t2.c0 > 4);
QUERY PLAN
--------------------------------------------------
Seq Scan on t2 (cost=0.00..1.03 rows=1 width=4)
Filter: ((c0 = c0) OR (c0 > 4))
(2 rows)

DROP DATABASE IF EXISTS database4;
CREATE DATABASE database4 WITH ENCODING 'UTF8' TEMPLATE template0;
\c database4;
CREATE TABLE t2(c0 int);
INSERT INTO t2(c0) VALUES(1);
INSERT INTO t2(c0) VALUES(2);
ANALYZE;
EXPLAIN SELECT t2.c0 FROM t2 WHERE t2.c0 IN (t2.c0);
EXPLAIN SELECT t2.c0 FROM t2 WHERE (t2.c0 IN (t2.c0)) OR (t2.c0 > 4);

Regards,
Ming Wei

On Wed, 2 Oct 2024 at 08:35, Andrei Lepikhov <lepihov@gmail.com> wrote:

Show quoted text

On 1/10/2024 18:43, Tom Lane wrote:

PG Bug reporting form <noreply@postgresql.org> writes:

Given predicate A and B, it is expected that size (SELECT X where A) <=
size (SELECT X WHERE A or B)
However, `EXPLAIN SELECT t2.c0 FROM t2 WHERE t2.c0 IN (t2.c0)` returns
rows=2537

I don't see any particular bug here. If you look closely at the
EXPLAIN output, you'll see that "t2.c0 IN (t2.c0)" is transformed
to "c0 IS NOT NULL" --- but only if it's at top level. So we're
estimating selectivities for two quite different conditions in
this example.

The NOT NULL bit happens because a top-level equality clause
is transformed into an "EquivalenceClass", and then when we
notice the class has only one member, we prefer to spit out
"x IS NOT NULL" rather than "x = x". That has the same effect
(at top level of WHERE, anyway) and tends to be estimated
more accurately.

I think their question was about why 'x IN (x)' transforms differently
at the top and inside the OR clause. It is pretty typical question.

--
regards, Andrei Lepikhov

Andrei Lepikhov

lepihov@gmail.com

over 1 year ago

In reply to: PG Bug reporting form (#1)

Re: BUG #18643: EXPLAIN estimated rows mismatch

On 1/10/2024 09:47, PG Bug reporting form wrote:

The following bug has been logged on the website:

Bug reference: 18643
Logged by: ming wei tan
Email address: mingwei.tc@gmail.com
PostgreSQL version: 12.20
Operating system: Debian 12.20-1.pgdg120+1 (Docker)
Description:

Given predicate A and B, it is expected that size (SELECT X where A) <=
size (SELECT X WHERE A or B)

However, `EXPLAIN SELECT t2.c0 FROM t2 WHERE t2.c0 IN (t2.c0)` returns
rows=2537
QUERY PLAN
------------------------------------------------------
Seq Scan on t2 (cost=0.00..35.50 rows=2537 width=4)
Filter: (c0 IS NOT NULL)

While, `EXPLAIN SELECT t2.c0 FROM t2 WHERE (t2.c0 IN (t2.c0)) OR (t2.c0 >
4)` returns rows=858
QUERY PLAN
-----------------------------------------------------
Seq Scan on t2 (cost=0.00..48.25 rows=858 width=4)
Filter: ((c0 = c0) OR (c0 > 4))

I found one link, which is connected to your case:
/messages/by-id/CAMjNa7cC4X9YR-vAJS-jSYCajhRDvJQnN7m2sLH1wLh-_Z2bsw@mail.gmail.com
there you can find answer to your question. In general - it is too
expensive to test each clause (especially non-top level one) on possible
evaluation into constant expression.

--
regards, Andrei Lepikhov

David Rowley

dgrowleyml@gmail.com

over 1 year ago

In reply to: ming wei tan (#4)

Re: BUG #18643: EXPLAIN estimated rows mismatch

On Thu, 3 Oct 2024 at 07:17, ming wei tan <mingwei.tc@gmail.com> wrote:

On 1/10/2024 18:43, Tom Lane wrote:

In any case, in this toy example that lacks an ANALYZE step,
the selectivity estimates are mostly going to be garbage.

Thanks for the replies. I'm just checking if a bug is present here
is a bug. Even with ANALYZE, the first EXPLAIN estimates more rows
compared to the second, even though the second WHERE clause is
less restrictive.

I think you already checked that and Tom answered mentioning the
reason that this happens.

We certainly could do better here, but, as Tom mentioned your example
does not seem convincing enough to warrant much effort.

You should read the commit message in [1]https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=8ec5429e2f422f4d570d4909507db0d4ca83bbac as that might help you
understand the project's point of view for these sort of
optimisations. See in particular the first sentence of the second
paragraph.

David

[1]: https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=8ec5429e2f422f4d570d4909507db0d4ca83bbac

Andrei Lepikhov

lepihov@gmail.com

over 1 year ago

In reply to: David Rowley (#6)

Re: BUG #18643: EXPLAIN estimated rows mismatch

On 10/3/24 03:57, David Rowley wrote:

On Thu, 3 Oct 2024 at 07:17, ming wei tan <mingwei.tc@gmail.com> wrote:

On 1/10/2024 18:43, Tom Lane wrote:

In any case, in this toy example that lacks an ANALYZE step,
the selectivity estimates are mostly going to be garbage.

Thanks for the replies. I'm just checking if a bug is present here
is a bug. Even with ANALYZE, the first EXPLAIN estimates more rows
compared to the second, even though the second WHERE clause is
less restrictive.

I think you already checked that and Tom answered mentioning the
reason that this happens.

We certainly could do better here, but, as Tom mentioned your example
does not seem convincing enough to warrant much effort.

You should read the commit message in [1] as that might help you
understand the project's point of view for these sort of
optimisations. See in particular the first sentence of the second
paragraph.

I can agree with the source reason. But presence of prepqual.c makes it
less clear:
Optimiser already attempts to remove duplicated ORs by the
find_duplicate_ors function. And does it in some strange manner:

explain
SELECT oid,relname FROM pg_class WHERE oid=1 OR oid=1;

Index Scan using pg_class_oid_index on pg_class
Index Cond: (oid = '1'::oid)

explain
SELECT oid,relname FROM pg_class WHERE oid=1 OR oid=1 OR oid=2;

Seq Scan on pg_class
Filter: ((oid = '1'::oid) OR (oid = '1'::oid) OR (oid = 2))

So, we already pass through the OR clauses. Why not to check semi-equal
clauses and remove duplicates even if not all clauses are such
duplicates? At least, it continually raises users' questions.

--
regards, Andrei Lepikhov

Tom Lane

tgl@sss.pgh.pa.us

over 1 year ago

In reply to: Andrei Lepikhov (#7)

Re: BUG #18643: EXPLAIN estimated rows mismatch

Andrei Lepikhov <lepihov@gmail.com> writes:

So, we already pass through the OR clauses. Why not to check semi-equal
clauses and remove duplicates even if not all clauses are such
duplicates? At least, it continually raises users' questions.

It's difficult to justify spending extra planner cycles to optimize
what are fundamentally stupidly-written queries. Who writes "X=X"
in the first place? (Other than ORM authors who need to spend some
time in a re-education camp.) And it would not be a trivial number
of extra cycles, either. As pointed out in the commit message
David mentioned, it's basically free to make this improvement
when we're looking at a potential EquivalenceClass clause.
We've already paid the cost of checking that the operator is a btree
equality operator, and we know that the clause is at top level of
WHERE (else we couldn't fuzz over the difference between false and
null results), and besides we have to check whether it's "X=X" because
not doing so causes some semantic problems for the EquivalenceClass
machinery. In the case of a random clause-underneath-OR, we would
have to make a brand new check whether it's btree equality, and we
would have to somehow track whether we had descended to an expression
level where "false and null are known equivalent" is no longer true.
So I really doubt that a case can be made that that is worth doing.

regards, tom lane

Andrei Lepikhov

lepihov@gmail.com

over 1 year ago

In reply to: Tom Lane (#8)

Re: BUG #18643: EXPLAIN estimated rows mismatch

On 10/3/24 11:40, Tom Lane wrote:

Andrei Lepikhov <lepihov@gmail.com> writes:

So, we already pass through the OR clauses. Why not to check semi-equal
clauses and remove duplicates even if not all clauses are such
duplicates? At least, it continually raises users' questions.

It's difficult to justify spending extra planner cycles to optimize
what are fundamentally stupidly-written queries. Who writes "X=X"
in the first place? (Other than ORM authors who need to spend some
time in a re-education camp.)

I don't oppose your arguments at first. I want to say that ORMs and any
APIs (I sometimes see data analysts who use an AI to request a database)
are already essential to development (at least in startups). It may not
be suitable to deal with such cases in the core, but it is too costly to
do it in an extension right now. In this sense 'silicon' users aren't
equal to more smart carbon-made ones. And here I see two options:
1. canonicalize_qual_hook - the most blunt approach. Of course, we will
need more hooks in the near future with this approach.
2. An 'ORM' GUC to enable multiple optimisations in the core,
'smoothing' users' mistakes.

It would be great to discuss other options.

And it would not be a trivial number
of extra cycles, either. As pointed out in the commit message
David mentioned, it's basically free to make this improvement
when we're looking at a potential EquivalenceClass clause.
We've already paid the cost of checking that the operator is a btree
equality operator, and we know that the clause is at top level of
WHERE (else we couldn't fuzz over the difference between false and
null results), and besides we have to check whether it's "X=X" because
not doing so causes some semantic problems for the EquivalenceClass
machinery. In the case of a random clause-underneath-OR, we would
have to make a brand new check whether it's btree equality, and we
would have to somehow track whether we had descended to an expression
level where "false and null are known equivalent" is no longer true.
So I really doubt that a case can be made that that is worth doing.

Thanks, this explanation is quite valuable for me.

--
regards, Andrei Lepikhov