Use extended statistics to estimate (Var op Var) clauses

Started by Tomas Vondraover 5 years ago49 messageshackers
Jump to latest
#1Tomas Vondra
tomas.vondra@2ndquadrant.com

Hi,

Attached is a patch to allow estimation of (Var op Var) clauses using
extended statistics. Currently we only use extended stats to estimate
(Var op Const) clauses, which is sufficient for most cases, but it's not
very hard to support this second type of clauses.

This is not an entirely new patch - I've originally included it in the
patch series in [1]/messages/by-id/20200113230008.g67iyk4cs3xbnjju@development but it's probably better to discuss it separately,
so that it does not get buried in that discussion.

[1]: /messages/by-id/20200113230008.g67iyk4cs3xbnjju@development
/messages/by-id/20200113230008.g67iyk4cs3xbnjju@development

To illustrate the purpose of this patch, consider this:

db=# create table t (a int, b int);
CREATE TABLE

db=# insert into t select mod(i,10), mod(i,10)+1
from generate_series(1,100000) s(i);
INSERT 0 100000

db=# analyze t;
ANALYZE

db=# explain select * from t where a < b;
QUERY PLAN
--------------------------------------------------------
Seq Scan on t (cost=0.00..1693.00 rows=33333 width=8)
Filter: (a < b)
(2 rows)

db=# explain select * from t where a > b;
QUERY PLAN
--------------------------------------------------------
Seq Scan on t (cost=0.00..1693.00 rows=33333 width=8)
Filter: (a > b)
(2 rows)

db=# create statistics s (mcv) on a,b from t;
CREATE STATISTICS

db=# analyze t;
ANALYZE

db=# explain select * from t where a < b;
QUERY PLAN
---------------------------------------------------------
Seq Scan on t (cost=0.00..1693.00 rows=100000 width=8)
Filter: (a < b)
(2 rows)

db=# explain select * from t where a > b;
QUERY PLAN
----------------------------------------------------
Seq Scan on t (cost=0.00..1693.00 rows=1 width=8)
Filter: (a > b)
(2 rows)

I'm not entirely convinced this patch (on it's own) is very useful, for
a couple of reasons:

(a) Clauses of this form are not particularly common, at least compared
to the Var op Const clauses. (I don't recall slow-query reports from any
of our mailing lists that might be attributed to such clauses.)

(b) For known cases of such queries (e.g. several TPC-H queries do use
clauses like "l_commitdate < l_receiptdate" etc.) this is somewhat
hindered by extended statistics only supporting MCV lists, which may not
work particularly well for high-cardinality columns like dates etc.

But despite that it seems like a useful feature / building block, and
those limitations may be addressed in some other way (e.g. we may add
multi-dimensional histograms to address the second one).

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

0001-Support-estimation-of-clauses-of-the-form-V-20201113.patchtext/x-patch; charset=UTF-8; name=0001-Support-estimation-of-clauses-of-the-form-V-20201113.patchDownload+243-20
#2Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Tomas Vondra (#1)
Re: Use extended statistics to estimate (Var op Var) clauses

On Nov 12, 2020, at 5:14 PM, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

<0001-Support-estimation-of-clauses-of-the-form-V-20201113.patch>

Your patch no longer applies. Can we get a new version please?


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#3Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Mark Dilger (#2)
Re: Use extended statistics to estimate (Var op Var) clauses

On 3/1/21 8:58 PM, Mark Dilger wrote:

On Nov 12, 2020, at 5:14 PM, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

<0001-Support-estimation-of-clauses-of-the-form-V-20201113.patch>

Your patch no longer applies. Can we get a new version please?

I do not plan to work on this patch in the 2021-03 commitfest. I'll
focus on the other patch about extended statistics on expressions.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#4Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#3)
Re: Use extended statistics to estimate (Var op Var) clauses

Hi,

Here is a slightly updated version of the patch - rebased to current
master and fixing some minor issues to handle expressions (and not just
the Var nodes as before).

The changes needed to support (Expr op Expr) are mostly mechanical,
though I'm sure the code needs some cleanup. The main issue I ran into
is the special case clauselist_selectivity, which does

if (list_length(clauses) == 1)
return clause_selectivity_ext(...);

which applies to cases like "WHERE a < b" which can now be handled by
extended statistics, thanks to this patch. But clause_selectivity_ext
only used to call restriction_selectivity for these clauses, which does
not use extended statistics, of course.

I considered either getting rid of the special case, passing everything
through extended stats, including cases with a single clause. But that
ends up affecting e.g. OR clauses, so I tweaked clause_selectivity_ext a
bit, which seems like a better approach.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

0001-Handling-Expr-op-Expr-clauses-in-extended-s-20210613.patchtext/x-patch; charset=UTF-8; name=0001-Handling-Expr-op-Expr-clauses-in-extended-s-20210613.patchDownload+350-71
#5Zhihong Yu
zyu@yugabyte.com
In reply to: Tomas Vondra (#4)
Re: Use extended statistics to estimate (Var op Var) clauses

On Sun, Jun 13, 2021 at 1:29 PM Tomas Vondra <tomas.vondra@enterprisedb.com>
wrote:

Hi,

Here is a slightly updated version of the patch - rebased to current
master and fixing some minor issues to handle expressions (and not just
the Var nodes as before).

The changes needed to support (Expr op Expr) are mostly mechanical,
though I'm sure the code needs some cleanup. The main issue I ran into
is the special case clauselist_selectivity, which does

if (list_length(clauses) == 1)
return clause_selectivity_ext(...);

which applies to cases like "WHERE a < b" which can now be handled by
extended statistics, thanks to this patch. But clause_selectivity_ext
only used to call restriction_selectivity for these clauses, which does
not use extended statistics, of course.

I considered either getting rid of the special case, passing everything
through extended stats, including cases with a single clause. But that
ends up affecting e.g. OR clauses, so I tweaked clause_selectivity_ext a
bit, which seems like a better approach.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Hi,

-           for (i = 0; i < mcvlist->nitems; i++)
+           if (cst)    /* Expr op Const */

It seems the Const op Expr is also covered by this if branch. Hence the
comment should include this case.

Cheers

#6Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Tomas Vondra (#4)
Re: Use extended statistics to estimate (Var op Var) clauses

On Jun 13, 2021, at 1:28 PM, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

Here is a slightly updated version of the patch

Thanks for taking this up again!

Applying the new test cases from your patch, multiple estimates have gotten better. That looks good. I wrote a few extra test cases and saw no change, which is fine. I was looking for regressions where the estimates are now worse than before. Do you expect there to be any such cases?


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#7Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Mark Dilger (#6)
Re: Use extended statistics to estimate (Var op Var) clauses

On 6/14/21 5:36 PM, Mark Dilger wrote:

On Jun 13, 2021, at 1:28 PM, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

Here is a slightly updated version of the patch

Thanks for taking this up again!

Applying the new test cases from your patch, multiple estimates have
gotten better. That looks good. I wrote a few extra test cases and
saw no change, which is fine. I was looking for regressions where
the estimates are now worse than before. Do you expect there to be
any such cases?

Not really. These clauses could not be estimated before, we generally
used default estimates for them. So

WHERE a = b

would use 0.5%, while

WHERE a < b

would use 33%, and so on. OTOH it depends on the accuracy of the
extended statistics - particularly the MCV list (what fraction of the
data it covers, etc.).

So it's possible the default estimate is very accurate by chance, and
MCV list represents only a tiny fraction of the data. Then the new
estimate could we worse. Consider for example this:

create table t (a int, b int);
insert into t select 100, 100 from generate_series(1,5000) s(i);
insert into t select i, i+1 from generate_series(1,995000) s(i);

This has exactly 0.5% of rows with (a=b). Without extended stats it's
perfect:

explain analyze select * from t where a = b;

Seq Scan on t (cost=0.00..16925.00 rows=5000 width=8)
(actual time=0.064..159.928 rows=5000 loops=1)

while with statistics it gets worse:

create statistics s (mcv) on a, b from t;
analyze t;

Seq Scan on t (cost=0.00..16925.00 rows=9810 width=8)
(actual time=0.059..160.467 rows=5000 loops=1)

It's not terrible, although we could construct worse examples. But the
same issue applies to other clauses, not just to these new ones. And it
relies on the regular estimation producing better estimate by chance.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#8Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: Tomas Vondra (#4)
Re: Use extended statistics to estimate (Var op Var) clauses

On Sun, 13 Jun 2021 at 21:28, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

Here is a slightly updated version of the patch

The main issue I ran into
is the special case clauselist_selectivity, which does

if (list_length(clauses) == 1)
return clause_selectivity_ext(...);

which applies to cases like "WHERE a < b" which can now be handled by
extended statistics, thanks to this patch. But clause_selectivity_ext
only used to call restriction_selectivity for these clauses, which does
not use extended statistics, of course.

I considered either getting rid of the special case, passing everything
through extended stats, including cases with a single clause. But that
ends up affecting e.g. OR clauses, so I tweaked clause_selectivity_ext a
bit, which seems like a better approach.

Yeah, I looked at this a few months ago, while looking at the other
extended stats stuff, and I came to the same conclusion that solving
this issue by tweaking clause_selectivity_ext() is the best approach.

I haven't looked at the patch in much detail yet, but I think the
basic approach looks OK.

There are a few comments that need updating, e.g.:
- In statext_is_compatible_clause_internal(), before the "if
(is_opclause(clause))" test.
- The description of the arguments for examine_opclause_args().

I wonder if "expronleftp" for examine_opclause_args() should be
"constonrightp", or some such -- as it stands it's being set to false
for an Expr Op Expr clause, which doesn't seem right because there
*is* an expression on the left.

Regards,
Dean

#9Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: Dean Rasheed (#8)
Re: Use extended statistics to estimate (Var op Var) clauses

On Sun, 13 Jun 2021 at 21:28, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

Here is a slightly updated version of the patch

Hi,

I have looked at this in some more detail, and it all looks pretty
good, other than some mostly cosmetic stuff.

The new code in statext_is_compatible_clause_internal() is a little
hard to follow because some of the comments aren't right (e.g. when
checking clause_expr2, it isn't an (Expr op Const) or (Const op Expr)
as the comment says). Rather than trying to comment on each
conditional branch, it might be simpler to just have a single
catch-all comment at the top, and also remove the "return true" in the
middle, to make it something like:

/*
* Check Vars appearing on either side by recursing, and make a note of
* any expressions.
*/
if (IsA(clause_expr, Var))
{
if (!statext_is_compatible_clause_internal(...))
return false;
}
else
*exprs = lappend(*exprs, clause_expr);

if (clause_expr2)
{
if (IsA(clause_expr2, Var))
{
if (!statext_is_compatible_clause_internal(...))
return false;
}
else
*exprs = lappend(*exprs, clause_expr2);
}

return true;

Is the FIXME comment in examine_opclause_args() necessary? The check
for a single relation has already been done in
clause[list]_selectivity_ext(), and I'm not sure what
examine_opclause_args() would do differently.

In mcv_get_match_bitmap(), perhaps do the RESULT_IS_FINAL() checks
first in each loop.

Also in mcv_get_match_bitmap(), the 2 "First check whether the
constant is below the lower boundary ..." comments don't make any
sense to me. Were those perhaps copied and pasted from somewhere else?
They should perhaps say "Otherwise, compare the MCVItem with the
constant" and "Otherwise compare the values from the MCVItem using the
clause operator", or something like that.

But other than such cosmetic things, I think the patch is good, and
gives some nice estimate improvements.

Regards,
Dean

#10Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Dean Rasheed (#9)
Re: Use extended statistics to estimate (Var op Var) clauses

Hi,

On 7/5/21 2:46 PM, Dean Rasheed wrote:

On Sun, 13 Jun 2021 at 21:28, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

Here is a slightly updated version of the patch

Hi,

I have looked at this in some more detail, and it all looks pretty
good, other than some mostly cosmetic stuff.

Thanks for the review!

The new code in statext_is_compatible_clause_internal() is a little
hard to follow because some of the comments aren't right (e.g. when
checking clause_expr2, it isn't an (Expr op Const) or (Const op Expr)
as the comment says). Rather than trying to comment on each
conditional branch, it might be simpler to just have a single
catch-all comment at the top, and also remove the "return true" in the
middle, to make it something like:

/*
* Check Vars appearing on either side by recursing, and make a note of
* any expressions.
*/
if (IsA(clause_expr, Var))
{
if (!statext_is_compatible_clause_internal(...))
return false;
}
else
*exprs = lappend(*exprs, clause_expr);

if (clause_expr2)
{
if (IsA(clause_expr2, Var))
{
if (!statext_is_compatible_clause_internal(...))
return false;
}
else
*exprs = lappend(*exprs, clause_expr2);
}

return true;

I ended up doing something slightly different - examine_opclause_args
now "returns" a list of expressions, instead of explicitly setting two
parameters. That means we can do a simple foreach() here, which seems
cleaner. It means we have to extract the expressions from the list in a
couple places, but that seems acceptable. Do you agree?

I also went through the comments and updated those that seemed wrong.

Is the FIXME comment in examine_opclause_args() necessary? The check
for a single relation has already been done in
clause[list]_selectivity_ext(), and I'm not sure what
examine_opclause_args() would do differently.

Yeah, I came to the same conclusion.

In mcv_get_match_bitmap(), perhaps do the RESULT_IS_FINAL() checks
first in each loop.

This is how master already does that now, and I wonder if it's done in
this order intentionally. It's not clear to me doing it in the other way
would be faster?

Also in mcv_get_match_bitmap(), the 2 "First check whether the
constant is below the lower boundary ..." comments don't make any
sense to me. Were those perhaps copied and pasted from somewhere else?
They should perhaps say "Otherwise, compare the MCVItem with the
constant" and "Otherwise compare the values from the MCVItem using the
clause operator", or something like that.

Yeah, that's another bit that comes from current master - the patch just
makes a new copy of the comment. I agree it's bogus, Seems like a
remainder of the original code which did various "smart" things we
removed over time. Will fix.

But other than such cosmetic things, I think the patch is good, and
gives some nice estimate improvements.

Thanks, sounds good. I guess the last thing is maybe mentioning this in
the docs, adding an example etc.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

0001-Handling-Expr-op-Expr-clauses-in-extended-stats-20210720.patchtext/x-patch; charset=UTF-8; name=0001-Handling-Expr-op-Expr-clauses-in-extended-stats-20210720.patchDownload+341-78
#11Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: Tomas Vondra (#10)
Re: Use extended statistics to estimate (Var op Var) clauses

On Tue, 20 Jul 2021 at 19:28, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

The new code in statext_is_compatible_clause_internal() is a little
hard to follow because some of the comments aren't right

I ended up doing something slightly different - examine_opclause_args
now "returns" a list of expressions, instead of explicitly setting two
parameters. That means we can do a simple foreach() here, which seems
cleaner. It means we have to extract the expressions from the list in a
couple places, but that seems acceptable. Do you agree?

Yes, that looks much neater.

In mcv_get_match_bitmap(), perhaps do the RESULT_IS_FINAL() checks
first in each loop.

This is how master already does that now, and I wonder if it's done in
this order intentionally. It's not clear to me doing it in the other way
would be faster?

Ah OK, it just felt more natural to do it the other way round. I
suppose though, that for the first clause, the is-final check isn't
going to catch anything, whereas the is-null checks might. For the
remaining clauses, it will depend on the data as to which way is
faster, but it probably isn't going to make any noticeable difference
either way. So, although it initially seems a bit counter-intuitive,
it's probably better the way it is.

I guess the last thing is maybe mentioning this in
the docs, adding an example etc.

Yeah, good idea.

Regards,
Dean

#12Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Tomas Vondra (#10)
Re: Use extended statistics to estimate (Var op Var) clauses

On Jul 20, 2021, at 11:28 AM, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
<0001-Handling-Expr-op-Expr-clauses-in-extended-stats-20210720.patch>

Hi Tomas,

I tested this patch against master looking for types of clauses that uniformly get worse with the patch applied. I found some.

The tests are too large to attach, but the scripts that generate them are not. To perform the tests:

git checkout master
perl ./gentest.pl > src/test/regress/sql/gentest.sql
cat /dev/null > src/test/regress/expected/gentest.out
echo "test: gentest" >> src/test/regress/parallel_schedule
./configure && make && make check
cp src/test/regress/results/gentest.out src/test/regress/expected/gentest.out
patch -p 1 < 0001-Handling-Expr-op-Expr-clauses-in-extended-stats-20210720.patch
make check
cat src/test/regress/regression.diffs | perl ./check.pl

This shows patterns of conditions that get worse, such as:

better:0, worse:80: A < B and A <> A or not A < A
better:0, worse:80: A < B and not A <= A or A <= A
better:0, worse:80: A < B or A = A
better:0, worse:80: A < B or A = A or not A >= A
better:0, worse:80: A < B or A >= A
better:0, worse:80: A < B or A >= A and not A <> A
better:0, worse:80: A < B or not A < A
better:0, worse:80: A < B or not A <> A
better:0, worse:80: A < B or not A <> A or A <= A
better:0, worse:80: A < B or not A >= A or not A < A

It seems things get worse when the conditions contain a column compared against itself. I suspect that is being handled incorrectly.

Attachments:

check.pltext/x-perl-script; name=check.pl; x-unix-mode=0755Download
gentest.pltext/x-perl-script; name=gentest.pl; x-unix-mode=0755Download
#13Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Mark Dilger (#12)
Re: Use extended statistics to estimate (Var op Var) clauses

On 8/9/21 9:19 PM, Mark Dilger wrote:

On Jul 20, 2021, at 11:28 AM, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
<0001-Handling-Expr-op-Expr-clauses-in-extended-stats-20210720.patch>

Hi Tomas,

I tested this patch against master looking for types of clauses that uniformly get worse with the patch applied. I found some.

The tests are too large to attach, but the scripts that generate them are not. To perform the tests:

git checkout master
perl ./gentest.pl > src/test/regress/sql/gentest.sql
cat /dev/null > src/test/regress/expected/gentest.out
echo "test: gentest" >> src/test/regress/parallel_schedule
./configure && make && make check
cp src/test/regress/results/gentest.out src/test/regress/expected/gentest.out
patch -p 1 < 0001-Handling-Expr-op-Expr-clauses-in-extended-stats-20210720.patch
make check
cat src/test/regress/regression.diffs | perl ./check.pl

This shows patterns of conditions that get worse, such as:

better:0, worse:80: A < B and A <> A or not A < A
better:0, worse:80: A < B and not A <= A or A <= A
better:0, worse:80: A < B or A = A
better:0, worse:80: A < B or A = A or not A >= A
better:0, worse:80: A < B or A >= A
better:0, worse:80: A < B or A >= A and not A <> A
better:0, worse:80: A < B or not A < A
better:0, worse:80: A < B or not A <> A
better:0, worse:80: A < B or not A <> A or A <= A
better:0, worse:80: A < B or not A >= A or not A < A

It seems things get worse when the conditions contain a column compared against itself. I suspect that is being handled incorrectly.

Thanks for this testing!

I took a quick look, and I think this is mostly due to luck in how the
(default) range estimates combine without and with extended statistics.
Consider for example this simple example:

create table t (a int, b int);

insert into t select mod(i,10), mod(i,20)
from generate_series(1,1000000) s(i);

Without stats, the first clauses example is estimated like this:

explain (timing off, analyze) select * from t
where (A < B and A <> A) or not A < A;

QUERY PLAN
----------------------------------------------------------
Seq Scan on t (cost=0.00..21925.00 rows=554444 width=8)
(actual rows=1000000 loops=1)
Filter: (((a < b) AND (a <> a)) OR (a >= a))
Planning Time: 0.054 ms
Execution Time: 80.485 ms
(4 rows)

and with MCV on (a,b) it gets estimates like this:

QUERY PLAN

----------------------------------------------------------
Seq Scan on t (cost=0.00..21925.00 rows=333333 width=8)
(actual rows=1000000 loops=1)
Filter: (((a < b) AND (a <> a)) OR (a >= a))
Planning Time: 0.152 ms
Execution Time: 79.917 ms
(4 rows)

So with the statistics, the estimate gets a bit worse. The reason is
fairly simple - if you look at the two parts of the OR clause, we get this:

clause actual no stats with stats
---------------------------------------------------------------
(A < B and A <> A) 0 331667 1
not (A < A) 1000000 333333 333333

This clearly shows that the first clause is clearly improved, while the
(A < A) is estimated the same way, because the clause has a single Var
so it's considered to be "simple" so we ignore the MCV selectivity and
just use the simple_sel calculated by clause_selectivity_ext.

And the 333333 and 331667 just happen to be closer to the actual row
count. But that's mostly by luck, clearly.

But now that I think about it, maybe the problem really is in how
statext_mcv_clauselist_selectivity treats this clause - the definition
of "simple" clauses as "has one attnum" was appropriate when only
clauses (Var op Const) were supported. But with (Var op Var) that's
probably not correct anymore.

And indeed, commenting out the if condition on line 1933 (so ignoring
simple_sel) and that does improve the estimates for this query. But
perhaps I'm missing something, this needs more thought.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#14Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: Tomas Vondra (#13)
Re: Use extended statistics to estimate (Var op Var) clauses

On Wed, 11 Aug 2021 at 00:05, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

So with the statistics, the estimate gets a bit worse. The reason is
fairly simple - if you look at the two parts of the OR clause, we get this:

clause actual no stats with stats
---------------------------------------------------------------
(A < B and A <> A) 0 331667 1
not (A < A) 1000000 333333 333333

This clearly shows that the first clause is clearly improved, while the
(A < A) is estimated the same way, because the clause has a single Var
so it's considered to be "simple" so we ignore the MCV selectivity and
just use the simple_sel calculated by clause_selectivity_ext.

And the 333333 and 331667 just happen to be closer to the actual row
count. But that's mostly by luck, clearly.

But now that I think about it, maybe the problem really is in how
statext_mcv_clauselist_selectivity treats this clause - the definition
of "simple" clauses as "has one attnum" was appropriate when only
clauses (Var op Const) were supported. But with (Var op Var) that's
probably not correct anymore.

Hmm, interesting. Clearly the fact that the combined estimate without
extended stats was better was just luck, based on it's large
overestimate of the first clause. But it's also true that a (Var op
Var) clause should not be treated as simple, because "simple" in this
context is meant to be for clauses that are likely to be better
estimated with regular stats, whereas in this case, extended stats
would almost certainly do better on the second clause.

Perhaps the easiest way to identify simple clauses would be in
statext_is_compatible_clause(), rather than the way it's done now,
because it has the relevant information at hand, so it could be made
to return an extra flag.

This feels like rather an artificial example though. Is there any real
use for this sort of clause?

Regards,
Dean

#15Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Dean Rasheed (#14)
Re: Use extended statistics to estimate (Var op Var) clauses

On Aug 11, 2021, at 5:08 AM, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:

This feels like rather an artificial example though. Is there any real
use for this sort of clause?

The test generated random combinations of clauses and then checked if any had consistently worse performance. These came up. I don't know that they represent anything real.

What was not random in the tests was the data in the tables. I've gotten curious if these types of clauses (with columns compared against themselves) would still be bad for random rather than orderly data sets. I'll go check....

testing....

Wow. Randomizing the data makes the problems even more extreme. It seems my original test set was actually playing to this patch's strengths, not its weaknesses. I've changed the columns to double precision and filled the columns with random() data, where column1 gets random()^1, column2 gets random()^2, etc. So on average the larger numbered columns will be smaller, and the mcv list will be irrelevant, since values should not tend to repeat.

Over all queries, 47791 have better estimates after the patch, but 34802 had worse estimates after the patch (with the remaining 17407 queries having roughly equal quality).

The worst estimates are still ones that have a column compared to itself:

better:0, worse:33: A <= B or A <= A or A <= A
better:0, worse:33: A <= B or A = A or not A <> A
better:0, worse:33: A <= B or A >= A or not A <> A
better:0, worse:33: A <> B or A <= A
better:0, worse:33: A <> B or A <= A or A <> A
better:0, worse:33: A <> B or A <= A or A >= A
better:0, worse:33: A <> B or A <= A or not A = A
better:0, worse:33: A <> B or A > A or not A < A
better:0, worse:33: A <> B or A >= A
better:0, worse:33: A <> B or A >= A and A <= A
better:0, worse:33: A = B or not A > A or not A > A
better:0, worse:33: A >= B or not A <> A or A = A
better:0, worse:39: B <= A or B <= B or B <= B
better:0, worse:39: B <= A or B = B or not B <> B
better:0, worse:39: B <= A or B >= B or not B <> B
better:0, worse:39: B <> A or B <= B
better:0, worse:39: B <> A or B <= B or B <> B
better:0, worse:39: B <> A or B <= B or B >= B
better:0, worse:39: B <> A or B <= B or not B = B
better:0, worse:39: B <> A or B > B or not B < B
better:0, worse:39: B <> A or B >= B
better:0, worse:39: B <> A or B >= B and B <= B
better:0, worse:39: B = A or not B > B or not B > B
better:0, worse:39: B >= A or not B <> B or B = B

But there are plenty that got worse without that, such as the following examples:

better:25, worse:39: A < B and A < B or B > A
better:10, worse:48: A < B and A < C
better:10, worse:54: A < B and A < C or C > A

I'll go test random data designed to have mcv lists of significance....


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#16Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Dean Rasheed (#14)
Re: Use extended statistics to estimate (Var op Var) clauses

On 8/11/21 2:08 PM, Dean Rasheed wrote:

On Wed, 11 Aug 2021 at 00:05, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

So with the statistics, the estimate gets a bit worse. The reason is
fairly simple - if you look at the two parts of the OR clause, we get this:

clause actual no stats with stats
---------------------------------------------------------------
(A < B and A <> A) 0 331667 1
not (A < A) 1000000 333333 333333

This clearly shows that the first clause is clearly improved, while the
(A < A) is estimated the same way, because the clause has a single Var
so it's considered to be "simple" so we ignore the MCV selectivity and
just use the simple_sel calculated by clause_selectivity_ext.

And the 333333 and 331667 just happen to be closer to the actual row
count. But that's mostly by luck, clearly.

But now that I think about it, maybe the problem really is in how
statext_mcv_clauselist_selectivity treats this clause - the definition
of "simple" clauses as "has one attnum" was appropriate when only
clauses (Var op Const) were supported. But with (Var op Var) that's
probably not correct anymore.

Hmm, interesting. Clearly the fact that the combined estimate without
extended stats was better was just luck, based on it's large
overestimate of the first clause. But it's also true that a (Var op
Var) clause should not be treated as simple, because "simple" in this
context is meant to be for clauses that are likely to be better
estimated with regular stats, whereas in this case, extended stats
would almost certainly do better on the second clause.

I don't see why extended stats would do better on the second clause. I
mean, if you have (A < A) then extended stats pretty much "collapse"
into per-column stats. We could get almost the same estimate on
single-column MCV list, etc. The reason why that does not happen is that
we just treat it as a range clause, and assign it a default 33% estimate.

But we could make that a bit smarter, and assign better estimates to
those clauses

(A < A) => 0.0
(A = A) => 1.0
(A <= A) => 1.0

And that'd give us the same estimates, I think. Not sure that's worth
it, because (A op A) clauses are probably very rare, OTOH it's cheap.

Perhaps the easiest way to identify simple clauses would be in
statext_is_compatible_clause(), rather than the way it's done now,
because it has the relevant information at hand, so it could be made
to return an extra flag.

Agreed, that seems like a better place to fix this.

This feels like rather an artificial example though. Is there any real
use for this sort of clause?

True. It seems a bit artificial, which is understandable as it came from
a synthetic test generating all possible clauses. OTOH, fixing it seems
fairly cheap ...

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#17Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Mark Dilger (#15)
Re: Use extended statistics to estimate (Var op Var) clauses

On Aug 11, 2021, at 7:51 AM, Mark Dilger <mark.dilger@enterprisedb.com> wrote:

I'll go test random data designed to have mcv lists of significance....

Done. The data for column_i is set to floor(random()^i*20). column_1 therefore is evenly distributed between 0..19, with successive columns weighted more towards smaller values.

This still gives (marginally) worse results than the original test I posted, but better than the completely random data from the last post. After the patch, 72294 estimates got better and 30654 got worse. The biggest losers from this data set are:

better:0, worse:31: A >= B or A = A or not A = A
better:0, worse:31: A >= B or A = A
better:0, worse:31: A >= B or not A <> A
better:0, worse:31: A >= A or A = B or not B = A
better:0, worse:31: A >= B and not A < A or A = A
better:0, worse:31: A = A or not A > B or B <> A
better:0, worse:31: A >= B or not A <> A or not A >= A
better:0, worse:32: B < A and B > C and not C < B <----
better:1, worse:65: A <> C and A <= B <----
better:0, worse:33: B <> A or B >= B
better:0, worse:33: B <> A or B <= B
better:0, worse:33: B <= A or B = B or not B > B
better:0, worse:33: B <> A or not B >= B or not B < B
better:0, worse:33: B = A or not B > B or B = B
better:0, worse:44: A = B or not A > A or A = A
better:0, worse:44: A <> B or A <= A
better:0, worse:44: A <> B or not A >= A or not A < A
better:0, worse:44: A <= B or A = A or not A > A
better:0, worse:44: A <> B or A >= A

Of which, a few do not contain columns compared against themselves, marked with <---- above.

I don't really know what to make of these results. It doesn't bother me that any particular estimate gets worse after the patch. That's just the nature of estimating. But it does bother me a bit that some types of estimates consistently get worse. We should either show that my analysis is wrong about that, or find a way to address it to avoid performance regressions. If I'm right that there are whole classes of estimates that are made consistently worse, then it stands to reason some users will have those data distributions and queries, and could easily notice.


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#18Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Mark Dilger (#15)
Re: Use extended statistics to estimate (Var op Var) clauses

On 8/11/21 4:51 PM, Mark Dilger wrote:

On Aug 11, 2021, at 5:08 AM, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:

This feels like rather an artificial example though. Is there any real
use for this sort of clause?

The test generated random combinations of clauses and then checked if any had consistently worse performance. These came up. I don't know that they represent anything real.

What was not random in the tests was the data in the tables. I've gotten curious if these types of clauses (with columns compared against themselves) would still be bad for random rather than orderly data sets. I'll go check....

testing....

Wow. Randomizing the data makes the problems even more extreme. It seems my original test set was actually playing to this patch's strengths, not its weaknesses. I've changed the columns to double precision and filled the columns with random() data, where column1 gets random()^1, column2 gets random()^2, etc. So on average the larger numbered columns will be smaller, and the mcv list will be irrelevant, since values should not tend to repeat.

Over all queries, 47791 have better estimates after the patch, but 34802 had worse estimates after the patch (with the remaining 17407 queries having roughly equal quality).

The worst estimates are still ones that have a column compared to itself:

better:0, worse:33: A <= B or A <= A or A <= A
better:0, worse:33: A <= B or A = A or not A <> A
better:0, worse:33: A <= B or A >= A or not A <> A
better:0, worse:33: A <> B or A <= A
better:0, worse:33: A <> B or A <= A or A <> A
better:0, worse:33: A <> B or A <= A or A >= A
better:0, worse:33: A <> B or A <= A or not A = A
better:0, worse:33: A <> B or A > A or not A < A
better:0, worse:33: A <> B or A >= A
better:0, worse:33: A <> B or A >= A and A <= A
better:0, worse:33: A = B or not A > A or not A > A
better:0, worse:33: A >= B or not A <> A or A = A
better:0, worse:39: B <= A or B <= B or B <= B
better:0, worse:39: B <= A or B = B or not B <> B
better:0, worse:39: B <= A or B >= B or not B <> B
better:0, worse:39: B <> A or B <= B
better:0, worse:39: B <> A or B <= B or B <> B
better:0, worse:39: B <> A or B <= B or B >= B
better:0, worse:39: B <> A or B <= B or not B = B
better:0, worse:39: B <> A or B > B or not B < B
better:0, worse:39: B <> A or B >= B
better:0, worse:39: B <> A or B >= B and B <= B
better:0, worse:39: B = A or not B > B or not B > B
better:0, worse:39: B >= A or not B <> B or B = B

The other interesting thing all those clauses have in common is that
they're OR clauses. And we handle that a bit differently. But I think
the "strange" clauses with the same Var on both sides is the main issue,
and not detecting them as "simple" clauses should fix that.

But there are plenty that got worse without that, such as the following examples:

better:25, worse:39: A < B and A < B or B > A
better:10, worse:48: A < B and A < C
better:10, worse:54: A < B and A < C or C > A

I'll go test random data designed to have mcv lists of significance....

Hard to say without having a look at the data set, but there'll always
be cases where the extended stats perform a bit worse, due to (a) luck
and (b) the stats covering only small fraction of the table.

But of course, it's worth investigating the suspicious cases.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#19Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Mark Dilger (#15)
Re: Use extended statistics to estimate (Var op Var) clauses

On 8/11/21 4:51 PM, Mark Dilger wrote:

On Aug 11, 2021, at 5:08 AM, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:

This feels like rather an artificial example though. Is there any real
use for this sort of clause?

The test generated random combinations of clauses and then checked if
any had consistently worse performance. These came up. I don't
know that they represent anything real.

What was not random in the tests was the data in the tables. I've
gotten curious if these types of clauses (with columns compared
against themselves) would still be bad for random rather than orderly
data sets. I'll go check.... >
testing....

Wow. Randomizing the data makes the problems even more extreme. It

seems my original test set was actually playing to this patch's
strengths, not its weaknesses. I've changed the columns to double
precision and filled the columns with random() data, where column1 gets
random()^1, column2 gets random()^2, etc. So on average the larger
numbered columns will be smaller, and the mcv list will be irrelevant,
since values should not tend to repeat.

I tried using the same randomized data set, i.e. essentially

create statistics s (mcv) on a, b, c from t;

insert into t

select random(), pow(random(), 2), pow(random(), 3), pow(random(),4)
from generate_series(1,1000000) s(i);

create statistics s (mcv) on a, b, c from t;

But I don't see any difference compared to the estimates without
extended statistics, which is not surprising because there should be no
MCV list built. So I'm a bit puzzled about the claim that random data
make the problems more extreme. Can you explain?

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#20Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Mark Dilger (#17)
Re: Use extended statistics to estimate (Var op Var) clauses

On 8/11/21 5:17 PM, Mark Dilger wrote:

On Aug 11, 2021, at 7:51 AM, Mark Dilger <mark.dilger@enterprisedb.com> wrote:

I'll go test random data designed to have mcv lists of significance....

Done. The data for column_i is set to floor(random()^i*20).
column_1 therefore is evenly distributed between 0..19, with
successive columns weighted more towards smaller values.

This still gives (marginally) worse results than the original test I
posted, but better than the completely random data from the last post.
After the patch, 72294 estimates got better and 30654 got worse. The
biggest losers from this data set are:

better:0, worse:31: A >= B or A = A or not A = A
better:0, worse:31: A >= B or A = A
better:0, worse:31: A >= B or not A <> A
better:0, worse:31: A >= A or A = B or not B = A
better:0, worse:31: A >= B and not A < A or A = A
better:0, worse:31: A = A or not A > B or B <> A
better:0, worse:31: A >= B or not A <> A or not A >= A
better:0, worse:32: B < A and B > C and not C < B <----
better:1, worse:65: A <> C and A <= B <----
better:0, worse:33: B <> A or B >= B
better:0, worse:33: B <> A or B <= B
better:0, worse:33: B <= A or B = B or not B > B
better:0, worse:33: B <> A or not B >= B or not B < B
better:0, worse:33: B = A or not B > B or B = B
better:0, worse:44: A = B or not A > A or A = A
better:0, worse:44: A <> B or A <= A
better:0, worse:44: A <> B or not A >= A or not A < A
better:0, worse:44: A <= B or A = A or not A > A
better:0, worse:44: A <> B or A >= A

Of which, a few do not contain columns compared against themselves,
marked with <---- above.

I don't really know what to make of these results. It doesn't
bother me that any particular estimate gets worse after the patch.
That's just the nature of estimating. But it does bother me a bit
that some types of estimates consistently get worse. We should
either show that my analysis is wrong about that, or find a way to
address it to avoid performance regressions. If I'm right that there
are whole classes of estimates that are made consistently worse, then
it stands to reason some users will have those data distributions and
queries, and could easily notice.

I'm not quite sure that's really a problem. Extended statistics are
meant for correlated columns, and it's mostly expected the estimates may
be a bit worse for random / independent data. The idea is mostly that
statistics will be created only for correlated columns, in which case it
should improve the estimates. I'd be way more concerned if you observed
consistently worse estimates on such data set.

Of course, there may be errors - the incorrect handling of (A op A) is
an example of such issue, probably.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#21Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Tomas Vondra (#19)
#22Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Mark Dilger (#21)
#23Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Tomas Vondra (#22)
#24Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Mark Dilger (#23)
#25Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Mark Dilger (#23)
#26Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Mark Dilger (#25)
#27Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Tomas Vondra (#26)
#28Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Mark Dilger (#27)
#29Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Tomas Vondra (#28)
#30Robert Haas
robertmhaas@gmail.com
In reply to: Tomas Vondra (#28)
#31Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#26)
#32Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Robert Haas (#30)
#33Robert Haas
robertmhaas@gmail.com
In reply to: Tomas Vondra (#32)
#34Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Robert Haas (#33)
#35Zhihong Yu
zyu@yugabyte.com
In reply to: Tomas Vondra (#34)
#36Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Tomas Vondra (#34)
#37Zhihong Yu
zyu@yugabyte.com
In reply to: Mark Dilger (#36)
#38Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Zhihong Yu (#37)
#39Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Mark Dilger (#36)
#40Zhihong Yu
zyu@yugabyte.com
In reply to: Tomas Vondra (#39)
#41Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Zhihong Yu (#40)
#42Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#39)
#43Zhihong Yu
zyu@yugabyte.com
In reply to: Tomas Vondra (#41)
#44Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Tomas Vondra (#42)
#45Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Mark Dilger (#44)
#46Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: Tomas Vondra (#42)
#47Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Dean Rasheed (#46)
#48Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: Tomas Vondra (#47)
#49Michael Paquier
michael@paquier.xyz
In reply to: Dean Rasheed (#48)