Additional improvements to extended statistics

Started by Tomas Vondraover 6 years ago40 messageshackers

tomas.vondra@2ndquadrant.com

over 6 years ago

Hi,

Now that I've committed [1]https://commitfest.postgresql.org/26/2320/ which allows us to use multiple extended
statistics per table, I'd like to start a thread discussing a couple of
additional improvements for extended statistics. I've considered
starting a separate patch for each, but that would be messy as those
changes will touch roughly the same places. So I've organized it into a
single patch series, with the simpler parts at the beginning.

There are three main improvements:

1) improve estimates of OR clauses

Until now, OR clauses pretty much ignored extended statistics, based on
the experience that they're less vulnerable to misestimates. But it's a
bit weird that AND clauses are handled while OR clauses are not, so this
extends the logic to OR clauses.

Status: I think this is fairly OK.

2) support estimating clauses (Var op Var)

Currently, we only support clauses with a single Var, i.e. clauses like

- Var op Const
- Var IS [NOT] NULL
- [NOT] Var
- ...

and AND/OR clauses built from those simple ones. This patch adds support
for clauses of the form (Var op Var), of course assuming both Vars come
from the same relation.

Status: This works, but it feels a bit hackish. Needs more work.

3) support extended statistics on expressions

Currently we only allow simple references to columns in extended stats,
so we can do

CREATE STATISTICS s ON a, b, c FROM t;

but not

CREATE STATISTICS s ON (a+b), (c + 1) FROM t;

This patch aims to allow this. At the moment it's a WIP - it does most
of the catalog changes and stats building, but with some hacks/bugs. And
it does not even try to use those statistics during estimation.

The first question is how to extend the current pg_statistic_ext catalog
to support expressions. I've been planning to do it the way we support
expressions for indexes, i.e. have two catalog fields - one for keys,
one for expressions.

One difference is that for statistics we don't care about order of the
keys, so that we don't need to bother with storing 0 keys in place for
expressions - we can simply assume keys are first, then expressions.

And this is what the patch does now.

I'm however wondering whether to keep this split - why not to just treat
everything as expressions, and be done with it? A key just represents a
Var expression, after all. And it would massively simplify a lot of code
that now has to care about both keys and expressions.

Of course, expressions are a bit more expensive, but I wonder how
noticeable that would be.

Opinions?

ragards

[1]: https://commitfest.postgresql.org/26/2320/

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Pavel Stehule

pavel.stehule@gmail.com

over 6 years ago

In reply to: Tomas Vondra (#1)

Re: Additional improvements to extended statistics

út 14. 1. 2020 v 0:00 odesílatel Tomas Vondra <tomas.vondra@2ndquadrant.com>
napsal:

Hi,

Now that I've committed [1] which allows us to use multiple extended
statistics per table, I'd like to start a thread discussing a couple of
additional improvements for extended statistics. I've considered
starting a separate patch for each, but that would be messy as those
changes will touch roughly the same places. So I've organized it into a
single patch series, with the simpler parts at the beginning.

There are three main improvements:

1) improve estimates of OR clauses

Until now, OR clauses pretty much ignored extended statistics, based on
the experience that they're less vulnerable to misestimates. But it's a
bit weird that AND clauses are handled while OR clauses are not, so this
extends the logic to OR clauses.

Status: I think this is fairly OK.

2) support estimating clauses (Var op Var)

Currently, we only support clauses with a single Var, i.e. clauses like

- Var op Const
- Var IS [NOT] NULL
- [NOT] Var
- ...

and AND/OR clauses built from those simple ones. This patch adds support
for clauses of the form (Var op Var), of course assuming both Vars come
from the same relation.

Status: This works, but it feels a bit hackish. Needs more work.

3) support extended statistics on expressions

Currently we only allow simple references to columns in extended stats,
so we can do

CREATE STATISTICS s ON a, b, c FROM t;

but not

CREATE STATISTICS s ON (a+b), (c + 1) FROM t;

+1 for expression's statisctics - it can be great feature.

Pavel

Show quoted text

This patch aims to allow this. At the moment it's a WIP - it does most
of the catalog changes and stats building, but with some hacks/bugs. And
it does not even try to use those statistics during estimation.

The first question is how to extend the current pg_statistic_ext catalog
to support expressions. I've been planning to do it the way we support
expressions for indexes, i.e. have two catalog fields - one for keys,
one for expressions.

One difference is that for statistics we don't care about order of the
keys, so that we don't need to bother with storing 0 keys in place for
expressions - we can simply assume keys are first, then expressions.

And this is what the patch does now.

I'm however wondering whether to keep this split - why not to just treat
everything as expressions, and be done with it? A key just represents a
Var expression, after all. And it would massively simplify a lot of code
that now has to care about both keys and expressions.

Of course, expressions are a bit more expensive, but I wonder how
noticeable that would be.

Opinions?

ragards

[1] https://commitfest.postgresql.org/26/2320/

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Tomas Vondra

tomas.vondra@2ndquadrant.com

about 6 years ago

In reply to: Tomas Vondra (#1)

Re: Additional improvements to extended statistics

Hi,

Here is a rebased version of this patch series. I've polished the first
two parts a bit - estimation of OR clauses and (Var op Var) clauses, and
added a bunch of regression tests to exercise this code. It's not quite
there yet, but I think it's feasible to get this committed for PG13.

The last part (extended stats on expressions) is far from complete, and
it's not feasible to get it into PG13. There's too much missing stuff.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Tomas Vondra

tomas.vondra@2ndquadrant.com

about 6 years ago

In reply to: Tomas Vondra (#3)

Re: Additional improvements to extended statistics

On Fri, Mar 06, 2020 at 01:15:56AM +0100, Tomas Vondra wrote:

Hi,

Here is a rebased version of this patch series. I've polished the first
two parts a bit - estimation of OR clauses and (Var op Var) clauses, and
added a bunch of regression tests to exercise this code. It's not quite
there yet, but I think it's feasible to get this committed for PG13.

The last part (extended stats on expressions) is far from complete, and
it's not feasible to get it into PG13. There's too much missing stuff.

Meh, the last part with stats on expression is not quite right and it
breaks the cputube tester, so here are the first two parts only. I don't
plan to pursue the 0003 part for PG13 anyway, as mentioned.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Dean Rasheed

dean.a.rasheed@gmail.com

about 6 years ago

In reply to: Tomas Vondra (#4)

Re: Additional improvements to extended statistics

On Fri, 6 Mar 2020 at 12:58, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

Here is a rebased version of this patch series. I've polished the first
two parts a bit - estimation of OR clauses and (Var op Var) clauses.

Hi,

I've been looking over the first patch (OR list support). It mostly
looks reasonable to me, except there's a problem with the way
statext_mcv_clauselist_selectivity() combines multiple stat_sel values
into the final result -- in the OR case, it needs to start with sel =
0, and then apply the OR formula to factor in each new estimate. I.e.,
this isn't right for an OR list:

/* Factor the estimate from this MCV to the oveall estimate. */
sel *= stat_sel;

(Oh and there's a typo in that comment: s/oveall/overall/).

For example, with the regression test data, this isn't estimated well:

SELECT * FROM mcv_lists_multi WHERE a = 0 OR b = 0 OR c = 0 OR d = 0;

Similarly, if no extended stats can be applied it needs to return 0
not 1, for example this query on the test data:

SELECT * FROM mcv_lists WHERE a = 1 OR a = 2 OR d IS NOT NULL;

It might also be worth adding a couple more regression test cases like these.

Regards,
Dean

Tomas Vondra

tomas.vondra@2ndquadrant.com

about 6 years ago

In reply to: Dean Rasheed (#5)

Re: Additional improvements to extended statistics

On Sun, Mar 08, 2020 at 07:17:10PM +0000, Dean Rasheed wrote:

On Fri, 6 Mar 2020 at 12:58, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

Here is a rebased version of this patch series. I've polished the first
two parts a bit - estimation of OR clauses and (Var op Var) clauses.

Hi,

I've been looking over the first patch (OR list support). It mostly
looks reasonable to me, except there's a problem with the way
statext_mcv_clauselist_selectivity() combines multiple stat_sel values
into the final result -- in the OR case, it needs to start with sel =
0, and then apply the OR formula to factor in each new estimate. I.e.,
this isn't right for an OR list:

/* Factor the estimate from this MCV to the oveall estimate. */
sel *= stat_sel;

(Oh and there's a typo in that comment: s/oveall/overall/).

For example, with the regression test data, this isn't estimated well:

SELECT * FROM mcv_lists_multi WHERE a = 0 OR b = 0 OR c = 0 OR d = 0;

Similarly, if no extended stats can be applied it needs to return 0
not 1, for example this query on the test data:

SELECT * FROM mcv_lists WHERE a = 1 OR a = 2 OR d IS NOT NULL;

Ah, right. Thanks for noticing this. Attaches is an updated patch series
with parts 0002 and 0003 adding tests demonstrating the issue and then
fixing it (both shall be merged to 0001).

It might also be worth adding a couple more regression test cases like these.

Agreed, 0002 adds a couple of relevant tests.

Incidentally, I've been working on improving test coverage for extended
stats over the past few days (it has ~80% lines covered, which is not
bad nor great). I haven't submitted that to hackers yet, because it's
mostly mechanical and it's would interfere with the two existing threads
about extended stats ...

Speaking of which, would you take a look at [1]/messages/by-id/13902317.Eha0YfKkKy@pierred-pdoc? I think supporting SAOP
is fine, but I wonder if you agree with my conclusion we can't really
support inclusion @> as explained in [2]/messages/by-id/20200202184134.swoqkqlqorqolrqv@development.

[1]: /messages/by-id/13902317.Eha0YfKkKy@pierred-pdoc
[2]: /messages/by-id/20200202184134.swoqkqlqorqolrqv@development

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Tomas Vondra

tomas.vondra@2ndquadrant.com

about 6 years ago

In reply to: Tomas Vondra (#6)

Re: Additional improvements to extended statistics

On Mon, Mar 09, 2020 at 01:01:57AM +0100, Tomas Vondra wrote:

On Sun, Mar 08, 2020 at 07:17:10PM +0000, Dean Rasheed wrote:

On Fri, 6 Mar 2020 at 12:58, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

Here is a rebased version of this patch series. I've polished the first
two parts a bit - estimation of OR clauses and (Var op Var) clauses.

Hi,

I've been looking over the first patch (OR list support). It mostly
looks reasonable to me, except there's a problem with the way
statext_mcv_clauselist_selectivity() combines multiple stat_sel values
into the final result -- in the OR case, it needs to start with sel =
0, and then apply the OR formula to factor in each new estimate. I.e.,
this isn't right for an OR list:

/* Factor the estimate from this MCV to the oveall estimate. */
sel *= stat_sel;

(Oh and there's a typo in that comment: s/oveall/overall/).

For example, with the regression test data, this isn't estimated well:

SELECT * FROM mcv_lists_multi WHERE a = 0 OR b = 0 OR c = 0 OR d = 0;

Similarly, if no extended stats can be applied it needs to return 0
not 1, for example this query on the test data:

SELECT * FROM mcv_lists WHERE a = 1 OR a = 2 OR d IS NOT NULL;

Ah, right. Thanks for noticing this. Attaches is an updated patch series
with parts 0002 and 0003 adding tests demonstrating the issue and then
fixing it (both shall be merged to 0001).

One day I won't forget to actually attach the files ...

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Dean Rasheed

dean.a.rasheed@gmail.com

about 6 years ago

In reply to: Tomas Vondra (#6)

Re: Additional improvements to extended statistics

On Mon, 9 Mar 2020 at 00:02, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

Speaking of which, would you take a look at [1]? I think supporting SAOP
is fine, but I wonder if you agree with my conclusion we can't really
support inclusion @> as explained in [2].

Hmm, I'm not sure. However, thinking about your example in [2] reminds
me of a thought I had a while ago, but then forgot about --- there is
a flaw in the formula used for computing probabilities with functional
dependencies:

P(a,b) = P(a) * [f + (1-f)*P(b)]

because it might return a value that is larger that P(b), which
obviously should not be possible. We should amend that formula to
prevent a result larger than P(b). The obvious way to do that would be
to use:

P(a,b) = Min(P(a) * [f + (1-f)*P(b)], P(b))

but actually I think it would be better and more principled to use:

P(a,b) = f*Min(P(a),P(b)) + (1-f)*P(a)*P(b)

I.e., for those rows believed to be functionally dependent, we use the
minimum probability, and for the rows believed to be independent, we
use the product.

I think that would solve the problem with the example you gave at the
end of [2], but I'm not sure if it helps with the general case.

Regards,
Dean

Show quoted text

[1] /messages/by-id/13902317.Eha0YfKkKy@pierred-pdoc
[2] /messages/by-id/20200202184134.swoqkqlqorqolrqv@development

Tomas Vondra

tomas.vondra@2ndquadrant.com

about 6 years ago

In reply to: Dean Rasheed (#8)

Re: Additional improvements to extended statistics

On Mon, Mar 09, 2020 at 08:35:48AM +0000, Dean Rasheed wrote:

On Mon, 9 Mar 2020 at 00:02, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

Speaking of which, would you take a look at [1]? I think supporting SAOP
is fine, but I wonder if you agree with my conclusion we can't really
support inclusion @> as explained in [2].

Hmm, I'm not sure. However, thinking about your example in [2] reminds
me of a thought I had a while ago, but then forgot about --- there is
a flaw in the formula used for computing probabilities with functional
dependencies:

P(a,b) = P(a) * [f + (1-f)*P(b)]

because it might return a value that is larger that P(b), which
obviously should not be possible.

Hmmm, yeah. It took me a while to reproduce this - the trick is in
"inverting" the dependency on a subset of data, e.g. like this:

create table t (a int, b int);
insert into t select mod(i,50), mod(i,25)
from generate_series(1,5000) s(i);
update t set a = 0 where a < 3;
create statistics s (dependencies) on a,b from t;

which then does this:

test=# explain select * from t where a = 0;
QUERY PLAN
----------------------------------------------------
Seq Scan on t (cost=0.00..86.50 rows=300 width=8)
Filter: (a = 0)
(2 rows)

test=# explain select * from t where b = 0;
QUERY PLAN
----------------------------------------------------
Seq Scan on t (cost=0.00..86.50 rows=200 width=8)
Filter: (b = 0)
(2 rows)

test=# explain select * from t where a = 0 and b = 0;
QUERY PLAN
----------------------------------------------------
Seq Scan on t (cost=0.00..99.00 rows=283 width=8)
Filter: ((a = 0) AND (b = 0))
(2 rows)

Which I think is the issue you've described ...

We should amend that formula to prevent a result larger than P(b). The
obvious way to do that would be to use:

P(a,b) = Min(P(a) * [f + (1-f)*P(b)], P(b))

but actually I think it would be better and more principled to use:

P(a,b) = f*Min(P(a),P(b)) + (1-f)*P(a)*P(b)

I.e., for those rows believed to be functionally dependent, we use the
minimum probability, and for the rows believed to be independent, we
use the product.

Hmmm, yeah. The trouble is that we currently don't really have both
selectivities in dependencies_clauselist_selectivity :-(

We get both clauses, but we only compute selectivity of the "implied"
clause, and we leave the other one as not estimated (possibly up to
clauselist_selectivity).

It's also not clear to me how would this work for more than two clauses,
that are all functionally dependent. Like (a => b => c), for example.
But I haven't thought about this very much yet.

I think that would solve the problem with the example you gave at the
end of [2], but I'm not sure if it helps with the general case.

I don't follow. I think the issue with [2] is that we can't really apply
stats about the array values to queries on individual array elements.
Can you explain how would the proposed changes deal with this?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#10

Dean Rasheed

dean.a.rasheed@gmail.com

about 6 years ago

In reply to: Tomas Vondra (#9)

Re: Additional improvements to extended statistics

On Mon, 9 Mar 2020 at 18:19, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:>

On Mon, Mar 09, 2020 at 08:35:48AM +0000, Dean Rasheed wrote:

P(a,b) = P(a) * [f + (1-f)*P(b)]

because it might return a value that is larger that P(b), which
obviously should not be possible.

Hmmm, yeah. It took me a while to reproduce this - the trick is in
"inverting" the dependency on a subset of data, e.g. like this:

create table t (a int, b int);
insert into t select mod(i,50), mod(i,25)
from generate_series(1,5000) s(i);
update t set a = 0 where a < 3;
create statistics s (dependencies) on a,b from t;

which then does this:

test=# explain select * from t where a = 0;
QUERY PLAN
----------------------------------------------------
Seq Scan on t (cost=0.00..86.50 rows=300 width=8)
Filter: (a = 0)
(2 rows)

test=# explain select * from t where b = 0;
QUERY PLAN
----------------------------------------------------
Seq Scan on t (cost=0.00..86.50 rows=200 width=8)
Filter: (b = 0)
(2 rows)

test=# explain select * from t where a = 0 and b = 0;
QUERY PLAN
----------------------------------------------------
Seq Scan on t (cost=0.00..99.00 rows=283 width=8)
Filter: ((a = 0) AND (b = 0))
(2 rows)

Which I think is the issue you've described ...

I think this is also related to the problem that functional dependency
stats don't take into account the fact that the user clauses may not
be compatible with one another. For example:

CREATE TABLE t (a int, b int);
INSERT INTO t
SELECT x/10,x/10 FROM (SELECT generate_series(1,x)
FROM generate_series(1,100) g(x)) AS t(x);
CREATE STATISTICS s (dependencies) ON a,b FROM t;
ANALYSE t;

EXPLAIN SELECT * FROM t WHERE a = 10;

QUERY PLAN
--------------------------------------------------
Seq Scan on t (cost=0.00..86.12 rows=1 width=8)
Filter: (a = 10)
(2 rows)

EXPLAIN SELECT * FROM t WHERE b = 1;

QUERY PLAN
----------------------------------------------------
Seq Scan on t (cost=0.00..86.12 rows=865 width=8)
Filter: (b = 1)
(2 rows)

EXPLAIN SELECT * FROM t WHERE a = 10 AND b = 1;

QUERY PLAN
----------------------------------------------------
Seq Scan on t (cost=0.00..98.75 rows=865 width=8)
Filter: ((a = 10) AND (b = 1))
(2 rows)

whereas without stats it would estimate 1 row. That kind of
over-estimate could get very bad, so it would be good to find a way to
fix it.

We should amend that formula to prevent a result larger than P(b). The
obvious way to do that would be to use:

P(a,b) = Min(P(a) * [f + (1-f)*P(b)], P(b))

but actually I think it would be better and more principled to use:

P(a,b) = f*Min(P(a),P(b)) + (1-f)*P(a)*P(b)

I.e., for those rows believed to be functionally dependent, we use the
minimum probability, and for the rows believed to be independent, we
use the product.

Hmmm, yeah. The trouble is that we currently don't really have both
selectivities in dependencies_clauselist_selectivity :-(

I hacked on this a bit, and I think it's possible to apply dependency
stats in a more general way (not necessarily assuming equality
clauses), something like the attached very rough patch.

This approach guarantees that the result of combining a pair of
selectivities with a functional dependency between them gives a
combined selectivity that is never greater than either individual
selectivity.

One regression test fails, but looking at it, that's to be expected --
the test alters the type of a column, causing its univariate stats to
be dropped, so the single-column estimate is reduced, and the new code
refuses to give a higher estimate than the single clause's new
estimate.

It's also not clear to me how would this work for more than two clauses,
that are all functionally dependent. Like (a => b => c), for example.
But I haven't thought about this very much yet.

I attempted to solve that by computing a chain of conditional
probabilities. The maths needs checking over (as I said, this is a
very rough patch). In particular, I think it's wrong for cases like (
a->b, a->c ), but I think it's along the right lines.

I think that would solve the problem with the example you gave at the
end of [2], but I'm not sure if it helps with the general case.

I don't follow. I think the issue with [2] is that we can't really apply
stats about the array values to queries on individual array elements.
Can you explain how would the proposed changes deal with this?

With this patch, the original estimate of ~900 rows in that example is
restored with functional dependencies, because of the way it utilises
the minimum selectivity of the 2 clauses.

I've not fully thought this through, but I think it might allow
functional dependencies to be applied to a wider range of operators.

Regards,
Dean

#11

Dean Rasheed

dean.a.rasheed@gmail.com

about 6 years ago

In reply to: Tomas Vondra (#7)

Re: Additional improvements to extended statistics

On Mon, 9 Mar 2020 at 00:06, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

On Mon, Mar 09, 2020 at 01:01:57AM +0100, Tomas Vondra wrote:

Attaches is an updated patch series
with parts 0002 and 0003 adding tests demonstrating the issue and then
fixing it (both shall be merged to 0001).

One day I won't forget to actually attach the files ...

0001-0003 look reasonable to me.

One minor point -- there are now 2 code blocks that are basically the
same, looping over a list of clauses, calling clause_selectivity() and
then applying the "s1 = s1 + s2 - s1 * s2" formula. Perhaps they could
be combined into a new function (clauselist_selectivity_simple_or(),
say). I guess it would need to be passed the initial starting
selectivity s1, but it ought to help reduce code duplication.

Regards,
Dean

#12

Tomas Vondra

tomas.vondra@2ndquadrant.com

about 6 years ago

In reply to: Dean Rasheed (#11)

Re: Additional improvements to extended statistics

On Fri, Mar 13, 2020 at 04:54:51PM +0000, Dean Rasheed wrote:

On Mon, 9 Mar 2020 at 00:06, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

On Mon, Mar 09, 2020 at 01:01:57AM +0100, Tomas Vondra wrote:

Attaches is an updated patch series
with parts 0002 and 0003 adding tests demonstrating the issue and then
fixing it (both shall be merged to 0001).

One day I won't forget to actually attach the files ...

0001-0003 look reasonable to me.

One minor point -- there are now 2 code blocks that are basically the
same, looping over a list of clauses, calling clause_selectivity() and
then applying the "s1 = s1 + s2 - s1 * s2" formula. Perhaps they could
be combined into a new function (clauselist_selectivity_simple_or(),
say). I guess it would need to be passed the initial starting
selectivity s1, but it ought to help reduce code duplication.

Attached is a patch series rebased on top of the current master, after
committing the ScalarArrayOpExpr enhancements. I've updated the OR patch
to get rid of the code duplication, and barring objections I'll get it
committed shortly together with the two parts improving test coverage.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#13

Tomas Vondra

tomas.vondra@2ndquadrant.com

about 6 years ago

In reply to: Tomas Vondra (#12)

Re: Additional improvements to extended statistics

On Sat, Mar 14, 2020 at 05:56:10PM +0100, Tomas Vondra wrote:

...

Attached is a patch series rebased on top of the current master, after
committing the ScalarArrayOpExpr enhancements. I've updated the OR patch
to get rid of the code duplication, and barring objections I'll get it
committed shortly together with the two parts improving test coverage.

I've pushed the two patches improving test coverage for functional
dependencies and MCV lists, which seems mostly non-controversial. I'll
wait a bit more with the two patches actually changing behavior (rebased
version attached, to keep cputube happy).

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#14

Thomas Munro

thomas.munro@gmail.com

about 6 years ago

In reply to: Tomas Vondra (#13)

Re: Additional improvements to extended statistics

On Sun, Mar 15, 2020 at 1:08 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Sat, Mar 14, 2020 at 05:56:10PM +0100, Tomas Vondra wrote:

Attached is a patch series rebased on top of the current master, after
committing the ScalarArrayOpExpr enhancements. I've updated the OR patch
to get rid of the code duplication, and barring objections I'll get it
committed shortly together with the two parts improving test coverage.

I've pushed the two patches improving test coverage for functional
dependencies and MCV lists, which seems mostly non-controversial. I'll
wait a bit more with the two patches actually changing behavior (rebased
version attached, to keep cputube happy).

Some comment fixes:

-               /* Check if the expression the right shape (one Var,
one Const) */
-               if (!examine_clause_args(expr->args, &var, NULL, NULL))
+               /*
+                * Check if the expression the right shape (one Var
and one Const,
+                * or two Vars).
+                */

Check if the expression "has" or "is of" the right shape.

- * Attempts to match the arguments to either (Var op Const) or (Const op Var),
- * possibly with a RelabelType on top. When the expression matches this form,
- * returns true, otherwise returns false.
+ * Attempts to match the arguments to either (Var op Const) or (Const op Var)
+ * or (Var op Var), possibly with a RelabelType on top. When the expression
+ * matches this form, returns true, otherwise returns false.

... match the arguments to (Var op Const), (Const op Var) or (Var op Var), ...

+               /*
+                * Both variables have to be for the same relation
(otherwise it's
+                * a join clause, and we don't deal with those yet.
+                */

Missing close parenthesis.

Stimulated by some bad plans involving JSON, I found my way to your
WIP stats-on-expressions patch in this thread. Do I understand
correctly that it will eventually also support single expressions,
like CREATE STATISTICS t_distinct_abc (ndistinct) ON
(my_jsonb_column->>'abc') FROM t? It looks like that would solve
problems that otherwise require a generated column or an expression
index just to get ndistinct.

#15

Tomas Vondra

tomas.vondra@2ndquadrant.com

about 6 years ago

In reply to: Thomas Munro (#14)

Re: Additional improvements to extended statistics

On Sun, Mar 15, 2020 at 02:48:02PM +1300, Thomas Munro wrote:

On Sun, Mar 15, 2020 at 1:08 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Sat, Mar 14, 2020 at 05:56:10PM +0100, Tomas Vondra wrote:

Attached is a patch series rebased on top of the current master, after
committing the ScalarArrayOpExpr enhancements. I've updated the OR patch
to get rid of the code duplication, and barring objections I'll get it
committed shortly together with the two parts improving test coverage.

I've pushed the two patches improving test coverage for functional
dependencies and MCV lists, which seems mostly non-controversial. I'll
wait a bit more with the two patches actually changing behavior (rebased
version attached, to keep cputube happy).

Some comment fixes:
-               /* Check if the expression the right shape (one Var,
one Const) */
-               if (!examine_clause_args(expr->args, &var, NULL, NULL))
+               /*
+                * Check if the expression the right shape (one Var
and one Const,
+                * or two Vars).
+                */
Check if the expression "has" or "is of" the right shape.
- * Attempts to match the arguments to either (Var op Const) or (Const op Var),
- * possibly with a RelabelType on top. When the expression matches this form,
- * returns true, otherwise returns false.
+ * Attempts to match the arguments to either (Var op Const) or (Const op Var)
+ * or (Var op Var), possibly with a RelabelType on top. When the expression
+ * matches this form, returns true, otherwise returns false.
... match the arguments to (Var op Const), (Const op Var) or (Var op Var), ...
+               /*
+                * Both variables have to be for the same relation
(otherwise it's
+                * a join clause, and we don't deal with those yet.
+                */
Missing close parenthesis.

Thanks, I'll get this fixed.

Stimulated by some bad plans involving JSON, I found my way to your
WIP stats-on-expressions patch in this thread. Do I understand
correctly that it will eventually also support single expressions,
like CREATE STATISTICS t_distinct_abc (ndistinct) ON
(my_jsonb_column->>'abc') FROM t? It looks like that would solve
problems that otherwise require a generated column or an expression
index just to get ndistinct.

Yes, I think that's generally the plan. I was also thinking about
inventing some sort of special JSON statistics (e.g. extracting paths
from the JSONB and computing frequencies, or something like that). But
stats on expressions are one of the things I'd like to do in PG14.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#16

Dean Rasheed

dean.a.rasheed@gmail.com

about 6 years ago

In reply to: Tomas Vondra (#13)

Re: Additional improvements to extended statistics

On Sun, 15 Mar 2020 at 00:08, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

On Sat, Mar 14, 2020 at 05:56:10PM +0100, Tomas Vondra wrote:

Attached is a patch series rebased on top of the current master, after
committing the ScalarArrayOpExpr enhancements. I've updated the OR patch
to get rid of the code duplication, and barring objections I'll get it
committed shortly together with the two parts improving test coverage.

I've pushed the two patches improving test coverage for functional
dependencies and MCV lists, which seems mostly non-controversial. I'll
wait a bit more with the two patches actually changing behavior (rebased
version attached, to keep cputube happy).

Patch 0001 looks to be mostly ready. Just a couple of final comments:

+       if (is_or)
+           simple_sel = clauselist_selectivity_simple_or(root,
stat_clauses, varRelid,
+                                                         jointype,
sjinfo, NULL, 1.0);
+       else

Surely that should be passing 0.0 as the final argument, otherwise it
will always just return simple_sel = 1.0.

+        *
+        * XXX We can't multiply with current value, because for OR clauses
+        * we start with 0.0, so we simply assign to s1 directly.
+        */
+       s = statext_clauselist_selectivity(root, clauses, varRelid,
+                                          jointype, sjinfo, rel,
+                                          &estimatedclauses, true);

That final part of the comment is no longer relevant (variable s1 no
longer exists). Probably it could now just be deleted, since I think
there are sufficient comments elsewhere to explain what's going on.

Otherwise it looks good, and I think this will lead to some very
worthwhile improvements.

Regards,
Dean

#17

Tomas Vondra

tomas.vondra@2ndquadrant.com

about 6 years ago

In reply to: Dean Rasheed (#16)

Re: Additional improvements to extended statistics

On Sun, Mar 15, 2020 at 12:37:37PM +0000, Dean Rasheed wrote:

On Sun, 15 Mar 2020 at 00:08, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

On Sat, Mar 14, 2020 at 05:56:10PM +0100, Tomas Vondra wrote:

Attached is a patch series rebased on top of the current master, after
committing the ScalarArrayOpExpr enhancements. I've updated the OR patch
to get rid of the code duplication, and barring objections I'll get it
committed shortly together with the two parts improving test coverage.

I've pushed the two patches improving test coverage for functional
dependencies and MCV lists, which seems mostly non-controversial. I'll
wait a bit more with the two patches actually changing behavior (rebased
version attached, to keep cputube happy).

Patch 0001 looks to be mostly ready. Just a couple of final comments:
+       if (is_or)
+           simple_sel = clauselist_selectivity_simple_or(root,
stat_clauses, varRelid,
+                                                         jointype,
sjinfo, NULL, 1.0);
+       else
Surely that should be passing 0.0 as the final argument, otherwise it
will always just return simple_sel = 1.0.
+        *
+        * XXX We can't multiply with current value, because for OR clauses
+        * we start with 0.0, so we simply assign to s1 directly.
+        */
+       s = statext_clauselist_selectivity(root, clauses, varRelid,
+                                          jointype, sjinfo, rel,
+                                          &estimatedclauses, true);
That final part of the comment is no longer relevant (variable s1 no
longer exists). Probably it could now just be deleted, since I think
there are sufficient comments elsewhere to explain what's going on.

Otherwise it looks good, and I think this will lead to some very
worthwhile improvements.

Attached is a rebased patch series, addressing both those issues.

I've been wondering why none of the regression tests failed because of
the 0.0 vs. 1.0 issue, but I think the explanation is pretty simple - to
make the tests stable, all the MCV lists we use are "perfect" i.e. it
represents 100% of the data. But this selectivity is used to compute
selectivity only for the part not represented by the MCV list, i.e. it's
not really used. I suppose we could add a test that would use larger
MCV item, but I'm afraid that'd be inherently unstable :-(

Another thing I was thinking about is the changes to the API. We need to
pass information whether the clauses are connected by AND or OR to a
number of places, and 0001 does that in two ways. For some functions it
adds a new parameter (called is_or), and for other functiosn it creates
a new copy of a function. So for example

- statext_mcv_clauselist_selectivity
- statext_clauselist_selectivity

got the new flag, while e.g. clauselist_selectivity gets a new "copy"
sibling called clauselist_selectivity_or.

There were two reasons for not using flag. First, clauselist_selectivity
and similar functions have to do very different stuff for these two
cases, so it'd be just one huge if/else block. Second, minimizing
breakage of third-party code - pretty much all the extensions I've seen
only work with AND clauses, and call clauselist_selectivity. Adding a
flag would break that code. (Also, there's a bit of laziness, because
this was the simplest thing to do during development.)

But I wonder if that's sufficient reason - maybe we should just add the
flag in all cases. It might break some code, but the fix is trivial (add
a false there).

Opinions?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#18

Dean Rasheed

dean.a.rasheed@gmail.com

about 6 years ago

In reply to: Tomas Vondra (#17)

Re: Additional improvements to extended statistics

On Wed, 18 Mar 2020 at 19:31, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

Attached is a rebased patch series, addressing both those issues.

I've been wondering why none of the regression tests failed because of
the 0.0 vs. 1.0 issue, but I think the explanation is pretty simple - to
make the tests stable, all the MCV lists we use are "perfect" i.e. it
represents 100% of the data. But this selectivity is used to compute
selectivity only for the part not represented by the MCV list, i.e. it's
not really used. I suppose we could add a test that would use larger
MCV item, but I'm afraid that'd be inherently unstable :-(

I think it ought to be possible to write stable tests for this code
branch -- I think all you need is for the number of rows to remain
small, so that the stats sample every row and are predictable, while
the MCVs don't cover all values, which can be achieved with a mix of
some common values, and some that only occur once.

I haven't tried it, but it seems like it would be possible in principle.

Another thing I was thinking about is the changes to the API. We need to
pass information whether the clauses are connected by AND or OR to a
number of places, and 0001 does that in two ways. For some functions it
adds a new parameter (called is_or), and for other functiosn it creates
a new copy of a function. So for example

- statext_mcv_clauselist_selectivity
- statext_clauselist_selectivity

got the new flag, while e.g. clauselist_selectivity gets a new "copy"
sibling called clauselist_selectivity_or.

There were two reasons for not using flag. First, clauselist_selectivity
and similar functions have to do very different stuff for these two
cases, so it'd be just one huge if/else block. Second, minimizing
breakage of third-party code - pretty much all the extensions I've seen
only work with AND clauses, and call clauselist_selectivity. Adding a
flag would break that code. (Also, there's a bit of laziness, because
this was the simplest thing to do during development.)

But I wonder if that's sufficient reason - maybe we should just add the
flag in all cases. It might break some code, but the fix is trivial (add
a false there).

Opinions?

-1

I think of clause_selectivity() and clauselist_selectivity() as the
public API that everyone is using, whilst the functions that support
lists of clauses to be combined using OR are internal (to the planner)
implementation details. I think callers of public API tend to either
have implicitly AND'ed list of clauses, or a single OR clause.

Regards,
Dean

#19

Tomas Vondra

tomas.vondra@2ndquadrant.com

about 6 years ago

In reply to: Dean Rasheed (#18)

Re: Additional improvements to extended statistics

On Thu, Mar 19, 2020 at 07:08:07PM +0000, Dean Rasheed wrote:

On Wed, 18 Mar 2020 at 19:31, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

Attached is a rebased patch series, addressing both those issues.

I've been wondering why none of the regression tests failed because of
the 0.0 vs. 1.0 issue, but I think the explanation is pretty simple - to
make the tests stable, all the MCV lists we use are "perfect" i.e. it
represents 100% of the data. But this selectivity is used to compute
selectivity only for the part not represented by the MCV list, i.e. it's
not really used. I suppose we could add a test that would use larger
MCV item, but I'm afraid that'd be inherently unstable :-(

I think it ought to be possible to write stable tests for this code
branch -- I think all you need is for the number of rows to remain
small, so that the stats sample every row and are predictable, while
the MCVs don't cover all values, which can be achieved with a mix of
some common values, and some that only occur once.

I haven't tried it, but it seems like it would be possible in principle.

Ah, right. Yeah, I think that should work. I thought there would be some
volatility due to groups randomly not making it into the MCV list, but
you're right it's possible to construct the data in a way to make it
perfectly deterministic. So that's what I've done in the attached patch.

Another thing I was thinking about is the changes to the API. We need to
pass information whether the clauses are connected by AND or OR to a
number of places, and 0001 does that in two ways. For some functions it
adds a new parameter (called is_or), and for other functiosn it creates
a new copy of a function. So for example

- statext_mcv_clauselist_selectivity
- statext_clauselist_selectivity

got the new flag, while e.g. clauselist_selectivity gets a new "copy"
sibling called clauselist_selectivity_or.

There were two reasons for not using flag. First, clauselist_selectivity
and similar functions have to do very different stuff for these two
cases, so it'd be just one huge if/else block. Second, minimizing
breakage of third-party code - pretty much all the extensions I've seen
only work with AND clauses, and call clauselist_selectivity. Adding a
flag would break that code. (Also, there's a bit of laziness, because
this was the simplest thing to do during development.)

But I wonder if that's sufficient reason - maybe we should just add the
flag in all cases. It might break some code, but the fix is trivial (add
a false there).

Opinions?

-1

I think of clause_selectivity() and clauselist_selectivity() as the
public API that everyone is using, whilst the functions that support
lists of clauses to be combined using OR are internal (to the planner)
implementation details. I think callers of public API tend to either
have implicitly AND'ed list of clauses, or a single OR clause.

OK, thanks. That was mostly my reasoning too - not wanting to cause
unnecessary breakage. And yes, I agree most people just call
clauselist_selectivity with a list of clauses combined using AND.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#20

Dean Rasheed

dean.a.rasheed@gmail.com

about 6 years ago

In reply to: Tomas Vondra (#19)

Re: Additional improvements to extended statistics

On Sat, 21 Mar 2020 at 21:59, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

Ah, right. Yeah, I think that should work. I thought there would be some
volatility due to groups randomly not making it into the MCV list, but
you're right it's possible to construct the data in a way to make it
perfectly deterministic. So that's what I've done in the attached patch.

Looking through those new tests, another issue occurred to me -- under
some circumstances this patch can lead to extended stats being applied
twice to the same clause, which is not great, because it involves
quite a lot of extra work, and also because it can lead to
overestimates because of the way that MCV stats are applied as a delta
correction to simple_sel.

The way this comes about is as follows -- if we start with an OR
clause, that will be passed as a single-item implicitly AND'ed list to
clauselist_selectivity(), and from there to
statext_clauselist_selectivity() and then to
statext_mcv_clauselist_selectivity(). This will call
clauselist_selectivity_simple() to get simple_sel, before calling
mcv_clauselist_selectivity(), which will recursively compute all the
MCV corrections. However, the call to clauselist_selectivity_simple()
will call clause_selectivity() for each clause (just a single OR
clause in this example), which will now call
clauselist_selectivity_or(), which will go back into
statext_clauselist_selectivity() with "is_or = true", which will apply
the MCV corrections to the same set of clauses that the outer call was
about to process.

I'm not sure what's the best way to resolve that. Perhaps
statext_mcv_clauselist_selectivity() / statext_is_compatible_clause()
should ignore OR clauses from an AND-list, on the basis that they will
get processed recursively later. Or perhaps estimatedclauses can
somehow be used to prevent this, though I'm not sure exactly how that
would work.

Regards,
Dean

#21

Tomas Vondra

tomas.vondra@2ndquadrant.com

about 6 years ago

In reply to: Dean Rasheed (#20)

#22

Dean Rasheed

dean.a.rasheed@gmail.com

about 6 years ago

In reply to: Tomas Vondra (#21)

#23