using extended statistics to improve join estimates

Started by Tomas Vondraabout 5 years ago33 messageshackers
Jump to latest
#1Tomas Vondra
tomas.vondra@2ndquadrant.com

Hi,

So far the extended statistics are applied only at scan level, i.e. when
estimating selectivity for individual tables. Which is great, but joins
are a known challenge, so let's try doing something about it ...

Konstantin Knizhnik posted a patch [1]/messages/by-id/71d67391-16a9-3e5e-b5e4-8f7fd32cc1b2@postgrespro.ru using functional dependencies to
improve join estimates in January. It's an interesting approach, but as
I explained in that thread I think we should try a different approach,
similar to how we use MCV lists without extended stats. We'll probably
end up considering functional dependencies too, but probably only as a
fallback (similar to what we do for single-relation estimates).

This is a PoC demonstrating the approach I envisioned. It's incomplete
and has various limitations:

- no expression support, just plain attribute references
- only equality conditions
- requires MCV lists on both sides
- inner joins only

All of this can / should be relaxed later, of course. But for a PoC this
seems sufficient.

The basic principle is fairly simple, and mimics what eqjoinsel_inner
does. Assume we have a query:

SELECT * FROM t1 JOIN t2 ON (t1.a = t2.a AND t1.b = t2.b)

If we have MCV lists on (t1.a,t1.b) and (t2.a,t2.b) then we can use the
same logic as eqjoinsel_inner and "match" them together. If the MCV list
is "larger" - e.g. on (a,b,c) - then it's a bit more complicated, but
the general idea is the same.

To demonstrate this, consider a very simple example with a table that
has a lot of dependency between the columns:

==================================================================

CREATE TABLE t (a INT, b INT, c INT, d INT);
INSERT INTO t SELECT mod(i,100), mod(i,100), mod(i,100), mod(i,100)
FROM generate_series(1,100000) s(i);
ANALYZE t;

SELECT * FROM t t1 JOIN t t2 ON (t1.a = t2.a AND t1.b = t2.b);

CREATE STATISTICS s (mcv, ndistinct) ON a,b,c,d FROM t;
ANALYZE t;

SELECT * FROM t t1 JOIN t t2 ON (t1.a = t2.a AND t1.b = t2.b);

ALTER STATISTICS s SET STATISTICS 10000;
ANALYZE t;

SELECT * FROM t t1 JOIN t t2 ON (t1.a = t2.a AND t1.b = t2.b);

==================================================================

The results look like this:

- actual rows: 100000000
- estimated (no stats): 1003638
- estimated (stats, 100): 100247844
- estimated (stats, 10k): 100000000

So, in this case the extended stats help quite a bit, even with the
default statistics target.

However, there are other things we can do. For example, we can use
restrictions (at relation level) as "conditions" to filter the MCV lits,
and calculate conditional probability. This is useful even if there's
just a single join condition (on one column), but there are dependencies
between that column and the other filters. Or maybe when there are
filters between conditions on the two sides.

Consider for example these two queries:

SELECT * FROM t t1 JOIN t t2 ON (t1.a = t2.a AND t1.b = t2.b)
WHERE t1.c < 25 AND t2.d < 25;

SELECT * FROM t t1 JOIN t t2 ON (t1.a = t2.a AND t1.b = t2.b)
WHERE t1.c < 25 AND t2.d > 75;

In this particular case we know that (a = b = c = d) so the two filters
are somewhat redundant. The regular estimates will ignore that, but with
MCV we can actually detect that - when we combine the two MCV lists, we
essentially calculate MCV (a,b,t1.c,t2.d), and use that.

Q1 Q2
- actual rows: 25000000 0
- estimated (no stats): 62158 60241
- estimated (stats, 100): 25047900 1
- estimated (stats, 10k): 25000000 1

Obviously, the accuracy depends on how representative the MCV list is
(what fraction of the data it represents), and in this case it works
fairly nicely. A lot of the future work will be about handling cases
when it represents only part of the data.

The attached PoC patch has a number of FIXME and XXX, describing stuff I
ignored to keep it simple, possible future improvement. And so on.

regards

[1]: /messages/by-id/71d67391-16a9-3e5e-b5e4-8f7fd32cc1b2@postgrespro.ru
/messages/by-id/71d67391-16a9-3e5e-b5e4-8f7fd32cc1b2@postgrespro.ru

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

extended-stats-joins-poc.patchtext/x-patch; charset=UTF-8; name=extended-stats-joins-poc.patchDownload+707-1
#2Zhihong Yu
zyu@yugabyte.com
In reply to: Tomas Vondra (#1)
Re: using extended statistics to improve join estimates

Hi,

+ * has_matching_mcv
+ *     Check whether the list contains statistic of a given kind

The method name is find_matching_mcv(). It seems the method initially
returned bool but later the return type was changed.

+ StatisticExtInfo *found = NULL;

found normally is associated with bool return value. Maybe call the
variable matching_mcv or something similar.

+           /* skip items eliminated by restrictions on rel2 */
+           if (conditions2 && !conditions2[j])
+               continue;

Maybe you can add a counter recording the number of non-skipped items for
the inner loop. If this counter is 0 after the completion of one iteration,
we come out of the outer loop directly.

Cheers

On Wed, Mar 31, 2021 at 10:36 AM Tomas Vondra <tomas.vondra@enterprisedb.com>
wrote:

Show quoted text

Hi,

So far the extended statistics are applied only at scan level, i.e. when
estimating selectivity for individual tables. Which is great, but joins
are a known challenge, so let's try doing something about it ...

Konstantin Knizhnik posted a patch [1] using functional dependencies to
improve join estimates in January. It's an interesting approach, but as
I explained in that thread I think we should try a different approach,
similar to how we use MCV lists without extended stats. We'll probably
end up considering functional dependencies too, but probably only as a
fallback (similar to what we do for single-relation estimates).

This is a PoC demonstrating the approach I envisioned. It's incomplete
and has various limitations:

- no expression support, just plain attribute references
- only equality conditions
- requires MCV lists on both sides
- inner joins only

All of this can / should be relaxed later, of course. But for a PoC this
seems sufficient.

The basic principle is fairly simple, and mimics what eqjoinsel_inner
does. Assume we have a query:

SELECT * FROM t1 JOIN t2 ON (t1.a = t2.a AND t1.b = t2.b)

If we have MCV lists on (t1.a,t1.b) and (t2.a,t2.b) then we can use the
same logic as eqjoinsel_inner and "match" them together. If the MCV list
is "larger" - e.g. on (a,b,c) - then it's a bit more complicated, but
the general idea is the same.

To demonstrate this, consider a very simple example with a table that
has a lot of dependency between the columns:

==================================================================

CREATE TABLE t (a INT, b INT, c INT, d INT);
INSERT INTO t SELECT mod(i,100), mod(i,100), mod(i,100), mod(i,100)
FROM generate_series(1,100000) s(i);
ANALYZE t;

SELECT * FROM t t1 JOIN t t2 ON (t1.a = t2.a AND t1.b = t2.b);

CREATE STATISTICS s (mcv, ndistinct) ON a,b,c,d FROM t;
ANALYZE t;

SELECT * FROM t t1 JOIN t t2 ON (t1.a = t2.a AND t1.b = t2.b);

ALTER STATISTICS s SET STATISTICS 10000;
ANALYZE t;

SELECT * FROM t t1 JOIN t t2 ON (t1.a = t2.a AND t1.b = t2.b);

==================================================================

The results look like this:

- actual rows: 100000000
- estimated (no stats): 1003638
- estimated (stats, 100): 100247844
- estimated (stats, 10k): 100000000

So, in this case the extended stats help quite a bit, even with the
default statistics target.

However, there are other things we can do. For example, we can use
restrictions (at relation level) as "conditions" to filter the MCV lits,
and calculate conditional probability. This is useful even if there's
just a single join condition (on one column), but there are dependencies
between that column and the other filters. Or maybe when there are
filters between conditions on the two sides.

Consider for example these two queries:

SELECT * FROM t t1 JOIN t t2 ON (t1.a = t2.a AND t1.b = t2.b)
WHERE t1.c < 25 AND t2.d < 25;

SELECT * FROM t t1 JOIN t t2 ON (t1.a = t2.a AND t1.b = t2.b)
WHERE t1.c < 25 AND t2.d > 75;

In this particular case we know that (a = b = c = d) so the two filters
are somewhat redundant. The regular estimates will ignore that, but with
MCV we can actually detect that - when we combine the two MCV lists, we
essentially calculate MCV (a,b,t1.c,t2.d), and use that.

Q1 Q2
- actual rows: 25000000 0
- estimated (no stats): 62158 60241
- estimated (stats, 100): 25047900 1
- estimated (stats, 10k): 25000000 1

Obviously, the accuracy depends on how representative the MCV list is
(what fraction of the data it represents), and in this case it works
fairly nicely. A lot of the future work will be about handling cases
when it represents only part of the data.

The attached PoC patch has a number of FIXME and XXX, describing stuff I
ignored to keep it simple, possible future improvement. And so on.

regards

[1]

/messages/by-id/71d67391-16a9-3e5e-b5e4-8f7fd32cc1b2@postgrespro.ru

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#3Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#1)
Re: using extended statistics to improve join estimates

Hi,

Here's a slightly improved / cleaned up version of the PoC patch,
removing a bunch of XXX and FIXMEs, adding comments, etc.

The approach is sound in principle, I think, although there's still a
bunch of things to address:

1) statext_compare_mcvs only really deals with equijoins / inner joins
at the moment, as it's based on eqjoinsel_inner. It's probably desirable
to add support for additional join types (inequality and outer joins).

2) Some of the steps are performed multiple times - e.g. matching base
restrictions to statistics, etc. Those probably can be cached somehow,
to reduce the overhead.

3) The logic of picking the statistics to apply is somewhat simplistic,
and maybe could be improved in some way. OTOH the number of candidate
statistics is likely low, so this is not a big issue.

4) statext_compare_mcvs is based on eqjoinsel_inner and makes a bunch of
assumptions similar to the original, but some of those assumptions may
be wrong in multi-column case, particularly when working with a subset
of columns. For example (ndistinct - size(MCV)) may not be the number of
distinct combinations outside the MCV, when ignoring some columns. Same
for nullfract, and so on. I'm not sure we can do much more than pick
some reasonable approximation.

5) There are no regression tests at the moment. Clearly a gap.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

extended-stats-joins-20210614.patchtext/x-patch; charset=UTF-8; name=extended-stats-joins-20210614.patchDownload+970-1
#4Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#3)
Re: using extended statistics to improve join estimates

Hi,

attached is an improved version of this patch, addressing some of the
points mentioned in my last message:

1) Adds a couple regression tests, testing various join cases with
expressions, additional conditions, etc.

2) Adds support for expressions, so the join clauses don't need to
reference just simple columns. So e.g. this can benefit from extended
statistics, when defined on the expressions:

-- CREATE STATISTICS s1 ON (a+1), b FROM t1;
-- CREATE STATISTICS s2 ON (a+1), b FROM t2;

SELECT * FROM t1 JOIN t2 ON ((t1.a + 1) = (t2.a + 1) AND t1.b = t2.b);

3) Can combine extended statistics and regular (per-column) statistics.
The previous version required extended statistics MCV on both sides of
the join, but adding extended statistics on both sides may impractical
(e.g. if one side does not have correlated columns it's silly to have to
add it just to make this patch happy).

For example you may have extended stats on the dimension table but not
the fact table, and the patch still can combine those two. Of course, if
there's no MCV on either side, we can't do much.

So this patch works when both sides have extended statistics MCV, or
when one side has extended MCV and the other side regular MCV. It might
seem silly, but the extended MCV allows considering additional baserel
conditions (if there are any).

examples
========

The table / data is very simple, but hopefully good enough for some
simple examples.

create table t1 (a int, b int, c int);
create table t2 (a int, b int, c int);

insert into t1 select mod(i,50), mod(i,50), mod(i,50)
from generate_series(1,1000) s(i);

insert into t2 select mod(i,50), mod(i,50), mod(i,50)
from generate_series(1,1000) s(i);

analyze t1, t2;

First, without extended stats (just the first line of explain analyze,
to keep the message short):

explain analyze select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b);

QUERY PLAN
------------------------------------------------------------------------
Hash Join (cost=31.00..106.00 rows=400 width=24)
(actual time=5.426..22.678 rows=20000 loops=1)

explain analyze select * from t1 join t2 on (t1.a = t2.a) where t1.c < 25;

QUERY PLAN
------------------------------------------------------------------------
Hash Join (cost=28.50..160.75 rows=10000 width=24)
(actual time=5.325..19.760 rows=10000 loops=1)

explain analyze select * from t1 join t2 on (t1.a = t2.a) where t1.c <
25 and t2.c > 25;

QUERY PLAN
------------------------------------------------------------------------
Hash Join (cost=24.50..104.75 rows=4800 width=24)
(actual time=5.618..5.639 rows=0 loops=1)

Now, let's create statistics:

create statistics s1 on a, b, c from t1 ;
create statistics s2 on a, b, c from t2 ;
analyze t1, t2;

and now the same queries again:

explain analyze select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b);

QUERY PLAN
------------------------------------------------------------------------
Hash Join (cost=31.00..106.00 rows=20000 width=24)
(actual time=5.448..22.713 rows=20000 loops=1)

explain analyze select * from t1 join t2 on (t1.a = t2.a) where t1.c < 25;

QUERY PLAN
------------------------------------------------------------------------
Hash Join (cost=28.50..160.75 rows=10000 width=24)
(actual time=5.317..16.680 rows=10000 loops=1)

explain analyze select * from t1 join t2 on (t1.a = t2.a) where t1.c <
25 and t2.c > 25;

QUERY PLAN
------------------------------------------------------------------------
Hash Join (cost=24.50..104.75 rows=1 width=24)
(actual time=5.647..5.667 rows=0 loops=1)

Those examples are a bit simplistic, but the improvements are fairly
clear I think.

limitations & open issues
=========================

Let's talk about the main general restrictions and open issues in the
current patch that I can think of at the moment.

1) statistics covering all join clauses

The patch requires the statistics to cover all the join clauses, mostly
because it simplifies the implementation. This means that to use the
per-column statistics, there has to be just a single join clause.

AFAICS this could be relaxed and we could use multiple statistics to
estimate the clauses. But it'd make selection of statistics much more
complicated, because we have to pick "matching" statistics on both sides
of the join. So it seems like an overkill, and most joins have very few
conditions anyway.

2) only equality join clauses

The other restriction is that at the moment this only supports simple
equality clauses, combined with AND. So for example this is supported

t1 JOIN t2 ON ((t1.a = t2.a) AND (t1.b + 2 = t2.b + 1))

while these are not:

t1 JOIN t2 ON ((t1.a = t2.a) OR (t1.b + 2 = t2.b + 1))

t1 JOIN t2 ON ((t1.a - t2.a = 0) AND (t1.b + 2 = t2.b + 1))

t1 JOIN t2 ON ((t1.a = t2.a) AND ((t1.b = t2.b) OR (t1.c = t2.c)))

I'm not entirely sure these restrictions can be relaxed. It's not that
difficult to evaluate these cases when matching items between the MCV
lists, similarly to how we evaluate bitmaps for baserel estimates.

But I'm not sure what to do about the part not covered by the MCV lists.
The eqjoinsel() approach uses ndistinct estimates for that, but that
only works for AND clauses, I think. How would that work for OR?

Similarly, I'm not sure we can do much for non-equality conditions, but
those are currently estimated as default selectivity in selfuncs.c.

3) estimation by join pairs

At the moment, the estimates are calculated for pairs of relations, so
for example given a query

explain analyze
select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b)
join t3 on (t1.b = t3.b and t2.c = t3.c);

we'll estimate the first join (t1,t2) just fine, but then the second
join actually combines (t1,t2,t3). What the patch currently does is it
splits it into (t1,t2) and (t2,t3) and estimates those. I wonder if this
should actually combine all three MCVs at once - we're pretty much
combining the MCVs into one large MCV representing the join result.

But I haven't done that yet, as it requires the MCVs to be combined
using the join clauses (overlap in a way), but I'm not sure how likely
that is in practice. In the example it could help, but that's a bit
artificial example.

4) still just inner equi-joins

I haven't done any work on extending this to outer joins etc. Adding
outer and semi joins should not be complicated, mostly copying and
tweaking what eqjoinsel() does.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

0001-Estimate-joins-using-extended-statistics-20211006.patchtext/x-patch; charset=UTF-8; name=0001-Estimate-joins-using-extended-statistics-20211006.patchDownload+1885-2
#5Zhihong Yu
zyu@yugabyte.com
In reply to: Tomas Vondra (#4)
Re: using extended statistics to improve join estimates

On Wed, Oct 6, 2021 at 12:33 PM Tomas Vondra <tomas.vondra@enterprisedb.com>
wrote:

Hi,

attached is an improved version of this patch, addressing some of the
points mentioned in my last message:

1) Adds a couple regression tests, testing various join cases with
expressions, additional conditions, etc.

2) Adds support for expressions, so the join clauses don't need to
reference just simple columns. So e.g. this can benefit from extended
statistics, when defined on the expressions:

-- CREATE STATISTICS s1 ON (a+1), b FROM t1;
-- CREATE STATISTICS s2 ON (a+1), b FROM t2;

SELECT * FROM t1 JOIN t2 ON ((t1.a + 1) = (t2.a + 1) AND t1.b = t2.b);

3) Can combine extended statistics and regular (per-column) statistics.
The previous version required extended statistics MCV on both sides of
the join, but adding extended statistics on both sides may impractical
(e.g. if one side does not have correlated columns it's silly to have to
add it just to make this patch happy).

For example you may have extended stats on the dimension table but not
the fact table, and the patch still can combine those two. Of course, if
there's no MCV on either side, we can't do much.

So this patch works when both sides have extended statistics MCV, or
when one side has extended MCV and the other side regular MCV. It might
seem silly, but the extended MCV allows considering additional baserel
conditions (if there are any).

examples
========

The table / data is very simple, but hopefully good enough for some
simple examples.

create table t1 (a int, b int, c int);
create table t2 (a int, b int, c int);

insert into t1 select mod(i,50), mod(i,50), mod(i,50)
from generate_series(1,1000) s(i);

insert into t2 select mod(i,50), mod(i,50), mod(i,50)
from generate_series(1,1000) s(i);

analyze t1, t2;

First, without extended stats (just the first line of explain analyze,
to keep the message short):

explain analyze select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b);

QUERY PLAN
------------------------------------------------------------------------
Hash Join (cost=31.00..106.00 rows=400 width=24)
(actual time=5.426..22.678 rows=20000 loops=1)

explain analyze select * from t1 join t2 on (t1.a = t2.a) where t1.c < 25;

QUERY PLAN
------------------------------------------------------------------------
Hash Join (cost=28.50..160.75 rows=10000 width=24)
(actual time=5.325..19.760 rows=10000 loops=1)

explain analyze select * from t1 join t2 on (t1.a = t2.a) where t1.c <
25 and t2.c > 25;

QUERY PLAN
------------------------------------------------------------------------
Hash Join (cost=24.50..104.75 rows=4800 width=24)
(actual time=5.618..5.639 rows=0 loops=1)

Now, let's create statistics:

create statistics s1 on a, b, c from t1 ;
create statistics s2 on a, b, c from t2 ;
analyze t1, t2;

and now the same queries again:

explain analyze select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b);

QUERY PLAN
------------------------------------------------------------------------
Hash Join (cost=31.00..106.00 rows=20000 width=24)
(actual time=5.448..22.713 rows=20000 loops=1)

explain analyze select * from t1 join t2 on (t1.a = t2.a) where t1.c < 25;

QUERY PLAN
------------------------------------------------------------------------
Hash Join (cost=28.50..160.75 rows=10000 width=24)
(actual time=5.317..16.680 rows=10000 loops=1)

explain analyze select * from t1 join t2 on (t1.a = t2.a) where t1.c <
25 and t2.c > 25;

QUERY PLAN
------------------------------------------------------------------------
Hash Join (cost=24.50..104.75 rows=1 width=24)
(actual time=5.647..5.667 rows=0 loops=1)

Those examples are a bit simplistic, but the improvements are fairly
clear I think.

limitations & open issues
=========================

Let's talk about the main general restrictions and open issues in the
current patch that I can think of at the moment.

1) statistics covering all join clauses

The patch requires the statistics to cover all the join clauses, mostly
because it simplifies the implementation. This means that to use the
per-column statistics, there has to be just a single join clause.

AFAICS this could be relaxed and we could use multiple statistics to
estimate the clauses. But it'd make selection of statistics much more
complicated, because we have to pick "matching" statistics on both sides
of the join. So it seems like an overkill, and most joins have very few
conditions anyway.

2) only equality join clauses

The other restriction is that at the moment this only supports simple
equality clauses, combined with AND. So for example this is supported

t1 JOIN t2 ON ((t1.a = t2.a) AND (t1.b + 2 = t2.b + 1))

while these are not:

t1 JOIN t2 ON ((t1.a = t2.a) OR (t1.b + 2 = t2.b + 1))

t1 JOIN t2 ON ((t1.a - t2.a = 0) AND (t1.b + 2 = t2.b + 1))

t1 JOIN t2 ON ((t1.a = t2.a) AND ((t1.b = t2.b) OR (t1.c = t2.c)))

I'm not entirely sure these restrictions can be relaxed. It's not that
difficult to evaluate these cases when matching items between the MCV
lists, similarly to how we evaluate bitmaps for baserel estimates.

But I'm not sure what to do about the part not covered by the MCV lists.
The eqjoinsel() approach uses ndistinct estimates for that, but that
only works for AND clauses, I think. How would that work for OR?

Similarly, I'm not sure we can do much for non-equality conditions, but
those are currently estimated as default selectivity in selfuncs.c.

3) estimation by join pairs

At the moment, the estimates are calculated for pairs of relations, so
for example given a query

explain analyze
select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b)
join t3 on (t1.b = t3.b and t2.c = t3.c);

we'll estimate the first join (t1,t2) just fine, but then the second
join actually combines (t1,t2,t3). What the patch currently does is it
splits it into (t1,t2) and (t2,t3) and estimates those. I wonder if this
should actually combine all three MCVs at once - we're pretty much
combining the MCVs into one large MCV representing the join result.

But I haven't done that yet, as it requires the MCVs to be combined
using the join clauses (overlap in a way), but I'm not sure how likely
that is in practice. In the example it could help, but that's a bit
artificial example.

4) still just inner equi-joins

I haven't done any work on extending this to outer joins etc. Adding
outer and semi joins should not be complicated, mostly copying and
tweaking what eqjoinsel() does.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Hi,

+       conditions2 = statext_determine_join_restrictions(root, rel, mcv);
+
+       /* if the new statistics covers more conditions, use it */
+       if (list_length(conditions2) > list_length(conditions1))
+       {
+           mcv = stat;

It seems conditions2 is calculated using mcv, I wonder why mcv is replaced
by stat (for conditions1 whose length is shorter) ?

Cheers

#6Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Zhihong Yu (#5)
Re: using extended statistics to improve join estimates

On 10/6/21 23:03, Zhihong Yu wrote:

Hi,

+       conditions2 = statext_determine_join_restrictions(root, rel, mcv);
+
+       /* if the new statistics covers more conditions, use it */
+       if (list_length(conditions2) > list_length(conditions1))
+       {
+           mcv = stat;

It seems conditions2 is calculated using mcv, I wonder why mcv is
replaced by stat (for conditions1 whose length is shorter) ?

Yeah, that's wrong - it should be the other way around, i.e.

if (list_length(conditions1) > list_length(conditions2))

There's no test with multiple candidate statistics yet, so this went
unnoticed :-/

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#7Andy Fan
zhihui.fan1213@gmail.com
In reply to: Tomas Vondra (#4)
Re: using extended statistics to improve join estimates

Hi Tomas:

This is the exact patch I want, thanks for the patch!

On Thu, Oct 7, 2021 at 3:33 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

3) estimation by join pairs

At the moment, the estimates are calculated for pairs of relations, so
for example given a query

explain analyze
select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b)
join t3 on (t1.b = t3.b and t2.c = t3.c);

we'll estimate the first join (t1,t2) just fine, but then the second
join actually combines (t1,t2,t3). What the patch currently does is it
splits it into (t1,t2) and (t2,t3) and estimates those.

Actually I can't understand how this works even for a simpler example.
let's say we query like this (ONLY use t2's column to join t3).

select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b)
join t3 on (t2.c = t3.c and t2.d = t3.d);

Then it works well on JoinRel(t1, t2) AND JoinRel(t2, t3). But when comes
to JoinRel(t1, t2, t3), we didn't maintain the MCV on join rel, so it
is hard to
work. Here I see your solution is splitting it into (t1, t2) AND (t2,
t3) and estimate
those. But how does this help to estimate the size of JoinRel(t1, t2, t3)?

I wonder if this
should actually combine all three MCVs at once - we're pretty much
combining the MCVs into one large MCV representing the join result.

I guess we can keep the MCVs on joinrel for these matches. Take the above
query I provided for example, and suppose the MCV data as below:

t1(a, b)
(1, 2) -> 0.1
(1, 3) -> 0.2
(2, 3) -> 0.5
(2, 8) -> 0.1

t2(a, b)
(1, 2) -> 0.2
(1, 3) -> 0.1
(2, 4) -> 0.2
(2, 10) -> 0.1

After t1.a = t2.a AND t1.b = t2.b, we can build the MCV as below

(1, 2, 1, 2) -> 0.1 * 0.2
(1, 3, 1, 3) -> 0.2 * 0.1

And recording the total mcv frequence as (0.1 + 0.2 + 0.5 + 0.1) *
(0.2 + 0.1 + 0.2 + 0.1)

With this design, the nitems of MCV on joinrel would be less than
either of baserel.

and since we handle the eqjoin as well, we even can record the items as

(1, 2) -> 0.1 * 0.2
(1, 3) -> 0.2 * 0.1;

About when we should maintain the JoinRel's MCV data, rather than
maintain this just
after the JoinRel size is estimated, we can only estimate it when it
is needed. for example:

select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b)
join t3 on (t2.c = t3.c and t2.d = t3.d);

we don't need to maintain the MCV on (t1, t2, t3) since no others
need it at all. However
I don't check code too closely to see if it (Lazing computing MVC on
joinrel) is convenient
to do.

But I haven't done that yet, as it requires the MCVs to be combined
using the join clauses (overlap in a way), but I'm not sure how likely
that is in practice. In the example it could help, but that's a bit
artificial example.

4) still just inner equi-joins

I haven't done any work on extending this to outer joins etc. Adding
outer and semi joins should not be complicated, mostly copying and
tweaking what eqjoinsel() does.

Overall, thanks for the feature and I am expecting there are more cases
to handle during discussion. To make the review process more efficient,
I suggest that we split the patch into smaller ones and review/commit them
separately if we have finalized the design roughly . For example:

Patch 1 -- required both sides to have extended statistics.
Patch 2 -- required one side to have extended statistics and the other side had
per-column MCV.
Patch 3 -- handle the case like WHERE t1.a = t2.a and t1.b = Const;
Patch 3 -- handle the case for 3+ table joins.
Patch 4 -- supports the outer join.

I think we can do this if we are sure that each individual patch would work in
some cases and would not make any other case worse. If you agree with this,
I can do that splitting work during my review process.

--
Best Regards
Andy Fan (https://www.aliyun.com/)

#8Justin Pryzby
pryzby@telsasoft.com
In reply to: Tomas Vondra (#4)
Re: using extended statistics to improve join estimates

Your regression tests include two errors, which appear to be accidental, and
fixing the error shows that this case is being estimated poorly.

+-- try combining with single-column (and single-expression) statistics
+DROP STATISTICS join_test_2;
+ERROR:  statistics object "join_test_2" does not exist
...
+ERROR:  statistics object "join_stats_2" already exists

--
Justin

#9Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Justin Pryzby (#8)
Re: using extended statistics to improve join estimates

On 11/22/21 02:23, Justin Pryzby wrote:

Your regression tests include two errors, which appear to be accidental, and
fixing the error shows that this case is being estimated poorly.

+-- try combining with single-column (and single-expression) statistics
+DROP STATISTICS join_test_2;
+ERROR:  statistics object "join_test_2" does not exist
...
+ERROR:  statistics object "join_stats_2" already exists

D'oh, what a silly mistake ...

You're right fixing the DROP STATISTICS results in worse estimate, but
that's actually expected for a fairly simple reason. The join condition
has expressions on both sides, and dropping the statistics means we
don't have any MCV for the join_test_2 side. So the optimizer ends up
not using the regular estimates, as if there were no extended stats.

A couple lines later the script creates an extended statistics on that
expression alone, which fixes this. An expression index would do the
trick too.

Attached is a patch fixing the test and also the issue reported by
Zhihong Yu some time ago.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

0001-Estimate-joins-using-extended-statistics-20211213.patchtext/x-patch; charset=UTF-8; name=0001-Estimate-joins-using-extended-statistics-20211213.patchDownload+1886-2
#10Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Andy Fan (#7)
Re: using extended statistics to improve join estimates

On 11/6/21 11:03, Andy Fan wrote:

Hi Tomas:

This is the exact patch I want, thanks for the patch!

Good to hear.

On Thu, Oct 7, 2021 at 3:33 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

3) estimation by join pairs

At the moment, the estimates are calculated for pairs of relations, so
for example given a query

explain analyze
select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b)
join t3 on (t1.b = t3.b and t2.c = t3.c);

we'll estimate the first join (t1,t2) just fine, but then the second
join actually combines (t1,t2,t3). What the patch currently does is it
splits it into (t1,t2) and (t2,t3) and estimates those.

Actually I can't understand how this works even for a simpler example.
let's say we query like this (ONLY use t2's column to join t3).

select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b)
join t3 on (t2.c = t3.c and t2.d = t3.d);

Then it works well on JoinRel(t1, t2) AND JoinRel(t2, t3). But when comes
to JoinRel(t1, t2, t3), we didn't maintain the MCV on join rel, so it
is hard to
work. Here I see your solution is splitting it into (t1, t2) AND (t2,
t3) and estimate
those. But how does this help to estimate the size of JoinRel(t1, t2, t3)?

Yeah, this is really confusing. The crucial thing to keep in mind is
this works with clauses before running setrefs.c, so the clauses
reference the original relations - *not* the join relation. Otherwise
even the regular estimation would not work, because where would it get
the per-column MCV lists etc.

Let's use a simple case with join clauses referencing just a single
attribute for each pair or relations. And let's talk about how many join
pairs it'll extract:

t1 JOIN t2 ON (t1.a = t2.a) JOIN t3 ON (t1.b = t3.b)

=> first we join t1/t2, which is 1 join pair (t1,t2)
=> then we join t1/t2/t3, but the join clause references just 2 rels, so
1 join pair (t1,t3)

Now a more complicated case, with more complex join clause

t1 JOIN t2 ON (t1.a = t2.a) JOIN t3 ON (t1.b = t3.b AND t2.c = t3.c)

=> first we join t1/t2, which is 1 join pair (t1,t2)
=> then we join t1/t2/t3, but this time the join clause references all
three rels, so we have two join pairs (t1,t3) and (t2,t3) and we can use
all the statistics.

I wonder if this
should actually combine all three MCVs at once - we're pretty much
combining the MCVs into one large MCV representing the join result.

I guess we can keep the MCVs on joinrel for these matches. Take the above
query I provided for example, and suppose the MCV data as below:

t1(a, b)
(1, 2) -> 0.1
(1, 3) -> 0.2
(2, 3) -> 0.5
(2, 8) -> 0.1

t2(a, b)
(1, 2) -> 0.2
(1, 3) -> 0.1
(2, 4) -> 0.2
(2, 10) -> 0.1

After t1.a = t2.a AND t1.b = t2.b, we can build the MCV as below

(1, 2, 1, 2) -> 0.1 * 0.2
(1, 3, 1, 3) -> 0.2 * 0.1

And recording the total mcv frequence as (0.1 + 0.2 + 0.5 + 0.1) *
(0.2 + 0.1 + 0.2 + 0.1)

Right, that's about the joint distribution I whole join.

With this design, the nitems of MCV on joinrel would be less than
either of baserel.

Actually, I think the number of items can grow, because the matches may
duplicate some items. For example in your example with (t1.a = t2.a) the
first first (1,2) item in t1 matches (1,2) and (1,3) in t2. And same for
(1,3) in t1. So that's 4 combinations. Of course, we could aggregate the
MCV by ignoring columns not used in the query.

and since we handle the eqjoin as well, we even can record the items as

(1, 2) -> 0.1 * 0.2
(1, 3) -> 0.2 * 0.1;

About when we should maintain the JoinRel's MCV data, rather than
maintain this just
after the JoinRel size is estimated, we can only estimate it when it
is needed. for example:

select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b)
join t3 on (t2.c = t3.c and t2.d = t3.d);

we don't need to maintain the MCV on (t1, t2, t3) since no others
need it at all. However
I don't check code too closely to see if it (Lazing computing MVC on
joinrel) is convenient
to do.

I'm not sure I understand what you're proposing here.

However, I think that estimating it for pairs has two advantages:

1) Combining MCVs for k relations requires k for loops. Processing 2
relations at a time limits the amount of CPU we need. Of course, this
assumes the joins are independent, which may or may not be true.

2) It seems fairly easy to combine different types of statistics
(regular, extended, ...), and also consider the part not represented by
MCV. It seems much harder when joining more than 2 relations.

I'm also worried about amplification of errors - I suspect attempting to
build the joint MCV for the whole join relation may produce significant
estimation errors.

Furthermore, I think joins with clauses referencing more than just two
relations are fairly uncommon. And we can always improve the feature in
this direction in the future.

But I haven't done that yet, as it requires the MCVs to be combined
using the join clauses (overlap in a way), but I'm not sure how likely
that is in practice. In the example it could help, but that's a bit
artificial example.

4) still just inner equi-joins

I haven't done any work on extending this to outer joins etc. Adding
outer and semi joins should not be complicated, mostly copying and
tweaking what eqjoinsel() does.

Overall, thanks for the feature and I am expecting there are more cases
to handle during discussion. To make the review process more efficient,
I suggest that we split the patch into smaller ones and review/commit them
separately if we have finalized the design roughly . For example:

Patch 1 -- required both sides to have extended statistics.
Patch 2 -- required one side to have extended statistics and the other side had
per-column MCV.
Patch 3 -- handle the case like WHERE t1.a = t2.a and t1.b = Const;
Patch 3 -- handle the case for 3+ table joins.
Patch 4 -- supports the outer join.

I think we can do this if we are sure that each individual patch would work in
some cases and would not make any other case worse. If you agree with this,
I can do that splitting work during my review process.

I'll consider splitting it like this, but I'm not sure it makes the main
patch that much smaller.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#11Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#9)
Re: using extended statistics to improve join estimates

Hi,

Here's an updated patch, rebased and fixing a couple typos reported by
Justin Pryzby directly.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

0001-Estimate-joins-using-extended-statistics-20220101.patchtext/x-patch; charset=UTF-8; name=0001-Estimate-joins-using-extended-statistics-20220101.patchDownload+1886-2
#12Andres Freund
andres@anarazel.de
In reply to: Tomas Vondra (#11)
Re: using extended statistics to improve join estimates

On 2022-01-01 18:21:06 +0100, Tomas Vondra wrote:

Here's an updated patch, rebased and fixing a couple typos reported by
Justin Pryzby directly.

FWIW, cfbot reports a few compiler warnings:

https://cirrus-ci.com/task/6067262669979648?logs=gcc_warning#L505
[18:52:15.132] time make -s -j${BUILD_JOBS} world-bin
[18:52:22.697] mcv.c: In function ‘mcv_combine_simple’:
[18:52:22.697] mcv.c:2787:7: error: ‘reverse’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
[18:52:22.697] 2787 | if (reverse)
[18:52:22.697] | ^
[18:52:22.697] mcv.c:2766:27: error: ‘index’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
[18:52:22.697] 2766 | if (mcv->items[i].isnull[index])
[18:52:22.697] | ^

Greetings,

Andres Freund

#13Julien Rouhaud
rjuju123@gmail.com
In reply to: Andres Freund (#12)
Re: using extended statistics to improve join estimates

Hi,

On Tue, Jan 04, 2022 at 03:55:50PM -0800, Andres Freund wrote:

On 2022-01-01 18:21:06 +0100, Tomas Vondra wrote:

Here's an updated patch, rebased and fixing a couple typos reported by
Justin Pryzby directly.

FWIW, cfbot reports a few compiler warnings:

Also the patch doesn't apply anymore:

http://cfbot.cputube.org/patch_36_3055.log
=== Applying patches on top of PostgreSQL commit ID 74527c3e022d3ace648340b79a6ddec3419f6732 ===
=== applying patch ./0001-Estimate-joins-using-extended-statistics-20220101.patch
patching file src/backend/optimizer/path/clausesel.c
patching file src/backend/statistics/extended_stats.c
Hunk #1 FAILED at 30.
Hunk #2 succeeded at 102 (offset 1 line).
Hunk #3 succeeded at 2619 (offset 9 lines).
1 out of 3 hunks FAILED -- saving rejects to file src/backend/statistics/extended_stats.c.rej

#14Justin Pryzby
pryzby@telsasoft.com
In reply to: Julien Rouhaud (#13)
Re: using extended statistics to improve join estimates

On Wed, Jan 19, 2022 at 06:18:09PM +0800, Julien Rouhaud wrote:

On Tue, Jan 04, 2022 at 03:55:50PM -0800, Andres Freund wrote:

On 2022-01-01 18:21:06 +0100, Tomas Vondra wrote:

Here's an updated patch, rebased and fixing a couple typos reported by
Justin Pryzby directly.

FWIW, cfbot reports a few compiler warnings:

Also the patch doesn't apply anymore:

http://cfbot.cputube.org/patch_36_3055.log
=== Applying patches on top of PostgreSQL commit ID 74527c3e022d3ace648340b79a6ddec3419f6732 ===
=== applying patch ./0001-Estimate-joins-using-extended-statistics-20220101.patch
patching file src/backend/optimizer/path/clausesel.c
patching file src/backend/statistics/extended_stats.c
Hunk #1 FAILED at 30.
Hunk #2 succeeded at 102 (offset 1 line).
Hunk #3 succeeded at 2619 (offset 9 lines).
1 out of 3 hunks FAILED -- saving rejects to file src/backend/statistics/extended_stats.c.rej

Rebased over 269b532ae and muted compiler warnings.

Tomas - is this patch viable for pg15 , or should move to the next CF ?

In case it's useful, I ran this on cirrus with my branch for code coverage.
https://cirrus-ci.com/task/5816731397521408
https://api.cirrus-ci.com/v1/artifact/task/5816731397521408/coverage/coverage/00-index.html

statext_find_matching_mcv() has poor coverage.
statext_clauselist_join_selectivity() has poor coverage for the "stats2" case.

In mcv.c: mcv_combine_extended() and mcv_combine_simple() have poor coverage
for the "else if" cases (does it matter?)

Not related to this patch:
build_attnums_array() isn't being hit.

Same at statext_is_compatible_clause_internal()
1538 0 : *exprs = lappend(*exprs, clause);

statext_mcv_[de]serialize() aren't being hit for cstrings.

--
Justin

#15Justin Pryzby
pryzby@telsasoft.com
In reply to: Justin Pryzby (#14)
Re: using extended statistics to improve join estimates

On Wed, Mar 02, 2022 at 11:38:21AM -0600, Justin Pryzby wrote:

Rebased over 269b532ae and muted compiler warnings.

And attached.

Attachments:

0001-Estimate-joins-using-extended-statistics.patchtext/x-diff; charset=us-asciiDownload+1889-2
#16Andy Fan
zhihui.fan1213@gmail.com
In reply to: Justin Pryzby (#15)
Re: using extended statistics to improve join estimates

On Wed, Mar 02, 2022 at 11:38:21AM -0600, Justin Pryzby wrote:

Rebased over 269b532ae and muted compiler warnings.

Thank you Justin for the rebase!

Hello Tomas,

Thanks for the patch! Before I review the path at the code level, I want
to explain my understanding about this patch first.

Before this patch, we already use MCV information for the eqjoinsel, it
works as combine the MCV on the both sides to figure out the mcv_freq
and then treat the rest equally, but this doesn't work for MCV in
extended statistics, this patch fill this gap. Besides that, since
extended statistics means more than 1 columns are involved, if 1+
columns are Const based on RestrictInfo, we can use such information to
filter the MCVs we are interesting, that's really cool.

I did some more testing, all of them are inner join so far, all of them
works amazing and I am suprised this patch didn't draw enough
attention. I will test more after I go though the code.

At for the code level, I reviewed them in the top-down manner and almost
40% completed. Here are some findings just FYI. For efficiency purpose,
I provide each feedback with a individual commit, after all I want to
make sure my comment is practical and coding and testing is a good way
to archive that. I tried to make each of them as small as possible so
that you can reject or accept them convinently.

0001 is your patch, I just rebase them against the current master. 0006
is not much relevant with current patch, and I think it can be committed
individually if you are OK with that.

Hope this kind of review is helpful.

--
Best Regards
Andy Fan

Attachments:

v1-0001-Estimate-joins-using-extended-statistics.patchtext/x-diffDownload+1890-2
v1-0002-Remove-estimiatedcluases-and-varRelid-arguments.patchtext/x-diffDownload+21-24
v1-0003-Remove-SpecialJoinInfo-sjinfo-argument.patchtext/x-diffDownload+9-14
v1-0004-Remove-joinType-argument.patchtext/x-diffDownload+2-7
v1-0005-use-the-pre-calculated-RestrictInfo-left-right_re.patchtext/x-diffDownload+6-30
v1-0006-Fast-path-for-general-clauselist_selectivity.patchtext/x-diffDownload+3-1
v1-0007-bms_is_empty-is-more-effective-than-bms_num_membe.patchtext/x-diffDownload+1-2
v1-0008-a-branch-of-updates-around-JoinPairInfo.patchtext/x-diffDownload+44-47
#17Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Andy Fan (#16)
Re: using extended statistics to improve join estimates

On 4/2/24 10:23, Andy Fan wrote:

On Wed, Mar 02, 2022 at 11:38:21AM -0600, Justin Pryzby wrote:

Rebased over 269b532ae and muted compiler warnings.

Thank you Justin for the rebase!

Hello Tomas,

Thanks for the patch! Before I review the path at the code level, I want
to explain my understanding about this patch first.

If you want to work on this patch, that'd be cool. A review would be
great, but if you want to maybe take over and try moving it forward,
that'd be even better. I don't know when I'll have time to work on it
again, but I'd promise to help you with working on it.

Before this patch, we already use MCV information for the eqjoinsel, it
works as combine the MCV on the both sides to figure out the mcv_freq
and then treat the rest equally, but this doesn't work for MCV in
extended statistics, this patch fill this gap. Besides that, since
extended statistics means more than 1 columns are involved, if 1+
columns are Const based on RestrictInfo, we can use such information to
filter the MCVs we are interesting, that's really cool.

Yes, I think that's an accurate description of what the patch does.

I did some more testing, all of them are inner join so far, all of them
works amazing and I am suprised this patch didn't draw enough
attention. I will test more after I go though the code.

I think it didn't go forward for a bunch of reasons:

1) I got distracted by something else requiring immediate attention, and
forgot about this patch.

2) I got stuck on some detail of the patch, unsure which of the possible
solutions to try first.

3) Uncertainty about how applicable the patch is in practice.

I suppose it was some combination of these reasons, not sure.

As for the "practicality" mentioned in (3), it's been a while since I
worked on the patch so I don't recall the details, but I think I've been
thinking mostly about "start join" queries, where a big "fact" table
joins to small dimensions. And in that case the fact table may have a
MCV, but the dimensions certainly don't have any (because the join
happens on a PK).

But maybe that's a wrong way to think about it - it was clearly useful
to consider the case with (per-attribute) MCVs on both sides as worth
special handling. So why not to do that for multi-column MCVs, right?

At for the code level, I reviewed them in the top-down manner and almost
40% completed. Here are some findings just FYI. For efficiency purpose,
I provide each feedback with a individual commit, after all I want to
make sure my comment is practical and coding and testing is a good way
to archive that. I tried to make each of them as small as possible so
that you can reject or accept them convinently.

0001 is your patch, I just rebase them against the current master. 0006
is not much relevant with current patch, and I think it can be committed
individually if you are OK with that.

Hope this kind of review is helpful.

Cool! There's obviously no chance to get this into v18, and I have stuff
to do in this CF. But I'll take a look after that.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#18Andy Fan
zhihui.fan1213@gmail.com
In reply to: Tomas Vondra (#17)
Re: using extended statistics to improve join estimates

Tomas Vondra <tomas.vondra@enterprisedb.com> writes:

On 4/2/24 10:23, Andy Fan wrote:

On Wed, Mar 02, 2022 at 11:38:21AM -0600, Justin Pryzby wrote:

Rebased over 269b532ae and muted compiler warnings.

Thank you Justin for the rebase!

Hello Tomas,

Thanks for the patch! Before I review the path at the code level, I want
to explain my understanding about this patch first.

If you want to work on this patch, that'd be cool. A review would be
great, but if you want to maybe take over and try moving it forward,
that'd be even better. I don't know when I'll have time to work on it
again, but I'd promise to help you with working on it.

OK, I'd try to moving it forward.

Before this patch, we already use MCV information for the eqjoinsel, it
works as combine the MCV on the both sides to figure out the mcv_freq
and then treat the rest equally, but this doesn't work for MCV in
extended statistics, this patch fill this gap. Besides that, since
extended statistics means more than 1 columns are involved, if 1+
columns are Const based on RestrictInfo, we can use such information to
filter the MCVs we are interesting, that's really cool.

Yes, I think that's an accurate description of what the patch does.

Great to know that:)

I did some more testing, all of them are inner join so far, all of them
works amazing and I am suprised this patch didn't draw enough
attention.

I think it didn't go forward for a bunch of reasons:

..

3) Uncertainty about how applicable the patch is in practice.

I suppose it was some combination of these reasons, not sure.

As for the "practicality" mentioned in (3), it's been a while since I
worked on the patch so I don't recall the details, but I think I've been
thinking mostly about "start join" queries, where a big "fact" table
joins to small dimensions. And in that case the fact table may have a
MCV, but the dimensions certainly don't have any (because the join
happens on a PK).

But maybe that's a wrong way to think about it - it was clearly useful
to consider the case with (per-attribute) MCVs on both sides as worth
special handling. So why not to do that for multi-column MCVs, right?

Yes, that's what my current understanding is.

There are some cases where there are 2+ clauses between two tables AND
the rows estimiation is bad AND the plan is not the best one. In such
sisuations, I'd think this patch probably be helpful. The current case
in hand is PG11, there is no MCV information for extended statistics, so
I even can't verify the patch here is useful or not manually. When I see
them next time in a newer version of PG, I can verity it manually to see
if the rows estimation can be better.

At for the code level, I reviewed them in the top-down manner and almost
40% completed. Here are some findings just FYI. For efficiency purpose,
I provide each feedback with a individual commit, after all I want to
make sure my comment is practical and coding and testing is a good way
to archive that. I tried to make each of them as small as possible so
that you can reject or accept them convinently.

0001 is your patch, I just rebase them against the current master. 0006
is not much relevant with current patch, and I think it can be committed
individually if you are OK with that.

Hope this kind of review is helpful.

Cool! There's obviously no chance to get this into v18, and I have stuff
to do in this CF. But I'll take a look after that.

Good to know that. I will continue my work before that.

--
Best Regards
Andy Fan

#19Justin Pryzby
pryzby@telsasoft.com
In reply to: Andy Fan (#16)
Re: using extended statistics to improve join estimates

On Tue, Apr 02, 2024 at 04:23:45PM +0800, Andy Fan wrote:

0001 is your patch, I just rebase them against the current master. 0006
is not much relevant with current patch, and I think it can be committed
individually if you are OK with that.

Your 002 should also remove listidx to avoid warning
../src/backend/statistics/extended_stats.c:2879:8: error: variable 'listidx' set but not used [-Werror,-Wunused-but-set-variable]

Subject: [PATCH v1 2/8] Remove estimiatedcluases and varRelid arguments

@@ -2939,15 +2939,11 @@ statext_try_join_estimates(PlannerInfo *root, List *clauses, int varRelid,
/* needs to happen before skipping any clauses */
listidx++;

- /* Skip clauses that were already estimated. */
- if (bms_is_member(listidx, estimatedclauses))
- continue;
-

Your 007 could instead test if relids == NULL:

Subject: [PATCH v1 7/8] bms_is_empty is more effective than bms_num_members(b)
-       if (bms_num_members(relids) == 0)
+       if (bms_is_empty(relids))

typos:
001: s/heuristict/heuristics/
002: s/grantee/guarantee/
002: s/estimiatedcluases/estimatedclauses/

It'd be nice to fix/silence these warnings from 001:

|../src/backend/statistics/extended_stats.c:3151:36: warning: ‘relid’ may be used uninitialized [-Wmaybe-uninitialized]
| 3151 | if (var->varno != relid)
| | ^
|../src/backend/statistics/extended_stats.c:3104:33: note: ‘relid’ was declared here
| 3104 | int relid;
| | ^~~~~
|[1016/1893] Compiling C object src/backend/postgres_lib.a.p/statistics_mcv.c.o
|../src/backend/statistics/mcv.c: In function ‘mcv_combine_extended’:
|../src/backend/statistics/mcv.c:2431:49: warning: declaration of ‘idx’ shadows a previous local [-Wshadow=compatible-local]

FYI, I also ran the patch with a $large number of reports without
observing any errors or crashes.

I'll try to look harder at the next patch revision.

--
Justin

#20Andy Fan
zhihui.fan1213@gmail.com
In reply to: Andy Fan (#18)
Re: using extended statistics to improve join estimates

Hello Tomas!

At for the code level, I reviewed them in the top-down manner and almost
40% completed. Here are some findings just FYI. For efficiency purpose,
I provide each feedback with a individual commit, after all I want to
make sure my comment is practical and coding and testing is a good way
to archive that. I tried to make each of them as small as possible so
that you can reject or accept them convinently.

0001 is your patch, I just rebase them against the current master. 0006
is not much relevant with current patch, and I think it can be committed
individually if you are OK with that.

Hope this kind of review is helpful.

Cool! There's obviously no chance to get this into v18, and I have stuff
to do in this CF. But I'll take a look after that.

Good to know that. I will continue my work before that.

I have completed my code level review and modification. These individual
commits and message probably be helpful for discussion.

--
Best Regards
Andy Fan

Attachments:

ext_stats_on_join.tarapplication/x-tarDownload+2290-302
#21Andy Fan
zhihui.fan1213@gmail.com
In reply to: Justin Pryzby (#19)
#22Justin Pryzby
pryzby@telsasoft.com
In reply to: Andy Fan (#21)
#23Andy Fan
zhihui.fan1213@gmail.com
In reply to: Justin Pryzby (#22)
#24Andrei Lepikhov
lepihov@gmail.com
In reply to: Tomas Vondra (#17)
#25Andy Fan
zhihui.fan1213@gmail.com
In reply to: Andrei Lepikhov (#24)
#26Andrei Lepikhov
lepihov@gmail.com
In reply to: Andy Fan (#25)
#27Andrei Lepikhov
lepihov@gmail.com
In reply to: Andrei Lepikhov (#26)
#28Andy Fan
zhihui.fan1213@gmail.com
In reply to: Andrei Lepikhov (#26)
#29Andrei Lepikhov
lepihov@gmail.com
In reply to: Andy Fan (#28)
#30Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Andrei Lepikhov (#29)
#31Andrei Lepikhov
lepihov@gmail.com
In reply to: Tomas Vondra (#30)
#32Andrei Lepikhov
lepihov@gmail.com
In reply to: Andrei Lepikhov (#31)
#33Ilia Evdokimov
ilya.evdokimov@tantorlabs.com
In reply to: Tomas Vondra (#30)