improving GROUP BY estimation

Started by Tomas Vondraalmost 10 years ago34 messages

tomas.vondra@2ndquadrant.com

almost 10 years ago

1 attachment(s)

Hi,

while working of using correlation statistics when estimating GROUP BY
(or generally whatever uses estimate_num_groups), I've noticed that we
apply the selectivity in a rather naive way.

After estimating number of groups in the whole table - either by reading
pg_stats.n_distinct for a single column, or multiplying (and doing some
additional magic) for a group of columns, we do this:

reldistinct *= rel->rows / rel->tuples;

which essentially applies the selectivity to the ndistinct estimate.

So when you have a table with 1000 distinct values in a column, and we
read 10% of the table, we expect to get 100 distinct values.

But consider for example the table has 100M rows, that the column has
n_distinct=10000 and we're reading 1% of the table (i.e. 1M rows). The
current code will estimate the GROUP BY cardinality to be 100 (as 1% of
the n_distinct), but in reality we're more likely to see all 10000
distinct values.

Example:

CREATE TABLE t (a INT, b FLOAT);
INSERT INTO t SELECT (100000 * random())::int, random()
FROM generate_series(1,10000000) s(i);
ANALYZE t;

SELECT n_distinct FROM pg_stats WHERE attname = 'a';

n_distinct
------------
98882

-- 1% of the table
EXPLAIN ANALYZE SELECT a FROM t WHERE b < 0.01 GROUP BY a;

QUERY PLAN
-------------------------------------------------------------------------
HashAggregate (cost=179514.33..179523.45 rows=912 width=4)
(actual time=916.224..955.136 rows=63492 loops=1)
Group Key: a
-> Seq Scan on t (cost=0.00..179053.25 rows=92216 width=4)
(actual time=0.103..830.104 rows=100268 loops=1)
Filter: (b < '0.01'::double precision)
Rows Removed by Filter: 9899732
Planning time: 1.237 ms
Execution time: 982.600 ms
(7 rows)

-- 5% of the table
EXPLAIN ANALYZE SELECT a FROM t WHERE b < 0.05 GROUP BY a;

QUERY PLAN
-------------------------------------------------------------------------
HashAggregate (cost=180296.06..180345.22 rows=4916 width=4)
(actual time=1379.519..1440.005 rows=99348 loops=1)
Group Key: a
-> Seq Scan on t (cost=0.00..179053.25 rows=497123 width=4)
(actual time=0.096..1000.494 rows=500320 loops=1)
Filter: (b < '0.05'::double precision)
Rows Removed by Filter: 9499680
Planning time: 0.129 ms
Execution time: 1480.986 ms
(7 rows)

This is getting especially bad when reading only a small fraction of a
huge table - it's easy to construct cases with arbitrarily large error.
And it's worth noting that the error is almost exclusively massive
under-estimate, so the "wrong" direction as HashAggregate is still
vulnerable to OOM (again, the larger the table the worse).

So I think a more appropriate way to estimate this would be to find the
expected number of distinct values when sampling with replacements, as
explained for example in [1]http://math.stackexchange.com/questions/72223/finding-expected-number-of-distinct-values-selected-from-a-set-of-integers.

Applied to the code, it means changing the line from

reldistinct *= rel->rows / rel->tuples;

reldistinct *= (1 - powl((reldistinct - 1) / reldistinct, rel->rows));

With this change, the estimates for the examples look like this:

QUERY PLAN
-------------------------------------------------------------------------
HashAggregate (cost=179283.79..179883.48 rows=59969 width=4)
(actual time=897.071..934.764 rows=63492 loops=1)
Group Key: a
-> Seq Scan on t (cost=0.00..179053.25 rows=92216 width=4)
(actual time=0.104..820.276 rows=100268 loops=1)
Filter: (b < '0.01'::double precision)
Rows Removed by Filter: 9899732
Planning time: 1.261 ms
Execution time: 962.812 ms
(7 rows)

and

QUERY PLAN
-------------------------------------------------------------------------
Group (cost=232886.15..235371.76 rows=98234 width=4)
(actual time=1519.545..2104.447 rows=99348 loops=1)
Group Key: a
-> Sort (cost=232886.15..234128.95 rows=497123 width=4)
(actual time=1519.540..1838.575 rows=500320 loops=1)
Sort Key: a
Sort Method: external merge Disk: 6824kB
-> Seq Scan on t (cost=0.00..179053.25 rows=497123 width=4)
(actual time=0.090..1044.099 rows=500320 loops=1)
Filter: (b < '0.05'::double precision)
Rows Removed by Filter: 9499680
Planning time: 0.133 ms
Execution time: 2147.340 ms
(10 rows)

So much better. Clearly, there are cases where this will over-estimate
the cardinality - for example when the values are somehow correlated.

But I'd argue over-estimating is better because of the OOM issues in
Hash Aggregate.

regards

[1]: http://math.stackexchange.com/questions/72223/finding-expected-number-of-distinct-values-selected-from-a-set-of-integers
http://math.stackexchange.com/questions/72223/finding-expected-number-of-distinct-values-selected-from-a-set-of-integers

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

estimate-num-groups-v1.patchbinary/octet-stream; name=estimate-num-groups-v1.patchDownload

diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 46c95b0..82a7f7f 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -3438,9 +3438,9 @@ estimate_num_groups(PlannerInfo *root, List *groupExprs, double input_rows,
 				reldistinct = clamp;
 
 			/*
-			 * Multiply by restriction selectivity.
+			 * Estimate number of distinct values expected in given number of rows.
 			 */
-			reldistinct *= rel->rows / rel->tuples;
+			reldistinct *= (1 - powl((reldistinct - 1) / reldistinct, rel->rows));
 
 			/*
 			 * Update estimate of total distinct groups.

Mark Dilger

hornschnorter@gmail.com

almost 10 years ago

In reply to: Tomas Vondra (#1)

1 attachment(s)

Re: improving GROUP BY estimation

On Feb 23, 2016, at 5:12 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

<snip>

So much better. Clearly, there are cases where this will over-estimate the cardinality - for example when the values are somehow correlated.

I applied your patch, which caused a few regression tests to fail. Attached
is a patch that includes the necessary changes to the expected test results.

It is not hard to write test cases where your patched version overestimates
the number of rows by a very similar factor as the old code underestimates
them. My very first test, which was not specifically designed to demonstrate
this, happens to be one such example:

CREATE TABLE t (a INT, b int);
INSERT INTO t SELECT sqrt(gs)::int, gs FROM generate_series(1,10000000) gs;
ANALYZE t;
EXPLAIN SELECT a FROM t WHERE b < 1000 GROUP BY a;
QUERY PLAN
---------------------------------------------------------------
HashAggregate (cost=169250.21..169258.71 rows=850 width=4)
Group Key: a
-> Seq Scan on t (cost=0.00..169247.71 rows=1000 width=4)
Filter: (b < 1000)
(4 rows)

SELECT COUNT(*) FROM (SELECT a FROM t WHERE b < 1000 GROUP BY a) AS ss;
count
-------
32
(1 row)

So, it estimates 850 rows where only 32 are returned . Without applying your patch,
it estimates just 1 row where 32 are returned. That's an overestimate of roughly 26 times,
rather than an underestimate of 32 times.

As a patch review, I'd say that your patch does what you claim it does, and it applies
cleanly, and passes the regression tests with my included modifications. I think there
needs to be some discussion on the list about whether the patch is a good idea.

Mark Dilger

Attachments:

estimate-num-groups-v2.txttext/plain; name=estimate-num-groups-v2.txtDownload

diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 46c95b0..82a7f7f 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -3438,9 +3438,9 @@ estimate_num_groups(PlannerInfo *root, List *groupExprs, double input_rows,
 				reldistinct = clamp;
 
 			/*
-			 * Multiply by restriction selectivity.
+			 * Estimate number of distinct values expected in given number of rows.
 			 */
-			reldistinct *= rel->rows / rel->tuples;
+			reldistinct *= (1 - powl((reldistinct - 1) / reldistinct, rel->rows));
 
 			/*
 			 * Update estimate of total distinct groups.
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index 59d7877..d9dd5ca 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -3951,17 +3951,17 @@ select d.* from d left join (select * from b group by b.id, b.c_id) s
   on d.a = s.id;
               QUERY PLAN               
 ---------------------------------------
- Merge Left Join
-   Merge Cond: (d.a = s.id)
-   ->  Sort
-         Sort Key: d.a
-         ->  Seq Scan on d
+ Merge Right Join
+   Merge Cond: (s.id = d.a)
    ->  Sort
          Sort Key: s.id
          ->  Subquery Scan on s
                ->  HashAggregate
                      Group Key: b.id
                      ->  Seq Scan on b
+   ->  Sort
+         Sort Key: d.a
+         ->  Seq Scan on d
 (11 rows)
 
 -- similarly, but keying off a DISTINCT clause
@@ -3970,17 +3970,17 @@ select d.* from d left join (select distinct * from b) s
   on d.a = s.id;
                  QUERY PLAN                  
 ---------------------------------------------
- Merge Left Join
-   Merge Cond: (d.a = s.id)
-   ->  Sort
-         Sort Key: d.a
-         ->  Seq Scan on d
+ Merge Right Join
+   Merge Cond: (s.id = d.a)
    ->  Sort
          Sort Key: s.id
          ->  Subquery Scan on s
                ->  HashAggregate
                      Group Key: b.id, b.c_id
                      ->  Seq Scan on b
+   ->  Sort
+         Sort Key: d.a
+         ->  Seq Scan on d
 (11 rows)
 
 -- check join removal works when uniqueness of the join condition is enforced
diff --git a/src/test/regress/expected/subselect.out b/src/test/regress/expected/subselect.out
index de64ca7..0fc93d9 100644
--- a/src/test/regress/expected/subselect.out
+++ b/src/test/regress/expected/subselect.out
@@ -807,27 +807,24 @@ select * from int4_tbl where
 explain (verbose, costs off)
 select * from int4_tbl o where (f1, f1) in
   (select f1, generate_series(1,2) / 10 g from int4_tbl i group by f1);
-                              QUERY PLAN                              
-----------------------------------------------------------------------
- Hash Join
+                           QUERY PLAN                           
+----------------------------------------------------------------
+ Hash Semi Join
    Output: o.f1
    Hash Cond: (o.f1 = "ANY_subquery".f1)
    ->  Seq Scan on public.int4_tbl o
          Output: o.f1
    ->  Hash
          Output: "ANY_subquery".f1, "ANY_subquery".g
-         ->  HashAggregate
+         ->  Subquery Scan on "ANY_subquery"
                Output: "ANY_subquery".f1, "ANY_subquery".g
-               Group Key: "ANY_subquery".f1, "ANY_subquery".g
-               ->  Subquery Scan on "ANY_subquery"
-                     Output: "ANY_subquery".f1, "ANY_subquery".g
-                     Filter: ("ANY_subquery".f1 = "ANY_subquery".g)
-                     ->  HashAggregate
-                           Output: i.f1, (generate_series(1, 2) / 10)
-                           Group Key: i.f1
-                           ->  Seq Scan on public.int4_tbl i
-                                 Output: i.f1
-(18 rows)
+               Filter: ("ANY_subquery".f1 = "ANY_subquery".g)
+               ->  HashAggregate
+                     Output: i.f1, (generate_series(1, 2) / 10)
+                     Group Key: i.f1
+                     ->  Seq Scan on public.int4_tbl i
+                           Output: i.f1
+(15 rows)
 
 select * from int4_tbl o where (f1, f1) in
   (select f1, generate_series(1,2) / 10 g from int4_tbl i group by f1);
diff --git a/src/test/regress/expected/union.out b/src/test/regress/expected/union.out
index 016571b..f2e297e 100644
--- a/src/test/regress/expected/union.out
+++ b/src/test/regress/expected/union.out
@@ -263,16 +263,16 @@ ORDER BY 1;
 SELECT q2 FROM int8_tbl INTERSECT SELECT q1 FROM int8_tbl;
         q2        
 ------------------
- 4567890123456789
               123
+ 4567890123456789
 (2 rows)
 
 SELECT q2 FROM int8_tbl INTERSECT ALL SELECT q1 FROM int8_tbl;
         q2        
 ------------------
+              123
  4567890123456789
  4567890123456789
-              123
 (3 rows)
 
 SELECT q2 FROM int8_tbl EXCEPT SELECT q1 FROM int8_tbl ORDER BY 1;
@@ -305,16 +305,16 @@ SELECT q1 FROM int8_tbl EXCEPT SELECT q2 FROM int8_tbl;
 SELECT q1 FROM int8_tbl EXCEPT ALL SELECT q2 FROM int8_tbl;
         q1        
 ------------------
- 4567890123456789
               123
+ 4567890123456789
 (2 rows)
 
 SELECT q1 FROM int8_tbl EXCEPT ALL SELECT DISTINCT q2 FROM int8_tbl;
         q1        
 ------------------
+              123
  4567890123456789
  4567890123456789
-              123
 (3 rows)
 
 SELECT q1 FROM int8_tbl EXCEPT ALL SELECT q1 FROM int8_tbl FOR NO KEY UPDATE;
@@ -343,8 +343,8 @@ SELECT f1 FROM float8_tbl EXCEPT SELECT f1 FROM int4_tbl ORDER BY 1;
 SELECT q1 FROM int8_tbl INTERSECT SELECT q2 FROM int8_tbl UNION ALL SELECT q2 FROM int8_tbl;
         q1         
 -------------------
-  4567890123456789
                123
+  4567890123456789
                456
   4567890123456789
                123
@@ -355,15 +355,15 @@ SELECT q1 FROM int8_tbl INTERSECT SELECT q2 FROM int8_tbl UNION ALL SELECT q2 FR
 SELECT q1 FROM int8_tbl INTERSECT (((SELECT q2 FROM int8_tbl UNION ALL SELECT q2 FROM int8_tbl)));
         q1        
 ------------------
- 4567890123456789
               123
+ 4567890123456789
 (2 rows)
 
 (((SELECT q1 FROM int8_tbl INTERSECT SELECT q2 FROM int8_tbl))) UNION ALL SELECT q2 FROM int8_tbl;
         q1         
 -------------------
-  4567890123456789
                123
+  4567890123456789
                456
   4567890123456789
                123
@@ -419,8 +419,8 @@ HINT:  There is a column named "q2" in table "*SELECT* 2", but it cannot be refe
 SELECT q1 FROM int8_tbl EXCEPT (((SELECT q2 FROM int8_tbl ORDER BY q2 LIMIT 1)));
         q1        
 ------------------
- 4567890123456789
               123
+ 4567890123456789
 (2 rows)
 
 --

Mark Dilger

hornschnorter@gmail.com

almost 10 years ago

In reply to: Mark Dilger (#2)

Re: improving GROUP BY estimation

On Feb 25, 2016, at 3:16 PM, Mark Dilger <hornschnorter@gmail.com> wrote:

On Feb 23, 2016, at 5:12 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

<snip>

So much better. Clearly, there are cases where this will over-estimate the cardinality - for example when the values are somehow correlated.

I applied your patch, which caused a few regression tests to fail. Attached
is a patch that includes the necessary changes to the expected test results.

It is not hard to write test cases where your patched version overestimates
the number of rows by a very similar factor as the old code underestimates
them. My very first test, which was not specifically designed to demonstrate
this, happens to be one such example:

CREATE TABLE t (a INT, b int);
INSERT INTO t SELECT sqrt(gs)::int, gs FROM generate_series(1,10000000) gs;
ANALYZE t;
EXPLAIN SELECT a FROM t WHERE b < 1000 GROUP BY a;
QUERY PLAN
---------------------------------------------------------------
HashAggregate (cost=169250.21..169258.71 rows=850 width=4)
Group Key: a
-> Seq Scan on t (cost=0.00..169247.71 rows=1000 width=4)
Filter: (b < 1000)
(4 rows)

SELECT COUNT(*) FROM (SELECT a FROM t WHERE b < 1000 GROUP BY a) AS ss;
count
-------
32
(1 row)

So, it estimates 850 rows where only 32 are returned . Without applying your patch,
it estimates just 1 row where 32 are returned. That's an overestimate of roughly 26 times,
rather than an underestimate of 32 times.

As a patch review, I'd say that your patch does what you claim it does, and it applies
cleanly, and passes the regression tests with my included modifications. I think there
needs to be some discussion on the list about whether the patch is a good idea.

Mark Dilger

<estimate-num-groups-v2.txt>

Assuming the goal is to minimize the degree to which the estimates are inaccurate, I
get better results by a kind of averaging of the old values from the current code base
and the new values from Tomas's patch. I tried this and at least for the few examples
we've been throwing around, I found the results were not as wildly inaccurate in the
worst case examples than in either of the other two approaches:

diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 46c95b0..c83135a 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -3414,6 +3414,8 @@ estimate_num_groups(PlannerInfo *root, List *groupExprs, double input_rows,
                Assert(rel->reloptkind == RELOPT_BASEREL);
                if (rel->tuples > 0)
                {
+                       double          old_factor;             /* The way it is currently done on master */
+                       double          new_factor;             /* The way Tomas Vondra proposes doing it */
                        /*
                         * Clamp to size of rel, or size of rel / 10 if multiple Vars. The
                         * fudge factor is because the Vars are probably correlated but we
@@ -3440,7 +3442,10 @@ estimate_num_groups(PlannerInfo *root, List *groupExprs, double input_rows,
                        /*
                         * Multiply by restriction selectivity.
                         */
-                       reldistinct *= rel->rows / rel->tuples;
+                       old_factor = rel->rows / rel->tuples;           /* old way */
+                       new_factor = (1 - powl((reldistinct - 1) / reldistinct, rel->rows));  /* Tomas's way */
+
+                       reldistinct *= sqrt(old_factor * new_factor);   /* average of old way and new way, sort of */

/*
* Update estimate of total distinct groups.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 10 years ago

In reply to: Mark Dilger (#2)

Re: improving GROUP BY estimation

Hi,

On 02/26/2016 12:16 AM, Mark Dilger wrote:

It is not hard to write test cases where your patched version overestimates
the number of rows by a very similar factor as the old code underestimates
them. My very first test, which was not specifically designed to demonstrate
this, happens to be one such example:

CREATE TABLE t (a INT, b int);
INSERT INTO t SELECT sqrt(gs)::int, gs FROM generate_series(1,10000000) gs;
ANALYZE t;
EXPLAIN SELECT a FROM t WHERE b < 1000 GROUP BY a;
QUERY PLAN
---------------------------------------------------------------
HashAggregate (cost=169250.21..169258.71 rows=850 width=4)
Group Key: a
-> Seq Scan on t (cost=0.00..169247.71 rows=1000 width=4)
Filter: (b < 1000)
(4 rows)

SELECT COUNT(*) FROM (SELECT a FROM t WHERE b < 1000 GROUP BY a) AS ss;
count
-------
32
(1 row)

So, it estimates 850 rows where only 32 are returned . Without
applying your patch, it estimates just 1 row where 32 are returned.
That's an overestimate of roughly 26 times, rather than an
underestimate of 32 times.

The culprit here is that the two columns are not independent, but are
(rather strongly) correlated due to the way you've generated the data.

In cases like this (with correlated columns), it's mostly just luck how
accurate estimates you get, no matter which of the estimators you use.
It's pretty easy to generate arbitrary errors by breaking the
independence assumption, and it's not just this particular place of the
planner. And it won't change until we add some sort of smartness about
dependencies between columns.

I think over-estimates are a bit more acceptable in this case, e.g.
because of the HashAggregate OOM risk. Also, I've seen too many nested
loop cascades due to under-estimates recently, but that's a bit unrelated.

As a patch review, I'd say that your patch does what you claim it
does, and it applies cleanly, and passes the regression tests with
my included modifications. I think there needs to be some discussion
on the list about whether the patch is agood idea.

Yes, that's why I haven't bothered with fixing the regression tests in
the patch, actually.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Mark Dilger

hornschnorter@gmail.com

almost 10 years ago

In reply to: Tomas Vondra (#4)

Re: improving GROUP BY estimation

On Feb 25, 2016, at 4:59 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

Hi,

On 02/26/2016 12:16 AM, Mark Dilger wrote:

It is not hard to write test cases where your patched version overestimates
the number of rows by a very similar factor as the old code underestimates
them. My very first test, which was not specifically designed to demonstrate
this, happens to be one such example:

CREATE TABLE t (a INT, b int);
INSERT INTO t SELECT sqrt(gs)::int, gs FROM generate_series(1,10000000) gs;
ANALYZE t;
EXPLAIN SELECT a FROM t WHERE b < 1000 GROUP BY a;
QUERY PLAN
---------------------------------------------------------------
HashAggregate (cost=169250.21..169258.71 rows=850 width=4)
Group Key: a
-> Seq Scan on t (cost=0.00..169247.71 rows=1000 width=4)
Filter: (b < 1000)
(4 rows)

SELECT COUNT(*) FROM (SELECT a FROM t WHERE b < 1000 GROUP BY a) AS ss;
count
-------
32
(1 row)

So, it estimates 850 rows where only 32 are returned . Without
applying your patch, it estimates just 1 row where 32 are returned.
That's an overestimate of roughly 26 times, rather than an
underestimate of 32 times.

The culprit here is that the two columns are not independent, but are (rather strongly) correlated due to the way you've generated the data.

Yes, that was intentional. Your formula is exactly correct, so far as i can tell,
for completely uncorrelated data. I don't have any tables with completely
uncorrelated data, and was therefore interested in what happens when the
data is correlated and your patch is applied.

BTW, the whole reason I responded to your post is that I think I would like
to have your changes in the code base. I'm just playing Devil's Advocate
here, to see if there is room for any improvement.

In cases like this (with correlated columns), it's mostly just luck how accurate estimates you get, no matter which of the estimators you use. It's pretty easy to generate arbitrary errors by breaking the independence assumption, and it's not just this particular place of the planner. And it won't change until we add some sort of smartness about dependencies between columns.

I think over-estimates are a bit more acceptable in this case, e.g. because of the HashAggregate OOM risk. Also, I've seen too many nested loop cascades due to under-estimates recently, but that's a bit unrelated.

I have seen similar problems in systems I maintain, hence my interest
in your patch.

As a patch review, I'd say that your patch does what you claim it
does, and it applies cleanly, and passes the regression tests with
my included modifications. I think there needs to be some discussion
on the list about whether the patch is agood idea.

Yes, that's why I haven't bothered with fixing the regression tests in the patch, actually.

Right, but for the group as a whole, I thought it might generate some
feedback if people saw what else changed in the regression results.
If you look, the changes have to do with plans chosen and row estimates --
exactly the sort of stuff you would expect to change.

Thanks for the patch submission. I hope my effort to review it is on the
whole more helpful than harmful.

Mark Dilger

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 10 years ago

In reply to: Mark Dilger (#5)

Re: improving GROUP BY estimation

Hi,

On 02/26/2016 04:32 AM, Mark Dilger wrote:

On Feb 25, 2016, at 4:59 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

...

The culprit here is that the two columns are not independent, but
are (rather strongly) correlated due to the way you've generated
the data.

Yes, that was intentional. Your formula is exactly correct, so far as
i can tell, for completely uncorrelated data. I don't have any tables
with completely uncorrelated data, and was therefore interested in
what happens when the data is correlated and your patch is applied.

BTW, the whole reason I responded to your post is that I think I would like
to have your changes in the code base. I'm just playing Devil's Advocate
here, to see if there is room for any improvement.

Sure, that's how I understood it. I just wanted to point out the
correlation, as that might not have been obvious to others.

Thanks for the patch submission. I hope my effort to review it is on
the whole more helpful than harmful.

Thanks for the feedback!

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Alexander Korotkov

a.korotkov@postgrespro.ru

almost 10 years ago

In reply to: Tomas Vondra (#6)

Re: improving GROUP BY estimation

Hi, Tomas!

I've assigned to review this patch.

I've checked version estimate-num-groups-v2.txt by Mark Dilger.
It applies to head cleanly, passes corrected regression tests.

About correlated/uncorrelated cases. I think now optimizer mostly assumes
all distributions to be independent.
I think we should follow this assumption in this case also until we have
fundamentally better option (like your multivariate statistics).

@@ -3438,9 +3438,9 @@ estimate_num_groups(PlannerInfo *root, List
*groupExprs, double input_rows,
reldistinct = clamp;

  /*
- * Multiply by restriction selectivity.
+ * Estimate number of distinct values expected in given number of rows.
  */
- reldistinct *= rel->rows / rel->tuples;
+ reldistinct *= (1 - powl((reldistinct - 1) / reldistinct, rel->rows));

/*
* Update estimate of total distinct groups.

I think we need way more explanation in comments here (possibly with
external links). For now, it looks like formula which appears from nowhere.

Also, note that rel->tuples gone away from formula parameters. So, we
actually don't care about how may tuples are in the table. This is because
this formula assumes that same tuple could be selected multiple times. For
low numbers this can lead to some errors. Consider selecting 2 from 3
distinct tuples. If you can't select same tuple twice then you always
selecting 2 distinct tuples. But according to formula you will select 5/3
in average. I think this error in small numbers is negligible, especially
because getting rid of this error would require way more computations. But
it worth mentioning in comments though.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 10 years ago

In reply to: Alexander Korotkov (#7)

Re: improving GROUP BY estimation

Hi,

On 03/03/2016 12:53 PM, Alexander Korotkov wrote:

Hi, Tomas!

I've assigned to review this patch.

I've checked version estimate-num-groups-v2.txt by Mark Dilger.
It applies to head cleanly, passes corrected regression tests.

About correlated/uncorrelated cases. I think now optimizer mostly assumes
all distributions to be independent. I think we should follow this
assumption in this case also until we have fundamentally better option (like
your multivariate statistics).

@@ -3438,9 +3438,9 @@ estimate_num_groups(PlannerInfo *root, List *groupExprs,
double input_rows,
reldistinct = clamp;
/*
-* Multiply by restriction selectivity.
+* Estimate number of distinct values expected in given number of rows.
*/
-reldistinct *= rel->rows / rel->tuples;
+reldistinct *= (1 - powl((reldistinct - 1) / reldistinct, rel->rows));
/*
* Update estimate of total distinct groups.

I think we need way more explanation in comments here (possibly with
external links). For now, it looks like formula which appears from nowhere.

I thought the details (particularly the link to stackexchange, discussing the
formula) would be enough, but let me elaborate.

The current formula

reldistinct *= rel->rows / rel->tuples;

simply multiplies the ndistinct estimate with selectivity. So when you select
1% of the table, the estimate says we'll see 1% of ndistinct values. But
clearly that's naive, because for example when you have 10k distinct values and
you select 10M rows, you should expect nearly all of them in the sample.

And that's what the formula does - it gives you the expected number of distinct
values, when selecting 'k' values from 'd' distinct values with replacements:

count(k, d) = d * (1 - ((d-1)/d) ^ k)

It's assuming the distinct values are equally probable (uniform distribution)
and that the probabilities do not change (that's the "with replacement").

I guess we could relax those assumptions and for example use the MCV statistics
to further improve the estimate, and also relax the 'with replacement' but
that'd make the formula way more complicated.

Also, note that rel->tuples gone away from formula parameters. So, we
actually don't care about how may tuples are in the table. This is because
this formula assumes that same tuple could be selected multiple times. For
low numbers this can lead to some errors. Consider selecting 2 from 3
distinct tuples. If you can't select same tuple twice then you always
selecting 2 distinct tuples. But according to formula you will select 5/3 in
average. I think this error in small numbers is negligible, especially
because getting rid of this error would require way more computations. But
it worth mentioning in comments though.

Well, yeah. That's the consequence of 'with replacement' assumption.

I guess we could handle this somehow, though. For example we can compute the
minimum possible number of distinct values like this:

average_rows_per_value = (tuples / ndistinct);
min_estimate = ceil(rows / average_rows_per_value);

and use that as a minimum for the estimate. In the example you've given this
would say

average_rows_per_value = (3 / 3) = 1;
min_estimate = ceil(2 / 1) = 2;

So it serves as a correction for the 'with replacement' assumption. With only 2
distinct values we'd get this:

average_rows_per_value = (3 / 2) = 1.5;
min_estimate = ceil(2 / 1.5) = 2;

Of course, it's just a heuristics, so this may fail in some other cases.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Mark Dilger

hornschnorter@gmail.com

almost 10 years ago

In reply to: Tomas Vondra (#8)

Re: improving GROUP BY estimation

On Mar 3, 2016, at 8:35 AM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

Hi,

On 03/03/2016 12:53 PM, Alexander Korotkov wrote:
Hi, Tomas!

I've assigned to review this patch.

I've checked version estimate-num-groups-v2.txt by Mark Dilger.
It applies to head cleanly, passes corrected regression tests.

About correlated/uncorrelated cases. I think now optimizer mostly assumes
all distributions to be independent. I think we should follow this
assumption in this case also until we have fundamentally better option (like
your multivariate statistics).

@@ -3438,9 +3438,9 @@ estimate_num_groups(PlannerInfo *root, List *groupExprs,
double input_rows,
reldistinct = clamp;
/*
-* Multiply by restriction selectivity.
+* Estimate number of distinct values expected in given number of rows.
*/
-reldistinct *= rel->rows / rel->tuples;
+reldistinct *= (1 - powl((reldistinct - 1) / reldistinct, rel->rows));
/*
* Update estimate of total distinct groups.

I think we need way more explanation in comments here (possibly with
external links). For now, it looks like formula which appears from nowhere.
I thought the details (particularly the link to stackexchange, discussing the formula) would be enough, but let me elaborate.

The current formula

reldistinct *= rel->rows / rel->tuples;

simply multiplies the ndistinct estimate with selectivity. So when you select 1% of the table, the estimate says we'll see 1% of ndistinct values. But clearly that's naive, because for example when you have 10k distinct values and you select 10M rows, you should expect nearly all of them in the sample.

The current formula assumes total dependence between the
columns. You can create tables where this is true, and the
formula gives precisely the right estimate. I'm not claiming this
is common, or that it should be the default assumption, but it
is one end of the spectrum of possible correct estimates.

And that's what the formula does - it gives you the expected number of distinct values, when selecting 'k' values from 'd' distinct values with replacements:

count(k, d) = d * (1 - ((d-1)/d) ^ k)

It's assuming the distinct values are equally probable (uniform distribution) and that the probabilities do not change (that's the "with replacement").

The new formula assumes total independence between the
columns. You can likewise create tables where this is true,
and you did so upthread, and for those tables it gives precisely
the right estimate. This is the other end of the spectrum of
possible correct estimates.

In the absence of multivariate statistics, either because you
haven't committed that work yet, or because the multivariate
data the planner needs hasn't been collected, choosing an
estimate exactly in the middle of the spectrum (whatever
that would mean mathematically) would minimize the
maximum possible error in the estimate.

I have the sense that my point of view is in the minority, because
the expectation is the problems caused by independence
assumptions will only be fixed when multivariate statistics are
implemented and available, and so we should just keep the
independence assumptions everywhere until that time. I
tend to disagree with that, on the grounds that even when that
work is finished, we'll never have complete coverage of every
possible set of columns and what their degree of dependence
is.

Perhaps I am badly mistaken.

Can somebody more familiar with the issues than I am please
disabuse me of my wrongheadedness?

Mark Dilger

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10

Alexander Korotkov

a.korotkov@postgrespro.ru

almost 10 years ago

In reply to: Tomas Vondra (#8)

Re: improving GROUP BY estimation

On Thu, Mar 3, 2016 at 7:35 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:

On 03/03/2016 12:53 PM, Alexander Korotkov wrote:
I've assigned to review this patch.

I've checked version estimate-num-groups-v2.txt by Mark Dilger.
It applies to head cleanly, passes corrected regression tests.

About correlated/uncorrelated cases. I think now optimizer mostly assumes
all distributions to be independent. I think we should follow this
assumption in this case also until we have fundamentally better option
(like
your multivariate statistics).

@@ -3438,9 +3438,9 @@ estimate_num_groups(PlannerInfo *root, List
*groupExprs,
double input_rows,
reldistinct = clamp;
/*
-* Multiply by restriction selectivity.
+* Estimate number of distinct values expected in given number of rows.
*/
-reldistinct *= rel->rows / rel->tuples;
+reldistinct *= (1 - powl((reldistinct - 1) / reldistinct, rel->rows));
/*
* Update estimate of total distinct groups.

I think we need way more explanation in comments here (possibly with
external links). For now, it looks like formula which appears from
nowhere.
I thought the details (particularly the link to stackexchange, discussing
the formula) would be enough, but let me elaborate.

The current formula

reldistinct *= rel->rows / rel->tuples;

simply multiplies the ndistinct estimate with selectivity. So when you
select 1% of the table, the estimate says we'll see 1% of ndistinct values.
But clearly that's naive, because for example when you have 10k distinct
values and you select 10M rows, you should expect nearly all of them in the
sample.

And that's what the formula does - it gives you the expected number of
distinct values, when selecting 'k' values from 'd' distinct values with
replacements:

count(k, d) = d * (1 - ((d-1)/d) ^ k)

It's assuming the distinct values are equally probable (uniform
distribution) and that the probabilities do not change (that's the "with
replacement").

I guess we could relax those assumptions and for example use the MCV
statistics to further improve the estimate, and also relax the 'with
replacement' but that'd make the formula way more complicated.

[1]
http://math.stackexchange.com/questions/72223/finding-expected-number-of-distinct-values-selected-from-a-set-of-integers

Your explanation in the first mail was good enough. Not it's even better.
But actually I mean that we need to include some brief explanation into
source code (either comments or readme). It would be good if one who want
to understand this could do without searching mailing list archives or git
history.

Also, note that rel->tuples gone away from formula parameters. So, we
actually don't care about how may tuples are in the table. This is because
this formula assumes that same tuple could be selected multiple times. For
low numbers this can lead to some errors. Consider selecting 2 from 3
distinct tuples. If you can't select same tuple twice then you always
selecting 2 distinct tuples. But according to formula you will select 5/3
in
average. I think this error in small numbers is negligible, especially
because getting rid of this error would require way more computations. But
it worth mentioning in comments though.

Well, yeah. That's the consequence of 'with replacement' assumption.

I guess we could handle this somehow, though. For example we can compute
the minimum possible number of distinct values like this:

average_rows_per_value = (tuples / ndistinct);
min_estimate = ceil(rows / average_rows_per_value);

and use that as a minimum for the estimate. In the example you've given
this would say

average_rows_per_value = (3 / 3) = 1;
min_estimate = ceil(2 / 1) = 2;

So it serves as a correction for the 'with replacement' assumption. With
only 2 distinct values we'd get this:

average_rows_per_value = (3 / 2) = 1.5;
min_estimate = ceil(2 / 1.5) = 2;

Of course, it's just a heuristics, so this may fail in some other cases.

I'm not sure we actually need it. Even in worst case error doesn't seems to
be very big. But feel free to add this heuristics, it looks OK for me.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#11

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 10 years ago

In reply to: Mark Dilger (#9)

Re: improving GROUP BY estimation

Hi,

On 03/03/2016 06:40 PM, Mark Dilger wrote:

On Mar 3, 2016, at 8:35 AM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

Hi,

On 03/03/2016 12:53 PM, Alexander Korotkov wrote:
Hi, Tomas!

I've assigned to review this patch.

I've checked version estimate-num-groups-v2.txt by Mark Dilger.
It applies to head cleanly, passes corrected regression tests.

About correlated/uncorrelated cases. I think now optimizer mostly assumes
all distributions to be independent. I think we should follow this
assumption in this case also until we have fundamentally better option (like
your multivariate statistics).

@@ -3438,9 +3438,9 @@ estimate_num_groups(PlannerInfo *root, List *groupExprs,
double input_rows,
reldistinct = clamp;
/*
-* Multiply by restriction selectivity.
+* Estimate number of distinct values expected in given number of rows.
*/
-reldistinct *= rel->rows / rel->tuples;
+reldistinct *= (1 - powl((reldistinct - 1) / reldistinct, rel->rows));
/*
* Update estimate of total distinct groups.

I think we need way more explanation in comments here (possibly with
external links). For now, it looks like formula which appears from nowhere.
I thought the details (particularly the link to stackexchange, discussing
the formula) would be enough, but let me elaborate.

The current formula

reldistinct *= rel->rows / rel->tuples;

simply multiplies the ndistinct estimate with selectivity. So when you
select 1% of the table, the estimate says we'll see 1% of ndistinct
values. But clearly that's naive, because for example when you have 10k
distinct values and you select 10M rows, you should expect nearly all of
them in the sample.
The current formula assumes total dependence between the columns. You can
create tables where this is true, and the formula gives precisely the right
estimate. I'm not claiming this is common, or that it should be the default
assumption, but it is one end of the spectrum of possible correct
estimates.

I'm not really sure what you mean by dependence between columns, as both the
old and new formula only works with total cardinality and selectivity, and has
absolutely no idea about multiple columns.

In case you mean dependence between columns in the GROUP BY and columns used to
compute the selectivity (i.e. referenced in WHERE), then perhaps you could say
it like that, although I think there's no such explicit assumption.

And that's what the formula does - it gives you the expected number of
distinct values, when selecting 'k' values from 'd' distinct values with
replacements:

count(k, d) = d * (1 - ((d-1)/d) ^ k)

It's assuming the distinct values are equally probable (uniform
distribution) and that the probabilities do not change (that's the "with
replacement").

The new formula assumes total independence between the columns. You can
likewise create tables where this is true, and you did so upthread, and for
those tables it gives precisely the right estimate. This is the other end of
the spectrum of possible correct estimates.

In the absence of multivariate statistics, either because you haven't
committed that work yet, or because the multivariate data the planner needs
hasn't been collected, choosing an estimate exactly in the middle of the
spectrum (whatever that would mean mathematically) would minimize the
maximum possible error in the estimate.

FWIW the current version of multivariate statistics can't really help in this
case. Perhaps it will help in the future, but it's way more complicated that it
might seem as it requires transferring information from the WHERE clause to the
GROUP BY clause.

I have the sense that my point of view is in the minority, because the
expectation is the problems caused by independence assumptions will only be
fixed when multivariate statistics are implemented and available, and so we
should just keep the independence assumptions everywhere until that time. I
tend to disagree with that, on the grounds that even when that work is
finished, we'll never have complete coverage of every possible set of
columns and what their degree of dependence is.

Well, in a sense you're right that those estimates work best in different
situations. The problem with constructing an estimator the way you propose
(simply returning an average) is that in both the extreme cases (perfect
dependence or independence) one of those estimates is really bad. Moreover the
new formula typically produces higher values than the old one,

Consider for example this:

CREATE TABLE t AS SELECT (10000 * random())::int a,
(10000 * random())::int b
FROM generate_series(1,1000000) s(i);

and let's see estimates for queries like this:

SELECT a FROM t WHERE b < $1 GROUP BY a;

Then for different values of $1 we get this:

| 10 | 50 | 100 | 500 | 1000 | 5000
--------------------------------------------------------
actual | 919 | 3829 | 6244 | 9944 | 10001 | 10001
current | 10 | 50 | 102 | 516 | 1018 | 4996
new | 973 | 4001 | 6382 | 9897 | 9951 | 9951
average | 491 | 2025 | 3205 | 5206 | 5484 | 7473

and for a table with perfectly dependent columns:

CREATE TABLE t AS SELECT mod(i,10000) a,
mod(i,10000) b
FROM generate_series(1,1000000) s(i);

we get this:

| 10 | 50 | 100 | 500 | 1000 | 5000
--------------------------------------------------------
actual | 10 | 50 | 100 | 500 | 1000 | 5000
current | 10 | 53 | 105 | 508 | 1016 | 5014
new | 880 | 4105 | 6472 | 9955 | 10018 | 10018
average | 445 | 2079 | 3288 | 5231 | 5517 | 7516

So yes, each estimator works great for exactly the opposite cases. But notice
that typically, the results of the new formula is much higher than the old one,
sometimes by two orders of magnitude (and it shouldn't be difficult to
construct examples of much higher differences).

The table also includes the 'average' estimator you propose, but it's rather
obvious that the result is always much closer to the new value, simply because

(small number) + (huge number)
------------------------------
2

is always much closer to the huge number. We're usually quite happy when the
estimates are within the same order of magnitude, so whether it's K or K/2
makes pretty much no difference.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12

Alexander Korotkov

a.korotkov@postgrespro.ru

almost 10 years ago

In reply to: Tomas Vondra (#11)

Re: improving GROUP BY estimation

On Thu, Mar 3, 2016 at 10:16 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:

So yes, each estimator works great for exactly the opposite cases. But
notice that typically, the results of the new formula is much higher than
the old one, sometimes by two orders of magnitude (and it shouldn't be
difficult to construct examples of much higher differences).

The table also includes the 'average' estimator you propose, but it's
rather obvious that the result is always much closer to the new value,
simply because

(small number) + (huge number)
------------------------------
2

is always much closer to the huge number. We're usually quite happy when
the estimates are within the same order of magnitude, so whether it's K or
K/2 makes pretty much no difference.

I believe that Mark means geometrical average, i.e. sqrt((small number) *
(huge number)).

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#13

Mark Dilger

hornschnorter@gmail.com

almost 10 years ago

In reply to: Alexander Korotkov (#12)

Re: improving GROUP BY estimation

On Mar 3, 2016, at 11:27 AM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:

On Thu, Mar 3, 2016 at 10:16 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
So yes, each estimator works great for exactly the opposite cases. But notice that typically, the results of the new formula is much higher than the old one, sometimes by two orders of magnitude (and it shouldn't be difficult to construct examples of much higher differences).

The table also includes the 'average' estimator you propose, but it's rather obvious that the result is always much closer to the new value, simply because

(small number) + (huge number)
------------------------------
2

is always much closer to the huge number. We're usually quite happy when the estimates are within the same order of magnitude, so whether it's K or K/2 makes pretty much no difference.

I believe that Mark means geometrical average, i.e. sqrt((small number) * (huge number)).

Yes, that is what I proposed upthread. I'm not wedded to that, though.
In general, I am with Tomas on this one, believing that his estimate
will be much better than the current estimate. But I believe the *best*
estimate will be somewhere between his and the current one, and I'm
fishing for any decent way of coming up with a weighted average that
is closer to his than to the current, but not simply equal to his proposal.

The reason I want the formula to be closer to Tomas's than to the
current is that I think that on average, across all tables, across all
databases, in practice it will be closer to the right estimate than the
current formula. That's just my intuition, and so I can't defend it.
But if my intuition is right, the best formula we can adopt would be one
that is moderated from his by a little bit, and in the direction of the
estimate that the current code generates.

I could easily lose this debate merely for lack of a principled basis
for saying how far toward the current estimate the new estimate should
be adjusted. The geometric average is one suggestion, but I don't have
a principled argument for it.

Like I said above, I'm fishing for a decent formula here.

Mark Dilger

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 10 years ago

In reply to: Mark Dilger (#13)

Re: improving GROUP BY estimation

On Thu, 2016-03-03 at 11:42 -0800, Mark Dilger wrote:

On Mar 3, 2016, at 11:27 AM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:

On Thu, Mar 3, 2016 at 10:16 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
So yes, each estimator works great for exactly the opposite cases. But notice that typically, the results of the new formula is much higher than the old one, sometimes by two orders of magnitude (and it shouldn't be difficult to construct examples of much higher differences).

The table also includes the 'average' estimator you propose, but it's rather obvious that the result is always much closer to the new value, simply because

(small number) + (huge number)
------------------------------
2

is always much closer to the huge number. We're usually quite happy
when the estimates are within the same order of magnitude, so whether
it's K or K/2 makes pretty much no difference.

I believe that Mark means geometrical average, i.e. sqrt((small number) * (huge number)).

Ah, OK. Haven't realized you meant geometric mean. With that it looks
like this:

1) independent

10 50 100 500 1000 5000
------------------------------------------------------------------
actual 919 3829 6244 9944 10001 10001
current 10 50 102 516 1018 4996
new 973 4001 6382 9897 9951 9951
average 491 2025 3205 5206 5484 7473
geom. mean 99 447 807 2260 3183 7051

2) dependent

10 50 100 500 1000 5000
------------------------------------------------------------------
actual 10 50 100 500 1000 5000
current 10 53 105 508 1016 5014
new 880 4105 6472 9955 10018 10018
average 445 2079 3288 5231 5517 7516
geom. mean 94 466 824 2249 3190 7087

To better illustrate the differences, we may use the "ratio error"
defined as

err(a,b) = MAX(a/b, b/a)

where 'a' is the actual value and 'b' is the estimate. That gives us:

1) independent

10 50 100 500 1000 5000
------------------------------------------------------------------
current 91.90 76.58 61.22 19.27 9.82 2.00
new 1.06 1.04 1.02 1.00 1.01 1.01
average 1.87 1.89 1.95 1.91 1.82 1.34
geom. mean 9.32 8.56 7.74 4.40 3.14 1.42

2) dependent

10 50 100 500 1000 5000
------------------------------------------------------------------
current 1.00 1.06 1.05 1.02 1.02 1.00
new 88.00 82.10 64.72 19.91 10.02 2.00
average 44.50 41.58 32.88 10.46 5.52 1.50
geom. mean 9.38 9.33 8.24 4.50 3.19 1.42

So the geometric mean seems to be working better than plain average. But
I don't think we can simply conclude we should use the geometric mean of
the estimates, as it assumes the risks of over- and under-estimating the
cardinality are the same. Because they aren't.

Yes, that is what I proposed upthread. I'm not wedded to that, though.
In general, I am with Tomas on this one, believing that his estimate
will be much better than the current estimate. But I believe the *best*
estimate will be somewhere between his and the current one, and I'm
fishing for any decent way of coming up with a weighted average that
is closer to his than to the current, but not simply equal to his proposal.

The reason I want the formula to be closer to Tomas's than to the
current is that I think that on average, across all tables, across all
databases, in practice it will be closer to the right estimate than the
current formula. That's just my intuition, and so I can't defend it.
But if my intuition is right, the best formula we can adopt would be one
that is moderated from his by a little bit, and in the direction of the
estimate that the current code generates.

I think looking merely at what fraction of data sets (or queries) uses
dependent GROUP BY and WHERE clauses is not sufficient.

The general behavior of the two GROUP BY estimators is that the current
one tends to under-estimate, while the new one tends to over-estimate.
(It's not difficult to construct counter-examples by using more complex
dependencies between the columns etc. but let's ignore that.)

The risk associated with over-estimation is that we may switch from
HashAggregate to GroupAggregate, and generally select paths better
suited for large number of rows. Not great, because the plan may not be
optimal and thus run much slower - but that usually only happens on
small tables, because on large tables we would probably end up using
those same paths anyway.

With under-estimation, the main risks are much graver, as we may end up
using HashAggregate only to get killed by OOM. When we're lucky not to
hit that, my experience this often leads to a cascade of NestedLoop
nodes processing orders of magnitude more tuples than expected. Awful.

So I think the risk is asymmetric, and that's why I like the new
estimator more, because it tends to over-estimate, but in a very
reasonable way.

I could easily lose this debate merely for lack of a principled basis
for saying how far toward the current estimate the new estimate should
be adjusted. The geometric average is one suggestion, but I don't have
a principled argument for it.

Like I said above, I'm fishing for a decent formula here.

Thanks for the valuable feedback!

Mark Dilger

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15

Dean Rasheed

dean.a.rasheed@gmail.com

almost 10 years ago

In reply to: Tomas Vondra (#14)

Re: improving GROUP BY estimation

On 4 March 2016 at 13:10, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

The risk associated with over-estimation is that we may switch from
HashAggregate to GroupAggregate, and generally select paths better
suited for large number of rows. Not great, because the plan may not be
optimal and thus run much slower - but that usually only happens on
small tables, because on large tables we would probably end up using
those same paths anyway.

With under-estimation, the main risks are much graver, as we may end up
using HashAggregate only to get killed by OOM. When we're lucky not to
hit that, my experience this often leads to a cascade of NestedLoop
nodes processing orders of magnitude more tuples than expected. Awful.

So I think the risk is asymmetric, and that's why I like the new
estimator more, because it tends to over-estimate, but in a very
reasonable way.

Agreed. Under-estimating is worse than over-estimating.

-           reldistinct *= rel->rows / rel->tuples;
+           reldistinct *= (1 - powl((reldistinct - 1) / reldistinct, rel->rows)

Looking at this, I agree that this new formula from [1]http://math.stackexchange.com/questions/72223/finding-expected-number-of-distinct-values-selected-from-a-set-of-integers is generally
better than the old one in most (but not all) cases.

One problem with it is that it no longer takes into account
rel->tuples, which means that when returning all the tuples from the
table, it no longer gives an estimate equal to the total number of
distinct values in the table. In fact it tends to underestimate when
returning nearly all the rows from the table.

The old formula is clearly not a good general-purpose estimate.
However, there is one case where it does return an accurate estimate
-- when all the values in the underlying table are distinct. In this
case the new formula consistently produces underestimates, while the
old formula works well. For example:

CREATE TABLE t AS SELECT (100000000 * random())::int a,
i::int b
FROM generate_series(1,1000000) s(i);
ANALYSE t;

EXPLAIN ANALYSE
SELECT a FROM t GROUP BY a;

So there are clearly around 1 million distinct values for "a", but the
new formula gives an estimate of around 630,000. That's not a massive
underestimate, but it's sufficiently obvious and visible that it would
be good to do better if we can.

I think that a better formula to use would be

reldistinct *= (1 - powl(1 - rel-rows / rel->tuples, rel->tuples / reldistinct)

This comes from [2]http://www.adellera.it/investigations/distinct_balls/distinct_balls.pdf, which gives a general formula for selection
without replacement, and then gives the above as an approximation to
the uniform distribution case. It's not really made clear in that
paper, but that approximation is actually the "with replacement"
approximation, but starting from different initial assumptions to give
a formula that has better boundary conditions and special-case
behaviour, as described below.

For the record, here's a quick analysis of how the 2 formulae come about:

Assume there are:
N rows in the table
n distinct values (n <= N)
p rows are chosen at random (p <= N)

so the problem is to estimate how many distinct values there will be
in the p rows chosen. Both approaches work by first estimating the
probability that some particular value x is *not* chosen.

[1]: http://math.stackexchange.com/questions/72223/finding-expected-number-of-distinct-values-selected-from-a-set-of-integers
probability of 1/n of being equal to x. So the probability that x is
not the first value chosen is

P(not x, 1) = 1 - 1/n

Then, if the value chosen first is replaced, the probability that x is
not the second value chosen is the same. The "with replacement"
approximation makes each choice an independent probability, so the
overall probability that x is not in any of the p rows chosen is just
the product of the p individual probabilities, which is just

P(not x, p) = (1 - 1/n)^p

Then of course the probability that x *is* chosen at least once in the p rows is

P(x, p) = 1 - (1 - 1/n)^p

Finally, since there are n possible values of x, and they're all
equally likely in the table, the expected number of distinct values is

D(p) = n * (1 - (1 - 1/n)^p)

The flaw in this approach is that for large p the "with replacement"
approximation gets worse and worse, and the probabilities P(x, p)
systematically under-estimate the actual probability that x is chosen,
which increases as more values not equal to x are chosen. By the time
the last value is chosen P(x, p=N) should actually be 1.

[2]: http://www.adellera.it/investigations/distinct_balls/distinct_balls.pdf
case) -- for any given value x, assume that the table contains N/n
instances of x (ignoring the fact that that might not be an integer).
Note that with this assumption there is guaranteed to be at least one
instance of x, which is not the case with the above approach.

Now consider the first instance of x in the table. If we choose p rows
from the table, the probability that that first instance of x is not
chosen is

P(not x[1]http://math.stackexchange.com/questions/72223/finding-expected-number-of-distinct-values-selected-from-a-set-of-integers, p) = 1 - p / N

Making the same "with replacement" simplification, the probability
that the second instance of x is not chosen is the same, and they're
independent probabilities again. So, if there are N/n instances of x,
the overall probability that none of them is chosen is

P(not x, p) = (1 - p / N)^(N/n)

So then the formula for the expected number of distinct values becomes

D(p) = n * (1 - (1 - p / N)^(N/n))

This alternate formula has a few nice properties:

1). D(p=0) = 0

2). D(p=N) = n (choosing all the rows results in all the distinct
values being chosen)

3). When n=N, D(p) = p (when all the values in the table are distinct,
so are all the values chosen). This case matches the current formula.

The first formula for D(p) lacks properties (2) and (3).

The reason this second formula generally works better is that the
"with replacement" approximation is only being made in for up to (N/n)
selections, rather than up to N selections. So it remains a good
approximation as long as (N/n) is large compared to N, i.e., as long
as n is fairly large. In fact even for n=1 or 2, it still works
adequately if the results are rounded to the nearest integer.

The following snipet can be used in gnuplot to visualise the results
for various values of n and N. I've included the exact "without
replacement" formula too, by rewriting it in terms of the gamma
function:

N=1000000.0; n=500000.0; plot [p=0:N] [0:n] p*n/N, n*(1-(1-1/n)**p),
n*(1-(1-p/N)**(N/n)), n*(1 - exp(lgamma(N-N/n+1) + lgamma(N-p+1) -
lgamma(N+1) - lgamma(N-N/n-p+1)))

For most values of N and n, the approximation from [2]http://www.adellera.it/investigations/distinct_balls/distinct_balls.pdf is almost
indistinguishable from the exact formula, whereas the formula from [1]http://math.stackexchange.com/questions/72223/finding-expected-number-of-distinct-values-selected-from-a-set-of-integers
tends to underestimate when selecting a large number of distinct
values (e.g., try setting n=900000 in the plot above).

Regards,
Dean

[1]: http://math.stackexchange.com/questions/72223/finding-expected-number-of-distinct-values-selected-from-a-set-of-integers
[2]: http://www.adellera.it/investigations/distinct_balls/distinct_balls.pdf

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 10 years ago

In reply to: Dean Rasheed (#15)

Re: improving GROUP BY estimation

Hi,

On Sun, 2016-03-13 at 15:24 +0000, Dean Rasheed wrote:

On 4 March 2016 at 13:10, Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:

The risk associated with over-estimation is that we may switch from
HashAggregate to GroupAggregate, and generally select paths better
suited for large number of rows. Not great, because the plan may
not be
optimal and thus run much slower - but that usually only happens on
small tables, because on large tables we would probably end up
using
those same paths anyway.

With under-estimation, the main risks are much graver, as we may
end up
using HashAggregate only to get killed by OOM. When we're lucky not
to
hit that, my experience this often leads to a cascade of NestedLoop
nodes processing orders of magnitude more tuples than expected.
Awful.

So I think the risk is asymmetric, and that's why I like the new
estimator more, because it tends to over-estimate, but in a very
reasonable way.

Agreed. Under-estimating is worse than over-estimating.
-           reldistinct *= rel->rows / rel->tuples;
+           reldistinct *= (1 - powl((reldistinct - 1) / reldistinct,
rel->rows)
Looking at this, I agree that this new formula from [1] is generally
better than the old one in most (but not all) cases.

One problem with it is that it no longer takes into account
rel->tuples, which means that when returning all the tuples from the
table, it no longer gives an estimate equal to the total number of
distinct values in the table. In fact it tends to underestimate when
returning nearly all the rows from the table.

Yes, that's a good point.

The old formula is clearly not a good general-purpose estimate.
However, there is one case where it does return an accurate estimate
-- when all the values in the underlying table are distinct. In this
case the new formula consistently produces underestimates, while the
old formula works well. For example:

CREATE TABLE t AS SELECT (100000000 * random())::int a,
i::int b
FROM generate_series(1,1000000) s(i);
ANALYSE t;

EXPLAIN ANALYSE
SELECT a FROM t GROUP BY a;

So there are clearly around 1 million distinct values for "a", but
the new formula gives an estimate of around 630,000. That's not a
massive underestimate, but it's sufficiently obvious and visible that
it would be good to do better if we can.

I think that a better formula to use would be

reldistinct *= (1 - powl(1 - rel-rows / rel->tuples, rel->tuples /
reldistinct)

This comes from [2], which gives a general formula for selection
without replacement, and then gives the above as an approximation to
the uniform distribution case. It's not really made clear in that
paper, but that approximation is actually the "with replacement"
approximation, but starting from different initial assumptions to
give a formula that has better boundary conditions and special-case
behaviour, as described below.

...

For most values of N and n, the approximation from [2] is almost
indistinguishable from the exact formula, whereas the formula from
[1] tends to underestimate when selecting a large number of distinct
values (e.g., try setting n=900000 in the plot above).

Yes, I agree that formula you propose seems to behave better.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 10 years ago

In reply to: Tomas Vondra (#16)

1 attachment(s)

Re: improving GROUP BY estimation

On 03/13/2016 11:09 PM, Tomas Vondra wrote:

Hi,

On Sun, 2016-03-13 at 15:24 +0000, Dean Rasheed wrote:

On 4 March 2016 at 13:10, Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:

...

I think that a better formula to use would be

reldistinct *= (1 - powl(1 - rel-rows / rel->tuples, rel->tuples /
reldistinct)

Attached is a v3 of the patch using this formula instead of the original
one. Interestingly, that apparently reduces the number of regression
tests that get broken to a single one.

I'm not sure whether we need to provide a link to the PDF the formula
comes from - perhaps we should?

I've also repeated the tests for the two tables (dependent and
independent columns), comparing the actual number of groups and
different estimates, and the results look like this (v3 is the formula
used in this patch):

1) independent

| 10 | 50 | 100 | 500 | 1000 | 5000
---------------------------------------------------------
actual | 919 | 3829 | 6244 | 9944 | 10001 | 10001
current | 10 | 50 | 102 | 516 | 1018 | 4996
new (v1) | 973 | 4001 | 6382 | 9897 | 9951 | 9951
new (v3) | 1117 | 3852 | 6229 | 9943 | 10004 | 10004

2) dependent

| 10 | 50 | 100 | 500 | 1000 | 5000
--------------------------------------------------------
actual | 10 | 50 | 100 | 500 | 1000 | 5000
current | 10 | 53 | 105 | 508 | 1016 | 5014
new (v1) | 880 | 4105 | 6472 | 9955 | 10018 | 10018
new (v3) | 807 | 3680 | 6050 | 9916 | 9983 | 9983

I only collected numbers for the new estimator, the other numbers are
just a copy from the previous message. So there might be minor
differences due to slightly different ndistinct estimates etc.

Anyway, the numbers are obviously quite close to the formula from v1 of
the patch, plus the formula gives better estimates when scanning nearly
all rows.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

estimate-num-groups-v3.patchbinary/octet-stream; name=estimate-num-groups-v3.patchDownload

diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index d396ef1..3630469 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -3439,9 +3439,12 @@ estimate_num_groups(PlannerInfo *root, List *groupExprs, double input_rows,
 				reldistinct = clamp;
 
 			/*
-			 * Multiply by restriction selectivity.
+			 * Estimate the number of groups observed in the expected number of
+			 * rows, using a formula for selection without replacement, assuming
+			 * uniform distribution.
 			 */
-			reldistinct *= rel->rows / rel->tuples;
+			reldistinct *= (1 - powl(1 - rel->rows / rel->tuples,
+									 rel->tuples / reldistinct));
 
 			/*
 			 * Update estimate of total distinct groups.
diff --git a/src/test/regress/expected/subselect.out b/src/test/regress/expected/subselect.out
index de64ca7..0fc93d9 100644
--- a/src/test/regress/expected/subselect.out
+++ b/src/test/regress/expected/subselect.out
@@ -807,27 +807,24 @@ select * from int4_tbl where
 explain (verbose, costs off)
 select * from int4_tbl o where (f1, f1) in
   (select f1, generate_series(1,2) / 10 g from int4_tbl i group by f1);
-                              QUERY PLAN                              
-----------------------------------------------------------------------
- Hash Join
+                           QUERY PLAN                           
+----------------------------------------------------------------
+ Hash Semi Join
    Output: o.f1
    Hash Cond: (o.f1 = "ANY_subquery".f1)
    ->  Seq Scan on public.int4_tbl o
          Output: o.f1
    ->  Hash
          Output: "ANY_subquery".f1, "ANY_subquery".g
-         ->  HashAggregate
+         ->  Subquery Scan on "ANY_subquery"
                Output: "ANY_subquery".f1, "ANY_subquery".g
-               Group Key: "ANY_subquery".f1, "ANY_subquery".g
-               ->  Subquery Scan on "ANY_subquery"
-                     Output: "ANY_subquery".f1, "ANY_subquery".g
-                     Filter: ("ANY_subquery".f1 = "ANY_subquery".g)
-                     ->  HashAggregate
-                           Output: i.f1, (generate_series(1, 2) / 10)
-                           Group Key: i.f1
-                           ->  Seq Scan on public.int4_tbl i
-                                 Output: i.f1
-(18 rows)
+               Filter: ("ANY_subquery".f1 = "ANY_subquery".g)
+               ->  HashAggregate
+                     Output: i.f1, (generate_series(1, 2) / 10)
+                     Group Key: i.f1
+                     ->  Seq Scan on public.int4_tbl i
+                           Output: i.f1
+(15 rows)
 
 select * from int4_tbl o where (f1, f1) in
   (select f1, generate_series(1,2) / 10 g from int4_tbl i group by f1);

#18

Dean Rasheed

dean.a.rasheed@gmail.com

almost 10 years ago

In reply to: Tomas Vondra (#17)

Re: improving GROUP BY estimation

On 18 March 2016 at 00:37, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

On Sun, 2016-03-13 at 15:24 +0000, Dean Rasheed wrote:

I think that a better formula to use would be

reldistinct *= (1 - powl(1 - rel-rows / rel->tuples, rel->tuples /
reldistinct)

Attached is a v3 of the patch using this formula instead of the original
one. Interestingly, that apparently reduces the number of regression tests
that get broken to a single one.

Cool.

I'm not sure whether we need to provide a link to the PDF the formula comes
from - perhaps we should?

Probably a better URL to give is
http://www.adellera.it/investigations/distinct_balls/ which has a link
to the PDF version of the paper and also some supporting material.

However, while that paper is in general very clear, I don't think it
gives a very clear explanation of that particular formula, and it
doesn't explain what it represents. It merely says that if "i" can be
ignored "for some reason (e.g. i << Nr)", then that formula is an
approximation to the exact "without replacement" formula, which is the
subject of that paper.

But actually, that formula is the exact formula for the expected
number of distinct values when selecting *with replacement* from a
collection where each of the distinct values occurs an equal number of
times. So I think we should say that.

Perhaps something along the lines of:

/*
* Update the estimate based on the restriction selectivity.
*
* This uses the formula for the expected number of distinct values
* when selecting with replacement from a collection where each of
* the distinct values occurs an equal number of times (a uniform
* distribution of values). This is a very close approximation to
* the more correct method of selecting without replacement when
* the number of distinct values is quite large --- for example,
* see http://www.adellera.it/investigations/distinct_balls/.
* Additionally, it also turns out to work very well even when the
* number of distinct values is small.
*/

Regards,
Dean

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19

Alexander Korotkov

a.korotkov@postgrespro.ru

almost 10 years ago

In reply to: Dean Rasheed (#18)

Re: improving GROUP BY estimation

Hi, Dean!

On Fri, Mar 18, 2016 at 1:20 PM, Dean Rasheed <dean.a.rasheed@gmail.com>
wrote:

Probably a better URL to give is
http://www.adellera.it/investigations/distinct_balls/ which has a link
to the PDF version of the paper and also some supporting material.

However, while that paper is in general very clear, I don't think it
gives a very clear explanation of that particular formula, and it
doesn't explain what it represents. It merely says that if "i" can be
ignored "for some reason (e.g. i << Nr)", then that formula is an
approximation to the exact "without replacement" formula, which is the
subject of that paper.

But actually, that formula is the exact formula for the expected
number of distinct values when selecting *with replacement* from a
collection where each of the distinct values occurs an equal number of
times. So I think we should say that.

Perhaps something along the lines of:

/*
* Update the estimate based on the restriction selectivity.
*
* This uses the formula for the expected number of distinct
values
* when selecting with replacement from a collection where
each of
* the distinct values occurs an equal number of times (a
uniform
* distribution of values). This is a very close approximation
to
* the more correct method of selecting without replacement
when
* the number of distinct values is quite large --- for
example,
* see http://www.adellera.it/investigations/distinct_balls/.
* Additionally, it also turns out to work very well even when
the
* number of distinct values is small.
*/

+1
Thank you for work on this patch. The formula you propose and explanation
look great!

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#20

Alexander Korotkov

a.korotkov@postgrespro.ru

almost 10 years ago

In reply to: Alexander Korotkov (#19)

Re: improving GROUP BY estimation

Hi, Tomas!

On Mon, Mar 21, 2016 at 2:39 PM, Alexander Korotkov <
a.korotkov@postgrespro.ru> wrote:

On Fri, Mar 18, 2016 at 1:20 PM, Dean Rasheed <dean.a.rasheed@gmail.com>
wrote:

Probably a better URL to give is
http://www.adellera.it/investigations/distinct_balls/ which has a link
to the PDF version of the paper and also some supporting material.

However, while that paper is in general very clear, I don't think it
gives a very clear explanation of that particular formula, and it
doesn't explain what it represents. It merely says that if "i" can be
ignored "for some reason (e.g. i << Nr)", then that formula is an
approximation to the exact "without replacement" formula, which is the
subject of that paper.

But actually, that formula is the exact formula for the expected
number of distinct values when selecting *with replacement* from a
collection where each of the distinct values occurs an equal number of
times. So I think we should say that.

Perhaps something along the lines of:

/*
* Update the estimate based on the restriction selectivity.
*
* This uses the formula for the expected number of distinct
values
* when selecting with replacement from a collection where
each of
* the distinct values occurs an equal number of times (a
uniform
* distribution of values). This is a very close
approximation to
* the more correct method of selecting without replacement
when
* the number of distinct values is quite large --- for
example,
* see http://www.adellera.it/investigations/distinct_balls/.
* Additionally, it also turns out to work very well even
when the
* number of distinct values is small.
*/

+1
Thank you for work on this patch. The formula you propose and explanation
look great!

I think you should send a revision of patch including comments proposed by
Deam Rasheed.
I'm switching patch status to waiting on author in commitfest.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#21

David Steele

david@pgmasters.net

almost 10 years ago

In reply to: Alexander Korotkov (#20)

Re: improving GROUP BY estimation

Hi Thomas,

On 3/22/16 10:40 AM, Alexander Korotkov wrote:

I think you should send a revision of patch including comments proposed
by Deam Rasheed.
I'm switching patch status to waiting on author in commitfest.

We're getting pretty close to the end of the CF. Do you know when you
will have a new patch ready?

Thanks,
--
-David
david@pgmasters.net

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#22

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 10 years ago

In reply to: Alexander Korotkov (#20)

1 attachment(s)

Re: improving GROUP BY estimation

Hi,

On 03/22/2016 03:40 PM, Alexander Korotkov wrote:

I think you should send a revision of patch including comments proposed
by Deam Rasheed.
I'm switching patch status to waiting on author in commitfest.

Attached is v4 of the patch - the only difference w.r.t. v3 is that I've
used the comment as proposed by Dean.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

estimate-num-groups-v4.patchbinary/octet-stream; name=estimate-num-groups-v4.patchDownload

diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index a6555e9..00755da 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -3439,9 +3439,20 @@ estimate_num_groups(PlannerInfo *root, List *groupExprs, double input_rows,
 				reldistinct = clamp;
 
 			/*
-			 * Multiply by restriction selectivity.
+			 * Update the estimate based on the restriction selectivity.
+			 *
+			 * This uses the formula for the expected number of distinct values
+			 * when selecting with replacement from a collection where each of
+			 * the distinct values occurs an equal number of times (a uniform
+			 * distribution of values). This is a very close approximation to
+			 * the more correct method of selecting without replacement when
+			 * the number of distinct values is quite large --- for example,
+			 * see http://www.adellera.it/investigations/distinct_balls/.
+			 * Additionally, it also turns out to work very well even when the
+			 * number of distinct values is small.
 			 */
-			reldistinct *= rel->rows / rel->tuples;
+			reldistinct *= (1 - powl(1 - rel->rows / rel->tuples,
+									 rel->tuples / reldistinct));
 
 			/*
 			 * Update estimate of total distinct groups.
diff --git a/src/test/regress/expected/subselect.out b/src/test/regress/expected/subselect.out
index de64ca7..0fc93d9 100644
--- a/src/test/regress/expected/subselect.out
+++ b/src/test/regress/expected/subselect.out
@@ -807,27 +807,24 @@ select * from int4_tbl where
 explain (verbose, costs off)
 select * from int4_tbl o where (f1, f1) in
   (select f1, generate_series(1,2) / 10 g from int4_tbl i group by f1);
-                              QUERY PLAN                              
-----------------------------------------------------------------------
- Hash Join
+                           QUERY PLAN                           
+----------------------------------------------------------------
+ Hash Semi Join
    Output: o.f1
    Hash Cond: (o.f1 = "ANY_subquery".f1)
    ->  Seq Scan on public.int4_tbl o
          Output: o.f1
    ->  Hash
          Output: "ANY_subquery".f1, "ANY_subquery".g
-         ->  HashAggregate
+         ->  Subquery Scan on "ANY_subquery"
                Output: "ANY_subquery".f1, "ANY_subquery".g
-               Group Key: "ANY_subquery".f1, "ANY_subquery".g
-               ->  Subquery Scan on "ANY_subquery"
-                     Output: "ANY_subquery".f1, "ANY_subquery".g
-                     Filter: ("ANY_subquery".f1 = "ANY_subquery".g)
-                     ->  HashAggregate
-                           Output: i.f1, (generate_series(1, 2) / 10)
-                           Group Key: i.f1
-                           ->  Seq Scan on public.int4_tbl i
-                                 Output: i.f1
-(18 rows)
+               Filter: ("ANY_subquery".f1 = "ANY_subquery".g)
+               ->  HashAggregate
+                     Output: i.f1, (generate_series(1, 2) / 10)
+                     Group Key: i.f1
+                     ->  Seq Scan on public.int4_tbl i
+                           Output: i.f1
+(15 rows)
 
 select * from int4_tbl o where (f1, f1) in
   (select f1, generate_series(1,2) / 10 g from int4_tbl i group by f1);

#23

Dean Rasheed

dean.a.rasheed@gmail.com

almost 10 years ago

In reply to: Tomas Vondra (#22)

Re: improving GROUP BY estimation

On 30 March 2016 at 14:03, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

Attached is v4 of the patch

Thanks, I think this is good to go, except that I think we need to use
pow() rather than powl() because AIUI powl() is new to C99, and so
won't necessarily be available on all supported platforms. I don't
think we need worry about loss of precision, since that would only be
an issue if rel->rows / rel->tuples were smaller than maybe 10^-14 or
so, and it seems unlikely we'll get anywhere near that any time soon.

I think this is a good, well thought-out change, so unless anyone
objects I'll push it (probably this weekend).

Regards,
Dean

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#24

Tom Lane

tgl@sss.pgh.pa.us

almost 10 years ago

In reply to: Dean Rasheed (#23)

Re: improving GROUP BY estimation

Dean Rasheed <dean.a.rasheed@gmail.com> writes:

On 30 March 2016 at 14:03, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

Attached is v4 of the patch

Thanks, I think this is good to go, except that I think we need to use
pow() rather than powl() because AIUI powl() is new to C99, and so
won't necessarily be available on all supported platforms. I don't
think we need worry about loss of precision, since that would only be
an issue if rel->rows / rel->tuples were smaller than maybe 10^-14 or
so, and it seems unlikely we'll get anywhere near that any time soon.

I took a quick look. I concur with using pow() not powl(); the latter
is not in SUS v2 which is our baseline portability expectation, and in
fact there is *noplace* where we expect long double to work. Moreover,
I don't believe that any of the estimates we're working with are so
accurate that a double-width power result would be a useful improvement.

Also, I wonder if it'd be a good idea to provide a guard against division
by zero --- we know rel->tuples > 0 at this point, but I'm less sure that
reldistinct can't be zero. In the same vein, I'm worried about the first
argument of pow() being slightly negative due to roundoff error, leading
to a NaN result.

Maybe we should also consider clamping the final reldistinct estimate to
an integer with clamp_row_est(). The existing code doesn't do that but
it seems like a good idea on general principles.

Another minor gripe is the use of a random URL as justification. This
code will still be around when that URL exists nowhere but the Wayback
Machine. Can't we find a more formal citation to use?

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#25

Alvaro Herrera

alvherre@2ndquadrant.com

almost 10 years ago

In reply to: Tom Lane (#24)

Re: improving GROUP BY estimation

Tom Lane wrote:

Another minor gripe is the use of a random URL as justification. This
code will still be around when that URL exists nowhere but the Wayback
Machine. Can't we find a more formal citation to use?

The article text refers to this 1977 S. B. Yao paper "Approximating
block accesses in database organizations" which doesn't appear to be
available online, except behind ACM's paywall at
http://dl.acm.org/citation.cfm?id=359475

--
ï¿½lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#26

Tom Lane

tgl@sss.pgh.pa.us

almost 10 years ago

In reply to: Alvaro Herrera (#25)

Re: improving GROUP BY estimation

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

Tom Lane wrote:

Another minor gripe is the use of a random URL as justification. This
code will still be around when that URL exists nowhere but the Wayback
Machine. Can't we find a more formal citation to use?

The article text refers to this 1977 S. B. Yao paper "Approximating
block accesses in database organizations" which doesn't appear to be
available online, except behind ACM's paywall at
http://dl.acm.org/citation.cfm?id=359475

Well, a CACM citation is perfectly fine by my lights (especially one
that's that far back and therefore certainly patent-free ...)

Let's use something like this:

See "Approximating block accesses in database organizations", S. B. Yao,
Communications of the ACM, Volume 20 Issue 4, April 1977 Pages 260-261

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#27

Dean Rasheed

dean.a.rasheed@gmail.com

almost 10 years ago

In reply to: Tom Lane (#24)

Re: improving GROUP BY estimation

On 31 March 2016 at 20:18, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Also, I wonder if it'd be a good idea to provide a guard against division
by zero --- we know rel->tuples > 0 at this point, but I'm less sure that
reldistinct can't be zero. In the same vein, I'm worried about the first
argument of pow() being slightly negative due to roundoff error, leading
to a NaN result.

Yeah, that makes sense. In fact, if we only apply the adjustment when
reldistinct > 0 and rel->rows < rel->tuples, and rewrite the first
argument to pow() as (rel->tuples - rel->rows) / rel->tuples, then it
is guaranteed to be non-negative. If rel->rows >= rel->tuples (not
sure if it can be greater), then we just want the original
reldistinct.

Maybe we should also consider clamping the final reldistinct estimate to
an integer with clamp_row_est(). The existing code doesn't do that but
it seems like a good idea on general principles.

OK, that seems sensible.

Regards,
Dean

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#28

Dean Rasheed

dean.a.rasheed@gmail.com

almost 10 years ago

In reply to: Tom Lane (#26)

Re: improving GROUP BY estimation

On 31 March 2016 at 21:40, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

Tom Lane wrote:

Another minor gripe is the use of a random URL as justification. This
code will still be around when that URL exists nowhere but the Wayback
Machine. Can't we find a more formal citation to use?

The article text refers to this 1977 S. B. Yao paper "Approximating
block accesses in database organizations" which doesn't appear to be
available online, except behind ACM's paywall at
http://dl.acm.org/citation.cfm?id=359475

Well, a CACM citation is perfectly fine by my lights (especially one
that's that far back and therefore certainly patent-free ...)

Let's use something like this:

See "Approximating block accesses in database organizations", S. B. Yao,
Communications of the ACM, Volume 20 Issue 4, April 1977 Pages 260-261

Sounds good.

Regards,
Dean

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#29

Alvaro Herrera

alvherre@2ndquadrant.com

almost 10 years ago

In reply to: Tom Lane (#26)

Re: improving GROUP BY estimation

Tom Lane wrote:

The article text refers to this 1977 S. B. Yao paper "Approximating
block accesses in database organizations" which doesn't appear to be
available online, except behind ACM's paywall at
http://dl.acm.org/citation.cfm?id=359475

Well, a CACM citation is perfectly fine by my lights (especially one
that's that far back and therefore certainly patent-free ...)

Let's use something like this:

See "Approximating block accesses in database organizations", S. B. Yao,
Communications of the ACM, Volume 20 Issue 4, April 1977 Pages 260-261

That sounds nice all right, but I'm not sure it's actually helpful,
because the article text is not available anywhere. I doubt most people
will spend 15 bucks to buy that paper ... so we don't actually know
whether the paper supports the chosen formula :-) unless you have a
CACM subscription and can verify it.

I think it's good to have the ACM reference anyway, for posterity, but
it'd be good to (additionally) have something that people can read.

--
ï¿½lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#30

Tom Lane

tgl@sss.pgh.pa.us

almost 10 years ago

In reply to: Dean Rasheed (#28)

Re: improving GROUP BY estimation

Dean Rasheed <dean.a.rasheed@gmail.com> writes:

On 31 March 2016 at 21:40, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Let's use something like this:
See "Approximating block accesses in database organizations", S. B. Yao,
Communications of the ACM, Volume 20 Issue 4, April 1977 Pages 260-261

Actually, having now looked at both the Dellera paper and the Yao one,
what the latter shows seems to be equivalent to Dellera's equation (2)
(the first one in his section 2.2). But what the code is actually
using is the approximation in the second equation in 2.2, and that
one is certainly not in Yao, although it's not hard to see how you
get from the first to the second if you assume i << Nr.

I think it'd be worth writing out equation (2), attributing it to the Yao
paper, and then saying something like "Alberto Dellera points out that if
Nd is large, so that all the values of i are much less than Nr, then this
formula can be approximated by [the formula used in the code]. It turns
out that that formula also works well even when Nd is small."

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#31

Tom Lane

tgl@sss.pgh.pa.us

almost 10 years ago

In reply to: Dean Rasheed (#27)

Re: improving GROUP BY estimation

Dean Rasheed <dean.a.rasheed@gmail.com> writes:

Yeah, that makes sense. In fact, if we only apply the adjustment when
reldistinct > 0 and rel->rows < rel->tuples, and rewrite the first
argument to pow() as (rel->tuples - rel->rows) / rel->tuples, then it
is guaranteed to be non-negative. If rel->rows >= rel->tuples (not
sure if it can be greater), then we just want the original
reldistinct.

Works for me.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#32

Tom Lane

tgl@sss.pgh.pa.us

almost 10 years ago

In reply to: Alvaro Herrera (#29)

Re: improving GROUP BY estimation

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

Tom Lane wrote:

See "Approximating block accesses in database organizations", S. B. Yao,
Communications of the ACM, Volume 20 Issue 4, April 1977 Pages 260-261

That sounds nice all right, but I'm not sure it's actually helpful,
because the article text is not available anywhere. I doubt most people
will spend 15 bucks to buy that paper ... so we don't actually know
whether the paper supports the chosen formula :-) unless you have a
CACM subscription and can verify it.

Well, I do and I did, see my previous response.

I think it's good to have the ACM reference anyway, for posterity, but
it'd be good to (additionally) have something that people can read.

I'm just concerned about what happens when the Dellera paper stops being
available. I don't mind including that URL as a backup to the written-out
argument I just suggested.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#33

Dean Rasheed

dean.a.rasheed@gmail.com

almost 10 years ago

In reply to: Tom Lane (#32)

1 attachment(s)

Re: improving GROUP BY estimation

On 31 March 2016 at 22:02, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I'm just concerned about what happens when the Dellera paper stops being
available. I don't mind including that URL as a backup to the written-out
argument I just suggested.

Here's an updated patch with references to both papers, and a more
detailed justification for the formula, along with the other changes
discussed. Note that although equation (2) in the Dell'Era paper looks
different from the Yao formula, it's actually the same.

Regards,
Dean

Attachments:

estimate-num-groups-v5.patchtext/x-diff; charset=US-ASCII; name=estimate-num-groups-v5.patchDownload

diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
new file mode 100644
index a6555e9..99f5f7c
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -3439,9 +3439,51 @@ estimate_num_groups(PlannerInfo *root, L
 				reldistinct = clamp;
 
 			/*
-			 * Multiply by restriction selectivity.
+			 * Update the estimate based on the restriction selectivity,
+			 * guarding against division by zero when reldistinct is zero.
+			 * Also skip this if we know that we are returning all rows.
 			 */
-			reldistinct *= rel->rows / rel->tuples;
+			if (reldistinct > 0 && rel->rows < rel->tuples)
+			{
+				/*
+				 * Given a table containing N rows with n distinct values in a
+				 * uniform distribution, if we select p rows at random then
+				 * the expected number of distinct values selected is
+				 *
+				 * n * (1 - product((N-N/n-i)/(N-i), i=0..p-1))
+				 *
+				 * = n * (1 - (N-N/n)! / (N-N/n-p)! * (N-p)! / N!)
+				 *
+				 * See "Approximating block accesses in database
+				 * organizations", S. B. Yao, Communications of the ACM,
+				 * Volume 20 Issue 4, April 1977 Pages 260-261.
+				 *
+				 * Alternatively, re-arranging the terms from the factorials,
+				 * this may be written as
+				 *
+				 * n * (1 - product((N-p-i)/(N-i), i=0..N/n-1))
+				 *
+				 * This form of the formula is more efficient to compute in
+				 * the common case where p is larger than N/n.  Additionally,
+				 * as pointed out by Dell'Era, if i << N for all terms in the
+				 * product, it can be approximated by
+				 *
+				 * n * (1 - ((N-p)/N)^(N/n))
+				 *
+				 * See "Expected distinct values when selecting from a bag
+				 * without replacement", Alberto Dell'Era,
+				 * http://www.adellera.it/investigations/distinct_balls/.
+				 *
+				 * The condition i << N is equivalent to n >> 1, so this is a
+				 * good approximation when the number of distinct values in
+				 * the table is large.  It turns out that this formula also
+				 * works well even when n is small.
+				 */
+				reldistinct *=
+					(1 - pow((rel->tuples - rel->rows) / rel->tuples,
+							 rel->tuples / reldistinct));
+			}
+			reldistinct = clamp_row_est(reldistinct);
 
 			/*
 			 * Update estimate of total distinct groups.
diff --git a/src/test/regress/expected/subselect.out b/src/test/regress/expected/subselect.out
new file mode 100644
index de64ca7..0fc93d9
--- a/src/test/regress/expected/subselect.out
+++ b/src/test/regress/expected/subselect.out
@@ -807,27 +807,24 @@ select * from int4_tbl where
 explain (verbose, costs off)
 select * from int4_tbl o where (f1, f1) in
   (select f1, generate_series(1,2) / 10 g from int4_tbl i group by f1);
-                              QUERY PLAN                              
-----------------------------------------------------------------------
- Hash Join
+                           QUERY PLAN                           
+----------------------------------------------------------------
+ Hash Semi Join
    Output: o.f1
    Hash Cond: (o.f1 = "ANY_subquery".f1)
    ->  Seq Scan on public.int4_tbl o
          Output: o.f1
    ->  Hash
          Output: "ANY_subquery".f1, "ANY_subquery".g
-         ->  HashAggregate
+         ->  Subquery Scan on "ANY_subquery"
                Output: "ANY_subquery".f1, "ANY_subquery".g
-               Group Key: "ANY_subquery".f1, "ANY_subquery".g
-               ->  Subquery Scan on "ANY_subquery"
-                     Output: "ANY_subquery".f1, "ANY_subquery".g
-                     Filter: ("ANY_subquery".f1 = "ANY_subquery".g)
-                     ->  HashAggregate
-                           Output: i.f1, (generate_series(1, 2) / 10)
-                           Group Key: i.f1
-                           ->  Seq Scan on public.int4_tbl i
-                                 Output: i.f1
-(18 rows)
+               Filter: ("ANY_subquery".f1 = "ANY_subquery".g)
+               ->  HashAggregate
+                     Output: i.f1, (generate_series(1, 2) / 10)
+                     Group Key: i.f1
+                     ->  Seq Scan on public.int4_tbl i
+                           Output: i.f1
+(15 rows)
 
 select * from int4_tbl o where (f1, f1) in
   (select f1, generate_series(1,2) / 10 g from int4_tbl i group by f1);

#34

Tom Lane

tgl@sss.pgh.pa.us

almost 10 years ago

In reply to: Dean Rasheed (#33)

Re: improving GROUP BY estimation

Dean Rasheed <dean.a.rasheed@gmail.com> writes:

Here's an updated patch with references to both papers, and a more
detailed justification for the formula, along with the other changes
discussed. Note that although equation (2) in the Dell'Era paper looks
different from the Yao formula, it's actually the same.

Looks good to me.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers