pgsql: Implement multivariate n-distinct coefficients

Started by Alvaro Herreraabout 9 years ago6 messageshackers

alvherre@2ndquadrant.com

about 9 years ago

Implement multivariate n-distinct coefficients

Add support for explicitly declared statistic objects (CREATE
STATISTICS), allowing collection of statistics on more complex
combinations that individual table columns. Companion commands DROP
STATISTICS and ALTER STATISTICS ... OWNER TO / SET SCHEMA / RENAME are
added too. All this DDL has been designed so that more statistic types
can be added later on, such as multivariate most-common-values and
multivariate histograms between columns of a single table, leaving room
for permitting columns on multiple tables, too, as well as expressions.

This commit only adds support for collection of n-distinct coefficient
on user-specified sets of columns in a single table. This is useful to
estimate number of distinct groups in GROUP BY and DISTINCT clauses;
estimation errors there can cause over-allocation of memory in hashed
aggregates, for instance, so it's a worthwhile problem to solve. A new
special pseudo-type pg_ndistinct is used.

(num-distinct estimation was deemed sufficiently useful by itself that
this is worthwhile even if no further statistic types are added
immediately; so much so that another version of essentially the same
functionality was submitted by Kyotaro Horiguchi:
/messages/by-id/20150828.173334.114731693.horiguchi.kyotaro@lab.ntt.co.jp
though this commit does not use that code.)

Author: Tomas Vondra. Some code rework by Álvaro.
Reviewed-by: Dean Rasheed, David Rowley, Kyotaro Horiguchi, Jeff Janes,
Ideriha Takeshi
Discussion: /messages/by-id/543AFA15.4080608@fuzzy.cz
/messages/by-id/20170320190220.ixlaueanxegqd5gr@alvherre.pgsql

Branch
------
master

Details
-------
http://git.postgresql.org/pg/commitdiff/7b504eb282ca2f5104b5c00b4f05a3ef6bb1385b

--
Sent via pgsql-committers mailing list (pgsql-committers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-committers

Robert Haas

robertmhaas@gmail.com

about 9 years ago

In reply to: Alvaro Herrera (#1)

Re: [COMMITTERS] pgsql: Implement multivariate n-distinct coefficients

On Fri, Mar 24, 2017 at 1:16 PM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

Implement multivariate n-distinct coefficients

dromedary and arapaima have failures like this, which seems likely
related to this commit:

EXPLAIN
SELECT COUNT(*) FROM ndistinct GROUP BY a, d;
QUERY PLAN
---------------------------------------------------------------------
! HashAggregate (cost=225.00..235.00 rows=1000 width=16)
Group Key: a, d
! -> Seq Scan on ndistinct (cost=0.00..150.00 rows=10000 width=8)
(3 rows)

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Alvaro Herrera

alvherre@2ndquadrant.com

about 9 years ago

In reply to: Robert Haas (#2)

Re: [COMMITTERS] pgsql: Implement multivariate n-distinct coefficients

Robert Haas wrote:

On Fri, Mar 24, 2017 at 1:16 PM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

Implement multivariate n-distinct coefficients

dromedary and arapaima have failures like this, which seems likely
related to this commit:

EXPLAIN
SELECT COUNT(*) FROM ndistinct GROUP BY a, d;
QUERY PLAN
---------------------------------------------------------------------
! HashAggregate (cost=225.00..235.00 rows=1000 width=16)
Group Key: a, d
! -> Seq Scan on ndistinct (cost=0.00..150.00 rows=10000 width=8)
(3 rows)

Yes. What seems to be going on here, is that both arapaima and
dromedary are 32 bit machines; all the 64 bit ones are passing (except
for prion which showed a real relcache bug, which I already stomped).
Now, the difference is that the total cost in those machines for seqscan
is 155 instead of 150. Tomas suggests that this happens because
MAXALIGN is different, leading to packing tuples differently: the
expected cost (on our laptop's 64 bit) is 155, and the cost we get in 32
bit arch is 150 -- so 5 pages of difference. We insert 1000 rows on the
table; 4 bytes per tuple would amount to 40 kB, which is exactly 5
pages.

I'll push an alternate expected file for this test, which we think is
the simplest fix.

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tom Lane

tgl@sss.pgh.pa.us

about 9 years ago

In reply to: Alvaro Herrera (#3)

Re: Re: [COMMITTERS] pgsql: Implement multivariate n-distinct coefficients

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

Robert Haas wrote:

dromedary and arapaima have failures like this, which seems likely
related to this commit:

EXPLAIN
SELECT COUNT(*) FROM ndistinct GROUP BY a, d;
QUERY PLAN
---------------------------------------------------------------------
! HashAggregate (cost=225.00..235.00 rows=1000 width=16)
Group Key: a, d
! -> Seq Scan on ndistinct (cost=0.00..150.00 rows=10000 width=8)
(3 rows)

Yes. What seems to be going on here, is that both arapaima and
dromedary are 32 bit machines; all the 64 bit ones are passing (except
for prion which showed a real relcache bug, which I already stomped).
Now, the difference is that the total cost in those machines for seqscan
is 155 instead of 150. Tomas suggests that this happens because
MAXALIGN is different, leading to packing tuples differently: the
expected cost (on our laptop's 64 bit) is 155, and the cost we get in 32
bit arch is 150 -- so 5 pages of difference. We insert 1000 rows on the
table; 4 bytes per tuple would amount to 40 kB, which is exactly 5
pages.

I'll push an alternate expected file for this test, which we think is
the simplest fix.

Why not use COSTS OFF? Or I'll put that even more strongly: all the
existing regression tests use COSTS OFF, exactly to avoid this sort of
machine-dependent output. There had better be a really damn good
reason not to use it here.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Alvaro Herrera

alvherre@2ndquadrant.com

about 9 years ago

In reply to: Tom Lane (#4)

Re: Re: [COMMITTERS] pgsql: Implement multivariate n-distinct coefficients

Tom Lane wrote:

Why not use COSTS OFF? Or I'll put that even more strongly: all the
existing regression tests use COSTS OFF, exactly to avoid this sort of
machine-dependent output. There had better be a really damn good
reason not to use it here.

If we use COSTS OFF, the test is completely pointless, as the plans look
identical regardless of whether the multivariate stats are being used or
not.

If we had a ROWS option to ANALYZE that showed estimated number of rows
but not the cost, that would be an option.

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tom Lane

tgl@sss.pgh.pa.us

about 9 years ago

In reply to: Alvaro Herrera (#5)

Re: Re: [COMMITTERS] pgsql: Implement multivariate n-distinct coefficients

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

Tom Lane wrote:

Why not use COSTS OFF? Or I'll put that even more strongly: all the
existing regression tests use COSTS OFF, exactly to avoid this sort of
machine-dependent output. There had better be a really damn good
reason not to use it here.

If we use COSTS OFF, the test is completely pointless, as the plans look
identical regardless of whether the multivariate stats are being used or
not.

Well, I think you are going to find that the exact costs are far too
fragile to have in the regression test output. Just because you wish
you could test them this way doesn't mean you can.

If we had a ROWS option to ANALYZE that showed estimated number of rows
but not the cost, that would be an option.

Unlikely to be any better. All these numbers are subject to lots of
noise, eg due to auto-analyze happening at unexpected times, random
sampling during analyze, etc. If you try to constrain the test case
enough that none of that happens, I wonder how useful it will really be.

What I would suggest is devising a test case whereby you actually
get a different plan shape now than you did before. That shouldn't
be too terribly hard, or else what was the point?

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers