pgsql: Implement multivariate n-distinct coefficients
Implement multivariate n-distinct coefficients
Add support for explicitly declared statistic objects (CREATE
STATISTICS), allowing collection of statistics on more complex
combinations that individual table columns. Companion commands DROP
STATISTICS and ALTER STATISTICS ... OWNER TO / SET SCHEMA / RENAME are
added too. All this DDL has been designed so that more statistic types
can be added later on, such as multivariate most-common-values and
multivariate histograms between columns of a single table, leaving room
for permitting columns on multiple tables, too, as well as expressions.
This commit only adds support for collection of n-distinct coefficient
on user-specified sets of columns in a single table. This is useful to
estimate number of distinct groups in GROUP BY and DISTINCT clauses;
estimation errors there can cause over-allocation of memory in hashed
aggregates, for instance, so it's a worthwhile problem to solve. A new
special pseudo-type pg_ndistinct is used.
(num-distinct estimation was deemed sufficiently useful by itself that
this is worthwhile even if no further statistic types are added
immediately; so much so that another version of essentially the same
functionality was submitted by Kyotaro Horiguchi:
/messages/by-id/20150828.173334.114731693.horiguchi.kyotaro@lab.ntt.co.jp
though this commit does not use that code.)
Author: Tomas Vondra. Some code rework by Álvaro.
Reviewed-by: Dean Rasheed, David Rowley, Kyotaro Horiguchi, Jeff Janes,
Ideriha Takeshi
Discussion: /messages/by-id/543AFA15.4080608@fuzzy.cz
/messages/by-id/20170320190220.ixlaueanxegqd5gr@alvherre.pgsql
Branch
------
master
Details
-------
http://git.postgresql.org/pg/commitdiff/7b504eb282ca2f5104b5c00b4f05a3ef6bb1385b
Modified Files
--------------
doc/src/sgml/catalogs.sgml | 97 ++++
doc/src/sgml/func.sgml | 36 +-
doc/src/sgml/ref/allfiles.sgml | 3 +
doc/src/sgml/ref/alter_statistics.sgml | 115 ++++
doc/src/sgml/ref/alter_table.sgml | 9 +-
doc/src/sgml/ref/comment.sgml | 6 +-
doc/src/sgml/ref/create_statistics.sgml | 155 ++++++
doc/src/sgml/ref/drop_statistics.sgml | 98 ++++
doc/src/sgml/reference.sgml | 3 +
src/backend/Makefile | 2 +-
src/backend/catalog/Makefile | 1 +
src/backend/catalog/aclchk.c | 35 ++
src/backend/catalog/dependency.c | 9 +
src/backend/catalog/heap.c | 74 +++
src/backend/catalog/namespace.c | 56 ++
src/backend/catalog/objectaddress.c | 55 ++
src/backend/catalog/pg_shdepend.c | 2 +
src/backend/catalog/system_views.sql | 10 +
src/backend/commands/Makefile | 6 +-
src/backend/commands/alter.c | 8 +
src/backend/commands/analyze.c | 23 +-
src/backend/commands/dropcmds.c | 7 +
src/backend/commands/event_trigger.c | 3 +
src/backend/commands/statscmds.c | 296 ++++++++++
src/backend/nodes/copyfuncs.c | 17 +
src/backend/nodes/equalfuncs.c | 15 +
src/backend/nodes/outfuncs.c | 31 ++
src/backend/optimizer/util/plancat.c | 67 ++-
src/backend/parser/gram.y | 62 ++-
src/backend/statistics/Makefile | 17 +
src/backend/statistics/README | 34 ++
src/backend/statistics/extended_stats.c | 389 +++++++++++++
src/backend/statistics/mvdistinct.c | 671 +++++++++++++++++++++++
src/backend/tcop/utility.c | 12 +
src/backend/utils/adt/ruleutils.c | 81 +++
src/backend/utils/adt/selfuncs.c | 181 +++++-
src/backend/utils/cache/relcache.c | 79 +++
src/backend/utils/cache/syscache.c | 23 +
src/bin/pg_dump/common.c | 4 +
src/bin/pg_dump/pg_backup_archiver.c | 3 +-
src/bin/pg_dump/pg_dump.c | 153 ++++++
src/bin/pg_dump/pg_dump.h | 9 +
src/bin/pg_dump/pg_dump_sort.c | 12 +-
src/bin/psql/describe.c | 51 ++
src/include/catalog/catversion.h | 2 +-
src/include/catalog/dependency.h | 1 +
src/include/catalog/heap.h | 1 +
src/include/catalog/indexing.h | 7 +
src/include/catalog/namespace.h | 2 +
src/include/catalog/pg_cast.h | 4 +
src/include/catalog/pg_proc.h | 11 +
src/include/catalog/pg_statistic_ext.h | 75 +++
src/include/catalog/pg_type.h | 4 +
src/include/catalog/toasting.h | 1 +
src/include/commands/defrem.h | 4 +
src/include/nodes/nodes.h | 2 +
src/include/nodes/parsenodes.h | 15 +
src/include/nodes/relation.h | 19 +
src/include/statistics/extended_stats_internal.h | 64 +++
src/include/statistics/statistics.h | 47 ++
src/include/utils/acl.h | 2 +
src/include/utils/rel.h | 4 +
src/include/utils/relcache.h | 1 +
src/include/utils/syscache.h | 2 +
src/test/regress/expected/alter_generic.out | 45 +-
src/test/regress/expected/object_address.out | 7 +-
src/test/regress/expected/opr_sanity.out | 3 +-
src/test/regress/expected/rules.out | 8 +
src/test/regress/expected/sanity_check.out | 1 +
src/test/regress/expected/stats_ext.out | 155 ++++++
src/test/regress/expected/type_sanity.out | 13 +-
src/test/regress/parallel_schedule | 2 +-
src/test/regress/serial_schedule | 1 +
src/test/regress/sql/alter_generic.sql | 31 ++
src/test/regress/sql/object_address.sql | 4 +-
src/test/regress/sql/stats_ext.sql | 102 ++++
src/test/regress/sql/type_sanity.sql | 2 +-
77 files changed, 3599 insertions(+), 63 deletions(-)
--
Sent via pgsql-committers mailing list (pgsql-committers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-committers
On Fri, Mar 24, 2017 at 1:16 PM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
Implement multivariate n-distinct coefficients
dromedary and arapaima have failures like this, which seems likely
related to this commit:
EXPLAIN
SELECT COUNT(*) FROM ndistinct GROUP BY a, d;
QUERY PLAN
---------------------------------------------------------------------
! HashAggregate (cost=225.00..235.00 rows=1000 width=16)
Group Key: a, d
! -> Seq Scan on ndistinct (cost=0.00..150.00 rows=10000 width=8)
(3 rows)
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Robert Haas wrote:
On Fri, Mar 24, 2017 at 1:16 PM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
Implement multivariate n-distinct coefficients
dromedary and arapaima have failures like this, which seems likely
related to this commit:EXPLAIN
SELECT COUNT(*) FROM ndistinct GROUP BY a, d;
QUERY PLAN
---------------------------------------------------------------------
! HashAggregate (cost=225.00..235.00 rows=1000 width=16)
Group Key: a, d
! -> Seq Scan on ndistinct (cost=0.00..150.00 rows=10000 width=8)
(3 rows)
Yes. What seems to be going on here, is that both arapaima and
dromedary are 32 bit machines; all the 64 bit ones are passing (except
for prion which showed a real relcache bug, which I already stomped).
Now, the difference is that the total cost in those machines for seqscan
is 155 instead of 150. Tomas suggests that this happens because
MAXALIGN is different, leading to packing tuples differently: the
expected cost (on our laptop's 64 bit) is 155, and the cost we get in 32
bit arch is 150 -- so 5 pages of difference. We insert 1000 rows on the
table; 4 bytes per tuple would amount to 40 kB, which is exactly 5
pages.
I'll push an alternate expected file for this test, which we think is
the simplest fix.
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
Robert Haas wrote:
dromedary and arapaima have failures like this, which seems likely
related to this commit:EXPLAIN
SELECT COUNT(*) FROM ndistinct GROUP BY a, d;
QUERY PLAN
---------------------------------------------------------------------
! HashAggregate (cost=225.00..235.00 rows=1000 width=16)
Group Key: a, d
! -> Seq Scan on ndistinct (cost=0.00..150.00 rows=10000 width=8)
(3 rows)
Yes. What seems to be going on here, is that both arapaima and
dromedary are 32 bit machines; all the 64 bit ones are passing (except
for prion which showed a real relcache bug, which I already stomped).
Now, the difference is that the total cost in those machines for seqscan
is 155 instead of 150. Tomas suggests that this happens because
MAXALIGN is different, leading to packing tuples differently: the
expected cost (on our laptop's 64 bit) is 155, and the cost we get in 32
bit arch is 150 -- so 5 pages of difference. We insert 1000 rows on the
table; 4 bytes per tuple would amount to 40 kB, which is exactly 5
pages.
I'll push an alternate expected file for this test, which we think is
the simplest fix.
Why not use COSTS OFF? Or I'll put that even more strongly: all the
existing regression tests use COSTS OFF, exactly to avoid this sort of
machine-dependent output. There had better be a really damn good
reason not to use it here.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Tom Lane wrote:
Why not use COSTS OFF? Or I'll put that even more strongly: all the
existing regression tests use COSTS OFF, exactly to avoid this sort of
machine-dependent output. There had better be a really damn good
reason not to use it here.
If we use COSTS OFF, the test is completely pointless, as the plans look
identical regardless of whether the multivariate stats are being used or
not.
If we had a ROWS option to ANALYZE that showed estimated number of rows
but not the cost, that would be an option.
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
Tom Lane wrote:
Why not use COSTS OFF? Or I'll put that even more strongly: all the
existing regression tests use COSTS OFF, exactly to avoid this sort of
machine-dependent output. There had better be a really damn good
reason not to use it here.
If we use COSTS OFF, the test is completely pointless, as the plans look
identical regardless of whether the multivariate stats are being used or
not.
Well, I think you are going to find that the exact costs are far too
fragile to have in the regression test output. Just because you wish
you could test them this way doesn't mean you can.
If we had a ROWS option to ANALYZE that showed estimated number of rows
but not the cost, that would be an option.
Unlikely to be any better. All these numbers are subject to lots of
noise, eg due to auto-analyze happening at unexpected times, random
sampling during analyze, etc. If you try to constrain the test case
enough that none of that happens, I wonder how useful it will really be.
What I would suggest is devising a test case whereby you actually
get a different plan shape now than you did before. That shouldn't
be too terribly hard, or else what was the point?
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers