WIP: multivariate statistics / proof of concept
Hi,
attached is a WIP patch implementing multivariate statistics. The code
certainly is not "ready" - parts of it look as if written by a rogue
chimp who got bored of attempts to type the complete works of William
Shakespeare, and decided to try something different.
I also cut some corners to make it work, and those limitations need to
be fixed before the eventual commit (those are not difficult problems,
but were not necessary for a proof-of-concept patch).
It however seems to be working sufficiently well at this point, enough
to get some useful feedback. So here we go.
I expect to be busy over the next two weeks because of travel, so sorry
for somehow delayed responses. If you happen to attend pgconf.eu next
week (Oct 20-24), we can of course discuss this patch in person.
Goals and basics
----------------
The goal of this patch is allowing users to define multivariate
statistics (i.e. statistics on multiple columns), and improving
estimation when the columns are correlated.
Take for example a table like this:
CREATE TABLE test (a INT, b INT, c INT);
INSERT INTO test SELECT i/10000, i/10000, i/10000
FROM generate_series(1,1000000) s(i);
ANALYZE test;
and do a query like this:
SELECT * FROM test WHERE (a = 10) AND (b = 10) AND (c = 10);
which is estimated like this:
QUERY PLAN
---------------------------------------------------------
Seq Scan on test (cost=0.00..22906.00 rows=1 width=12)
Filter: ((a = 10) AND (b = 10) AND (c = 10))
Planning time: 0.142 ms
(3 rows)
The query of course returns 10.000 rows, but the planner assumes the
columns are independent and thus multiplies the selectivities. And 1/100
for each column means 1/1000000 in total, which is 1 row.
This example is of course somehow artificial, but the problem is far
from uncommon, especially in denormalized datasets (e.g. star schema).
If you ever got an index scan instead of a sequential scan due to poor
estimate, resulting in a query running for hours instead of seconds, you
know the pain.
The patch allows you to do this:
ALTER TABLE test ADD STATISTICS ON (a, b, c);
ANALYZE test;
which then results in this estimate:
QUERY PLAN
------------------------------------------------------------
Seq Scan on test (cost=0.00..22906.00 rows=9667 width=12)
Filter: ((a = 10) AND (b = 10) AND (c = 10))
Planning time: 0.110 ms
(3 rows)
This however is not free - both building such statistics (during
ANALYZE) and using it (during planning) costs some cycles. Even if we
optimize the hell out of it, it won't be entirely free.
One of the design goals in this patch is not to make the ANALYZE or
planning more expensive unless you add such statistics.
Those who add such statistics probably decided that the price is worth
the improved estimates, and lower risk of inefficient plans. If the
planning takes a few more miliseconds, it's probably worth it if you
risk queries running for minutes or hours because of misestimates.
It also does not guarantee the estimates to be always better. There will
be misestimates, although rather in the other direction (independence
assumption usually leads to underestimates, this may lead to
overestimates). However based on my experience from writing the patch I
be I believe it's possible to reasonably limit the extent of such errors
(just like in the single-column histograms, it's related to the bucket
size).
Of course, there will be cases when the old approach is lucky by
accident - there's not much we can do to beat luck. And we can't rely on
it either.
Design overview
---------------
The patch adds a new system catalog, called pg_mv_statistic, which is
used to keep track of requested statistics. There's also a pg_mv_stats
view, showing some basic info about the stats (not all the data).
There are three kinds of statistics
- list of most common combinations of values (MCV list)
- multi-dimensional histogram
- associative rules
The first two are extensions of the single-column stats we already have.
The MCV list is a trivial extension to multiple dimensions, just
tracking combinations and frequencies. The histogram is more complex -
the structure is quite simple (multi-dimensional rectangles) but there's
a lot of ways to build it. But even the current naive and simple
implementation seems to work quite well.
The last kind (associative rules) is an attempt to track "implications"
between columns. It is however an experiment and it's not really used in
the patch so I'll ignore it for now.
I'm not going to explain all the implementation details here - if you
want to learn more, the best way is probably by reading the changes in
those files (probably in this order):
src/include/utils/mvstats.h
src/backend/commands/analyze.c
src/backend/optimizer/path/clausesel.c
I tried to explain the ideas thoroughly in the comments, along with a
lot of TODO/FIXME items related to limitations, explained in the next
section.
Limitations
-----------
As I mentioned, the current patch has a number of practical limitations,
most importantly:
(a) only data types passed by value (no varlena types)
(b) only data types with sort (to be able to build histogram)
(c) no NULL values supported
(d) not handling DROP COLUMN or DROP TABLE and such
(e) limited to stats on 8 columns (max)
(f) optimizer uses single stats per table
(g) limited list of compatible WHERE clauses
(h) incomplete ADD STATISTICS syntax
The first three conditions are really a shortcut to a working patch, and
fixing them should not be difficult.
The limited number of columns is really just a sanity check. It's
possible to increase it, but I doubt stats on more columns will be
practical because of excessive size or poor accuracy.
A better approach is to support combining multiple stats, defined on
various subsets of columns. This is not implemented at the memoment, but
it's certainly on the roadmap. Currently the "smallest" stats covering
the most columns is selected.
Regarding the compatible WHERE clauses, the patch currently handles
conditions of the form
column OPERATOR constant
where operator is one of the comparison operators (=, <, >, =<, >=). In
the future it's possible to add support for more conditions, e.g.
"column IS NULL" or "column OPERATOR column".
The last point is really just "unfinished implementation" - the syntax I
propose is this:
ALTER TABLE ... ADD STATISTICS (options) ON (columns)
where the options influence the MCV list and histogram size, etc. The
options are recognized and may give you an idea of what it might do, but
it's not really used at the moment (except for storing in the
pg_mv_statistic catalog).
Examples
--------
Let's see a few examples of how to define the stats, and what difference
in estimates it makes:
CREATE TABLE test (a INT, b INT c INT);
-- same value in all columns
INSERT INTO test SELECT mod(i,100), mod(i,100), mod(i,100)
FROM generate_series(1,1000000) s(i);
ANALYZE test;
=============== no multivariate stats ============================
SELECT * FROM test WHERE a = 10 AND b = 10;
QUERY PLAN
-------------------------------------------------------------------
Seq Scan on test (cost=0.00..20406.00 rows=101 width=12)
(actual time=0.007..60.902 rows=10000 loops=1)
Filter: ((a = 10) AND (b = 10))
Rows Removed by Filter: 990000
Planning time: 0.119 ms
Execution time: 61.164 ms
(5 rows)
SELECT * FROM test WHERE a = 10 AND b = 10 AND c = 10;
QUERY PLAN
-------------------------------------------------------------------
Seq Scan on test (cost=0.00..22906.00 rows=1 width=12)
(actual time=0.010..56.780 rows=10000 loops=1)
Filter: ((a = 10) AND (b = 10) AND (c = 10))
Rows Removed by Filter: 990000
Planning time: 0.061 ms
Execution time: 56.994 ms
(5 rows)
=============== with multivariate stats ===========================
ALTER TABLE test ADD STATISTICS ON (a, b, c);
ANALYZE test;
SELECT * FROM test WHERE a = 10 AND b = 10;
QUERY PLAN
-------------------------------------------------------------------
Seq Scan on test (cost=0.00..20406.00 rows=10767 width=12)
(actual time=0.007..58.981 rows=10000 loops=1)
Filter: ((a = 10) AND (b = 10))
Rows Removed by Filter: 990000
Planning time: 0.114 ms
Execution time: 59.214 ms
(5 rows)
SELECT * FROM test WHERE a = 10 AND b = 10 AND c = 10;
QUERY PLAN
-------------------------------------------------------------------
Seq Scan on test (cost=0.00..22906.00 rows=10767 width=12)
(actual time=0.008..61.838 rows=10000 loops=1)
Filter: ((a = 10) AND (b = 10) AND (c = 10))
Rows Removed by Filter: 990000
Planning time: 0.088 ms
Execution time: 62.057 ms
(5 rows)
OK, that was rather significant improvement, but it's also trivial
dataset. Let's see something more complicated - the following table has
correlated columns with distributions skewed to 0.
CREATE TABLE test (a INT, b INT, c INT);
INSERT INTO test SELECT r*MOD(i,50),
pow(r,2)*MOD(i,100),
pow(r,4)*MOD(i,500)
FROM (SELECT random() AS r, i
FROM generate_series(1,1000000) s(i)) foo;
ANALYZE test;
SELECT * FROM test WHERE a = 0 AND b = 0;
=============== no multivariate stats ============================
QUERY PLAN
-------------------------------------------------------------------
Seq Scan on test (cost=0.00..20406.00 rows=9024 width=12)
(actual time=0.007..62.969 rows=49503 loops=1)
Filter: ((a = 0) AND (b = 0))
Rows Removed by Filter: 950497
Planning time: 0.057 ms
Execution time: 64.098 ms
(5 rows)
SELECT * FROM test WHERE a = 0 AND b = 0 AND c = 0;
QUERY PLAN
-------------------------------------------------------------------
Seq Scan on test (cost=0.00..22906.00 rows=2126 width=12)
(actual time=0.008..63.862 rows=40770 loops=1)
Filter: ((a = 0) AND (b = 0) AND (c = 0))
Rows Removed by Filter: 959230
Planning time: 0.060 ms
Execution time: 64.794 ms
(5 rows)
=============== with multivariate stats ============================
ALTER TABLE test ADD STATISTICS ON (a, b, c);
ANALYZE test;
db=> SELECT * FROM pg_mv_stats;
schemaname | public
tablename | test
attnums | 1 2 3
mcvbytes | 25904
mcvinfo | nitems=809
histbytes | 568240
histinfo | nbuckets=13772
SELECT * FROM test WHERE a = 0 AND b = 0;
QUERY PLAN
-------------------------------------------------------------------
Seq Scan on test (cost=0.00..20406.00 rows=47717 width=12)
(actual time=0.007..61.782 rows=49503 loops=1)
Filter: ((a = 0) AND (b = 0))
Rows Removed by Filter: 950497
Planning time: 3.181 ms
Execution time: 62.859 ms
(5 rows)
SELECT * FROM test WHERE a = 0 AND b = 0 AND c = 0;
QUERY PLAN
-------------------------------------------------------------------
Seq Scan on test (cost=0.00..22906.00 rows=40567 width=12)
(actual time=0.009..66.685 rows=40770 loops=1)
Filter: ((a = 0) AND (b = 0) AND (c = 0))
Rows Removed by Filter: 959230
Planning time: 0.188 ms
Execution time: 67.593 ms
(5 rows)
regards
Tomas
Attachments:
multivar-stats-v1.patchtext/x-diff; name=multivar-stats-v1.patchDownload+3844-9
Tomas Vondra wrote:
attached is a WIP patch implementing multivariate statistics.
I think that is pretty useful.
Oracle has an identical feature called "extended statistics".
That's probably an entirely different thing, but it would be very
nice to have statistics to estimate the correlation between columns
of different tables, to improve the estimate for the number of rows
in a join.
Yours,
Laurenz Albe
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hi!
On 13.10.2014 09:36, Albe Laurenz wrote:
Tomas Vondra wrote:
attached is a WIP patch implementing multivariate statistics.
I think that is pretty useful.
Oracle has an identical feature called "extended statistics".That's probably an entirely different thing, but it would be very
nice to have statistics to estimate the correlation between columns
of different tables, to improve the estimate for the number of rows
in a join.
I don't have a clear idea of how that should work, but from the quick
look at how join selectivity estimation is implemented, I believe two
things might be possible:
(a) using conditional probabilities
Say we have a join "ta JOIN tb ON (ta.x = tb.y)"
Currently, the selectivity is derived from stats on the two keys.
Essentially probabilities P(x), P(y), represented by the MCV lists.
But if there are additional WHERE conditions on the tables, and we
have suitable multivariate stats, it's possible to use conditional
probabilities.
E.g. if the query actually uses
... ta JOIN tb ON (ta.x = tb.y) WHERE ta.z = 10
and we have stats on (ta.x, ta.z), we can use P(x|z=10) instead.
If the two columns are correlated, this might be much different.
(b) using this for multi-column conditions
If the join condition involves multiple columns, e.g.
ON (ta.x = tb.y AND ta.p = tb.q)
and we happen to have stats on (ta.x,ta.p) and (tb.y,tb.q), we may
use this to compute the cardinality (pretty much as we do today).
But I haven't really worked on this so far, I suspect there are various
subtle issues and I certainly don't plan to address this in the first
phase of the patch.
Tomas
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Oct 13, 2014 at 11:00 AM, Tomas Vondra <tv@fuzzy.cz> wrote:
Hi,
attached is a WIP patch implementing multivariate statistics. The code
certainly is not "ready" - parts of it look as if written by a rogue
chimp who got bored of attempts to type the complete works of William
Shakespeare, and decided to try something different.
I'm really glad you're working on this. I had been thinking of looking into
doing this myself.
The last point is really just "unfinished implementation" - the syntax I
propose is this:ALTER TABLE ... ADD STATISTICS (options) ON (columns)
where the options influence the MCV list and histogram size, etc. The
options are recognized and may give you an idea of what it might do, but
it's not really used at the moment (except for storing in the
pg_mv_statistic catalog).
I've not really gotten around to looking at the patch yet, but I'm also
wondering if it would be simple include allowing functional statistics too.
The pg_mv_statistic name seems to indicate multi columns, but how about
stats on date(datetime_column), or perhaps any non-volatile function. This
would help to solve the problem highlighted here
/messages/by-id/CAApHDvp2vH=7O-gp-zAf7aWy+A-WHWVg7h3Vc6=5pf9Uf34DhQ@mail.gmail.com
. Without giving it too much thought, perhaps any expression that can be
indexed should be allowed to have stats? Would that be really difficult to
implement in comparison to what you've already done with the patch so far?
I'm quite interested in reviewing your work on this, but it appears that
some of your changes are not C89:
src\backend\commands\analyze.c(3774): error C2057: expected constant
expression [D:\Postgres\a\postgres.vcxproj]
src\backend\commands\analyze.c(3774): error C2466: cannot allocate an
array of constant size 0 [D:\Postgres\a\postgres.vcxproj]
src\backend\commands\analyze.c(3774): error C2133: 'indexes' : unknown
size [D:\Postgres\a\postgres.vcxproj]
src\backend\commands\analyze.c(4302): error C2057: expected constant
expression [D:\Postgres\a\postgres.vcxproj]
src\backend\commands\analyze.c(4302): error C2466: cannot allocate an
array of constant size 0 [D:\Postgres\a\postgres.vcxproj]
src\backend\commands\analyze.c(4302): error C2133: 'ndistincts' : unknown
size [D:\Postgres\a\postgres.vcxproj]
src\backend\commands\analyze.c(4775): error C2057: expected constant
expression [D:\Postgres\a\postgres.vcxproj]
src\backend\commands\analyze.c(4775): error C2466: cannot allocate an
array of constant size 0 [D:\Postgres\a\postgres.vcxproj]
src\backend\commands\analyze.c(4775): error C2133: 'keys' : unknown size
[D:\Postgres\a\postgres.vcxproj]
The compiler I'm using is a bit too stupid to understand the C99 syntax.
I guess you'd need to palloc() these arrays instead in order to comply with
the project standards.
http://www.postgresql.org/docs/devel/static/install-requirements.html
I'm going to sign myself up to review this, so probably my first feedback
would be the compiling problem.
Regards
David Rowley
Dne 29 Říjen 2014, 10:41, David Rowley napsal(a):
I've not really gotten around to looking at the patch yet, but I'm also
wondering if it would be simple include allowing functional statistics
too.
The pg_mv_statistic name seems to indicate multi columns, but how about
stats on date(datetime_column), or perhaps any non-volatile function. This
would help to solve the problem highlighted here
/messages/by-id/CAApHDvp2vH=7O-gp-zAf7aWy+A-WHWVg7h3Vc6=5pf9Uf34DhQ@mail.gmail.com
. Without giving it too much thought, perhaps any expression that can be
indexed should be allowed to have stats? Would that be really difficult to
implement in comparison to what you've already done with the patch so far?
I don't know, but it seems mostly orthogonal to what the patch aims to do.
If we add collecting statistics on expressions (on a single column), then I'd
expect it to be reasonably simple to add this to the multi-column case.
There are features like join stats or range type stats, that are probably
more directly related to the patch (but out of scope for the initial
version).
I'm quite interested in reviewing your work on this, but it appears that
some of your changes are not C89:src\backend\commands\analyze.c(3774): error C2057: expected constant
expression [D:\Postgres\a\postgres.vcxproj]
src\backend\commands\analyze.c(3774): error C2466: cannot allocate an
array of constant size 0 [D:\Postgres\a\postgres.vcxproj]
src\backend\commands\analyze.c(3774): error C2133: 'indexes' : unknown
size [D:\Postgres\a\postgres.vcxproj]
src\backend\commands\analyze.c(4302): error C2057: expected constant
expression [D:\Postgres\a\postgres.vcxproj]
src\backend\commands\analyze.c(4302): error C2466: cannot allocate an
array of constant size 0 [D:\Postgres\a\postgres.vcxproj]
src\backend\commands\analyze.c(4302): error C2133: 'ndistincts' : unknown
size [D:\Postgres\a\postgres.vcxproj]
src\backend\commands\analyze.c(4775): error C2057: expected constant
expression [D:\Postgres\a\postgres.vcxproj]
src\backend\commands\analyze.c(4775): error C2466: cannot allocate an
array of constant size 0 [D:\Postgres\a\postgres.vcxproj]
src\backend\commands\analyze.c(4775): error C2133: 'keys' : unknown size
[D:\Postgres\a\postgres.vcxproj]The compiler I'm using is a bit too stupid to understand the C99 syntax.
I guess you'd need to palloc() these arrays instead in order to comply
with
the project standards.http://www.postgresql.org/docs/devel/static/install-requirements.html
I'm going to sign myself up to review this, so probably my first feedback
would be the compiling problem.
I'll look into that. The thing is I don't have access to MSVC, so it's a bit
difficult to spot / fix those issues :-(
regards
Tomas
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 29/10/14 10:41, David Rowley wrote:
On Mon, Oct 13, 2014 at 11:00 AM, Tomas Vondra <tv@fuzzy.cz
The last point is really just "unfinished implementation" - the syntax I
propose is this:ALTER TABLE ... ADD STATISTICS (options) ON (columns)
where the options influence the MCV list and histogram size, etc. The
options are recognized and may give you an idea of what it might do, but
it's not really used at the moment (except for storing in the
pg_mv_statistic catalog).I've not really gotten around to looking at the patch yet, but I'm also
wondering if it would be simple include allowing functional statistics
too. The pg_mv_statistic name seems to indicate multi columns, but how
about stats on date(datetime_column), or perhaps any non-volatile
function. This would help to solve the problem highlighted here
/messages/by-id/CAApHDvp2vH=7O-gp-zAf7aWy+A-WHWVg7h3Vc6=5pf9Uf34DhQ@mail.gmail.com
. Without giving it too much thought, perhaps any expression that can be
indexed should be allowed to have stats? Would that be really difficult
to implement in comparison to what you've already done with the patch so
far?
I would not over-complicate requirements for the first version of this,
I think it's already complicated enough.
Quick look at the patch suggests that it mainly needs discussion about
design and particular implementation choices, there is fair amount of
TODOs and FIXMEs. I'd like to look at it too but I doubt that I'll have
time to do in depth review in this CF.
--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Dne 29 Říjen 2014, 12:31, Petr Jelinek napsal(a):
On 29/10/14 10:41, David Rowley wrote:
On Mon, Oct 13, 2014 at 11:00 AM, Tomas Vondra <tv@fuzzy.cz
The last point is really just "unfinished implementation" - the
syntax I
propose is this:ALTER TABLE ... ADD STATISTICS (options) ON (columns)
where the options influence the MCV list and histogram size, etc.
The
options are recognized and may give you an idea of what it might do,
but
it's not really used at the moment (except for storing in the
pg_mv_statistic catalog).I've not really gotten around to looking at the patch yet, but I'm also
wondering if it would be simple include allowing functional statistics
too. The pg_mv_statistic name seems to indicate multi columns, but how
about stats on date(datetime_column), or perhaps any non-volatile
function. This would help to solve the problem highlighted here
/messages/by-id/CAApHDvp2vH=7O-gp-zAf7aWy+A-WHWVg7h3Vc6=5pf9Uf34DhQ@mail.gmail.com
. Without giving it too much thought, perhaps any expression that can be
indexed should be allowed to have stats? Would that be really difficult
to implement in comparison to what you've already done with the patch so
far?I would not over-complicate requirements for the first version of this,
I think it's already complicated enough.
My thoughts, exactly. I'm not willing to put more features into the
initial version of the patch. Actually, I'm thinking about ripping out
some experimental features (particularly "hashed MCV" and "associative
rules").
Quick look at the patch suggests that it mainly needs discussion about
design and particular implementation choices, there is fair amount of
TODOs and FIXMEs. I'd like to look at it too but I doubt that I'll have
time to do in depth review in this CF.
Yes. I think it's a bit premature to discuss the code thoroughly at this
point - I'd like to discuss the general approach to the feature (i.e.
minimizing the impact on those not using it, etc.).
The most interesting part of the code are probably the comments,
explaining the design in more detail, known shortcomings and possible ways
to address them.
regards
Tomas
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Oct 30, 2014 at 12:48 AM, Tomas Vondra <tv@fuzzy.cz> wrote:
Dne 29 Říjen 2014, 12:31, Petr Jelinek napsal(a):
I've not really gotten around to looking at the patch yet, but I'm also
wondering if it would be simple include allowing functional statistics
too. The pg_mv_statistic name seems to indicate multi columns, but how
about stats on date(datetime_column), or perhaps any non-volatile
function. This would help to solve the problem highlighted here/messages/by-id/CAApHDvp2vH=7O-gp-zAf7aWy+A-WHWVg7h3Vc6=5pf9Uf34DhQ@mail.gmail.com
. Without giving it too much thought, perhaps any expression that can be
indexed should be allowed to have stats? Would that be really difficult
to implement in comparison to what you've already done with the patch so
far?I would not over-complicate requirements for the first version of this,
I think it's already complicated enough.My thoughts, exactly. I'm not willing to put more features into the
initial version of the patch. Actually, I'm thinking about ripping out
some experimental features (particularly "hashed MCV" and "associative
rules").
That's fair, but I didn't really mean to imply that you should go work on
that too and that it should be part of this patch..
I was thinking more along the lines of that I don't really agree with the
table name for the new stats and that at some later date someone will want
to add expression stats and we'd probably better come up design that would
be friendly towards that. At this time I can only think that the name of
the table might not suit well to expression stats, I'd hate to see someone
have to invent a 3rd table to support these when we could likely come up
with something that could be extended later and still make sense both today
and in the future.
I was just looking at how expression indexes are stored in pg_index and I
see that if it's an expression index that the expression is stored in
the indexprs column which is of type pg_node_tree, so quite possibly at
some point in the future the new stats table could just have an extra
column added, and for today, we'd just need to come up with a future proof
name... Perhaps pg_statistic_ext or pg_statisticx, and name functions and
source files something along those lines instead?
Regards
David Rowley
On Thu, Oct 30, 2014 at 12:21 AM, Tomas Vondra <tv@fuzzy.cz> wrote:
Dne 29 Říjen 2014, 10:41, David Rowley napsal(a):
I'm quite interested in reviewing your work on this, but it appears that
some of your changes are not C89:src\backend\commands\analyze.c(3774): error C2057: expected constant
expression [D:\Postgres\a\postgres.vcxproj]
src\backend\commands\analyze.c(3774): error C2466: cannot allocate an
array of constant size 0 [D:\Postgres\a\postgres.vcxproj]
src\backend\commands\analyze.c(3774): error C2133: 'indexes' : unknown
size [D:\Postgres\a\postgres.vcxproj]
src\backend\commands\analyze.c(4302): error C2057: expected constant
expression [D:\Postgres\a\postgres.vcxproj]
src\backend\commands\analyze.c(4302): error C2466: cannot allocate an
array of constant size 0 [D:\Postgres\a\postgres.vcxproj]
src\backend\commands\analyze.c(4302): error C2133: 'ndistincts' :unknown
size [D:\Postgres\a\postgres.vcxproj]
src\backend\commands\analyze.c(4775): error C2057: expected constant
expression [D:\Postgres\a\postgres.vcxproj]
src\backend\commands\analyze.c(4775): error C2466: cannot allocate an
array of constant size 0 [D:\Postgres\a\postgres.vcxproj]
src\backend\commands\analyze.c(4775): error C2133: 'keys' : unknown size
[D:\Postgres\a\postgres.vcxproj]I'll look into that. The thing is I don't have access to MSVC, so it's a
bit
difficult to spot / fix those issues :-(
It should be a pretty simple fix, just use the files and line numbers from
the above. It's just a problem that in those 3 places you're declaring an
array of a variable size, which is not allowed in C89. The thing to do
instead would just be to palloc() the size you need and the pfree() it when
you're done.
Regards
David Rowley
Dne 30 Říjen 2014, 10:17, David Rowley napsal(a):
On Thu, Oct 30, 2014 at 12:48 AM, Tomas Vondra <tv@fuzzy.cz> wrote:
Dne 29 Říjen 2014, 12:31, Petr Jelinek napsal(a):
I've not really gotten around to looking at the patch yet, but I'm
also
wondering if it would be simple include allowing functional
statistics
too. The pg_mv_statistic name seems to indicate multi columns, but
how
about stats on date(datetime_column), or perhaps any non-volatile
function. This would help to solve the problem highlighted here/messages/by-id/CAApHDvp2vH=7O-gp-zAf7aWy+A-WHWVg7h3Vc6=5pf9Uf34DhQ@mail.gmail.com
. Without giving it too much thought, perhaps any expression that can
be
indexed should be allowed to have stats? Would that be really
difficult
to implement in comparison to what you've already done with the patch
so
far?
I would not over-complicate requirements for the first version of
this,
I think it's already complicated enough.
My thoughts, exactly. I'm not willing to put more features into the
initial version of the patch. Actually, I'm thinking about ripping out
some experimental features (particularly "hashed MCV" and "associative
rules").That's fair, but I didn't really mean to imply that you should go work on
that too and that it should be part of this patch..
I was thinking more along the lines of that I don't really agree with the
table name for the new stats and that at some later date someone will want
to add expression stats and we'd probably better come up design that would
be friendly towards that. At this time I can only think that the name of
the table might not suit well to expression stats, I'd hate to see someone
have to invent a 3rd table to support these when we could likely come up
with something that could be extended later and still make sense both
today
and in the future.I was just looking at how expression indexes are stored in pg_index and I
see that if it's an expression index that the expression is stored in
the indexprs column which is of type pg_node_tree, so quite possibly at
some point in the future the new stats table could just have an extra
column added, and for today, we'd just need to come up with a future proof
name... Perhaps pg_statistic_ext or pg_statisticx, and name functions and
source files something along those lines instead?
Ah, OK. I don't think the catalog name "pg_mv_statistic" is somehow
inappropriate for this purpose, though. IMHO the "multivariate" does not
mean "only columns" or "no expressions", it simply describes that the
approximated density function has multiple input variables, be it
attributes or expressions.
But maybe there's a better name.
Tomas
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 30.10.2014 10:23, David Rowley wrote:
On Thu, Oct 30, 2014 at 12:21 AM, Tomas Vondra <tv@fuzzy.cz
<mailto:tv@fuzzy.cz>> wrote:Dne 29 Říjen 2014, 10:41, David Rowley napsal(a):
I'm quite interested in reviewing your work on this, but it
appears that
some of your changes are not C89:
src\backend\commands\analyze.c(3774): error C2057: expected constant
expression [D:\Postgres\a\postgres.vcxproj]
src\backend\commands\analyze.c(3774): error C2466: cannot allocate an
array of constant size 0 [D:\Postgres\a\postgres.vcxproj]
src\backend\commands\analyze.c(3774): error C2133: 'indexes' :unknown
size [D:\Postgres\a\postgres.vcxproj]
src\backend\commands\analyze.c(4302): error C2057: expected constant
expression [D:\Postgres\a\postgres.vcxproj]
src\backend\commands\analyze.c(4302): error C2466: cannot allocate an
array of constant size 0 [D:\Postgres\a\postgres.vcxproj]
src\backend\commands\analyze.c(4302): error C2133: 'ndistincts' :unknown
size [D:\Postgres\a\postgres.vcxproj]
src\backend\commands\analyze.c(4775): error C2057: expected constant
expression [D:\Postgres\a\postgres.vcxproj]
src\backend\commands\analyze.c(4775): error C2466: cannot allocate an
array of constant size 0 [D:\Postgres\a\postgres.vcxproj]
src\backend\commands\analyze.c(4775): error C2133: 'keys' :unknown size
[D:\Postgres\a\postgres.vcxproj]
I'll look into that. The thing is I don't have access to MSVC, so
it's a bit difficult to spot / fix those issues :-(It should be a pretty simple fix, just use the files and line
numbers from the above. It's just a problem that in those 3 places
you're declaring an array of a variable size, which is not allowed in
C89. The thing to do instead would just be to palloc() the size you
need and the pfree() it when you're done.
Attached is a patch that should fix these issues.
The bad news is there are a few installcheck failures (and were in the
previous patch, but I haven't noticed for some reason). Apparently,
there's some mixup in how the patch handles Var->varno in some causes,
causing issues with a handful of regression tests.
The problem is that is_mv_compatible (checking whether the condition is
compatible with multivariate stats) does this
if (! ((varRelid == 0) || (varRelid == var->varno)))
return false;
/* Also skip special varno values, and system attributes ... */
if ((IS_SPECIAL_VARNO(var->varno)) ||
(! AttrNumberIsForUserDefinedAttr(var->varattno)))
return false;
assuming that after this, varno represents an index into the range
table, and passes it out to the caller.
And the caller (collect_mv_attnums) does this:
RelOptInfo *rel = find_base_rel(root, varno);
which fails with errors like these:
ERROR: no relation entry for relid 0
ERROR: no relation entry for relid 1880
or whatever. What's even stranger is this:
regression=# SELECT table_name, is_updatable, is_insertable_into
regression-# FROM information_schema.views
regression-# WHERE table_name = 'rw_view1';
ERROR: no relation entry for relid 0
regression=# SELECT table_name, is_updatable, is_insertable_into
regression-# FROM information_schema.views
regression-# ;
regression=# SELECT table_name, is_updatable, is_insertable_into
regression-# FROM information_schema.views
regression-# WHERE table_name = 'rw_view1';
table_name | is_updatable | is_insertable_into
------------+--------------+--------------------
(0 rows)
regression=# explain SELECT table_name, is_updatable, is_insertable_into
FROM information_schema.views
WHERE table_name = 'rw_view1';
ERROR: no relation entry for relid 0
So, the query fails. After removing the WHERE clause it works, and this
somehow fixes the original query (with the WHERE clause). Nevertheless,
I still can't do explain on the query.
Clearly, I'm doing something wrong. I suspect it's caused either by
conditions involving function calls, or the fact that the view is a join
of multiple tables. But what?
For simple queries (single table, ...) it seems to be working fine.
regards
Tomas
Attachments:
multivar-stats-v2.patchtext/x-diff; name=multivar-stats-v2.patchDownload+4291-9
On 12 October 2014 23:00, Tomas Vondra <tv@fuzzy.cz> wrote:
It however seems to be working sufficiently well at this point, enough
to get some useful feedback. So here we go.
This looks interesting and useful.
What I'd like to check before a detailed review is that this has
sufficient applicability to be useful.
My understanding is that Q9 and Q18 of TPC-H have poor plans as a
result of multi-column stats errors.
Could you look at those queries and confirm that this patch can
produce better plans for them?
If so, I will work with you to review this patch.
One aspect of the patch that seems to be missing is a user declaration
of correlation, just as we have for setting n_distinct. It seems like
an even easier place to start to just let the user specify the stats
declaratively. That way we can split the patch into two parts. First,
allow multi column stats that are user declared. Then add user stats
collected by ANALYZE. The first part is possibly contentious and thus
a good initial focus. The second part will have lots of discussion, so
good to skip for a first version.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Dne 13 Listopad 2014, 12:31, Simon Riggs napsal(a):
On 12 October 2014 23:00, Tomas Vondra <tv@fuzzy.cz> wrote:
It however seems to be working sufficiently well at this point, enough
to get some useful feedback. So here we go.This looks interesting and useful.
What I'd like to check before a detailed review is that this has
sufficient applicability to be useful.My understanding is that Q9 and Q18 of TPC-H have poor plans as a
result of multi-column stats errors.Could you look at those queries and confirm that this patch can
produce better plans for them?
Sure. I planned to do such verification/demonstration anyway, after
discussing the overall approach.
I planned to give it a try on TPC-DS, but I can start with the TPC-H
queries you propose. I'm not sure whether the poor estimates in Q9 & Q18
come from column correlation though - if it's due to some other issues
(e.g. conditions that are difficult to estimate), this patch can't do
anything with them. But it's a good start.
If so, I will work with you to review this patch.
Thanks!
One aspect of the patch that seems to be missing is a user declaration
of correlation, just as we have for setting n_distinct. It seems like
an even easier place to start to just let the user specify the stats
declaratively. That way we can split the patch into two parts. First,
allow multi column stats that are user declared. Then add user stats
collected by ANALYZE. The first part is possibly contentious and thus
a good initial focus. The second part will have lots of discussion, so
good to skip for a first version.
I'm not a big fan of this approach, for a number of reasons.
Firstly, it only works for "simple" parameters that are trivial to specify
(say, Pearson's correlation coefficient), and the patch does not work with
those at all - it only works with histograms, MCV lists (and might work
with associative rules in the future). And we certainly can't ask users to
specify multivariate histograms - because it's very difficult to do, and
also because complex stats are more susceptible to get stale after adding
new data to the table.
Secondly, even if we add such "simple" parameters to the patch, we have to
come up with a way to apply those parameters to the estimates. The
problem is that as the parameters get simpler, it's less and less useful
to compute the stats.
Another question is whether it should support more than 2 columns ...
The only place where I think this might work are the associative rules.
It's simple to specify rules like ("ZIP code" implies "city") and we could
even do some simple check against the data to see if it actually makes
sense (and 'disable' the rule if not).
But maybe I got it wrong and you have something particular in mind? Can
you give an example of how it would work?
regards
Tomas
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 13.11.2014 14:11, Tomas Vondra wrote:
Dne 13 Listopad 2014, 12:31, Simon Riggs napsal(a):
On 12 October 2014 23:00, Tomas Vondra <tv@fuzzy.cz> wrote:
It however seems to be working sufficiently well at this point, enough
to get some useful feedback. So here we go.This looks interesting and useful.
What I'd like to check before a detailed review is that this has
sufficient applicability to be useful.My understanding is that Q9 and Q18 of TPC-H have poor plans as a
result of multi-column stats errors.Could you look at those queries and confirm that this patch can
produce better plans for them?Sure. I planned to do such verification/demonstration anyway, after
discussing the overall approach.I planned to give it a try on TPC-DS, but I can start with the TPC-H
queries you propose. I'm not sure whether the poor estimates in Q9 & Q18
come from column correlation though - if it's due to some other issues
(e.g. conditions that are difficult to estimate), this patch can't do
anything with them. But it's a good start.If so, I will work with you to review this patch.
Thanks!
One aspect of the patch that seems to be missing is a user declaration
of correlation, just as we have for setting n_distinct. It seems like
an even easier place to start to just let the user specify the stats
declaratively. That way we can split the patch into two parts. First,
allow multi column stats that are user declared. Then add user stats
collected by ANALYZE. The first part is possibly contentious and thus
a good initial focus. The second part will have lots of discussion, so
good to skip for a first version.I'm not a big fan of this approach, for a number of reasons.
Firstly, it only works for "simple" parameters that are trivial to specify
(say, Pearson's correlation coefficient), and the patch does not work with
those at all - it only works with histograms, MCV lists (and might work
with associative rules in the future). And we certainly can't ask users to
specify multivariate histograms - because it's very difficult to do, and
also because complex stats are more susceptible to get stale after adding
new data to the table.Secondly, even if we add such "simple" parameters to the patch, we have to
come up with a way to apply those parameters to the estimates. The
problem is that as the parameters get simpler, it's less and less useful
to compute the stats.Another question is whether it should support more than 2 columns ...
The only place where I think this might work are the associative rules.
It's simple to specify rules like ("ZIP code" implies "city") and we could
even do some simple check against the data to see if it actually makes
sense (and 'disable' the rule if not).
and even this simple example has its limits, at least in Germany ZIP
codes are not unique for rural areas, where several villages have the
same ZIP code.
I guess there are just a few examples where columns are completely
functional dependent without any exceptions.
But of course, if the user gives this information just for optimization
the statistics, some exceptions don't matter.
If this information should be used for creating different execution
plans (e.g. on column A is an index and column B is functional
dependent, one could think about using this index on A and the
dependency instead of running through the whole table to find all tuples
that fit the query on column B), exceptions are a very important issue.
But maybe I got it wrong and you have something particular in mind? Can
you give an example of how it would work?regards
Tomas
--
Dipl.-Math. Katharina Büchse
Friedrich-Schiller-Universität Jena
Institut für Informatik
Lehrstuhl für Datenbanken und Informationssysteme
Ernst-Abbe-Platz 2
07743 Jena
Telefon 03641/946367
Webseite http://users.minet.uni-jena.de/~re89qen/
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Dne 13 Listopad 2014, 16:51, Katharina Büchse napsal(a):
On 13.11.2014 14:11, Tomas Vondra wrote:
The only place where I think this might work are the associative rules.
It's simple to specify rules like ("ZIP code" implies "city") and we
could
even do some simple check against the data to see if it actually makes
sense (and 'disable' the rule if not).and even this simple example has its limits, at least in Germany ZIP
codes are not unique for rural areas, where several villages have the
same ZIP code.I guess there are just a few examples where columns are completely
functional dependent without any exceptions.
But of course, if the user gives this information just for optimization
the statistics, some exceptions don't matter.
If this information should be used for creating different execution
plans (e.g. on column A is an index and column B is functional
dependent, one could think about using this index on A and the
dependency instead of running through the whole table to find all tuples
that fit the query on column B), exceptions are a very important issue.
Yes, exactly. The aim of this patch is "only" improving estimates, not
removing conditions from the plan (e.g. checking only the ZIP code and not
the city name). That certainly can't be done solely based on approximate
statistics, and as you point out most real-world data either contain bugs
or are inherently imperfect (we have the same kind of ZIP/city
inconsistencies in Czech). That's not a big issue for estimates (assuming
only small fraction of rows violates the rule) though.
Tomas
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Tomas Vondra <tv@fuzzy.cz> wrote:
Dne 13 Listopad 2014, 16:51, Katharina Büchse napsal(a):
On 13.11.2014 14:11, Tomas Vondra wrote:
The only place where I think this might work are the associative rules.
It's simple to specify rules like ("ZIP code" implies "city") and we could
even do some simple check against the data to see if it actually makes
sense (and 'disable' the rule if not).and even this simple example has its limits, at least in Germany ZIP
codes are not unique for rural areas, where several villages have the
same ZIP code.
as you point out most real-world data either contain bugs
or are inherently imperfect (we have the same kind of ZIP/city
inconsistencies in Czech).
You can have lots of fun with U.S. zip code, too. Just on the
nominally "Madison, Wisconsin" zip codes (those starting with 537),
there are several exceptions:
select zipcode, city, locationtype
from zipcode
where zipcode like '537%'
and Decommisioned = 'false'
and zipcodetype = 'STANDARD'
and locationtype in ('PRIMARY', 'ACCEPTABLE')
order by zipcode, city;
zipcode | city | locationtype
---------+-----------+--------------
53703 | MADISON | PRIMARY
53704 | MADISON | PRIMARY
53705 | MADISON | PRIMARY
53706 | MADISON | PRIMARY
53711 | FITCHBURG | ACCEPTABLE
53711 | MADISON | PRIMARY
53713 | FITCHBURG | ACCEPTABLE
53713 | MADISON | PRIMARY
53713 | MONONA | ACCEPTABLE
53714 | MADISON | PRIMARY
53714 | MONONA | ACCEPTABLE
53715 | MADISON | PRIMARY
53716 | MADISON | PRIMARY
53716 | MONONA | ACCEPTABLE
53717 | MADISON | PRIMARY
53718 | MADISON | PRIMARY
53719 | FITCHBURG | ACCEPTABLE
53719 | MADISON | PRIMARY
53725 | MADISON | PRIMARY
53726 | MADISON | PRIMARY
53744 | MADISON | PRIMARY
(21 rows)
If you eliminate the quals besides the zipcode column you get 61
rows and it gets much stranger, with legal municipalities that are
completely surrounded by Madison that the postal service would
rather you didn't use in addressing your envelopes, but they have
to deliver to anyway, and organizations inside Madison receiving
enough mail to (literally) have their own zip code -- where the
postal service allows the organization name as a deliverable
"city".
If you want to have your own fun with this data, you can download
it here:
http://federalgovernmentzipcodes.us/free-zipcode-database.csv
I was able to load it into PostgreSQL with this:
create table zipcode
(
recordnumber integer not null,
zipcode text not null,
zipcodetype text not null,
city text not null,
state text not null,
locationtype text not null,
lat double precision,
long double precision,
xaxis double precision not null,
yaxis double precision not null,
zaxis double precision not null,
worldregion text not null,
country text not null,
locationtext text,
location text,
decommisioned text not null,
taxreturnsfiled bigint,
estimatedpopulation bigint,
totalwages bigint,
notes text
);
comment on column zipcode.zipcode is 'Zipcode or military postal code(FPO/APO)';
comment on column zipcode.zipcodetype is 'Standard, PO BOX Only, Unique, Military(implies APO or FPO)';
comment on column zipcode.city is 'offical city name(s)';
comment on column zipcode.state is 'offical state, territory, or quasi-state (AA, AE, AP) abbreviation code';
comment on column zipcode.locationtype is 'Primary, Acceptable,Not Acceptable';
comment on column zipcode.lat is 'Decimal Latitude, if available';
comment on column zipcode.long is 'Decimal Longitude, if available';
comment on column zipcode.location is 'Standard Display (eg Phoenix, AZ ; Pago Pago, AS ; Melbourne, AU )';
comment on column zipcode.decommisioned is 'If Primary location, Yes implies historical Zipcode, No Implies current Zipcode; If not Primary, Yes implies Historical Placename';
comment on column zipcode.taxreturnsfiled is 'Number of Individual Tax Returns Filed in 2008';
copy zipcode from 'filepath' with (format csv, header);
alter table zipcode add primary key (recordnumber);
create unique index zipcode_city on zipcode (zipcode, city);
I bet there are all sorts of correlation possibilities with, for
example, latitude and longitude and other variables. With 81831
rows and so many correlations among the columns, it might be a
useful data set to test with.
--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 15.11.2014 18:49, Kevin Grittner
If you eliminate the quals besides the zipcode column you get 61
rows and it gets much stranger, with legal municipalities that are
completely surrounded by Madison that the postal service would
rather you didn't use in addressing your envelopes, but they have
to deliver to anyway, and organizations inside Madison receiving
enough mail to (literally) have their own zip code -- where the
postal service allows the organization name as a deliverable
"city".If you want to have your own fun with this data, you can download
it here:http://federalgovernmentzipcodes.us/free-zipcode-database.csv
...
I bet there are all sorts of correlation possibilities with, for
example, latitude and longitude and other variables. With 81831
rows and so many correlations among the columns, it might be a
useful data set to test with.
Thanks for the link. I've been looking for a good dataset with such
data, and this one is by far the best one.
The current version of the patch supports only data types passed by
value (i.e. no varlena types - text, ), which means it's impossible to
build multivariate stats on some of the interesting columns (state,
city, ...).
I guess it's time to start working on removing this limitation.
Tomas
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sun, Nov 16, 2014 at 3:35 AM, Tomas Vondra <tv@fuzzy.cz> wrote:
Thanks for the link. I've been looking for a good dataset with such
data, and this one is by far the best one.The current version of the patch supports only data types passed by
value (i.e. no varlena types - text, ), which means it's impossible to
build multivariate stats on some of the interesting columns (state,
city, ...).I guess it's time to start working on removing this limitation.
Tomas, what's your status on this patch? Are you planning to make it
more complicated than it is? For now I have switched it to a "Needs
Review" state because even your first version did not get advanced
review (that's quite big btw). I guess that we should switch it to the
next CF.
Regards,
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 8.12.2014 02:01, Michael Paquier wrote:
On Sun, Nov 16, 2014 at 3:35 AM, Tomas Vondra <tv@fuzzy.cz> wrote:
Thanks for the link. I've been looking for a good dataset with such
data, and this one is by far the best one.The current version of the patch supports only data types passed by
value (i.e. no varlena types - text, ), which means it's impossible to
build multivariate stats on some of the interesting columns (state,
city, ...).I guess it's time to start working on removing this limitation.
Tomas, what's your status on this patch? Are you planning to make it
more complicated than it is? For now I have switched it to a "Needs
Review" state because even your first version did not get advanced
review (that's quite big btw). I guess that we should switch it to the
next CF.
Hello Michael,
I agree with moving the patch to the next CF - I'm working on the patch,
but I will take a bit more time to submit a new version and I can do
that in the next CF.
regards
Tomas
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 10/13/2014 01:00 AM, Tomas Vondra wrote:
Hi,
attached is a WIP patch implementing multivariate statistics.
Great! Really glad to see you working on this.
+ * FIXME This sample sizing is mostly OK when computing stats for + * individual columns, but when computing multi-variate stats + * for multivariate stats (histograms, mcv, ...) it's rather + * insufficient. For small number of dimensions it works, but + * for complex stats it'd be nice use sample proportional to + * the table (say, 0.5% - 1%) instead of a fixed size.
I don't think a fraction of the table is appropriate. As long as the
sample is random, the accuracy of a sample doesn't depend much on the
size of the population. For example, if you sample 1,000 rows from a
table with 100,000 rows, or 1000 rows from a table with 100,000,000
rows, the accuracy is pretty much the same. That doesn't change when you
go from a single variable to multiple variables.
You do need a bigger sample with multiple variables, however. My gut
feeling is that if you sample N rows for a single variable, with two
variables you need to sample N^2 rows to get the same accuracy. But it's
not proportional to the table size. (I have no proof for that, but I'm
sure there is literature on this.)
+ * Multivariate histograms + * + * Histograms are a collection of buckets, represented by n-dimensional + * rectangles. Each rectangle is delimited by an array of lower and + * upper boundaries, so that for for the i-th attribute + * + * min[i] <= value[i] <= max[i] + * + * Each bucket tracks frequency (fraction of tuples it contains), + * information about the inequalities, number of distinct values in + * each dimension (which is used when building the histogram) etc. + * + * The boundaries may be either inclusive or exclusive, or the whole + * dimension may be NULL. + * + * The buckets may overlap (assuming the build algorithm keeps the + * frequencies additive) or may not cover the whole space (i.e. allow + * gaps). This entirely depends on the algorithm used to build the + * histogram.
That sounds pretty exotic. These buckets are quite different from the
single-dimension buckets we currently have.
The paper you reference in partition_bucket() function, M.
Muralikrishna, David J. DeWitt: Equi-Depth Histograms For Estimating
Selectivity Factors For Multi-Dimensional Queries. SIGMOD Conference
1988: 28-36, actually doesn't mention overlapping buckets at all. I
haven't read the code in detail, but if it implements the algorithm from
that paper, there will be no overlap.
- Heikki
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers