Cross-column statistics revisited

Started by Joshua Tolleyover 17 years ago30 messageshackers

eggyknap@gmail.com

over 17 years ago

I've been interested in what it would take to start tracking
cross-column statistics. A review of the mailing lists as linked from
the TODO item on the subject [1]"Allow accurate statistics to be collected on indexes with more than one column or expression indexes, perhaps using per-index statistics", http://wiki.postgresql.org/wiki/Todo suggests the following concerns:

1) What information exactly would be tracked?
2) How would it be kept from exploding in size?
3) For which combinations of columns would statistics be kept?

The major concern in #1 seemed to be that the most suitable form for
keeping most common value lists, histograms, etc. is in an array, and
at the time of the posts I read, arrays of composite types weren't
possible. This seems much less of a concern now -- perhaps in greatest
part because a test I just did against a recent 8.4devel sure makes it
look like stats on composite type columns aren't even kept. The most
straightforward is that we'd keep a simple multi-dimensional
histogram, but that leads to a discussion of #2.

#2 is an interesting problem, and possibly the most difficult from a
theoretical perspective. Provided a fair expectation that the results
would be useful, I'd like to play with various forms of statistical
regressions to compress large cross-column histograms. But I'm not
sure I want to spend the time if there are other serious issues with
the whole idea of cross-column stats. Another possibility involves not
storing a histogram at all, but instead storing an array of quantiles
for each column. In other words, if S represents the statistics target
of a column, we'd have an array A of S+1 entries of the type in
question, where the beginning and end of the array are the minimum and
maximum values in the column, and the range between any two
consecutive array elements A[n] and A[n+1] contains 1/S of the total
entries in the column. For a uniformly distributed column, these array
elements would be evenly spaced across the total range of the column;
the difference between consecutive array elements would indicate
deviations from this distribution. Unfortunately this only works if
there's a valid distance metric for the data type in question.

In the archives, #3 was raised frequently as a concern -- obviously
tracking stats on all permutations of the columns in a database is a
non-starter. However there seemed to be little argument over sensible
defaults, namely that columns involved in a multi-column index would
have cross-column statistics kept automatically, and the user would be
allowed to instruct the server to keep stats on other arbitrary groups
of columns. There was some discussion about the fact that the user
would have to be smart about how (s)he used this feature, but there
are enough other potential foot guns in the system of similar
magnitude that worrying too much about that seems of little use.

I bring all this up because I'm interested in looking at how this
might be done, and perhaps doing it if 1) time permits, and 2) someone
else doesn't beat me to it. Comments, anyone? Am I missing something
obvious/serious/etc.? Thanks in advance.

- Josh / eggyknap

[1]: "Allow accurate statistics to be collected on indexes with more than one column or expression indexes, perhaps using per-index statistics", http://wiki.postgresql.org/wiki/Todo
than one column or expression indexes, perhaps using per-index
statistics", http://wiki.postgresql.org/wiki/Todo

Bruce Momjian

bruce@momjian.us

over 17 years ago

In reply to: Joshua Tolley (#1)

Re: Cross-column statistics revisited

"Joshua Tolley" <eggyknap@gmail.com> writes:

I've been interested in what it would take to start tracking
cross-column statistics. A review of the mailing lists as linked from
the TODO item on the subject [1] suggests the following concerns:

1) What information exactly would be tracked?
2) How would it be kept from exploding in size?
3) For which combinations of columns would statistics be kept?

I think then you have

4) How would we form estimates from these stats

The major concern in #1 seemed to be that the most suitable form for
keeping most common value lists, histograms, etc. is in an array, and
at the time of the posts I read, arrays of composite types weren't
possible. This seems much less of a concern now -- perhaps in greatest
part because a test I just did against a recent 8.4devel sure makes it
look like stats on composite type columns aren't even kept. The most
straightforward is that we'd keep a simple multi-dimensional
histogram, but that leads to a discussion of #2.

"multi-dimensional histogram" isn't such a simple concept, at least not to me.

Histograms aren't a bar chart of equal widths and various heights like I was
taught in school. They're actually bars of various widths arranged such that
they all of the same heights.

It's not clear how to extend that concept into two dimensions. I imagine
there's research on this though. What do the GIST statistics functions store?

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's 24x7 Postgres support!

Joshua Tolley

eggyknap@gmail.com

over 17 years ago

In reply to: Bruce Momjian (#2)

Re: Cross-column statistics revisited

On Wed, Oct 15, 2008 at 7:51 AM, Gregory Stark <stark@enterprisedb.com> wrote:

"Joshua Tolley" <eggyknap@gmail.com> writes:

I've been interested in what it would take to start tracking
cross-column statistics. A review of the mailing lists as linked from
the TODO item on the subject [1] suggests the following concerns:

1) What information exactly would be tracked?
2) How would it be kept from exploding in size?
3) For which combinations of columns would statistics be kept?

I think then you have

4) How would we form estimates from these stats

That too, but see below.

The major concern in #1 seemed to be that the most suitable form for
keeping most common value lists, histograms, etc. is in an array, and
at the time of the posts I read, arrays of composite types weren't
possible. This seems much less of a concern now -- perhaps in greatest
part because a test I just did against a recent 8.4devel sure makes it
look like stats on composite type columns aren't even kept. The most
straightforward is that we'd keep a simple multi-dimensional
histogram, but that leads to a discussion of #2.

"multi-dimensional histogram" isn't such a simple concept, at least not to me.

Histograms aren't a bar chart of equal widths and various heights like I was
taught in school. They're actually bars of various widths arranged such that
they all of the same heights.

That depends on whose definition you've got in mind. Wikipedia's
version (for whatever that's worth) defines it as variable width
columns of not-necessarily-uniform heights, where the area of each bar
is the frequency. The default behavior of the hist() function in the R
statistics package is uniform bar widths, which generates complaints
from purists. Semantics aside, it looks like pgsql's stats stuff
(which I realize I'll need to get to know lots better to undertake
something like this) uses a list of quantiles, which still only makes
sense, as I see it, when a distance measurement is available for the
data type. The multidimensional case might do better with a frequency
count, where we'd store a matrix/cube/etc. with uniformly-sized units,
probably compressed via some regression function. That said, a
multidimensional quantile list would also be possible, compressed
similarly.

Provided a frequency chart or quantile set, it seems estimates can be
made just as they are in one dimension, by finding the right value in
the quantile or frequency matrix.

It's not clear how to extend that concept into two dimensions. I imagine
there's research on this though. What do the GIST statistics functions store?

No idea.

- Josh / eggyknap

Martijn van Oosterhout

kleptog@svana.org

over 17 years ago

In reply to: Joshua Tolley (#1)

Re: Cross-column statistics revisited

On Wed, Oct 15, 2008 at 04:53:10AM -0600, Joshua Tolley wrote:

I've been interested in what it would take to start tracking
cross-column statistics. A review of the mailing lists as linked from
the TODO item on the subject [1] suggests the following concerns:

1) What information exactly would be tracked?
2) How would it be kept from exploding in size?
3) For which combinations of columns would statistics be kept?

I think you need to go a step back: how are you going to use this data?
Whatever structure you choose the eventual goal you take a discription
of the column (a,b) and take a clause like 'a < 5' and be able to
generate an estimate of the distribution of b.

Secondly, people arn't going to ask for multi-column stats on column
that arn't correlated in some way. So you need to work out what kinds
of correlation people are interested in and see how you can store them.

One potential use case is the (startdate,enddate) columns. Here what
you want to detect somehow that the distribution of (enddate-startdate)
is constant.

I think the real question is: what other kinds of correlation might
people be interested in representing?

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

Please line up in a tree and maintain the heap invariant while
boarding. Thank you for flying nlogn airlines.

Tom Lane

tgl@sss.pgh.pa.us

over 17 years ago

In reply to: Martijn van Oosterhout (#4)

Re: Cross-column statistics revisited

Martijn van Oosterhout <kleptog@svana.org> writes:

I think you need to go a step back: how are you going to use this data?

The fundamental issue as the planner sees it is not having to assume
independence of WHERE clauses. For instance, given

WHERE a < 5 AND b > 10

our current approach is to estimate the fraction of rows with a < 5
(using stats for a), likewise estimate the fraction with b > 10
(using stats for b), and then multiply these fractions together.
This is correct if a and b are independent, but can be very bad if
they aren't. So if we had joint statistics on a and b, we'd want to
somehow match that up to clauses for a and b and properly derive
the joint probability.

(I'm not certain of how to do that efficiently, even if we had the
right stats :-()

regards, tom lane

Bruce Momjian

bruce@momjian.us

over 17 years ago

In reply to: Tom Lane (#5)

Re: Cross-column statistics revisited

[sorry for top osting - dam phone]

It's pretty straightforward to to a chi-squared test on all the pairs.
But that tells you that the product is more likely to be wrong. It
doesn't tell you whether it's going to be too high or too low...

greg

On 16 Oct 2008, at 07:20 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Show quoted text

Martijn van Oosterhout <kleptog@svana.org> writes:

I think you need to go a step back: how are you going to use this
data?

The fundamental issue as the planner sees it is not having to assume
independence of WHERE clauses. For instance, given

WHERE a < 5 AND b > 10

our current approach is to estimate the fraction of rows with a < 5
(using stats for a), likewise estimate the fraction with b > 10
(using stats for b), and then multiply these fractions together.
This is correct if a and b are independent, but can be very bad if
they aren't. So if we had joint statistics on a and b, we'd want to
somehow match that up to clauses for a and b and properly derive
the joint probability.

(I'm not certain of how to do that efficiently, even if we had the
right stats :-()

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Robert Haas

robertmhaas@gmail.com

over 17 years ago

In reply to: Martijn van Oosterhout (#4)

Re: Cross-column statistics revisited

I think the real question is: what other kinds of correlation might
people be interested in representing?

Yes, or to phrase that another way: What kinds of queries are being
poorly optimized now and why?

I suspect that a lot of the correlations people care about are
extreme. For example, it's fairly common for me to have a table where
column B is only used at all for certain values of column A. Like,
atm_machine_id is usually or always NULL unless transaction_type is
ATM, or something. So a clause of the form transaction_type = 'ATM'
and atm_machine_id < 10000 looks more selective than it really is
(because the first half is redundant).

The other half of this is that bad selectivity estimates only matter
if they're bad enough to change the plan, and I'm not sure whether
cases like this are actually a problem in practice.

...Robert

Martijn van Oosterhout

kleptog@svana.org

over 17 years ago

In reply to: Robert Haas (#7)

Re: Cross-column statistics revisited

On Thu, Oct 16, 2008 at 01:34:59PM -0400, Robert Haas wrote:

I suspect that a lot of the correlations people care about are
extreme. For example, it's fairly common for me to have a table where
column B is only used at all for certain values of column A. Like,
atm_machine_id is usually or always NULL unless transaction_type is
ATM, or something. So a clause of the form transaction_type = 'ATM'
and atm_machine_id < 10000 looks more selective than it really is
(because the first half is redundant).

That case is easily done by simply considering the indexed values of a
column with a partial index. This should be fairly easy to do I think.

It might be worthwhile someone trawling through the archives looking
for examples where we estimate the correlation wrong.

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

Please line up in a tree and maintain the heap invariant while
boarding. Thank you for flying nlogn airlines.

Ron Mayer

rm_pg@cheapcomplexdevices.com

over 17 years ago

In reply to: Robert Haas (#7)

Re: Cross-column statistics revisited

Robert Haas wrote:

I think the real question is: what other kinds of correlation might
people be interested in representing?

Yes, or to phrase that another way: What kinds of queries are being
poorly optimized now and why?

The one that affects our largest tables are ones where we
have an address (or other geo-data) clustered by zip, but
with other columns (city, county, state, school-zone, police
beat, etc) used in queries.

Postgres considers those unclustered (correlation 0 in the stats),
despite all rows for a given value residing on the same few pages.

I could imagine that this could be handled by either some cross-column
correlation (each zip has only 1-2 cities); or by an enhanced
single-column statistic (even though cities aren't sorted alphabetically,
all rows on a page tend to refer to the same city).

#10

Josh Berkus

josh@agliodbs.com

over 17 years ago

In reply to: Tom Lane (#5)

Re: Cross-column statistics revisited

Tom,

(I'm not certain of how to do that efficiently, even if we had the
right stats :-()

I was actually talking to someone about this at pgWest. Apparently there's
a fair amount of academic algorithms devoted to this topic. Josh, do you
remember who was talking about this?

--
--Josh

Josh Berkus
PostgreSQL
San Francisco

#11

Joshua Tolley

eggyknap@gmail.com

over 17 years ago

In reply to: Josh Berkus (#10)

Re: Cross-column statistics revisited

On Thu, Oct 16, 2008 at 2:54 PM, Josh Berkus <josh@agliodbs.com> wrote:

Tom,

(I'm not certain of how to do that efficiently, even if we had the
right stats :-()

I was actually talking to someone about this at pgWest. Apparently there's
a fair amount of academic algorithms devoted to this topic. Josh, do you
remember who was talking about this?

Actually, it was me :) My biggest concern at the time was finding ways
to compress the correlation data, on the apparently fairly tenuous
assumption that we'd somehow be able to make use of it. As it turns
out, the particular set of algorithms I had in mind don't compress
anything, but there are other methods that do.

Most of the comments on this thread have centered around the questions
of "what we'd store" and "how we'd use it", which might be better
phrased as, "The database assumes columns are independent, but we know
that's not always true. Does this cause enough problems to make it
worth fixing? How might we fix it?" I have to admit an inability to
show that it causes problems, though Neil Conway pointed me to some
literature[1]http://www.cs.umd.edu/~amol/papers/paper-dep.pdf on the subject I've not yet had time to go through. My
basic idea is that if we have a cross-column frequency count, it will
help us get better plans, but I don't know the internals enough to
have details further than that.

- Josh / eggyknap

[1]: http://www.cs.umd.edu/~amol/papers/paper-dep.pdf

#12

Josh Berkus

josh@agliodbs.com

over 17 years ago

In reply to: Robert Haas (#7)

Re: Cross-column statistics revisited

Yes, or to phrase that another way: What kinds of queries are being
poorly optimized now and why?

Well, we have two different correlation problems. One is the problem of
dependant correlation, such as the 1.0 correlation of ZIP and CITY fields
as a common problem. This could in fact be fixed, I believe, via a linear
math calculation based on the sampled level of correlation, assuming we
have enough samples. And it's really only an issue if the correlation is

0.5.

The second type of correlation issue we have is correlating values in a
parent table with *rows* in child table (i.e. FK joins). Currently, the
planner assumes that all rows in the child table are evenly distributed
against keys in the parent table. But many real-world databases have this
kind of problem:

A B
1 10000 rows
2 10000 rows
3 1000 rows
4 .. 1000 0 to 1 rows

For queries which cover values between 4..1000 on A, the misestimate won't
be much of a real execution problem. But for values 1,2,3, the query will
bomb.

The other half of this is that bad selectivity estimates only matter
if they're bad enough to change the plan, and I'm not sure whether
cases like this are actually a problem in practice.

My experience is that any estimate which is more than 5x wrong (i.e. < .2
or > 5.0) usually causes problems, and 3x sometimes causes problems.

--
--Josh

Josh Berkus
PostgreSQL
San Francisco

#13

Bruce Momjian

bruce@momjian.us

over 17 years ago

In reply to: Josh Berkus (#12)

Re: Cross-column statistics revisited

Correlation is the wrong tool. In fact zip codes and city have nearly
zero correlation. Zip codes near 00000 are no more likely to be in
cities starting with A than Z.

Even if you use an appropriate tool I'm not clear what to do with the
information. Consider the case of WHERE city='boston' and zip='02139'
and another query with WHERE city='boston' and zip='90210'. One will
produce many more records than the separate histograms would predict
and the other would produce zero. How do you determine which category
a given pair of constants falls into?

Separately you mention cross-table stats - but that' a whole other
kettle of worms. I'm not sure which is easier but let's do one at a
time?

greg

On 17 Oct 2008, at 12:12 AM, Josh Berkus <josh@agliodbs.com> wrote:

Show quoted text

Yes, or to phrase that another way: What kinds of queries are being
poorly optimized now and why?

Well, we have two different correlation problems. One is the
problem of
dependant correlation, such as the 1.0 correlation of ZIP and CITY
fields
as a common problem. This could in fact be fixed, I believe, via a
linear
math calculation based on the sampled level of correlation, assuming
we
have enough samples. And it's really only an issue if the
correlation is

0.5.

The second type of correlation issue we have is correlating values
in a
parent table with *rows* in child table (i.e. FK joins). Currently,
the
planner assumes that all rows in the child table are evenly
distributed
against keys in the parent table. But many real-world databases
have this
kind of problem:

A B
1 10000 rows
2 10000 rows
3 1000 rows
4 .. 1000 0 to 1 rows

For queries which cover values between 4..1000 on A, the misestimate
won't
be much of a real execution problem. But for values 1,2,3, the
query will
bomb.

The other half of this is that bad selectivity estimates only matter
if they're bad enough to change the plan, and I'm not sure whether
cases like this are actually a problem in practice.

My experience is that any estimate which is more than 5x wrong (i.e.
< .2
or > 5.0) usually causes problems, and 3x sometimes causes problems.

--
--Josh

Josh Berkus
PostgreSQL
San Francisco

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14

Ron Mayer

rm_pg@cheapcomplexdevices.com

over 17 years ago

In reply to: Josh Berkus (#12)

Re: Cross-column statistics revisited

Josh Berkus wrote:

Yes, or to phrase that another way: What kinds of queries are being
poorly optimized now and why?

Well, we have two different correlation problems. One is the problem of
dependant correlation, such as the 1.0 correlation of ZIP and CITY fields
as a common problem. This could in fact be fixed, I believe, via a linear
math calculation based on the sampled level of correlation, assuming we
have enough samples. And it's really only an issue if the correlation is

0.5.

I'd note that this can be an issue even without 2 columns involved.

I've seen a number of tables where the data is loaded in batches
so similar-values from a batch tend to be packed into relatively few pages.

Thinks a database for a retailer that nightly aggregates data from
each of many stores. Each incoming batch inserts the store's data
into tightly packed disk pages where most all rows on the page are for
that store. But those pages are interspersed with pages from other
stores.

I think I like the ideas Greg Stark had a couple years ago:
http://archives.postgresql.org/pgsql-hackers/2006-09/msg01040.php
"...sort the sampled values by value
and count up the average number of distinct blocks per value.... Or
perhaps we need a second histogram where the quantities are of
distinct pages rather than total records.... We might also need a
separate "average number of n-block spans per value"
since those seem to me to lead more directly to values like "blocks
that need to be read".

#15

Bruce Momjian

bruce@momjian.us

over 17 years ago

In reply to: Ron Mayer (#14)

Re: Cross-column statistics revisited

This is yet another issue entirely. This is about estimating how much
io will be random io if we do an index order scan. Correlation is a
passable tool for this but we might be able to do better.

But it has nothing to do with the cross-column stats problem.

greg

On 17 Oct 2008, at 01:29 AM, Ron Mayer <rm_pg@cheapcomplexdevices.com>
wrote:

Show quoted text

Josh Berkus wrote:

Yes, or to phrase that another way: What kinds of queries are being
poorly optimized now and why?

Well, we have two different correlation problems. One is the
problem of dependant correlation, such as the 1.0 correlation of
ZIP and CITY fields as a common problem. This could in fact be
fixed, I believe, via a linear math calculation based on the
sampled level of correlation, assuming we have enough samples. And
it's really only an issue if the correlation is

0.5.

I'd note that this can be an issue even without 2 columns involved.

I've seen a number of tables where the data is loaded in batches
so similar-values from a batch tend to be packed into relatively few
pages.

Thinks a database for a retailer that nightly aggregates data from
each of many stores. Each incoming batch inserts the store's data
into tightly packed disk pages where most all rows on the page are for
that store. But those pages are interspersed with pages from other
stores.

I think I like the ideas Greg Stark had a couple years ago:
http://archives.postgresql.org/pgsql-hackers/2006-09/msg01040.php
"...sort the sampled values by value
and count up the average number of distinct blocks per value.... Or
perhaps we need a second histogram where the quantities are of
distinct pages rather than total records.... We might also need a
separate "average number of n-block spans per value"
since those seem to me to lead more directly to values like "blocks
that need to be read".

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16

Tom Lane

tgl@sss.pgh.pa.us

over 17 years ago

In reply to: Joshua Tolley (#11)

Re: Cross-column statistics revisited

"Joshua Tolley" <eggyknap@gmail.com> writes:

Most of the comments on this thread have centered around the questions
of "what we'd store" and "how we'd use it", which might be better
phrased as, "The database assumes columns are independent, but we know
that's not always true. Does this cause enough problems to make it
worth fixing? How might we fix it?" I have to admit an inability to
show that it causes problems,

Any small amount of trolling in our archives will turn up plenty of
examples.

It appears to me that a lot of people in this thread are confusing
correlation in the sense of statistical correlation between two
variables with correlation in the sense of how well physically-ordered
a column is. (The latter is actually the same kind of animal, but
always taking one of the two variables to be physical position.)
A bad estimate for physical-position correlation has only limited
impact, as Josh B said upthread; but the other case leads to very
bad rowcount estimates which have *huge* impact on plan choices.

regards, tom lane

#17

Joshua Tolley

eggyknap@gmail.com

over 17 years ago

In reply to: Tom Lane (#16)

Re: Cross-column statistics revisited

On Thu, Oct 16, 2008 at 6:32 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

It appears to me that a lot of people in this thread are confusing
correlation in the sense of statistical correlation between two
variables with correlation in the sense of how well physically-ordered
a column is.

For what it's worth, neither version of correlation was what I had in
mind. Statistical correlation between two variables is a single
number, is fairly easy to calculate, and probably wouldn't help query
plans much at all. I'm more interested in a more complex data
gathering. The data model I have in mind (which I note I have *not*
proven to actually help a large number of query plans -- that's
obviously an important part of what I'd need to do in all this)
involves instead a matrix of frequency counts. Right now our
"histogram" values are really quantiles; the statistics_target T for a
column determines a number of quantiles we'll keep track of, and we
grab values from into an ordered list L so that approximately 1/T of
the entries in that column fall between values L[n] and L[n+1]. I'm
thinking that multicolumn statistics would instead divide the range of
each column up into T equally sized segments, to form in the
two-column case a matrix, where the values of the matrix are frequency
counts -- the number of rows whose values for each column fall within
the particular segments of their respective ranges represented by the
boundaries of that cell in the matrix. I just realized while writing
this that this might not extend to situations where the two columns
are from different tables and don't necessarily have the same row
count, but I'll have to think about that.

Anyway, the size of such a matrix would be exponential in T, and
cross-column statistics involving just a few columns could easily
involve millions of values, for fairly normal statistics_targets.
That's where the compression ideas come in to play. This would
obviously need a fair bit of testing, but it's certainly conceivable
that modern regression techniques could reduce that frequency matrix
to a set of functions with a small number of parameters. Whether that
would result in values the planner can look up for a given set of
columns without spending more time than it's worth is another question
that will need exploring.

I started this thread knowing that past discussions have posed the
following questions:

1. What sorts of cross-column data can we really use?
2. Can we collect that information?
3. How do we know what columns to track?

For what it's worth, my original question was whether anyone had
concerns beyond these, and I think that has been fairly well answered
in this thread.

- Josh / eggyknap

#18

Tom Lane

tgl@sss.pgh.pa.us

over 17 years ago

In reply to: Joshua Tolley (#17)

Re: Cross-column statistics revisited

"Joshua Tolley" <eggyknap@gmail.com> writes:

For what it's worth, neither version of correlation was what I had in
mind. Statistical correlation between two variables is a single
number, is fairly easy to calculate, and probably wouldn't help query
plans much at all. I'm more interested in a more complex data
gathering. The data model I have in mind (which I note I have *not*
proven to actually help a large number of query plans -- that's
obviously an important part of what I'd need to do in all this)
involves instead a matrix of frequency counts.

Oh, I see the distinction you're trying to draw. Agreed on both points:
a simple correlation number is pretty useless to us, and we don't have
hard evidence that a histogram-matrix will solve the problem. However,
we do know that one-dimensional histograms of the sort currently
collected by ANALYZE work pretty well (if they're detailed enough).
It seems reasonable to me to extrapolate that same concept to two or
more dimensions. The issue then becomes that a "detailed enough"
matrix might be impracticably bulky, so you need some kind of lossy
compression, and we don't have hard evidence about how well that will
work. Nonetheless the road map seems clear enough to me.

Right now our
"histogram" values are really quantiles; the statistics_target T for a
column determines a number of quantiles we'll keep track of, and we
grab values from into an ordered list L so that approximately 1/T of
the entries in that column fall between values L[n] and L[n+1]. I'm
thinking that multicolumn statistics would instead divide the range of
each column up into T equally sized segments,

Why would you not use the same histogram bin bounds derived for the
scalar stats (along each axis of the matrix, of course)? This seems to
me to be arbitrarily replacing something proven to work with something
not proven. Also, the above forces you to invent a concept of "equally
sized" ranges, which is going to be pretty bogus for a lot of datatypes.

regards, tom lane

#19

Joshua Tolley

eggyknap@gmail.com

over 17 years ago

In reply to: Tom Lane (#18)

Re: Cross-column statistics revisited

On Thu, Oct 16, 2008 at 8:38 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

"Joshua Tolley" <eggyknap@gmail.com> writes:

For what it's worth, neither version of correlation was what I had in
mind. Statistical correlation between two variables is a single
number, is fairly easy to calculate, and probably wouldn't help query
plans much at all. I'm more interested in a more complex data
gathering. The data model I have in mind (which I note I have *not*
proven to actually help a large number of query plans -- that's
obviously an important part of what I'd need to do in all this)
involves instead a matrix of frequency counts.

Oh, I see the distinction you're trying to draw. Agreed on both points:
a simple correlation number is pretty useless to us, and we don't have
hard evidence that a histogram-matrix will solve the problem. However,
we do know that one-dimensional histograms of the sort currently
collected by ANALYZE work pretty well (if they're detailed enough).
It seems reasonable to me to extrapolate that same concept to two or
more dimensions. The issue then becomes that a "detailed enough"
matrix might be impracticably bulky, so you need some kind of lossy
compression, and we don't have hard evidence about how well that will
work. Nonetheless the road map seems clear enough to me.

Oh goodie -- I feel so validated :)

Right now our
"histogram" values are really quantiles; the statistics_target T for a
column determines a number of quantiles we'll keep track of, and we
grab values from into an ordered list L so that approximately 1/T of
the entries in that column fall between values L[n] and L[n+1]. I'm
thinking that multicolumn statistics would instead divide the range of
each column up into T equally sized segments,

Why would you not use the same histogram bin bounds derived for the
scalar stats (along each axis of the matrix, of course)? This seems to
me to be arbitrarily replacing something proven to work with something
not proven. Also, the above forces you to invent a concept of "equally
sized" ranges, which is going to be pretty bogus for a lot of datatypes.

Because I'm trying to picture geometrically how this might work for
the two-column case, and hoping to extend that to more dimensions, and
am finding that picturing a quantile-based system like the one we have
now in multiple dimensions is difficult. I believe those are the same
difficulties Gregory Stark mentioned having in his first post in this
thread. But of course that's an excellent point, that what we do now
is proven. I'm not sure which problem will be harder to solve -- the
weird geometry or the "equally sized ranges" for data types where that
makes no sense.

One option that came to mind recently would be to have statistics for
a subset of the rows in a single column A, where that subset is
defined by conditions on some other column B with a defined
relationship to A (for instance, that they're in the same table).
There are all kinds of complexities possible with that idea, but one
bright spot is that the values in the catalog and the way the planner
makes use of them would stay essentially the same as they are now.
That one's interesting to think about, but probably too complex for
real use. Anyway, I'll keep pondering.

- Josh / eggyknap

#20

Martijn van Oosterhout

kleptog@svana.org

over 17 years ago

In reply to: Joshua Tolley (#19)

Re: Cross-column statistics revisited

On Thu, Oct 16, 2008 at 09:17:03PM -0600, Joshua Tolley wrote:

Because I'm trying to picture geometrically how this might work for
the two-column case, and hoping to extend that to more dimensions, and
am finding that picturing a quantile-based system like the one we have
now in multiple dimensions is difficult.

Just a note: using a multidimensional histograms will work well for the
cases like (startdate,enddate) where the histogram will show a
clustering of values along the diagonal. But it will fail for the case
(zipcode,state) where one implies the other. Histogram-wise you're not
going to see any correlation at all but what you want to know is:

count(distinct zipcode,state) = count(distinct zipcode)

So you might need to think about storing/searching for different kinds
of correlation.

Secondly, my feeling about multidimensional histograms is that you're
not going to need the matrix to have 100 bins along each axis, but that
it'll be enough to have 1000 bins total. The cases where we get it
wrong enough for people to notice will probably be the same cases where
the histogram will have noticable variation even for a small number of
bins.

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text