proposal : cross-column stats

Started by Tomas Vondraover 15 years ago61 messageshackers
Jump to latest
#1Tomas Vondra
tomas.vondra@2ndquadrant.com

Hi everyone,

one of the ssesion I've attended on PgDay last week was Heikki's session
about statistics in PostgreSQL. One of the issues he mentioned (and one
I regularly run into) is the absence of cross-column stats. When the
columns are not independent, this usually result in poor estimates (and
then in suboptimal plans).

I was thinking about this issue before, but that session was the last
impulse that pushed me to try to hack up a proof of concept. So here it
is ;-)

Lets talk about one special case - I'll explain how the proposed
solution works, and then I'll explain how to make it more general, what
improvements are possible, what issues are there. Anyway this is by no
means a perfect or complete solution - it's just a starting point.

------------------------ Short introduction ------------------------

Say we have a table with two INT columns - col_a and col_b, and we want
to estimate number of rows for a condition involving both columns:

WHERE (col_a BETWEEN m AND n) AND (col_b BETWEEN p AND q)

When the columns are independent, doing the estimate is just a matter of
multiplication. When the columns are dependent, the estimate may be way off.

Lets assume there are histograms with 5 bins for both columns. What I
propose is basically building a 2D histogram. It kind of resembles
contingency table.

So we do have a table like this

col_b \ col_a || 20% | 20% | 20% | 20% | 20% |
===============||==============================
20% || 4% | 4% | 4% | 4% | 4% |
20% || 4% | 4% | 4% | 4% | 4% |
20% || 4% | 4% | 4% | 4% | 4% |
20% || 4% | 4% | 4% | 4% | 4% |
20% || 4% | 4% | 4% | 4% | 4% |
===============||==============================

where each column / row represents a bin in the original histograms, and
each cell represents an expected number of rows in it (for really
independent columns, it's 4%).

For dependent columns the actual values may be actually very different,
of course - e.g. for strongly dependent columns it might be like this

col_b \ col_a || 20% | 20% | 20% | 20% | 20% |
===============||==============================
20% || 20% | 0% | 0% | 0% | 0% |
20% || 0% | 20% | 0% | 0% | 0% |
20% || 0% | 0% | 20% | 0% | 0% |
20% || 0% | 0% | 0% | 20% | 0% |
20% || 0% | 0% | 9% | 0% | 20% |
===============||==============================

To estimate the number of rows matching the condition, you'd sum
estimates for cells matching the condition. I.e. when the condition on
col_a matches the lowest 20% (the first histogram bin) and the condition
on col_b matches the lowest 20% of values, this corresponds to the first
cell (20% of rows).

Current optimizer estimates this to be 4% as it believes the columns are
independent.

I'm not sure whether I've explained it well enough, but the essence of
the proposal is to build N-dimensional histograms (where N is the number
of columns covered by the statistics) just as we are building histograms
today.

----------------------- Proof of concept ---------------------------

I've hacked a nasty proof of concept in PL/pgSQL (see the attachment).
It creates two tables - test_table and cross_stats. The former contains
the data (in two columns), the latter is used to store cross-column
statistics of test_table.

Then there are several PL/pgSQL functions - two of them are really
important:

collect_stats(p_bins_a INT, p_bins_b INT)
- this one is used to collect the stats, the parameters represent
number of bins for the columns (sides of the contingency table)
- collect_stats(10,10) will build contingency table with 100 cells

get_estimate(p_from_a INT, p_to_a INT, p_from_b INT, p_to_b INT)
- computes estimate for the condition listed above (ranges on both
columns)

So to run the PoC, just do something like this:

1) fill the table with some data

INSERT INTO test_table SELECT round(random()*1000),
round(random()*1000) FROM generate_series(1,100000);

2) collect the cross-column statistics

SELECT collect_stats(10,10);

3) see current estimated and actual number of rows

EXPLAIN ANALYZE SELECT * FROM test_table
WHERE (col_a BETWEEN 30 AND 129)
AND (col_b BETWEEN 234 AND 484);

4) see the estimate based on contingency table

SELECT get_estimate(30, 129, 234, 484);

Just two very simple tests for now - col_a/col_b contain the range from
the query, then there are actual number of matching rows and a current
estimate, And finally the new estimate based on contingency table with
various number of bins.

A) independent columns (proof that it produces just as good estimates
as the current code)

col_a | col_b | actual | expected | 10x10 | 20x20 |
[50,70] | [50,70] | 41 | 40 | 41 | 47 |
[50,250] | [50,250] | 4023 | 4024 | 4436 | 3944 |
[50,250] | [750,950] | 4023 | 3955 | 4509 | 3933 |

B) strongly positively dependent columns (actually col_a = col_b)

col_a | col_b | actual | expect | 10x10 | 20x20 | 100x100 |
[50,70] | [50,70] | 2143 | 57 | 391 | 729 | 1866 |
[50,250] | [50,250] | 20181 | 4111 | 15401 | 19983 | 19991 |
[50,250] | [750,950] | 0 | 3977 | 0 | 0 | 0 |

Which proves that it actually significantly improves the estimates.

Again, I've hacked that in about 1 hour, so it's really slow and I think
some of the estimates may be further improved. And the precision
obviously depends on the number of cells.

----------------------- Additional notes ---------------------------

1) The whole thing may be easily extended to more than two columns,
just build N-dimensional cubes.

2) I really don't think we should collect stats for all combinations of
columns of a table - I do like the Oracle-like approach where a DBA
has to enable cross-column stats using an ALTER TABLE (for a
particular list of columns).

The only exception might be columns from a multi-column index. It
might be quite efficient I guess?

3) There are independence tests for contingency tables (e.g. Pearson's
Chi-squared test), so that it's easy to find out whether the columns
are independent. In that case we can just throw away these stats and
use the simple estimation.

http://mathworld.wolfram.com/Chi-SquaredTest.html

4) Or we could store just those cells where expected and observed values
differ significantly (may help if most of the values are indendent,
but there's a small glitch somewhere).

5) It does not make sense to collect stats for two list of columns that
share several columns - just collect cross stats for union of the
column lists and then 'dripp up' as needed. An extreme case might be
collecting stats for all columns in a table. But there are practical
issues with this approach (huge tables, small values within cells).

6) The precision increases with the number of cells, but the number of
cells grows exponentionally. So it's necessary to choose a
reasonable number of bins for the columns (might be part of the
ALTER COMMAND, should not be firmly bound to the statistics target).

At a cell level, the simple 'multiplication' estimates are actually
used (but only when the cell is divided by the range).

7) All this is done under the assumption that all the columns are in a
single table - when doing research in the archives I've noticed
suggestions to use this to optimize joins. Maybe it can be done but
I really don't know how to do that.

8) I'm not sure where to store this and in what form - generally the
table does not need to be significantly more complex than cross_stats
table from the PoC.

9) I've noticed demands to collect data about columns used frequently
in a single WHERE condition. Not sure how to do that and how to
analyze the collected data. I think collecting data about expected
and observed number of columns should be just fine - but it's kinda
independent from this proposal.

regards
Tomas

Attachments:

cross-column-poc.sqltext/plain; name=cross-column-poc.sqlDownload
#2Martijn van Oosterhout
kleptog@svana.org
In reply to: Tomas Vondra (#1)
Re: proposal : cross-column stats

On Sun, Dec 12, 2010 at 03:58:49AM +0100, Tomas Vondra wrote:

Hi everyone,

one of the ssesion I've attended on PgDay last week was Heikki's session
about statistics in PostgreSQL. One of the issues he mentioned (and one
I regularly run into) is the absence of cross-column stats. When the
columns are not independent, this usually result in poor estimates (and
then in suboptimal plans).

Very cool that you're working on this.

Lets talk about one special case - I'll explain how the proposed
solution works, and then I'll explain how to make it more general, what
improvements are possible, what issues are there. Anyway this is by no
means a perfect or complete solution - it's just a starting point.

It looks like you handled most of the issues. Just a few points:

- This is obviously applicable to more than just integers, probably
anything with a b-tree operator class. What you've coded seems rely
on calculations on the values. Have you thought about how it could
work for, for example, strings?

The classic failure case has always been: postcodes and city names.
Strongly correlated, but in a way that the computer can't easily see.
Not that I suggest you fix this, but it's food for though. Though
strictly speaking this is a different kind of correlation than what
you're looking at.

2) I really don't think we should collect stats for all combinations of
columns of a table - I do like the Oracle-like approach where a DBA
has to enable cross-column stats using an ALTER TABLE (for a
particular list of columns).

The only exception might be columns from a multi-column index. It
might be quite efficient I guess?

In the past it has been suggested to only do it for multi-column
indexes, but I find these days I find in some situations I prefer to
make individual indexes and let the bitmap scan code combine them. So
perhaps it would be best to let it be configured by the DBA.

3) There are independence tests for contingency tables (e.g. Pearson's
Chi-squared test), so that it's easy to find out whether the columns
are independent. In that case we can just throw away these stats and
use the simple estimation.

http://mathworld.wolfram.com/Chi-SquaredTest.html

I think this would be good to include, if possible.

Actually, I wonder if the existing stats collection code could be
altered to attempt to calculate the correlation between columns as part
of its other work.

4) Or we could store just those cells where expected and observed values
differ significantly (may help if most of the values are indendent,
but there's a small glitch somewhere).

Comrpessing that grid would be useful, given that for many dimensions
most of the grid will be not interesting. In fact, storing the 20
largest values may be enough. Worth an experiment.

Hope this helps,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

Patriotism is when love of your own people comes first; nationalism,
when hate for people other than your own comes first.
- Charles de Gaulle

#3Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Martijn van Oosterhout (#2)
Re: proposal : cross-column stats

On 12.12.2010 15:17, Martijn van Oosterhout wrote:

On Sun, Dec 12, 2010 at 03:58:49AM +0100, Tomas Vondra wrote:
Very cool that you're working on this.

+1

Lets talk about one special case - I'll explain how the proposed
solution works, and then I'll explain how to make it more general, what
improvements are possible, what issues are there. Anyway this is by no
means a perfect or complete solution - it's just a starting point.

It looks like you handled most of the issues. Just a few points:

- This is obviously applicable to more than just integers, probably
anything with a b-tree operator class. What you've coded seems rely
on calculations on the values. Have you thought about how it could
work for, for example, strings?

The classic failure case has always been: postcodes and city names.
Strongly correlated, but in a way that the computer can't easily see.

Yeah, and that's actually analogous to the example I used in my
presentation.

The way I think of that problem is that once you know the postcode,
knowing the city name doesn't add any information. The postcode implies
the city name. So the selectivity for "postcode = ? AND city = ?" should
be the selectivity of "postcode = ?" alone. The measurement we need is
"implicativeness": How strongly does column A imply a certain value for
column B. Perhaps that could be measured by counting the number of
distinct values of column B for each value of column A, or something
like that. I don't know what the statisticians call that property, or if
there's some existing theory on how to measure that from a sample.

That's assuming the combination has any matches. It's possible that the
user chooses a postcode and city combination that doesn't exist, but
that's no different from a user doing "city = 'fsdfsdfsd'" on a single
column, returning no matches. We should assume that the combination
makes sense.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#4Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Martijn van Oosterhout (#2)
Re: proposal : cross-column stats

Hi!

Dne 12.12.2010 15:17, Martijn van Oosterhout napsal(a):

Lets talk about one special case - I'll explain how the proposed
solution works, and then I'll explain how to make it more general, what
improvements are possible, what issues are there. Anyway this is by no
means a perfect or complete solution - it's just a starting point.

It looks like you handled most of the issues. Just a few points:

- This is obviously applicable to more than just integers, probably
anything with a b-tree operator class. What you've coded seems rely
on calculations on the values. Have you thought about how it could
work for, for example, strings?

Yes, I know, I just forgot to address this in my previous e-mail. The
contingency tables have a really nice feature - they are based on
splitting the sets into groups (~ bins of the histograms for each
column). And this can be done if you can sort the values, you really
don't need any calculations. So it should work with strings.

And another thing I somehow forgot is handling the case when there is no
histogram, just MCV. That's mostly the same - each of the values might
be a separate group, or the values might be grouped to form less groups,
etc.

The classic failure case has always been: postcodes and city names.
Strongly correlated, but in a way that the computer can't easily see.
Not that I suggest you fix this, but it's food for though. Though
strictly speaking this is a different kind of correlation than what
you're looking at.

Hmmm, I see. I think the proposal does not fix this particular case,
although it might improve the situation a little bit (limit the error
between expected and observed number of rows).

The problem is that once we get to a cell-level of the contingency
table, there is no additional (more detailed) information. So we're
stuck with the multiplication estimate, or something like that.

I was thinking about it actually, and I think we could collect some more
info - a correlation coefficient for each bin, or something like that.
But that was not part of my proposal, and I'm not sure how to do that.

2) I really don't think we should collect stats for all combinations of
columns of a table - I do like the Oracle-like approach where a DBA
has to enable cross-column stats using an ALTER TABLE (for a
particular list of columns).

The only exception might be columns from a multi-column index. It
might be quite efficient I guess?

In the past it has been suggested to only do it for multi-column
indexes, but I find these days I find in some situations I prefer to
make individual indexes and let the bitmap scan code combine them. So
perhaps it would be best to let it be configured by the DBA.

Yes, I prefer individual indexes too.

The idea behind collecting cross-column stats for multi-column indexes
was that maybe we could 'append' this to the current functionality
(building the index or something like that) so that it does not
introduce significant performance problems.

3) There are independence tests for contingency tables (e.g. Pearson's
Chi-squared test), so that it's easy to find out whether the columns
are independent. In that case we can just throw away these stats and
use the simple estimation.

http://mathworld.wolfram.com/Chi-SquaredTest.html

I think this would be good to include, if possible.

Actually, I wonder if the existing stats collection code could be
altered to attempt to calculate the correlation between columns as part
of its other work.

I guess that would be rather expensive - to compute correlation you need
two passes, and you need to do that for each pair or columns. So I'd be
surprised if it is possible (and effective).

Another thing is that you can compute correlation only for numeric
columns, so it's not possible to do that for city/ZIP code mentioned
above. More precisely - it's possible to do that (if you map strings to
numbers somehow), but I doubt you'll get useful results as the
assignment is rather random.

Well, you could ask the governments to assign the ZIP codes to cities in
strictly alphabecital order, but I guess they'll say no.

4) Or we could store just those cells where expected and observed values
differ significantly (may help if most of the values are indendent,
but there's a small glitch somewhere).

Comrpessing that grid would be useful, given that for many dimensions
most of the grid will be not interesting. In fact, storing the 20
largest values may be enough. Worth an experiment.

Not exactly just the largest values - rather values that are
significantly different from the expected values. Generally there are
two interesting cases

expected << observed - The optimizer may choose index scan, although
the seq scan would be better.

expected >> observed - The optimizer may choose seq scan, although
the index scan would be better.

regards
Tomas

#5Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Heikki Linnakangas (#3)
Re: proposal : cross-column stats

Dne 12.12.2010 15:43, Heikki Linnakangas napsal(a):

The classic failure case has always been: postcodes and city names.
Strongly correlated, but in a way that the computer can't easily see.

Yeah, and that's actually analogous to the example I used in my
presentation.

The way I think of that problem is that once you know the postcode,
knowing the city name doesn't add any information. The postcode implies
the city name. So the selectivity for "postcode = ? AND city = ?" should
be the selectivity of "postcode = ?" alone. The measurement we need is
"implicativeness": How strongly does column A imply a certain value for
column B. Perhaps that could be measured by counting the number of
distinct values of column B for each value of column A, or something
like that. I don't know what the statisticians call that property, or if
there's some existing theory on how to measure that from a sample.

Yes, those issues are a righteous punishment for breaking BCNF rules ;-)

I'm not sure it's solvable using the contingency tables, as it requires
knowledge about dependencies between individual values (working with
cells is not enough, although it might improve the estimates).

Well, maybe we could collect these stats (number of cities for a given
ZIP code and number of ZIP codes for a given city). Collecting a good
stats about this is a bit tricky, but possible. What about collecting
this for the MCVs from both columns?

Tomas

#6Florian Pflug
fgp@phlo.org
In reply to: Heikki Linnakangas (#3)
Re: proposal : cross-column stats

On Dec12, 2010, at 15:43 , Heikki Linnakangas wrote:

The way I think of that problem is that once you know the postcode, knowing the city name doesn't add any information. The postcode implies the city name. So the selectivity for "postcode = ? AND city = ?" should be the selectivity of "postcode = ?" alone. The measurement we need is "implicativeness": How strongly does column A imply a certain value for column B. Perhaps that could be measured by counting the number of distinct values of column B for each value of column A, or something like that. I don't know what the statisticians call that property, or if there's some existing theory on how to measure that from a sample.

The statistical term for this is "conditional probability", written P(A|B), meaning the probability of A under the assumption or knowledge of B. The basic tool for working with conditional probabilities is bayes' theorem which states that

P(A|B) = P(A and B) / P(B).

Currently, we assume that P(A|B) = P(A), meaning the probability (or selectivity as we call it) of an event (like a=3) does not change under additional assumptions like b=4. Bayes' theorem thus becomes

P(A) = P(A and B) / P(B) <=>
P(A and B) = P(A)*P(B)

which is how we currently compute the selectivity of a clause such as "WHERE a=3 AND b=4".

I believe that measuring this by counting the number of distinct values of column B for each A is basically the right idea. Maybe we could count the number of distinct values of "b" for every one of the most common values of "a", and compare that to the overall number of distinct values of "b"...

A (very) quick search on scholar.google.com for "estimate conditional probability" didn't turn up anything useful, but it's hard to believe that there isn't at least some literature on the subject.

best regards,
Florian Pflug

#7Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Florian Pflug (#6)
Re: proposal : cross-column stats

Dne 12.12.2010 17:33, Florian Pflug napsal(a):

On Dec12, 2010, at 15:43 , Heikki Linnakangas wrote:

The way I think of that problem is that once you know the postcode, knowing the city name doesn't add any information. The postcode implies the city name. So the selectivity for "postcode = ? AND city = ?" should be the selectivity of "postcode = ?" alone. The measurement we need is "implicativeness": How strongly does column A imply a certain value for column B. Perhaps that could be measured by counting the number of distinct values of column B for each value of column A, or something like that. I don't know what the statisticians call that property, or if there's some existing theory on how to measure that from a sample.

The statistical term for this is "conditional probability", written P(A|B), meaning the probability of A under the assumption or knowledge of B. The basic tool for working with conditional probabilities is bayes' theorem which states that

P(A|B) = P(A and B) / P(B).

Currently, we assume that P(A|B) = P(A), meaning the probability (or selectivity as we call it) of an event (like a=3) does not change under additional assumptions like b=4. Bayes' theorem thus becomes

P(A) = P(A and B) / P(B) <=>
P(A and B) = P(A)*P(B)

which is how we currently compute the selectivity of a clause such as "WHERE a=3 AND b=4".

I believe that measuring this by counting the number of distinct values of column B for each A is basically the right idea. Maybe we could count the number of distinct values of "b" for every one of the most common values of "a", and compare that to the overall number of distinct values of "b"...

Good point!

Well, I was thinking about this too - generally this means creating a
contingency table with the MCV as bins. Then you can compute these
interesting probabilities P(A and B). (OK, now I definitely look like
some contingency table weirdo, who tries to solve everything with a
contingency table. OMG!)

The question is - what are we going to do when the values in the query
are not in the MCV list? Is there some heuristics to estimate the
probability from MCV, or something like that? Could we use some
"average" probability or what?

Tomas

#8Robert Haas
robertmhaas@gmail.com
In reply to: Heikki Linnakangas (#3)
Re: proposal : cross-column stats

On Sun, Dec 12, 2010 at 9:43 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

The way I think of that problem is that once you know the postcode, knowing
the city name doesn't add any information. The postcode implies the city
name. So the selectivity for "postcode = ? AND city = ?" should be the
selectivity of "postcode = ?" alone. The measurement we need is
"implicativeness": How strongly does column A imply a certain value for
column B. Perhaps that could be measured by counting the number of distinct
values of column B for each value of column A, or something like that. I
don't know what the statisticians call that property, or if there's some
existing theory on how to measure that from a sample.

This is a good idea, but I guess the question is what you do next. If
you know that the "applicability" is 100%, you can disregard the
restriction clause on the implied column. And if it has no
implicatory power, then you just do what we do now. But what if it
has some intermediate degree of implicability?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#9Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Robert Haas (#8)
Re: proposal : cross-column stats

Dne 13.12.2010 01:05, Robert Haas napsal(a):

This is a good idea, but I guess the question is what you do next. If
you know that the "applicability" is 100%, you can disregard the
restriction clause on the implied column. And if it has no
implicatory power, then you just do what we do now. But what if it
has some intermediate degree of implicability?

Well, I think you've missed the e-mail from Florian Pflug - he actually
pointed out that the 'implicativeness' Heikki mentioned is called
conditional probability. And conditional probability can be used to
express the "AND" probability we are looking for (selectiveness).

For two columns, this is actually pretty straighforward - as Florian
wrote, the equation is

P(A and B) = P(A|B) * P(B) = P(B|A) * P(A)

where P(B) may be estimated from the current histogram, and P(A|B) may
be estimated from the contingency (see the previous mails). And "P(A and
B)" is actually the value we're looking for.

Anyway there really is no "intermediate" degree of aplicability, it just
gives you the right estimate.

And AFAIR this is easily extensible to more than two columns, as

P(A and B and C) = P(A and (B and C)) = P(A|(B and C)) * P(B and C)

so it's basically a recursion.

Well, I hope my statements are really correct - it's been a few years
since I gained my degree in statistics ;-)

regards
Tomas

#10Robert Haas
robertmhaas@gmail.com
In reply to: Tomas Vondra (#9)
Re: proposal : cross-column stats

On Sun, Dec 12, 2010 at 8:46 PM, Tomas Vondra <tv@fuzzy.cz> wrote:

Dne 13.12.2010 01:05, Robert Haas napsal(a):

This is a good idea, but I guess the question is what you do next.  If
you know that the "applicability" is 100%, you can disregard the
restriction clause on the implied column.  And if it has no
implicatory power, then you just do what we do now.  But what if it
has some intermediate degree of implicability?

Well, I think you've missed the e-mail from Florian Pflug - he actually
pointed out that the 'implicativeness' Heikki mentioned is called
conditional probability. And conditional probability can be used to
express the "AND" probability we are looking for (selectiveness).

For two columns, this is actually pretty straighforward - as Florian
wrote, the equation is

  P(A and B) = P(A|B) * P(B) = P(B|A) * P(A)

Well, the question is what data you are actually storing. It's
appealing to store a measure of the extent to which a constraint on
column X constrains column Y, because you'd only need to store
O(ncolumns^2) values, which would be reasonably compact and would
potentially handle the zip code problem - a classic "hard case" rather
neatly. But that wouldn't be sufficient to use the above equation,
because there A and B need to be things like "column X has value x",
and it's not going to be practical to store a complete set of MCVs for
column X for each possible value that could appear in column Y.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#11Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Florian Pflug (#6)
Re: proposal : cross-column stats

P(A|B) = P(A and B) / P(B).

Well, until this point we've discussed failure cases involving 'AND'
conditions. What about 'OR' conditions? I think the current optimizer
computes the selectivity as 's1+s2 - s1*s2' (at least that's what I
found in backend/optimizer/path/clausesel.c:630).

Sometimes that may return nearly 2x the actual selectivity, but in
general it's a reasonable estimate. Are there any severe failure cases
that produce much worse estimates?

regards
Tomas

#12Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Robert Haas (#10)
Re: proposal : cross-column stats

Dne 13.12.2010 03:00, Robert Haas napsal(a):

Well, the question is what data you are actually storing. It's
appealing to store a measure of the extent to which a constraint on
column X constrains column Y, because you'd only need to store
O(ncolumns^2) values, which would be reasonably compact and would
potentially handle the zip code problem - a classic "hard case" rather
neatly. But that wouldn't be sufficient to use the above equation,
because there A and B need to be things like "column X has value x",
and it's not going to be practical to store a complete set of MCVs for
column X for each possible value that could appear in column Y.

O(ncolumns^2) values? You mean collecting such stats for each possible
pair of columns? Well, I meant something different.

The proposed solution is based on contingency tables, built for selected
groups of columns (not for each possible group). And the contingency
table gives you the ability to estimate the probabilities needed to
compute the selectivity. Or am I missing something?

regards
Tomas

#13Robert Haas
robertmhaas@gmail.com
In reply to: Tomas Vondra (#12)
Re: proposal : cross-column stats

On Sun, Dec 12, 2010 at 9:16 PM, Tomas Vondra <tv@fuzzy.cz> wrote:

Dne 13.12.2010 03:00, Robert Haas napsal(a):

Well, the question is what data you are actually storing.  It's
appealing to store a measure of the extent to which a constraint on
column X constrains column Y, because you'd only need to store
O(ncolumns^2) values, which would be reasonably compact and would
potentially handle the zip code problem - a classic "hard case" rather
neatly.  But that wouldn't be sufficient to use the above equation,
because there A and B need to be things like "column X has value x",
and it's not going to be practical to store a complete set of MCVs for
column X for each possible value that could appear in column Y.

O(ncolumns^2) values? You mean collecting such stats for each possible
pair of columns? Well, I meant something different.

The proposed solution is based on contingency tables, built for selected
groups of columns (not for each possible group). And the contingency
table gives you the ability to estimate the probabilities needed to
compute the selectivity. Or am I missing something?

Well, I'm not real familiar with contingency tables, but it seems like
you could end up needing to store a huge amount of data to get any
benefit out of it, in some cases. For example, in the United States,
there are over 40,000 postal codes, and some even larger number of
city names, and doesn't the number of entries go as O(m*n)? Now maybe
this is useful enough anyway that we should Just Do It, but it'd be a
lot cooler if we could find a way to give the planner a meaningful
clue out of some more compact representation.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#14Nathan Boley
npboley@gmail.com
In reply to: Robert Haas (#13)
Re: proposal : cross-column stats

The proposed solution is based on contingency tables, built for selected
groups of columns (not for each possible group). And the contingency
table gives you the ability to estimate the probabilities needed to
compute the selectivity. Or am I missing something?

Well, I'm not real familiar with contingency tables, but it seems like
you could end up needing to store a huge amount of data to get any
benefit out of it, in some cases.  For example, in the United States,
there are over 40,000 postal codes, and some even larger number of
city names, and doesn't the number of entries go as O(m*n)?

Not to mention that the model parameters will be impossible to estimate well.

( I've only scanned the thread, so sorry if Im repeating something
that's already been said )

My intuition is that storing the covariance structure will be
unworkable both technically and statistically for all but the simplest
of cases. That being said, I dont think the problem is a lack of ways
to parameterize the covariance structure ( there are several papers on
mutli-dim histogram estimators, at least one of whichtalks about
databases explicitly, not to mention approaches like CART[1]http://en.wikipedia.org/wiki/Classification_and_regression_tree ) , but a
complete lack of infrastructure to do anything with the estimates. So
keep working on it ;-) ( but try to make the framework general enough
to allow better estimators ).

I wonder if a good first step might be to focus on the AND and OR
operators since they seem like the most common cases and union and
intersection commute ( although it's important to remember that the
estimate variances do NOT ) That is, what about estimating the
selectivity of the condition

WHERE X=A and Y=B

by

f(A,B) = x_1*(selectivity(X = A) + selectivity( Y = B )) +
x_2*selectivity(X = A)*selectivity( Y = B ) + x_3

where x_{1,2,3} are parameters to be estimated from the data.

Another quick note: I think that storing the full contingency table is
wasteful since the marginals are already stored in the single column
statistics. Look at copulas [2]http://en.wikipedia.org/wiki/Copula_%28statistics%29 ( FWIW I think that Josh Tolley was
looking at this a couple years back ).

Best,
Nathan

[1]: http://en.wikipedia.org/wiki/Classification_and_regression_tree
[2]: http://en.wikipedia.org/wiki/Copula_%28statistics%29

#15Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Robert Haas (#13)
Re: proposal : cross-column stats

On Sun, Dec 12, 2010 at 9:16 PM, Tomas Vondra <tv@fuzzy.cz> wrote:

Dne 13.12.2010 03:00, Robert Haas napsal(a):

Well, the question is what data you are actually storing.  It's
appealing to store a measure of the extent to which a constraint on
column X constrains column Y, because you'd only need to store
O(ncolumns^2) values, which would be reasonably compact and would
potentially handle the zip code problem - a classic "hard case" rather
neatly.  But that wouldn't be sufficient to use the above equation,
because there A and B need to be things like "column X has value x",
and it's not going to be practical to store a complete set of MCVs for
column X for each possible value that could appear in column Y.

O(ncolumns^2) values? You mean collecting such stats for each possible
pair of columns? Well, I meant something different.

The proposed solution is based on contingency tables, built for selected
groups of columns (not for each possible group). And the contingency
table gives you the ability to estimate the probabilities needed to
compute the selectivity. Or am I missing something?

Well, I'm not real familiar with contingency tables, but it seems like
you could end up needing to store a huge amount of data to get any
benefit out of it, in some cases. For example, in the United States,
there are over 40,000 postal codes, and some even larger number of
city names, and doesn't the number of entries go as O(m*n)? Now maybe
this is useful enough anyway that we should Just Do It, but it'd be a
lot cooler if we could find a way to give the planner a meaningful
clue out of some more compact representation.

Yes, storing a complete contingency table is not feasible in such cases.
My original proposal actually did not address this particular issue
(cities and ZIP codes) as it was based on simplified contingency tables
(with bins corresponding to current histograms, not to individual values).
So the number of values to store would grow much slower.

On the other hand, this generalization really makes it unusable in some
cases, and the issue we're discussing here (cities and ZIP codes) is one
of them. I think in such cases we could build a contingency table for MCV
and then use it to estimate those conditional probabilities we need, but I
expect it to be very tricky.

Thanks for the comments.

Tomas

#16Yeb Havinga
yebhavinga@gmail.com
In reply to: Robert Haas (#13)
Re: proposal : cross-column stats

On 2010-12-13 03:28, Robert Haas wrote:

Well, I'm not real familiar with contingency tables, but it seems like
you could end up needing to store a huge amount of data to get any
benefit out of it, in some cases. For example, in the United States,
there are over 40,000 postal codes, and some even larger number of
city names, and doesn't the number of entries go as O(m*n)? Now maybe
this is useful enough anyway that we should Just Do It, but it'd be a
lot cooler if we could find a way to give the planner a meaningful
clue out of some more compact representation.

A sparse matrix that holds only 'implicative' (P(A|B) <> P(A*B)?)
combinations? Also, some information might be deduced from others. For
Heikki's city/region example, for each city it would be known that it is
100% in one region. In that case it suffices to store only that
information, since 0% in all other regions ca be deduced. I wouldn't be
surprized if storing implicatures like this would reduce the size to O(n).

regards,
Yeb Havinga

#17Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Yeb Havinga (#16)
Re: proposal : cross-column stats

On 2010-12-13 03:28, Robert Haas wrote:

Well, I'm not real familiar with contingency tables, but it seems like
you could end up needing to store a huge amount of data to get any
benefit out of it, in some cases. For example, in the United States,
there are over 40,000 postal codes, and some even larger number of
city names, and doesn't the number of entries go as O(m*n)? Now maybe
this is useful enough anyway that we should Just Do It, but it'd be a
lot cooler if we could find a way to give the planner a meaningful
clue out of some more compact representation.

A sparse matrix that holds only 'implicative' (P(A|B) <> P(A*B)?)
combinations? Also, some information might be deduced from others. For
Heikki's city/region example, for each city it would be known that it is
100% in one region. In that case it suffices to store only that
information, since 0% in all other regions ca be deduced. I wouldn't be
surprized if storing implicatures like this would reduce the size to O(n).

OK, but I'll leave this for the future. My plan is to build a small PoC,
just to see whether the contingency-table + probability-estimates approach
works for the failure case mentioned by Heikki. I'l like to do this till
the end of this week, if possible.

I'll read the articles/mentioned by Nathan Boley (thanks for those links,
if you have more of them just let me know).

Once we have a solution that solves (or significantly improves) these
failure cases, we can do further plans (how to do that ascually in the
code etc.).

BTW thanks for all the comments!

regards
Tomas

#18Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tomas Vondra (#11)
Re: proposal : cross-column stats

Tomas Vondra <tv@fuzzy.cz> writes:

Well, until this point we've discussed failure cases involving 'AND'
conditions. What about 'OR' conditions? I think the current optimizer
computes the selectivity as 's1+s2 - s1*s2' (at least that's what I
found in backend/optimizer/path/clausesel.c:630).

If you can solve the AND case, the OR case falls out of that. Just
replace s1*s2 with a more accurate AND calculation.

regards, tom lane

#19Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#13)
Re: proposal : cross-column stats

Robert Haas <robertmhaas@gmail.com> writes:

On Sun, Dec 12, 2010 at 9:16 PM, Tomas Vondra <tv@fuzzy.cz> wrote:

The proposed solution is based on contingency tables, built for selected
groups of columns (not for each possible group). And the contingency
table gives you the ability to estimate the probabilities needed to
compute the selectivity. Or am I missing something?

Well, I'm not real familiar with contingency tables, but it seems like
you could end up needing to store a huge amount of data to get any
benefit out of it, in some cases.

The reason that this wasn't done years ago is precisely that nobody's
figured out how to do it with a tolerable amount of stats data and a
tolerable amount of processing time (both at ANALYZE time and during
query planning). It's not hard to see what we'd ideally like to do;
it's getting from there to something useful in production that's hard.

regards, tom lane

#20Joshua Tolley
eggyknap@gmail.com
In reply to: Nathan Boley (#14)
Re: proposal : cross-column stats

On Sun, Dec 12, 2010 at 07:10:44PM -0800, Nathan Boley wrote:

Another quick note: I think that storing the full contingency table is
wasteful since the marginals are already stored in the single column
statistics. Look at copulas [2] ( FWIW I think that Josh Tolley was
looking at this a couple years back ).

Josh Tolley still looks at it occasionally, though time hasn't permitted any
sort of significant work for quite some time. The multicolstat branch on my
git.postgresql.org repository will create an empirical copula each
multi-column index, and stick it in pg_statistic. It doesn't yet do anything
useful with that information, nor am I convinced it's remotely bug-free. In a
brief PGCon discussion with Tom a while back, it was suggested a good place
for the planner to use these stats would be clausesel.c, which is responsible
for handling code such as "...WHERE foo > 4 AND foo > 5".

--
Joshua Tolley / eggyknap
End Point Corporation
http://www.endpoint.com

#21Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tom Lane (#18)
#22Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tom Lane (#19)
#23Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Joshua Tolley (#20)
#24Josh Berkus
josh@agliodbs.com
In reply to: Tomas Vondra (#22)
#25Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Josh Berkus (#24)
#26Pavel Stehule
pavel.stehule@gmail.com
In reply to: Josh Berkus (#24)
#27Robert Haas
robertmhaas@gmail.com
In reply to: Tomas Vondra (#25)
#28Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Heikki Linnakangas (#3)
#29Robert Haas
robertmhaas@gmail.com
In reply to: Tomas Vondra (#28)
#30Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#23)
#31Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Robert Haas (#29)
#32Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tomas Vondra (#31)
#33Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Tomas Vondra (#31)
#34Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tom Lane (#32)
#35Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Heikki Linnakangas (#33)
#36Florian Pflug
fgp@phlo.org
In reply to: Tomas Vondra (#35)
#37Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Florian Pflug (#36)
#38Florian Pflug
fgp@phlo.org
In reply to: Tomas Vondra (#37)
#39Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Florian Pflug (#38)
#40Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#19)
#41Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Simon Riggs (#40)
#42Florian Pflug
fgp@phlo.org
In reply to: Tomas Vondra (#39)
#43Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Florian Pflug (#42)
#44Robert Haas
robertmhaas@gmail.com
In reply to: Florian Pflug (#42)
#45Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Robert Haas (#44)
#46Florian Pflug
fgp@phlo.org
In reply to: Tomas Vondra (#43)
#47Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Florian Pflug (#46)
#48Florian Pflug
fgp@phlo.org
In reply to: Tomas Vondra (#47)
#49Florian Pflug
fgp@phlo.org
In reply to: Tomas Vondra (#45)
#50Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Florian Pflug (#48)
#51Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Florian Pflug (#49)
#52Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Florian Pflug (#49)
#53Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#51)
#54Florian Pflug
fgp@phlo.org
In reply to: Tomas Vondra (#52)
#55Nicolas Barbier
nicolas.barbier@gmail.com
In reply to: Florian Pflug (#54)
#56Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Nicolas Barbier (#55)
#57Florian Pflug
fgp@phlo.org
In reply to: Nicolas Barbier (#55)
#58Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#56)
#59Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Florian Pflug (#57)
#60Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Florian Pflug (#54)
#61Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Florian Pflug (#57)