proposal : cross-column stats

Started by Tomas Vondraover 15 years ago61 messageshackers

tomas.vondra@2ndquadrant.com

over 15 years ago

Hi everyone,

one of the ssesion I've attended on PgDay last week was Heikki's session
about statistics in PostgreSQL. One of the issues he mentioned (and one
I regularly run into) is the absence of cross-column stats. When the
columns are not independent, this usually result in poor estimates (and
then in suboptimal plans).

I was thinking about this issue before, but that session was the last
impulse that pushed me to try to hack up a proof of concept. So here it
is ;-)

Lets talk about one special case - I'll explain how the proposed
solution works, and then I'll explain how to make it more general, what
improvements are possible, what issues are there. Anyway this is by no
means a perfect or complete solution - it's just a starting point.

------------------------ Short introduction ------------------------

Say we have a table with two INT columns - col_a and col_b, and we want
to estimate number of rows for a condition involving both columns:

WHERE (col_a BETWEEN m AND n) AND (col_b BETWEEN p AND q)

When the columns are independent, doing the estimate is just a matter of
multiplication. When the columns are dependent, the estimate may be way off.

Lets assume there are histograms with 5 bins for both columns. What I
propose is basically building a 2D histogram. It kind of resembles
contingency table.

So we do have a table like this

col_b \ col_a || 20% | 20% | 20% | 20% | 20% |
===============||==============================
20% || 4% | 4% | 4% | 4% | 4% |
20% || 4% | 4% | 4% | 4% | 4% |
20% || 4% | 4% | 4% | 4% | 4% |
20% || 4% | 4% | 4% | 4% | 4% |
20% || 4% | 4% | 4% | 4% | 4% |
===============||==============================

where each column / row represents a bin in the original histograms, and
each cell represents an expected number of rows in it (for really
independent columns, it's 4%).

For dependent columns the actual values may be actually very different,
of course - e.g. for strongly dependent columns it might be like this

col_b \ col_a || 20% | 20% | 20% | 20% | 20% |
===============||==============================
20% || 20% | 0% | 0% | 0% | 0% |
20% || 0% | 20% | 0% | 0% | 0% |
20% || 0% | 0% | 20% | 0% | 0% |
20% || 0% | 0% | 0% | 20% | 0% |
20% || 0% | 0% | 9% | 0% | 20% |
===============||==============================

To estimate the number of rows matching the condition, you'd sum
estimates for cells matching the condition. I.e. when the condition on
col_a matches the lowest 20% (the first histogram bin) and the condition
on col_b matches the lowest 20% of values, this corresponds to the first
cell (20% of rows).

Current optimizer estimates this to be 4% as it believes the columns are
independent.

I'm not sure whether I've explained it well enough, but the essence of
the proposal is to build N-dimensional histograms (where N is the number
of columns covered by the statistics) just as we are building histograms
today.

----------------------- Proof of concept ---------------------------

I've hacked a nasty proof of concept in PL/pgSQL (see the attachment).
It creates two tables - test_table and cross_stats. The former contains
the data (in two columns), the latter is used to store cross-column
statistics of test_table.

Then there are several PL/pgSQL functions - two of them are really
important:

collect_stats(p_bins_a INT, p_bins_b INT)
- this one is used to collect the stats, the parameters represent
number of bins for the columns (sides of the contingency table)
- collect_stats(10,10) will build contingency table with 100 cells

get_estimate(p_from_a INT, p_to_a INT, p_from_b INT, p_to_b INT)
- computes estimate for the condition listed above (ranges on both
columns)

So to run the PoC, just do something like this:

1) fill the table with some data

INSERT INTO test_table SELECT round(random()*1000),
round(random()*1000) FROM generate_series(1,100000);

2) collect the cross-column statistics

SELECT collect_stats(10,10);

3) see current estimated and actual number of rows

EXPLAIN ANALYZE SELECT * FROM test_table
WHERE (col_a BETWEEN 30 AND 129)
AND (col_b BETWEEN 234 AND 484);

4) see the estimate based on contingency table

SELECT get_estimate(30, 129, 234, 484);

Just two very simple tests for now - col_a/col_b contain the range from
the query, then there are actual number of matching rows and a current
estimate, And finally the new estimate based on contingency table with
various number of bins.

A) independent columns (proof that it produces just as good estimates
as the current code)

col_a | col_b | actual | expected | 10x10 | 20x20 |
[50,70] | [50,70] | 41 | 40 | 41 | 47 |
[50,250] | [50,250] | 4023 | 4024 | 4436 | 3944 |
[50,250] | [750,950] | 4023 | 3955 | 4509 | 3933 |

B) strongly positively dependent columns (actually col_a = col_b)

col_a | col_b | actual | expect | 10x10 | 20x20 | 100x100 |
[50,70] | [50,70] | 2143 | 57 | 391 | 729 | 1866 |
[50,250] | [50,250] | 20181 | 4111 | 15401 | 19983 | 19991 |
[50,250] | [750,950] | 0 | 3977 | 0 | 0 | 0 |

Which proves that it actually significantly improves the estimates.

Again, I've hacked that in about 1 hour, so it's really slow and I think
some of the estimates may be further improved. And the precision
obviously depends on the number of cells.

----------------------- Additional notes ---------------------------

1) The whole thing may be easily extended to more than two columns,
just build N-dimensional cubes.

2) I really don't think we should collect stats for all combinations of
columns of a table - I do like the Oracle-like approach where a DBA
has to enable cross-column stats using an ALTER TABLE (for a
particular list of columns).

The only exception might be columns from a multi-column index. It
might be quite efficient I guess?

3) There are independence tests for contingency tables (e.g. Pearson's
Chi-squared test), so that it's easy to find out whether the columns
are independent. In that case we can just throw away these stats and
use the simple estimation.

http://mathworld.wolfram.com/Chi-SquaredTest.html

4) Or we could store just those cells where expected and observed values
differ significantly (may help if most of the values are indendent,
but there's a small glitch somewhere).

5) It does not make sense to collect stats for two list of columns that
share several columns - just collect cross stats for union of the
column lists and then 'dripp up' as needed. An extreme case might be
collecting stats for all columns in a table. But there are practical
issues with this approach (huge tables, small values within cells).

6) The precision increases with the number of cells, but the number of
cells grows exponentionally. So it's necessary to choose a
reasonable number of bins for the columns (might be part of the
ALTER COMMAND, should not be firmly bound to the statistics target).

At a cell level, the simple 'multiplication' estimates are actually
used (but only when the cell is divided by the range).

7) All this is done under the assumption that all the columns are in a
single table - when doing research in the archives I've noticed
suggestions to use this to optimize joins. Maybe it can be done but
I really don't know how to do that.

8) I'm not sure where to store this and in what form - generally the
table does not need to be significantly more complex than cross_stats
table from the PoC.

9) I've noticed demands to collect data about columns used frequently
in a single WHERE condition. Not sure how to do that and how to
analyze the collected data. I think collecting data about expected
and observed number of columns should be just fine - but it's kinda
independent from this proposal.

regards
Tomas

Martijn van Oosterhout

kleptog@svana.org

over 15 years ago

In reply to: Tomas Vondra (#1)

Re: proposal : cross-column stats

On Sun, Dec 12, 2010 at 03:58:49AM +0100, Tomas Vondra wrote:

Hi everyone,

one of the ssesion I've attended on PgDay last week was Heikki's session
about statistics in PostgreSQL. One of the issues he mentioned (and one
I regularly run into) is the absence of cross-column stats. When the
columns are not independent, this usually result in poor estimates (and
then in suboptimal plans).

Very cool that you're working on this.

Lets talk about one special case - I'll explain how the proposed
solution works, and then I'll explain how to make it more general, what
improvements are possible, what issues are there. Anyway this is by no
means a perfect or complete solution - it's just a starting point.

It looks like you handled most of the issues. Just a few points:

- This is obviously applicable to more than just integers, probably
anything with a b-tree operator class. What you've coded seems rely
on calculations on the values. Have you thought about how it could
work for, for example, strings?

The classic failure case has always been: postcodes and city names.
Strongly correlated, but in a way that the computer can't easily see.
Not that I suggest you fix this, but it's food for though. Though
strictly speaking this is a different kind of correlation than what
you're looking at.

2) I really don't think we should collect stats for all combinations of
columns of a table - I do like the Oracle-like approach where a DBA
has to enable cross-column stats using an ALTER TABLE (for a
particular list of columns).

The only exception might be columns from a multi-column index. It
might be quite efficient I guess?

In the past it has been suggested to only do it for multi-column
indexes, but I find these days I find in some situations I prefer to
make individual indexes and let the bitmap scan code combine them. So
perhaps it would be best to let it be configured by the DBA.

3) There are independence tests for contingency tables (e.g. Pearson's
Chi-squared test), so that it's easy to find out whether the columns
are independent. In that case we can just throw away these stats and
use the simple estimation.

http://mathworld.wolfram.com/Chi-SquaredTest.html

I think this would be good to include, if possible.

Actually, I wonder if the existing stats collection code could be
altered to attempt to calculate the correlation between columns as part
of its other work.

4) Or we could store just those cells where expected and observed values
differ significantly (may help if most of the values are indendent,
but there's a small glitch somewhere).

Comrpessing that grid would be useful, given that for many dimensions
most of the grid will be not interesting. In fact, storing the 20
largest values may be enough. Worth an experiment.

Hope this helps,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

Patriotism is when love of your own people comes first; nationalism,
when hate for people other than your own comes first.
- Charles de Gaulle

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 15 years ago

In reply to: Martijn van Oosterhout (#2)

Re: proposal : cross-column stats

On 12.12.2010 15:17, Martijn van Oosterhout wrote:

On Sun, Dec 12, 2010 at 03:58:49AM +0100, Tomas Vondra wrote:
Very cool that you're working on this.

Lets talk about one special case - I'll explain how the proposed
solution works, and then I'll explain how to make it more general, what
improvements are possible, what issues are there. Anyway this is by no
means a perfect or complete solution - it's just a starting point.

It looks like you handled most of the issues. Just a few points:

- This is obviously applicable to more than just integers, probably
anything with a b-tree operator class. What you've coded seems rely
on calculations on the values. Have you thought about how it could
work for, for example, strings?

The classic failure case has always been: postcodes and city names.
Strongly correlated, but in a way that the computer can't easily see.

Yeah, and that's actually analogous to the example I used in my
presentation.

The way I think of that problem is that once you know the postcode,
knowing the city name doesn't add any information. The postcode implies
the city name. So the selectivity for "postcode = ? AND city = ?" should
be the selectivity of "postcode = ?" alone. The measurement we need is
"implicativeness": How strongly does column A imply a certain value for
column B. Perhaps that could be measured by counting the number of
distinct values of column B for each value of column A, or something
like that. I don't know what the statisticians call that property, or if
there's some existing theory on how to measure that from a sample.

That's assuming the combination has any matches. It's possible that the
user chooses a postcode and city combination that doesn't exist, but
that's no different from a user doing "city = 'fsdfsdfsd'" on a single
column, returning no matches. We should assume that the combination
makes sense.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

proposal : cross-column stats

Attachments: