two dimensional statistics in Postgres

Started by Katharina Büchseabout 11 years ago9 messages

katharina.buechse@uni-jena.de

about 11 years ago

Hi,

I'm a phd-student at the university of Jena, Thüringen, Germany, in the
field of data bases, more accurate query optimization.
I want to implement a system in PostgreSQL that detects column
correlations and creates statistical data about correlated columns for
the optimizer. Therefore I need to store two dimensional statistics
(especially two dimensional histograms) in PostgreSQL.
I had a look at the description of "WIP: multivariate statistics / proof
of concept", which looks really promising, I guess these statistics are
based on scans of the data? For my system I need both -- statistical
data based on table scans (actually, samples are enough) and those based
on query feedback. Query feedback (tuple counts and, speaking a little
inaccurately, the where-part of the query itself) needs to be extracted
and there needs to be a decision for the optimizer, when to take
multivariate statistics and when to use the one dimensional ones. Oracle
in this case just disables one dimensional histograms if there is
already a multidimensional histogram, but this is not always useful,
especially in the case of a feedback based histogram (which might not
cover the whole data space). I want to use both kinds of histograms
because correlations might occur only in parts of the data. In this case
a histogram based on a sample of the whole table might not get the point
and wouldn't help for the part of the data the user seems to be
interested in.
There are special data structures for storing multidimensional
histograms based on feedback and I already tried to implement one of
these in C. In the case of two dimensions they are of course not "for
free" (one dimensional would be much cheaper), but based on the
principle of maximum entropy they deliver really good results. I decided
for only two dimensions because in this case we have the best proportion
of cost and benefit when searching for correlation (here I'm relying on
tests that were made in DB2 within a project called CORDS which detects
correlations even between different tables).

I'd be grateful for any advices and discussions.
Regards,

Katharina

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Gavin Flower

GavinFlower@archidevsys.co.nz

about 11 years ago

In reply to: Katharina Büchse (#1)

Re: two dimensional statistics in Postgres

On 06/11/14 23:15, Katharina Büchse wrote:

Hi,

I'm a phd-student at the university of Jena, Thüringen, Germany, in
the field of data bases, more accurate query optimization.
I want to implement a system in PostgreSQL that detects column
correlations and creates statistical data about correlated columns for
the optimizer. Therefore I need to store two dimensional statistics
(especially two dimensional histograms) in PostgreSQL.
I had a look at the description of "WIP: multivariate statistics /
proof of concept", which looks really promising, I guess these
statistics are based on scans of the data? For my system I need both
-- statistical data based on table scans (actually, samples are
enough) and those based on query feedback. Query feedback (tuple
counts and, speaking a little inaccurately, the where-part of the
query itself) needs to be extracted and there needs to be a decision
for the optimizer, when to take multivariate statistics and when to
use the one dimensional ones. Oracle in this case just disables one
dimensional histograms if there is already a multidimensional
histogram, but this is not always useful, especially in the case of a
feedback based histogram (which might not cover the whole data space).
I want to use both kinds of histograms because correlations might
occur only in parts of the data. In this case a histogram based on a
sample of the whole table might not get the point and wouldn't help
for the part of the data the user seems to be interested in.
There are special data structures for storing multidimensional
histograms based on feedback and I already tried to implement one of
these in C. In the case of two dimensions they are of course not "for
free" (one dimensional would be much cheaper), but based on the
principle of maximum entropy they deliver really good results. I
decided for only two dimensions because in this case we have the best
proportion of cost and benefit when searching for correlation (here
I'm relying on tests that were made in DB2 within a project called
CORDS which detects correlations even between different tables).

I'd be grateful for any advices and discussions.
Regards,

Katharina

Could you store a 2 dimensional histogram in a one dimensional array:
A[z] = value, where z = col * rowSize + row (zero starting index)?

Cheers,
Gavin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tomas Vondra

tv@fuzzy.cz

about 11 years ago

In reply to: Katharina Büchse (#1)

Re: two dimensional statistics in Postgres

Hi,

Dne 6 Listopad 2014, 11:15, Katharina Büchse napsal(a):

Hi,

I'm a phd-student at the university of Jena, Thüringen, Germany, in the
field of data bases, more accurate query optimization.
I want to implement a system in PostgreSQL that detects column
correlations and creates statistical data about correlated columns for
the optimizer. Therefore I need to store two dimensional statistics
(especially two dimensional histograms) in PostgreSQL.

Cool!

I had a look at the description of "WIP: multivariate statistics / proof
of concept", which looks really promising, I guess these statistics are
based on scans of the data? For my system I need both -- statistical

Yes, it's based on a sample of the data.

data based on table scans (actually, samples are enough) and those based
on query feedback. Query feedback (tuple counts and, speaking a little
inaccurately, the where-part of the query itself) needs to be extracted
and there needs to be a decision for the optimizer, when to take
multivariate statistics and when to use the one dimensional ones. Oracle
in this case just disables one dimensional histograms if there is
already a multidimensional histogram, but this is not always useful,
especially in the case of a feedback based histogram (which might not
cover the whole data space). I want to use both kinds of histograms

What do you mean by not covering the whole data space? I assume that when
building feedback-based histogram, parts of the data will be filtered out
because of WHERE clauses etc. Is that what you mean? I don't see how this
could happen for regular histograms, though.

because correlations might occur only in parts of the data. In this case
a histogram based on a sample of the whole table might not get the point
and wouldn't help for the part of the data the user seems to be
interested in.

Yeah, there may be dependencies that are difficult to spot in the whole
dataset, but emerge once you filter to a specific subset.

Now, how would that work in practice? Initially the query needs to be
planned using regular stats (because there's no feedback yet), and then -
when we decide the estimates are way off - may be re-planned using the
feedback. The feedback is inherently query-specific, so I'm not sure if
it's possible to reuse it for multiple queries (might be possible for
"sufficiently similar" ones).

Would this be done automatically for all queries / all conditions, or only
when specifically enabled (on a table, columns, ...)?

There are special data structures for storing multidimensional
histograms based on feedback and I already tried to implement one of
these in C. In the case of two dimensions they are of course not "for
free" (one dimensional would be much cheaper), but based on the
principle of maximum entropy they deliver really good results. I decided
for only two dimensions because in this case we have the best proportion
of cost and benefit when searching for correlation (here I'm relying on

I think hardcoding the two-dimensions limit is wrong. I understand higher
number of dimensions means more expensive operation, but if the user can
influence it, I believe it's OK.

Also, is there any particular reason why not to support other kinds of
stats (say, MCV lists)? In the end it's just a different way to
approximate the distribution, and it may be way cheaper than histograms.

tests that were made in DB2 within a project called CORDS which detects
correlations even between different tables).

Is this somehow related to LEO? I'm not familiar with the details, but
from the description it might be related.

regards
Tomas

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tomas Vondra

tv@fuzzy.cz

about 11 years ago

In reply to: Gavin Flower (#2)

Re: two dimensional statistics in Postgres

Dne 6 Listopad 2014, 11:50, Gavin Flower napsal(a):

Could you store a 2 dimensional histogram in a one dimensional array:
A[z] = value, where z = col * rowSize + row (zero starting index)?

How would that work for columns with different data types?

Tomas

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Gavin Flower

GavinFlower@archidevsys.co.nz

about 11 years ago

In reply to: Tomas Vondra (#4)

Re: two dimensional statistics in Postgres

On 06/11/14 23:57, Tomas Vondra wrote:

Dne 6 Listopad 2014, 11:50, Gavin Flower napsal(a):

Could you store a 2 dimensional histogram in a one dimensional array:
A[z] = value, where z = col * rowSize + row (zero starting index)?

How would that work for columns with different data types?

Tomas

I implicitly assumed that all cells were the same size & type. However,
I could devise a scheme to cope with columns of different types, given
the relevant definitions - but this would obviously be more complicated.

Also this method can be extended into higher dimensions.

Cheers,
Gavin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tomas Vondra

tv@fuzzy.cz

about 11 years ago

In reply to: Gavin Flower (#5)

Re: two dimensional statistics in Postgres

Dne 6 Listopad 2014, 12:05, Gavin Flower napsal(a):

On 06/11/14 23:57, Tomas Vondra wrote:

Dne 6 Listopad 2014, 11:50, Gavin Flower napsal(a):

Could you store a 2 dimensional histogram in a one dimensional array:
A[z] = value, where z = col * rowSize + row (zero starting index)?

How would that work for columns with different data types?

Tomas

I implicitly assumed that all cells were the same size & type. However,
I could devise a scheme to cope with columns of different types, given
the relevant definitions - but this would obviously be more complicated.

Also this method can be extended into higher dimensions.

Which is what I did in the "WIP: multivariate stats" ;-)

Tomas

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Katharina Büchse

katharina.buechse@uni-jena.de

about 11 years ago

In reply to: Tomas Vondra (#3)

Re: two dimensional statistics in Postgres

Ahoj ;-)

On 06.11.2014 11:56, Tomas Vondra wrote:

Hi,

Dne 6 Listopad 2014, 11:15, Katharina Büchse napsal(a):

Hi,

I'm a phd-student at the university of Jena, Thüringen, Germany, in the
field of data bases, more accurate query optimization.
I want to implement a system in PostgreSQL that detects column
correlations and creates statistical data about correlated columns for
the optimizer. Therefore I need to store two dimensional statistics
(especially two dimensional histograms) in PostgreSQL.

Cool!

I had a look at the description of "WIP: multivariate statistics / proof
of concept", which looks really promising, I guess these statistics are
based on scans of the data? For my system I need both -- statistical

Yes, it's based on a sample of the data.

data based on table scans (actually, samples are enough) and those based
on query feedback. Query feedback (tuple counts and, speaking a little
inaccurately, the where-part of the query itself) needs to be extracted
and there needs to be a decision for the optimizer, when to take
multivariate statistics and when to use the one dimensional ones. Oracle
in this case just disables one dimensional histograms if there is
already a multidimensional histogram, but this is not always useful,
especially in the case of a feedback based histogram (which might not
cover the whole data space). I want to use both kinds of histograms

What do you mean by not covering the whole data space? I assume that when
building feedback-based histogram, parts of the data will be filtered out
because of WHERE clauses etc. Is that what you mean? I don't see how this
could happen for regular histograms, though.

Yes, you're right. Because of the where clauses, some parts of the data
might be filtered out in feedback based histograms. This cannot happen
in "regular" histograms, but as I mentioned -- I would like to use both
kinds of histograms.

because correlations might occur only in parts of the data. In this case
a histogram based on a sample of the whole table might not get the point
and wouldn't help for the part of the data the user seems to be
interested in.

Yeah, there may be dependencies that are difficult to spot in the whole
dataset, but emerge once you filter to a specific subset.

Now, how would that work in practice? Initially the query needs to be
planned using regular stats (because there's no feedback yet), and then -
when we decide the estimates are way off - may be re-planned using the
feedback. The feedback is inherently query-specific, so I'm not sure if
it's possible to reuse it for multiple queries (might be possible for
"sufficiently similar" ones).

Would this be done automatically for all queries / all conditions, or only
when specifically enabled (on a table, columns, ...)?

The idea is the following: I want to find out correlations with two
different algorithms, one scanning some samples of the data, the other
analyzing query feedback. If both decide for a column combination that
it's correlated, then there should be made a "regular histogram" for
this combination. If only the "scanning"-algorithm says "correlated",
then it means that there is some correlation, but this is not
interesting for the user right now. So I would only "leave some note"
for the optimizer that there is correlation and if the user interest
changes and query results differ a lot from the estimates in the plan,
then again -- "regular histogram". If only the "feedback"-algorithm
decides that the columns are correlated, a histogram based on query
feedback is the most useful choice to support the work of the optimizer.

There are special data structures for storing multidimensional
histograms based on feedback and I already tried to implement one of
these in C. In the case of two dimensions they are of course not "for
free" (one dimensional would be much cheaper), but based on the
principle of maximum entropy they deliver really good results. I decided
for only two dimensions because in this case we have the best proportion
of cost and benefit when searching for correlation (here I'm relying on

I think hardcoding the two-dimensions limit is wrong. I understand higher
number of dimensions means more expensive operation, but if the user can
influence it, I believe it's OK.

I don't know whether the user has to decide whether the statistical data
is based on feedback or on data scans. I guess it's enough if he gets
his histograms in higher dimensions based on data scans.

Also, is there any particular reason why not to support other kinds of
stats (say, MCV lists)? In the end it's just a different way to
approximate the distribution, and it may be way cheaper than histograms.

The reason actually is just that 1) I have only limited time and cannot
cover every possibility to support the optimizer when there is
correlation and 2) there are good papers about feedback based histograms :-D

tests that were made in DB2 within a project called CORDS which detects
correlations even between different tables).

Is this somehow related to LEO? I'm not familiar with the details, but
from the description it might be related.

actually, LEO is purely feedback based, while CORDS is using data scans.
But some authors were involved in both projects I guess. LEO itself
never made it to be fully integrated in DB2, but some parts of it did.
What's interesting for me is the fact that in DB2 there's no possibility
for multidimensional histograms.

regards
Tomas

Katharina

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tomas Vondra

tv@fuzzy.cz

about 11 years ago

In reply to: Katharina Büchse (#7)

Re: two dimensional statistics in Postgres

On 7.11.2014 13:19, Katharina Büchse wrote:

On 06.11.2014 11:56, Tomas Vondra wrote:

Dne 6 Listopad 2014, 11:15, Katharina Büchse napsal(a):

because correlations might occur only in parts of the data. In this case
a histogram based on a sample of the whole table might not get the point
and wouldn't help for the part of the data the user seems to be
interested in.

Yeah, there may be dependencies that are difficult to spot in the whole
dataset, but emerge once you filter to a specific subset.

Now, how would that work in practice? Initially the query needs to be
planned using regular stats (because there's no feedback yet), and then -
when we decide the estimates are way off - may be re-planned using the
feedback. The feedback is inherently query-specific, so I'm not sure if
it's possible to reuse it for multiple queries (might be possible for
"sufficiently similar" ones).

Would this be done automatically for all queries / all conditions, or
only when specifically enabled (on a table, columns, ...)?

The idea is the following: I want to find out correlations with two
different algorithms, one scanning some samples of the data, the other
analyzing query feedback.

So you're starting with the default (either single or multivariate)
statistics, computed by ANALYZE from a sample (covering the whole
table). And then compute matching statistics while running the query
(i.e. a sample filtered by the WHERE clauses)?

If both decide for a column combination that it's correlated, then
there should be made a "regular histogram" for this combination. If
only the "scanning"-algorithm says "correlated", then it means that
there is some correlation, but this is not interesting for the user
right now.

Isn't it sufficient to simply compare the estimated and observed number
of rows? Either they're sufficiently close, and in that case the
existing stats are good enough (either the columns are independent, or
there are appropriate multivariate stats) - in this case additional
stats are not necessary. Or the estimates are way off, so either there
are no multivariate stats (or are not suitable for this query).

Or do you plan to compute some other stats, possibly more complex,
allowing for more complicated correlation detection?

So I would only "leave some note" for the optimizer that there is
correlation and if the user interest changes and query results differ
a lot from the estimates in the plan, then again -- "regular
histogram". If only the "feedback"-algorithm decides that the columns
are correlated, a histogram based on query feedback is the most
useful choice to support the work of the optimizer.

So the check is peformed only when the query completes, and if there's a
mismatch then you put somewhere a note that those columns are correlated?

I think what exactly is stored in the "note" is crucial. If it's only a
list of columns and "iscorrelated" flag, then I'm not sure how is the
optimizer going to use that directly. I.e. without actual histogram or
at least corrected estimates.

Actually, I'm starting to wonder whether I understand the idea. I'm
aware of the following two kinds of "feedback histograms":

a) building the full histogram from query feedback (STGrid/STHoles)

The main goal is to build and refine a histogram, without
examining the data set directly (not even sampling it).

b) optimizing the set of histograms wrt. to workload (SASH)

Decides what histograms to build, with respect to the observed
workload (i.e. executed queries).

Is the presented similar to one of these, or rather something different?

Actually, the "Introduction" in the CORDS paper says this:

CORDS is a data-driven tool that automatically discovers
correlations and soft functional dependencies (fds) between pairs
of columns and, based on these relationships, recommends a set of
statistics for the query optimizer to maintain.

which seems like the option (b). I haven't read the rest of the paper,
though.

There are special data structures for storing multidimensional
histograms based on feedback and I already tried to implement one
of these in C. In the case of two dimensions they are of course
not "for free" (one dimensional would be much cheaper), but based
on the principle of maximum entropy they deliver really good
results. I decided for only two dimensions because in this case
we have the best proportion of cost and benefit when searching
for correlation (here I'm relying on

I think hardcoding the two-dimensions limit is wrong. I understand
higher number of dimensions means more expensive operation, but if
the user can influence it, I believe it's OK.

I don't know whether the user has to decide whether the statistical
data is based on feedback or on data scans. I guess it's enough if
he gets his histograms in higher dimensions based on data scans.

I wasn't talking about deciding whether to use regular or feedback stats
(although I believe features like this should be opt-in), but about
tweaking the number of dimensions.

For example imagine there are three columns [a,b,c] and you know the
data is somehow correlated. Is it better to build three 2-dimensional
feedback histograms (a,b), (a,c) and (b,c), or a single 3-dimensional
histogram (a,b,c) or maybe something else?

Also, is there any particular reason why not to support other kinds
of stats (say, MCV lists)? In the end it's just a different way to
approximate the distribution, and it may be way cheaper than
histograms.

The reason actually is just that 1) I have only limited time and
cannot cover every possibility to support the optimizer when there
is correlation and 2) there are good papers about feedback based
histograms :-D

OK, understood. Those are completely valid reasons. I was just curious
whether there's some reason that makes the extension to more dimensions
impossible.

tests that were made in DB2 within a project called CORDS which
detects correlations even between different tables).

Is this somehow related to LEO? I'm not familiar with the details,
but from the description it might be related.

actually, LEO is purely feedback based, while CORDS is using data
scans. But some authors were involved in both projects I guess. LEO
itself never made it to be fully integrated in DB2, but some parts of
it did. What's interesting for me is the fact that in DB2 there's no
possibility for multidimensional histograms.

OK, so the point is to optimize the set of histograms, similar to what
CORDS does, but based on feedback. Correct?

Tomas

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tomas Vondra

tv@fuzzy.cz

about 11 years ago

In reply to: Katharina Büchse (#1)

Re: two dimensional statistics in Postgres

On 8.11.2014 15:40, Katharina Büchse wrote:

I'm sorry if I repeated myself too often, I somehow started answering
at the end of your mail and then went up... I promise to do this
better next time.

Nah, no problem. Better say something twice than not at all ;-)

However, I see you've responded to me directly (not through the
pgsql-hackers list). I assume that's not on purpose, so I'm adding the
list back into the loop ...

On 07.11.2014 20:37, Tomas Vondra wrote:

On 7.11.2014 13:19, Katharina Büchse wrote:

On 06.11.2014 11:56, Tomas Vondra wrote:

Dne 6 Listopad 2014, 11:15, Katharina Büchse napsal(a):

because correlations might occur only in parts of the data.
In this case a histogram based on a sample of the whole table
might not get the point and wouldn't help for the part of the
data the user seems to be interested in.

Yeah, there may be dependencies that are difficult to spot in the whole
dataset, but emerge once you filter to a specific subset.

Now, how would that work in practice? Initially the query needs
to be planned using regular stats (because there's no feedback
yet), and then - when we decide the estimates are way off - may
be re-planned using the feedback. The feedback is inherently
query-specific, so I'm not sure if it's possible to reuse it
for multiple queries (might be possible for "sufficiently
similar" ones).

Would this be done automatically for all queries / all conditions, or
only when specifically enabled (on a table, columns, ...)?

The idea is the following: I want to find out correlations with
two different algorithms, one scanning some samples of the data,
the other analyzing query feedback.

So you're starting with the default (either single or
multivariate) statistics, computed by ANALYZE from a sample
(covering the whole table).

yes

And then compute matching statistics while running the query (i.e.
a sample filtered by the WHERE clauses)?

well, not really... I would like to emphasize that my systems
consists of two parts:

1) finding correlations automatically
2) creating histograms for correlated columns.

In (1) we use two different approaches, one of them feedback based, but
even this approach has to collect several feedback data to be able to
decide if there's correlation or not. So "while running the query" is
not 100% correct. While running the query I would extract the feedback
data the algorithm needs, which is the tuple count and the constraint on
the data that was done in the query. Constraints should look like
"columnA = 'a' and columnB = 'b'" and the more different queries we have
with different constraints, the better it is. And yes, when this
algorithm starts analyzing, there needs to be a check done for choosing
consistent feedback data. So if there were several queries on two
columns of a table, but the data changed while these queries took place,
we cannot use all of the feedback we have.

OK, thanks for the explanation!

If both decide for a column combination that it's correlated,
then there should be made a "regular histogram" for this
combination. If only the "scanning"-algorithm says "correlated",
then it means that there is some correlation, but this is not
interesting for the user right now.

Isn't it sufficient to simply compare the estimated and observed
number of rows?

This sounds like the easiest way, but has many disadvantages. Just
imagine, that the statistical information is based on data that
already changed. The estimate could be totally wrong (even if we
automatically update statistics after a change of, let's say, 10%,
this might happen, because the user could ask for exactly the part
which changed, and if there was "only" a change of maybe 8%, the
histogram would still be the old one), but this has nothing to do
with correlation. If we then always decided to build a
multidimensional histogram, it would mean to do unnecessary work,
because creating and maintain multidimensional histograms is more
expensive then one dimensional ones.

Good point. IMHO stale stats are a problem in general, and here it may
clearly cause "false positives" if the algorithm is not careful enough.

But if an algorithm (which checked the data for correlations) says,
that there really is correlation, the fact, that estimates and query
results differ a lot from one another could be a good occasion to
really create a histogram. Of course this decision could be based on
the same "mistake" that I just described, but we already limited this
wrong decision to the case that one algorithm "voted" for
correlation.

Understood. I believe I understand the general goal, although I don't
have a clear idea how to implement that, or how complex it could get.
But I guess that's not a problem ... clearly you have a plan ;-)

So I would only "leave some note" for the optimizer that there
is correlation and if the user interest changes and query results
differ a lot from the estimates in the plan, then again --
"regular histogram". If only the "feedback"-algorithm decides
that the columns are correlated, a histogram based on query
feedback is the most useful choice to support the work of the
optimizer.

So the check is peformed only when the query completes, and if
there's a mismatch then you put somewhere a note that those columns
are correlated?

No, I would want the algorithm (which is now the one based on
samples) to store his note as soon as he finds out correlation (which
he does by analyzing -- doing some magic mathematics -- samples of
the data). The thing is -- if the other algorithm, which checks of
correlation by analyzing (also magic mathematics...) feedback data,
doesn't vote for correlation, we don't need the histogram right now
because the user is not interested in the correlated part of the
table. But there must be a note that histograms could be necessary in
the future.

OK. I quickly skimmed through the ISOMER paper (from ICDE 2006), and
this seems to match the "Figure 1", with steps like

* add new feedback
* detect inconsistent feedback
* eliminate unimportant feedback
* compute final histogram

Seems quite close to what's discussed in this thread, correct?

I think what exactly is stored in the "note" is crucial. If it's
only a list of columns and "iscorrelated" flag, then I'm not sure
how is the optimizer going to use that directly. I.e. without
actual histogram or at least corrected estimates.

As I mentioned -- the optimizer won't use this note directly,
because right now he doesn't need it. Of course we suppose that users
are lazy and don't change their interest too often. But if they do
and estimates are starting to get worse, then the "iscorrelated" flag
tells us that we should create a histogram as soon as possible.

OK, understood.

Actually, I'm starting to wonder whether I understand the idea. I'm
aware of the following two kinds of "feedback histograms":

a) building the full histogram from query feedback (STGrid/STHoles)

The main goal is to build and refine a histogram, without
examining the data set directly (not even sampling it).

b) optimizing the set of histograms wrt. to workload (SASH)

Decides what histograms to build, with respect to the observed
workload (i.e. executed queries).

Is the presented similar to one of these, or rather something different?

Actually, the "Introduction" in the CORDS paper says this:

CORDS is a data-driven tool that automatically discovers
correlations and soft functional dependencies (fds) between pairs
of columns and, based on these relationships, recommends a set of
statistics for the query optimizer to maintain.

which seems like the option (b). I haven't read the rest of the paper,
though.

It's something in between, I would say. I would like to use ISOMER
(which is a further development of STHoles) to create feedback based
histograms, but the decision, which histograms to build, would rely
on algorithms to find out correlations. One of them is
feedback-based....

OK, thanks for the clarifications.

I don't know whether the user has to decide whether the
statistical data is based on feedback or on data scans. I guess
it's enough if he gets his histograms in higher dimensions based
on data scans.

I wasn't talking about deciding whether to use regular or feedback
stats (although I believe features like this should be opt-in), but
about tweaking the number of dimensions.

For example imagine there are three columns [a,b,c] and you know
the data is somehow correlated. Is it better to build three
2-dimensional feedback histograms (a,b), (a,c) and (b,c), or a
single 3-dimensional histogram (a,b,c) or maybe something else?

that's a good question. When the correlation really is in all three
columns (for example, if we have a look at an employee table where
there is information about the car an employee is using, this might
be depended of his income and his family status), of course it should
be better to have a three dimensional histogram. And if the user was
able to point out these dependencies, it would be a pitty if there
wasn't any possibility in the database system to store this
information and create a histogram for all these columns. But if we
talk about finding out these dependencies automatically (and that's
my aim), there will be so (or even too) many possibilities to combine
3 columns....

I think that 'too many possibilities' really depends on the context. For
example we're working with hundreds of gigabytes of data, and
misestimates are a big issue for us, occasionally causing queries to run
for hours instead of seconds. We're perfectly fine with spending a few
more minutes by analyzing the stats / planning, because the gains
outweight the expenses. But clearly that's not a universal truth, which
is why I was asking about making it possible to tweak this.

That being said, I'm perfectly fine with limiting the scope of the
effort by explicitly supporting just 2 dimensions.

tests that were made in DB2 within a project called CORDS
which detects correlations even between different tables).

Is this somehow related to LEO? I'm not familiar with the
details, but from the description it might be related.

actually, LEO is purely feedback based, while CORDS is using
data scans. But some authors were involved in both projects I
guess. LEO itself never made it to be fully integrated in DB2,
but some parts of it did. What's interesting for me is the fact
that in DB2 there's no possibility for multidimensional
histograms.

OK, so the point is to optimize the set of histograms, similar to
what CORDS does, but based on feedback. Correct?

Tomas

Actually, the point is to help the optimizer to decide for better
plans and therefore first find correlations (in two different ways -
with feedback and with Cords) and then create new histograms.

I think that's what I meant by "optimizing the set of histograms" (i.e.
creating new histograms, etc.).

regards
Tomas

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Import Notes

Reply to msg id not found: 545E2B4F.6020505@uni-jena.de