Cross-table statistics idea

Started by Jim C. Nasbyover 19 years ago4 messages

jim@nasby.net

over 19 years ago

Since I don't recall any ideas ever having been thrown out on how to do
this...

ISTM that we could gain additional insight on how many rows would likely
result from a join be comparing the "shape" of the histogram for the
joining columns. For example, if the histogram arrays were exactly
identical, we're essentially stuck looking at the ratio of reltuples
between the two tables. (AFAIK that's the only estimate we make today)
If one histogram ended at a value smaller than the start of the other
histogram, we would estimate that no rows would result from an equal
join.

Am I right about how our estimates work right now? Where can I look in
the code? Has anyone looked down this path in the past?
--
Jim Nasby jim@nasby.net
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)

Tom Lane

tgl@sss.pgh.pa.us

over 19 years ago

In reply to: Jim C. Nasby (#1)

Re: Cross-table statistics idea

"Jim C. Nasby" <jim@nasby.net> writes:

ISTM that we could gain additional insight on how many rows would likely
result from a join be comparing the "shape" of the histogram for the
joining columns.

eqjoinsel already does this for the case of comparing the MCV lists.
If you're serious about using the histograms: well, maybe, but it seems
to require far stronger assumptions about the behavior of the datatype
than we use now.

Am I right about how our estimates work right now? Where can I look in
the code? Has anyone looked down this path in the past?

src/backend/utils/adt/selfuncs.c

regards, tom lane

Simon Riggs

simon@2ndquadrant.com

over 19 years ago

In reply to: Jim C. Nasby (#1)

Re: Cross-table statistics idea

On Tue, 2006-09-26 at 21:27 -0500, Jim C. Nasby wrote:

Since I don't recall any ideas ever having been thrown out on how to do
this...

ISTM that we could gain additional insight on how many rows would likely
result from a join

One thing we can do is to use cross-column relationships to improve the
estimation of Ndistinct.

If we have a table
Order_line (PK orderId, lineNum)

If we look at lineNum and see it has on average 10 values we can then
use this information to compute that Ndistinct should be -0.1, i.e. the
number of values is proportional to the number of rows with a factor of
10.

Right now if there are more than 10 lineNums per orderId on average then
we never decide that orderId is a scalable statistic.

I propose adding a final step to ANALYZE that applies a cross-column
rule after all columns have been analysed. If all except one column of a
PK have very low Ndistinct we can use that to calculate a minimum number
of Ndistinct for the column with a high number of values. If that
minimum number is less than the Ndistinct estimate in isolation, then we
overlay the new value.

This is desirable because the estimation of Ndistinct is very sensitive
to the number of matching rows in the sample, so Ndistinct estimates are
usually very poor for large Ndistinct. The estimates for low Ndistinct
are much better, so we can use them with a lower standard error to
correct the in-isolation estimate of other columns.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

Jim C. Nasby

jim@nasby.net

over 19 years ago

In reply to: Simon Riggs (#3)

Re: Cross-table statistics idea

On Wed, Sep 27, 2006 at 12:30:43PM +0100, Simon Riggs wrote:

On Tue, 2006-09-26 at 21:27 -0500, Jim C. Nasby wrote:

Since I don't recall any ideas ever having been thrown out on how to do
this...

ISTM that we could gain additional insight on how many rows would likely
result from a join

One thing we can do is to use cross-column relationships to improve the
estimation of Ndistinct.

If we have a table
Order_line (PK orderId, lineNum)

If we look at lineNum and see it has on average 10 values we can then
use this information to compute that Ndistinct should be -0.1, i.e. the
number of values is proportional to the number of rows with a factor of
10.

Right now if there are more than 10 lineNums per orderId on average then
we never decide that orderId is a scalable statistic.

I propose adding a final step to ANALYZE that applies a cross-column
rule after all columns have been analysed. If all except one column of a
PK have very low Ndistinct we can use that to calculate a minimum number
of Ndistinct for the column with a high number of values. If that
minimum number is less than the Ndistinct estimate in isolation, then we
overlay the new value.

This is desirable because the estimation of Ndistinct is very sensitive
to the number of matching rows in the sample, so Ndistinct estimates are
usually very poor for large Ndistinct. The estimates for low Ndistinct
are much better, so we can use them with a lower standard error to
correct the in-isolation estimate of other columns.

But wouldn't overlaying the value screw us if we wanted to look up
something based on the unique field? (ie: if there was a line_id in
order_line and we wanted to look something up based on line_id).
--
Jim Nasby jim@nasby.net
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)