More thoughts about planner's cost estimates

Started by Tom Lanealmost 20 years ago42 messageshackers

tgl@sss.pgh.pa.us

almost 20 years ago

I've been thinking about the planner's costing problems again,
particularly in connection with Philippe Lang's complaint here:
http://archives.postgresql.org/pgsql-general/2006-05/msg01526.php
Investigation showed that the planner is much too optimistic about the
usefulness of a multi-index BitmapAnd plan, and since that comparison is
just a cost-estimate comparison, it implies we are underestimating the
cost of an index scan. A typical example from his results is

-> BitmapAnd (cost=12.30..12.30 rows=1 width=0) (actual time=0.306..0.306 rows=0 loops=13628)
-> Bitmap Index Scan on lw_id_workflow (cost=0.00..2.02 rows=7 width=0) (actual time=0.009..0.009 rows=7 loops=13628)
Index Cond: (lw.id_workflow = "outer".id)
-> Bitmap Index Scan on lw_ordre (cost=0.00..10.03 rows=1437 width=0) (actual time=0.293..0.293 rows=1714 loops=13628)
Index Cond: (ordre = $4)

There are two variables involved here: the cost of touching an index page
and the cost of processing an index tuple. Given the two comparable data
points above, we can solve for those numbers, and it turns out that the
real cost ratio on Philippe's machine is about 50 to 1. Which says that
for him, cpu_index_tuple_cost plus cpu_operator_cost should be around
0.02, nearly an order of magnitude higher than their current default
values (0.001 and 0.0025 respectively).

In general it seems to me that for CPU-bound databases, the default values
of the cpu_xxx_cost variables are too low. I am tempted to raise the
default value of cpu_index_tuple_cost to 0.005, which would be half of
cpu_tuple_cost and twice cpu_operator_cost; that seems more in line with
my feel for the relative costs of things. For a CPU-bound database all
three would need to go up more than that. But rather than telling people
to manipulate all three of these variables individually, I think it might
also be a good idea to provide a new GUC variable named something like
"cpu_speed_scale" that would just be a multiplier for the other variables.
It would default to 1.0 and our standard advice for CPU-bound databases
would be "decrease random_page_cost to 1.0 and raise cpu_speed_scale to
10.0 or so". Seems cleaner than telling people to muck with three or so
individual values.

Another thing that's bothering me is that the index access cost computation
(in genericcostestimate) is looking sillier and sillier:

/*
* Estimate the number of index pages that will be retrieved.
*
* For all currently-supported index types, the first page of the index is
* a metadata page, and we should figure on fetching that plus a pro-rated
* fraction of the remaining pages.
*/
if (index->pages > 1 && index->tuples > 0)
{
numIndexPages = (numIndexTuples / index->tuples) * (index->pages - 1);
numIndexPages += 1; /* count the metapage too */
numIndexPages = ceil(numIndexPages);
}
else
numIndexPages = 1.0;

/*
* Compute the index access cost.
*
* Disk cost: our generic assumption is that the index pages will be read
* sequentially, so they have cost 1.0 each, not random_page_cost.
*/
*indexTotalCost = numIndexPages;

There are several things wrong with this, at least in its application to
btrees. It's not counting descent of the btree (ie, touches of the root
page and intermediate-level pages). On the other hand it's surely
overcharging for metapage touches. As of CVS tip we cache the metapage in
the relcache, so it's probably fair to disregard that cost altogether.
And on the third hand, when we need to retrieve multiple leaf pages it's
over-optimistic to assume that those'll be purely sequential fetches.
(That would be approximately true in a freshly-built btree, but not in one
that's suffered any amount of page splitting or recycling.)

I've desisted from touching this mainly because the obvious sanity
adjustments, such as adding something for tree descent and charging
random_page_cost not 1.0 for leaf page touches, would increase the
estimated cost of index usage, which seemed like not the direction we
wanted to go in. So I figured that the errors were more or less
cancelling each other. But the addition of bitmap index scans is now
exposing the weaknesses, so we need to face up to the problem.

I am inclined to think that a reasonable model is to continue to estimate
the number of index leaf pages touched as above (pro-rating against the
total number of index entries), to charge random_page_cost per leaf page
touched, and not to count anything for metapage or tree descent. I
justify the latter on the grounds that upper tree levels will tend to stay
in cache because they're touched so often. random_page_cost per leaf page
may be an overestimate, since sometimes logically adjacent leaf pages will
be physically adjacent too, but not charging for tree descent should help
to cancel that out. With this model, the disk cost to fetch a single
index entry will be estimated as random_page_cost (default 4.0) rather
than the current fixed 2.0. This shouldn't hurt things too much for
simple indexscans --- especially since anyone running with a reduced
random_page_cost won't see as much change. And it will increase the costs
for bitmap scans that cross many index pages, which is what we need in
light of Philippe's numbers.

Now we have seen a lot of cases in which indexscans were being drastically
overestimated, so increasing the cost estimate even more may seem like a
horrid idea. But I believe that most or all of these cases were ones
where the same index was being visited many times, and so the real
estimation failure is not accounting for cache effects across multiple
indexscans. Rather than being afraid to raise the cost estimate for an
indexscan in isolation, I think it's time to bite the bullet and do
something about that.

The big difficulty in modeling cache effects from multiple scans is that
we usually don't know how many times the index will be scanned. If we did
have that number, I think it'd be reasonable to use the Mackert and Lohman
approximation already used in cost_index to estimate the total number of
index leaf pages fetched over N scans, and then divide by N to get our
per-scan estimated cost. But because the planner builds cost estimates
bottom-up, in most scenarios we don't have any idea whether an indexscan
plan node will be iterated once or many times in the finished plan.
However there is one case where we do have some information: when
computing a best_inner_indexscan() for a nestloop join, it'd be reasonable
to use the estimated size of the outer relation as the loop count.
And this is exactly the case where we are falling down in practice:
the complaints we've seen are mostly about overestimating the cost of a
nestloop-with-inner-index-scan. So I think handling just this case would
be a big step forward, even if it's not a theoretically pure general
solution.

Thoughts, gripes, better ideas?

regards, tom lane

Josh Berkus

josh@agliodbs.com

almost 20 years ago

In reply to: Tom Lane (#1)

Re: More thoughts about planner's cost estimates

Tom,

As you know, this is something I think about a bit too, though not
nearly as deeply as you.

In general it seems to me that for CPU-bound databases, the default values
of the cpu_xxx_cost variables are too low. I am tempted to raise the
default value of cpu_index_tuple_cost to 0.005, which would be half of
cpu_tuple_cost and twice cpu_operator_cost; that seems more in line with
my feel for the relative costs of things. For a CPU-bound database all
three would need to go up more than that. But rather than telling people
to manipulate all three of these variables individually, I think it might
also be a good idea to provide a new GUC variable named something like
"cpu_speed_scale" that would just be a multiplier for the other variables.

Yes, please! I've done a bit of playing around with the cpu_*
variables, and frankly have always manipulated them by the same multiple.

Also, my testing has shown that on *large* databases (20x RAM +) with
fast processors (Opteron) the cpu_* values are too high. But a lot of
that is offsetting estimates which are too high elsewhere ... as is
random_page_cost.

I've desisted from touching this mainly because the obvious sanity
adjustments, such as adding something for tree descent and charging
random_page_cost not 1.0 for leaf page touches, would increase the
estimated cost of index usage, which seemed like not the direction we
wanted to go in. So I figured that the errors were more or less
cancelling each other. But the addition of bitmap index scans is now
exposing the weaknesses, so we need to face up to the problem.

Yeah. I've refrained from proposing changes because it's a
pick-up-sticks. If we start modifying the model, we need to fix
*everything*, not just one item. And then educate our users that they
need to use the GUC variables in a different way. Here's the issues I see:

1. n-distinct estimation is bad, as previously discussed;

2. our current heuristics sampling methods prevent us from sampling more
than 0.5% of any reasonably large table, causing all statistics on those
tables to be off for any table with irregular distribution of values;

3. We don't have any method to analyze inter-column correlation within a
table;

4. We don't keep statistics on foriegn key correlation;

5. random_page_cost (as previously discussed) is actually a funciton of
relatively immutable hardware statistics, and as such should not need to
exist as a GUC once the cost model is fixed.

6. We haven't added any way to estimate rows returned from SRFs.

I am inclined to think that a reasonable model is to continue to estimate
the number of index leaf pages touched as above (pro-rating against the
total number of index entries), to charge random_page_cost per leaf page
touched, and not to count anything for metapage or tree descent. I
justify the latter on the grounds that upper tree levels will tend to stay
in cache because they're touched so often. random_page_cost per leaf page
may be an overestimate, since sometimes logically adjacent leaf pages will
be physically adjacent too, but not charging for tree descent should help
to cancel that out. With this model, the disk cost to fetch a single
index entry will be estimated as random_page_cost (default 4.0) rather
than the current fixed 2.0. This shouldn't hurt things too much for
simple indexscans --- especially since anyone running with a reduced
random_page_cost won't see as much change. And it will increase the costs
for bitmap scans that cross many index pages, which is what we need in
light of Philippe's numbers.

Per above, I think anything involving random_page_cost is the wrong way
to go. RPC is a GUC assigned by the DBA on pure guesswork. We should
be decreasing its use or eliminating it entirely, not increasing the
amount we use it.

For btrees, we should be able to accurately estimate the cost of the
index access given:
a) the depth of the tree;
b) the likelyhood that requested values are adjacent;
c) whether the index and tables are already in shared_mem, or else (d)
and (e) below:
d) the probability that the index pages are cached in memory, which is
composed of:
(1) the frequency with which that index is accessed, modified by
(2) whether the index is a fraction of available ram, or larger than ram
e) the probability that the relevant table pages are cached in memory,
determined by the same two factors.

*None* of this should involve a userset parameter (random_page_cost)
since as you pointed out months ago, the ration of seq/seek access has
remained relatively constant over the past 6 years of HDD and I/O
engineering.

Sadly, I don't have the math for all of the above but I'm hoping someone
(either here or on ACM) does.

Also, this would require us to have different *formulas* and not merely
different cost factors for different kinds of indexes. I believe that
this is something that Jie is already struggling with. Jie?

The big difficulty in modeling cache effects from multiple scans is that
we usually don't know how many times the index will be scanned. If we did
have that number, I think it'd be reasonable to use the Mackert and Lohman
approximation already used in cost_index to estimate the total number of
index leaf pages fetched over N scans, and then divide by N to get our
per-scan estimated cost. But because the planner builds cost estimates
bottom-up, in most scenarios we don't have any idea whether an indexscan
plan node will be iterated once or many times in the finished plan.

I think we can work around this by figuring out whether the index has
already been scanned in the plan, something we *can* know. So the first
scan will be full cost and remaining scans will be fractional cost.
Possibly not as good as Mackert/Lohman, but it allows us reasonably
accurate *overall* estimates without needing to move away from bottom-up
plan building.

Thoughts, gripes, better ideas?

yes. I think if we're going to fix the cost model, we should overhaul
it and target 8.3. Realistically, once we start tinkering with the
cost model most queries are going to get worse before they get better.

--Josh

Bruce Momjian

bruce@momjian.us

almost 20 years ago

In reply to: Josh Berkus (#2)

Re: More thoughts about planner's cost estimates

Josh Berkus <josh@agliodbs.com> writes:

1. n-distinct estimation is bad, as previously discussed;

2. our current heuristics sampling methods prevent us from sampling more than
0.5% of any reasonably large table, causing all statistics on those tables to
be off for any table with irregular distribution of values;

I'm convinced these two are more connected than you believe. For distributions
of values in a range using small samples is on solid statistical basis. It's
the same reason poll results based on only a few hundred people can accurately
predict elections in a country with hundreds of millions of voters.

However for things like estimating discrete values there's absolutely no solid
foundation for these small samples. Your accuracy is going to be simply
proportional to the sample size, meaning you can't get anything trustworthy
without sampling large enough portions of the table that a full sequential
scan would be just as fast.

I might be interested in implementing that algorithm that was posted a while
back that involved generating good unbiased samples of discrete values. The
algorithm was quite clever and well described and paper claimed impressively
good results.

However it will only make sense if people are willing to accept that analyze
will need a full table scan -- at least for tables where the DBA knows that
good n_distinct estimates are necessary.

3. We don't have any method to analyze inter-column correlation within a table;

4. We don't keep statistics on foriegn key correlation;

Gosh these would be nice but they sound like hard problems. Has anybody even
made any headway in brainstorming how to tackle them?

5. random_page_cost (as previously discussed) is actually a funciton of
relatively immutable hardware statistics, and as such should not need to exist
as a GUC once the cost model is fixed.

I don't think that's true at all. Not all hardware is the same.

Certainly the need to twiddle this GUC should be drastically reduced if the
cache effects are modelled properly and the only excuses left are legitimate
hardware differences.

--
greg

Josh Berkus

josh@agliodbs.com

almost 20 years ago

In reply to: Bruce Momjian (#3)

Re: More thoughts about planner's cost estimates

Greg,

I'm convinced these two are more connected than you believe.

Actually, I think they are inseparable.

I might be interested in implementing that algorithm that was posted a
while back that involved generating good unbiased samples of discrete
values. The algorithm was quite clever and well described and paper
claimed impressively good results.

However it will only make sense if people are willing to accept that
analyze will need a full table scan -- at least for tables where the DBA
knows that good n_distinct estimates are necessary.

What about block-based sampling? Sampling 1 in 20 disk pages, rather than
1 in 20 rows, should require siginificantly less scanning, and yet give us
a large enough sample for reasonable accuracy.

3. We don't have any method to analyze inter-column correlation within
a table;

4. We don't keep statistics on foriegn key correlation;

Gosh these would be nice but they sound like hard problems. Has anybody
even made any headway in brainstorming how to tackle them?

There's no time like the present!

Actually, these both seem like fairly straightforward problems
storage-wise. The issue is deriving the statistics, for tables with many
columns or FKs.

5. random_page_cost (as previously discussed) is actually a funciton
of relatively immutable hardware statistics, and as such should not
need to exist as a GUC once the cost model is fixed.

I don't think that's true at all. Not all hardware is the same.

Certainly the need to twiddle this GUC should be drastically reduced if
the cache effects are modelled properly and the only excuses left are
legitimate hardware differences.

OK ... but still, it should become a "little knob" rather than the "big
knob" it is currently.

--
--Josh

Josh Berkus
PostgreSQL @ Sun
San Francisco

Bruce Momjian

bruce@momjian.us

almost 20 years ago

In reply to: Josh Berkus (#4)

Re: More thoughts about planner's cost estimates

Josh Berkus <josh@agliodbs.com> writes:

However it will only make sense if people are willing to accept that
analyze will need a full table scan -- at least for tables where the DBA
knows that good n_distinct estimates are necessary.

What about block-based sampling? Sampling 1 in 20 disk pages, rather than
1 in 20 rows, should require siginificantly less scanning, and yet give us
a large enough sample for reasonable accuracy.

a) We already use block based sampling to reduce overhead. If you're talking
about using the entire block and not just randomly sampled tuples from
within those blocks then your sample will be biased.

b) It *still* wouldn't give you reasonable accuracy for n_distinct estimates.
Your probability of being accurate for discrete processes like n_distinct
is going to be more or less proportional to your sample. So sampling 5% of
the table gives you only a 5% of the chance of an accurate prediction. More
or less.

Actually, these both seem like fairly straightforward problems
storage-wise. The issue is deriving the statistics, for tables with many
columns or FKs.

Well there are problems on many levels:

1) You have n^2 possible two-column combinations. That's a lot of processing
and storage.

2) It isn't even clear what data you're exactly looking for. Certainly
"correlation" is just shorthand here and isn't what you're actually looking
for. You need something like a complete histogram for the dependent column
for every possible value of the forcing column. That would be an enormous
amount of storage.

OK ... but still, it should become a "little knob" rather than the "big
knob" it is currently.

Sure, the more accurately you model the cache the less need there would be to
adjust random_page_cost. I'm not sure how accurate Postgres can really get
unless it switches philosophies towards using O_DIRECT and doing it's own
caching and scheduling. But it could certainly be much better than today.
Tom's comment makes it look like there's hope for pretty accurate cache
modelling for nested loop plans.

--
greg

Tom Lane

tgl@sss.pgh.pa.us

almost 20 years ago

In reply to: Josh Berkus (#2)

Re: More thoughts about planner's cost estimates

Josh Berkus <josh@agliodbs.com> writes:

Yeah. I've refrained from proposing changes because it's a
pick-up-sticks. If we start modifying the model, we need to fix
*everything*, not just one item. And then educate our users that they
need to use the GUC variables in a different way. Here's the issues I see:

[ various deficiencies in our statistics gathering ]

Those are all valid points, but pretty much orthogonal to what I'm on
about today. The problems I'm thinking about are seen even when the
planner's rowcount estimates are dead on (as indeed they mostly were
in Philippe's example).

5. random_page_cost (as previously discussed) is actually a funciton of
relatively immutable hardware statistics, and as such should not need to
exist as a GUC once the cost model is fixed.

Well, it'll still exist as a GUC for the same reasons the other ones are
GUCs. But I think the main reasons people have needed to twiddle it are
(1) their database is completely RAM-resident (where RPC
*should* logically be 1.0), or
(2) they're trying to compensate for the overestimation of
nestloop indexscans.
The changes I'm proposing should help with (2).

For btrees, we should be able to accurately estimate the cost of the
index access given:
a) the depth of the tree;

Although logically the tree descent *ought* to be a factor, I haven't
seen any evidence that we need to take it into account; in fact all the
evidence points to the conclusion that we're better off ignoring it.
My theory about why is that the upper levels of the tree stay in cache.
I have to admit that's just handwaving, but counting additional disk
hits to fetch the first index tuple is clearly not the direction the
cost model needs to go in. Look again at the examples in Philippe's
output: an indexscan fetching 1700 tuples (about 5 leaf pages probably)
took 32x longer than one fetching 7 tuples (presumably all on the same
leaf page). There's no way that a model that counts an additional
couple of page fetches for tree descent is going to arrive at those
numbers. And we see this over and over again: incredibly small times
to fetch a single index tuple, at least on the average when the index
is being hit many times in one query.

c) whether the index and tables are already in shared_mem, or else (d)
and (e) below:
d) the probability that the index pages are cached in memory, which is
composed of:
(1) the frequency with which that index is accessed, modified by
(2) whether the index is a fraction of available ram, or larger than ram
e) the probability that the relevant table pages are cached in memory,
determined by the same two factors.

These would all be nice things to know, but I'm afraid it's pie in the
sky. We have no reasonable way to get those numbers. (And if we could
get them, there would be another set of problems, namely plan instability:
the planner's choices would become very difficult to reproduce.)

The big difficulty in modeling cache effects from multiple scans is that
we usually don't know how many times the index will be scanned.

I think we can work around this by figuring out whether the index has
already been scanned in the plan, something we *can* know. So the first
scan will be full cost and remaining scans will be fractional cost.

Uh, no, because what we're normally thinking about is independent probes
into the index with different keys. For a small number of probes you
have to figure that those all hit different leaf pages and aren't going
to amortize well. As the number of probes goes up, you can start
charging fractional costs because it's more likely you're hitting a leaf
page you hit recently. The Mackert/Lohman equations do this reasonably
well --- it's possible we can find something better, but those seem like
a good place to start. The immediate problem is to get an
infrastructure in place that gives us some numbers to apply equations to.

regards, tom lane

Josh Berkus

josh@agliodbs.com

almost 20 years ago

In reply to: Bruce Momjian (#5)

Re: More thoughts about planner's cost estimates

Greg, Tom,

a) We already use block based sampling to reduce overhead. If you're
talking about using the entire block and not just randomly sampled
tuples from within those blocks then your sample will be biased.

There are actually some really good equations to work with this, estimating
both row correlation and n-distinct based on sampling complete blocks.
See prior discussions on this list on N-distinct.

1) You have n^2 possible two-column combinations. That's a lot of
processing and storage.

Yes, that's the hard problem to solve. Actually, btw, it's n!, not n^2.

2) It isn't even clear what data you're exactly looking for. Certainly
"correlation" is just shorthand here and isn't what you're actually
looking for.

Actually, I'd think that a correlation number estimate (0 = complete
uncorrelated, 1 = completely correlated) would be sufficient to improve
row count estimation significantly, without incurring the vast overhead of
histogramXhistorgram manipulation.

Those are all valid points, but pretty much orthogonal to what I'm on
about today. The problems I'm thinking about are seen even when the
planner's rowcount estimates are dead on (as indeed they mostly were
in Philippe's example).

OK, I was afraid they were interdependant. You would know better than me.

Well, it'll still exist as a GUC for the same reasons the other ones are
GUCs. But I think the main reasons people have needed to twiddle it are
(1) their database is completely RAM-resident (where RPC
*should* logically be 1.0), or
(2) they're trying to compensate for the overestimation of
nestloop indexscans.
The changes I'm proposing should help with (2).

Right. What I'm saying is that (1) should be derived from
estimated_cache_size and DBSIZE, not by setting an additional GUC.

(1) the frequency with which that index is accessed, modified by
(2) whether the index is a fraction of available ram, or larger than
ram e) the probability that the relevant table pages are cached in
memory, determined by the same two factors.

These would all be nice things to know, but I'm afraid it's pie in the
sky. We have no reasonable way to get those numbers. (And if we could
get them, there would be another set of problems, namely plan
instability: the planner's choices would become very difficult to
reproduce.)

Hmmm ... I think those numbers are possible to infer, at least on a gross
level. Whether the cost of calculating them outweighs the benefit of
having them is another question, resolvable only through some
experimentation.

a good place to start. The immediate problem is to get an
infrastructure in place that gives us some numbers to apply equations
to.

No arguments, of course.

--
--Josh

Josh Berkus
PostgreSQL @ Sun
San Francisco

Kenneth Marshall

ktm@it.is.rice.edu

almost 20 years ago

In reply to: Josh Berkus (#7)

Re: More thoughts about planner's cost estimates

Josh, Greg, and Tom,

I do not know how sensitive the plans will be to the correlation,
but one thought might be to map the histogram X histogram correlation
to a square grid of values. Then you can map them to an integer which
would give you 8 x 8 with binary values, a 5 x 5 with 4 values per
point, or a 4 x 4 with 8 values per point. If close is good enough,
that would be a compact way to store some histogram cross correlation
information.

Ken

Show quoted text

On Thu, Jun 01, 2006 at 01:50:26PM -0700, Josh Berkus wrote:

Greg, Tom,

...

2) It isn't even clear what data you're exactly looking for. Certainly
"correlation" is just shorthand here and isn't what you're actually
looking for.

Actually, I'd think that a correlation number estimate (0 = complete
uncorrelated, 1 = completely correlated) would be sufficient to improve
row count estimation significantly, without incurring the vast overhead of
histogramXhistorgram manipulation.
...

Josh Berkus

josh@agliodbs.com

almost 20 years ago

In reply to: Josh Berkus (#7)

Re: More thoughts about planner's cost estimates

Greg,

1) You have n^2 possible two-column combinations. That's a lot of
processing and storage.

Yes, that's the hard problem to solve. Actually, btw, it's n!, not n^2.

Ooops, bad math. Andrew pointed out it's actually n*(n-1)/2, not n!.

Also, we could omit columns unlikely to correlate, such as large text
columns, bytea and numerics with high precisions. Also, we probably don't
need to correlate UNIQUE columns inside ... I think.

--
--Josh

Josh Berkus
PostgreSQL @ Sun
San Francisco

#10

Jim Nasby

Jim.Nasby@BlueTreble.com

almost 20 years ago

In reply to: Bruce Momjian (#3)

Re: More thoughts about planner's cost estimates

On Thu, Jun 01, 2006 at 02:25:56PM -0400, Greg Stark wrote:

Josh Berkus <josh@agliodbs.com> writes:

1. n-distinct estimation is bad, as previously discussed;

2. our current heuristics sampling methods prevent us from sampling more than
0.5% of any reasonably large table, causing all statistics on those tables to
be off for any table with irregular distribution of values;

I'm convinced these two are more connected than you believe. For distributions
of values in a range using small samples is on solid statistical basis. It's
the same reason poll results based on only a few hundred people can accurately
predict elections in a country with hundreds of millions of voters.

However for things like estimating discrete values there's absolutely no solid
foundation for these small samples. Your accuracy is going to be simply
proportional to the sample size, meaning you can't get anything trustworthy
without sampling large enough portions of the table that a full sequential
scan would be just as fast.

I might be interested in implementing that algorithm that was posted a while
back that involved generating good unbiased samples of discrete values. The
algorithm was quite clever and well described and paper claimed impressively
good results.

However it will only make sense if people are willing to accept that analyze
will need a full table scan -- at least for tables where the DBA knows that
good n_distinct estimates are necessary.

There *might* be an alternative....

Suppose that the executor could keep track of what values it's seeing
from a table as it's actually running a query. These could be used to
build statistical information without actually running analyze.

Of course, the problem is that you could end up with some seriously
skewed statistics, depending on what your query patterns are. But in
some ways that's not bad; you're optimizing for the queries you most
often run.

One possible way to handle this would be to keep track of how many times
each value has been seen, as well as when it was last seen, so you have
some idea of the quality of the data. Another idea is to somehow blend
these stats with the traditional method.

3. We don't have any method to analyze inter-column correlation within a table;

4. We don't keep statistics on foriegn key correlation;

Gosh these would be nice but they sound like hard problems. Has anybody even
made any headway in brainstorming how to tackle them?

A *huge* improvement would be gathering statistics on all multi-column
indexes. Some of the stats might not make sense in this context, but
others (such as correlation) would.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

#11

Jim Nasby

Jim.Nasby@BlueTreble.com

almost 20 years ago

In reply to: Tom Lane (#6)

Re: More thoughts about planner's cost estimates

On Thu, Jun 01, 2006 at 03:15:09PM -0400, Tom Lane wrote:

These would all be nice things to know, but I'm afraid it's pie in the
sky. We have no reasonable way to get those numbers. (And if we could
get them, there would be another set of problems, namely plan instability:
the planner's choices would become very difficult to reproduce.)

Speaking of plan instability, something that's badly needed is the
ability to steer away from query plans that *might* be the most optimal,
but also will fail horribly should the cost estimates be wrong. People
generally don't care about getting the absolutely most optimal plan;
they do care about NOT getting a plan that's horribly bad. One possible
way to do this would be to have the estimator calculate a worst-case
cost to go along with the best case cost, or maybe even 3 numbers:
ideal, what we think will happen, and worst-case. I know that index
scans already have that information, at least in the form of cost for a
plan with 0 correlation and one with perfect correlation.

Does anyone have any experience developing genetic algorithms? I think
coming up with cost estimators might be an ideal use for that
technology...
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

#12

Tom Lane

tgl@sss.pgh.pa.us

almost 20 years ago

In reply to: Jim Nasby (#11)

Re: More thoughts about planner's cost estimates

"Jim C. Nasby" <jnasby@pervasive.com> writes:

Speaking of plan instability, something that's badly needed is the
ability to steer away from query plans that *might* be the most optimal,
but also will fail horribly should the cost estimates be wrong.

You sure that doesn't leave us with the empty set :-( ? Any plan
we pick will be horrible under some scenario. I do agree that the
current lowest-cost-and-nothing-else criterion can pick some pretty
brittle plans, but it's not that easy to see how to improve it.
I don't think either "best case" or "worst case" are particularly
helpful concepts here. You'd almost have to try to develop an
estimated probability distribution, and that's way more info than
we have.

People generally don't care about getting the absolutely most optimal
plan; they do care about NOT getting a plan that's horribly bad.

If 8.2 runs a query at half the speed of 8.1, people will be unhappy,
and they won't be mollified if you tell them that that plan is "better"
because it would have degraded less quickly if the planner's estimates
were wrong. The complaint that started this thread (Philippe Lang's
a couple days ago) was in fact of exactly that nature: his query was
running slower than it had been because the planner was picking bitmap
scans instead of plain index scans. Well, those bitmap scans would have
been a lot more robust in the face of bad rowcount estimates, but the
estimates weren't wrong.

regards, tom lane

#13

Bruce Momjian

bruce@momjian.us

almost 20 years ago

In reply to: Josh Berkus (#7)

Re: More thoughts about planner's cost estimates

Josh Berkus <josh@agliodbs.com> writes:

Greg, Tom,

a) We already use block based sampling to reduce overhead. If you're
talking about using the entire block and not just randomly sampled
tuples from within those blocks then your sample will be biased.

There are actually some really good equations to work with this, estimating
both row correlation and n-distinct based on sampling complete blocks.
See prior discussions on this list on N-distinct.

In the prior discussions someone posted the paper with the algorithm I
mentioned. That paper mentions that previous work showed poor results at
estimating n_distinct even with sample sizes as large as 50% or more.

1) You have n^2 possible two-column combinations. That's a lot of
processing and storage.

Yes, that's the hard problem to solve. Actually, btw, it's n!, not n^2.

You have O(n^2) possible *two-column* combinations and O(n!) n-column
combinations. I was hoping two-column combinations would be enough information
to extrapolate from for larger combinations.

2) It isn't even clear what data you're exactly looking for. Certainly
"correlation" is just shorthand here and isn't what you're actually
looking for.

Actually, I'd think that a correlation number estimate (0 = complete
uncorrelated, 1 = completely correlated) would be sufficient to improve
row count estimation significantly, without incurring the vast overhead of
histogramXhistorgram manipulation.

No, correlation is actually quite uninteresting here. Consider for example two
columns "state" and "area code". The "area code" is pretty much 100% dependent
on state, but will show very little correlation. Similarly things like
"product name" and "catalog number" or even "author" and "genre" would be
problem cases but have little correlation. And you can easily imagine queries
that specify restrictive clauses on two such columns for which the existing
statistics will overestimate the selectivity because it assumes no
interdependency.

Hopefully you're right that you don't need complete histograms. Perhaps
there's some statistics concept they don't teach in stats 101 that would cover
this need. What we're looking for is a function f(a,b,x) that gives the net
selectivity given a and b, the selectivity of two clauses based on two
columns, and x some simple value that can be easily calculated by looking at
the data in advance.

--
greg

#14

David Fetter

david@fetter.org

almost 20 years ago

In reply to: Bruce Momjian (#13)

Re: More thoughts about planner's cost estimates

On Thu, Jun 01, 2006 at 08:36:16PM -0400, Greg Stark wrote:

Josh Berkus <josh@agliodbs.com> writes:

Greg, Tom,

a) We already use block based sampling to reduce overhead. If
you're talking about using the entire block and not just
randomly sampled tuples from within those blocks then your
sample will be biased.

There are actually some really good equations to work with this,
estimating both row correlation and n-distinct based on sampling
complete blocks. See prior discussions on this list on
N-distinct.

In the prior discussions someone posted the paper with the algorithm
I mentioned. That paper mentions that previous work showed poor
results at estimating n_distinct even with sample sizes as large as
50% or more.

Which paper? People have referenced several different ones.

1) You have n^2 possible two-column combinations. That's a lot
of processing and storage.

Yes, that's the hard problem to solve. Actually, btw, it's n!,
not n^2.

You have O(n^2) possible *two-column* combinations and O(n!)
n-column combinations. I was hoping two-column combinations would
be enough information to extrapolate from for larger combinations.

The math nerd in me says that this is bad math, but it might work
"well enough" by ass-u-ming a lack of higher-order correllations.

2) It isn't even clear what data you're exactly looking for.
Certainly "correlation" is just shorthand here and isn't what
you're actually looking for.

Actually, I'd think that a correlation number estimate (0 =
complete uncorrelated, 1 = completely correlated) would be
sufficient to improve row count estimation significantly, without
incurring the vast overhead of histogramXhistorgram manipulation.

No, correlation is actually quite uninteresting here. Consider for
example two columns "state" and "area code". The "area code" is
pretty much 100% dependent on state, but will show very little
correlation.

There are quantitative tests of independence available. I'm not sure
whether any of these have been applied even theoretically in
DBMS-land.

Hopefully you're right that you don't need complete histograms.
Perhaps there's some statistics concept they don't teach in stats
101 that would cover this need. What we're looking for is a
function f(a,b,x) that gives the net selectivity given a and b, the
selectivity of two clauses based on two columns, and x some simple
value that can be easily calculated by looking at the data in
advance.

That would be neat :)

Cheers,
D
--
David Fetter <david@fetter.org> http://fetter.org/
phone: +1 415 235 3778 AIM: dfetter666
Skype: davidfetter

Remember to vote!

#15

Mark Kirkwood

mark.kirkwood@catalyst.net.nz

almost 20 years ago

In reply to: Tom Lane (#1)

Re: More thoughts about planner's cost estimates

Tom Lane wrote:

Another thing that's bothering me is that the index access cost computation
(in genericcostestimate) is looking sillier and sillier:

/*
* Estimate the number of index pages that will be retrieved.
*
* For all currently-supported index types, the first page of the index is
* a metadata page, and we should figure on fetching that plus a pro-rated
* fraction of the remaining pages.
*/
if (index->pages > 1 && index->tuples > 0)
{
numIndexPages = (numIndexTuples / index->tuples) * (index->pages - 1);
numIndexPages += 1; /* count the metapage too */
numIndexPages = ceil(numIndexPages);
}
else
numIndexPages = 1.0;

/*
* Compute the index access cost.
*
* Disk cost: our generic assumption is that the index pages will be read
* sequentially, so they have cost 1.0 each, not random_page_cost.
*/
*indexTotalCost = numIndexPages;

There are several things wrong with this, at least in its application to
btrees. It's not counting descent of the btree (ie, touches of the root
page and intermediate-level pages). On the other hand it's surely
overcharging for metapage touches. As of CVS tip we cache the metapage in
the relcache, so it's probably fair to disregard that cost altogether.
And on the third hand, when we need to retrieve multiple leaf pages it's
over-optimistic to assume that those'll be purely sequential fetches.
(That would be approximately true in a freshly-built btree, but not in one
that's suffered any amount of page splitting or recycling.)

I am inclined to think that a reasonable model is to continue to estimate
the number of index leaf pages touched as above (pro-rating against the
total number of index entries), to charge random_page_cost per leaf page
touched, and not to count anything for metapage or tree descent. I
justify the latter on the grounds that upper tree levels will tend to stay
in cache because they're touched so often. random_page_cost per leaf page
may be an overestimate, since sometimes logically adjacent leaf pages will
be physically adjacent too, but not charging for tree descent should help
to cancel that out. With this model, the disk cost to fetch a single
index entry will be estimated as random_page_cost (default 4.0) rather
than the current fixed 2.0. This shouldn't hurt things too much for
simple indexscans --- especially since anyone running with a reduced
random_page_cost won't see as much change. And it will increase the costs
for bitmap scans that cross many index pages, which is what we need in
light of Philippe's numbers.

This sounds good to me, although the 2.0 -> 4.0 cost jump may cause
(more) cases of people seeing their index scans in pre-8.2 versions
becoming some other type of access in 8.2. I guess a comment about
testing existing applications could be placed in the 8.2 release notes?

Now we have seen a lot of cases in which indexscans were being drastically
overestimated, so increasing the cost estimate even more may seem like a
horrid idea. But I believe that most or all of these cases were ones
where the same index was being visited many times, and so the real
estimation failure is not accounting for cache effects across multiple
indexscans. Rather than being afraid to raise the cost estimate for an
indexscan in isolation, I think it's time to bite the bullet and do
something about that.

The big difficulty in modeling cache effects from multiple scans is that
we usually don't know how many times the index will be scanned. If we did
have that number, I think it'd be reasonable to use the Mackert and Lohman
approximation already used in cost_index to estimate the total number of
index leaf pages fetched over N scans, and then divide by N to get our
per-scan estimated cost. But because the planner builds cost estimates
bottom-up, in most scenarios we don't have any idea whether an indexscan
plan node will be iterated once or many times in the finished plan.
However there is one case where we do have some information: when
computing a best_inner_indexscan() for a nestloop join, it'd be reasonable
to use the estimated size of the outer relation as the loop count.
And this is exactly the case where we are falling down in practice:
the complaints we've seen are mostly about overestimating the cost of a
nestloop-with-inner-index-scan. So I think handling just this case would
be a big step forward, even if it's not a theoretically pure general
solution.

If I understand correctly, this is the case were we normally see folks
needing add several 'set enable_xxx=off' statements to get the nested
loop plan that *actually* works best :-). So, also looks good!

Cheers

Mark

#16

Tom Lane

tgl@sss.pgh.pa.us

almost 20 years ago

In reply to: Mark Kirkwood (#15)

Re: More thoughts about planner's cost estimates

Mark Kirkwood <markir@paradise.net.nz> writes:

Tom Lane wrote:

With this model, the disk cost to fetch a single
index entry will be estimated as random_page_cost (default 4.0) rather
than the current fixed 2.0. This shouldn't hurt things too much for
simple indexscans --- especially since anyone running with a reduced
random_page_cost won't see as much change. And it will increase the costs
for bitmap scans that cross many index pages, which is what we need in
light of Philippe's numbers.

This sounds good to me, although the 2.0 -> 4.0 cost jump may cause
(more) cases of people seeing their index scans in pre-8.2 versions
becoming some other type of access in 8.2. I guess a comment about
testing existing applications could be placed in the 8.2 release notes?

Yeah, that comes with the territory. One point to note is that with
this model, setting random_page_cost below 2.0 will actually make small
indexscans look *cheaper* than they do now. So it'll certainly be
possible to make the thing jump in that direction if you need to.

regards, tom lane

#17

Nicolai Petri

nicolai@catpipe.net

almost 20 years ago

In reply to: Josh Berkus (#2)

Re: More thoughts about planner's cost estimates

Hi All,

Just a small comment from a mortal user.

On Thursday 01 June 2006 19:28, Josh Berkus wrote:

5. random_page_cost (as previously discussed) is actually a funciton of
relatively immutable hardware statistics, and as such should not need to
exist as a GUC once the cost model is fixed.

It's correct that the hardware statistics are quite immutable - per
tablespace. This suggests to me that the GUC should be default value only and
then overridable per tablespace. It's quite normal to have primary (current
data) and secondary (historical data) storage for larger databases.
This can also fit nicely into the pg_tmp storage if tie support for multiple
tmp folders together with tablespaces.

6. We haven't added any way to estimate rows returned from SRFs.

This would also be very cool - currently the planner can really get annoyed
when joining SRF functions with tables.

---
Nicolai

#18

Bruce Momjian

bruce@momjian.us

almost 20 years ago

In reply to: David Fetter (#14)

Re: More thoughts about planner's cost estimates

David Fetter <david@fetter.org> writes:

In the prior discussions someone posted the paper with the algorithm
I mentioned. That paper mentions that previous work showed poor
results at estimating n_distinct even with sample sizes as large as
50% or more.

Which paper? People have referenced several different ones.

Phillip B Gibbons. Distinct Sampling for Highly-Accurate Answers to Distinct
Values Queries and Event Reports. In Proceedings of the 27th VLDB Conference,
Roma, Italy, 2001.

Which says (emphasis in original as italics):

The most well-studied approach for distinct-values estimation is to
collect a uniform random sample S of the data, store S in a database, and
then use S at query time to provide fast, approximate answers to distinct
values queries [22, 23, 27, 21, 29, 5, 28, 18, 19, 9, 7]. However,
previous work [28, 18, 9, 7] has shown powerful negative results on the
quality of distinct-values estimates based on sampling (or other
techniques that examine only part of the input data), even for the simple
case of counting the number of distinct values in a column. The strongest
negative result is due to Charikar et al. [7], who proved that estimating
the number of distinct values in a column to within a small constant
factor (with probability > 1/2) requires that *nearly* *the* *entire*
*data* *set* *be* *sampled*. Moreover, all known sampling-based estimators
provide unsatisfactory results on data sets of interest [7], even for this
simple case.

And later:

Using a variety of synthetic and real-world data sets, we show that
distinct sampling gives estimates for distinct values queries that are
within 0%-10%, whereas previous methods were typically 50%-250% off,
across the spectrum of data sets and queries studied.

Here "distinct sampling" is the algorithm being proposed which requires
looking at every record and keeping a sample *of the distinct values*. The
"previous methods" are methods based on sampling the records

I haven't read the citation [7] that proves the strong negative result for any
estimator of distinct values based on sampling. It's:

M. Charikar, S. Chaudhuri, R. Motwani, and V. Narasayya. Towards estimation
error guarantees for distinct values. In Proc. 19th ACM Symp. on Principles
of Database Systems, pages 268?279, May 2000.

Hopefully you're right that you don't need complete histograms.
Perhaps there's some statistics concept they don't teach in stats
101 that would cover this need. What we're looking for is a
function f(a,b,x) that gives the net selectivity given a and b, the
selectivity of two clauses based on two columns, and x some simple
value that can be easily calculated by looking at the data in
advance.

That would be neat :)

Doing a bit of basic searching around I think the tool we're looking for here
is called a "chi-squared test for independence".

I haven't read up on how it works so I'm unclear if i it be calculated using a
simple O(n) scan or if it would require some non-linear post-processing after
the analyze pass which would be unfortunate.

And I haven't found anything that describes how to make use of the resulting
number. Does it actually give a formula f(a,b,x) that gives a nice convenient
expected selectivity for the clause?

Oh, and incidentally, at first glance it seems like calculating it depends on
having good distinct value sampling.

--
greg

#19

Josh Berkus

josh@agliodbs.com

almost 20 years ago

In reply to: Bruce Momjian (#18)

Re: More thoughts about planner's cost estimates

Greg,

Using a variety of synthetic and real-world data sets, we show that
distinct sampling gives estimates for distinct values queries that
are within 0%-10%, whereas previous methods were typically 50%-250% off,
across the spectrum of data sets and queries studied.

Aha. It's a question of the level of error permissable. For our
estimates, being 100% off is actually OK. That's why I was looking at 5%
block sampling; it stays within the range of +/- 50% n-distinct in 95% of
cases.

Doing a bit of basic searching around I think the tool we're looking for
here is called a "chi-squared test for independence".

Augh. I wrote a program (in Pascal) to do this back in 1988. Now I can't
remember the math. For a two-column test it's relatively
computation-light, though, as I recall ... but I don't remember standard
chi square works with a random sample.

--
--Josh

Josh Berkus
PostgreSQL @ Sun
San Francisco

#20

Bruce Momjian

bruce@momjian.us

almost 20 years ago

In reply to: Josh Berkus (#19)

Re: More thoughts about planner's cost estimates

Josh Berkus <josh@agliodbs.com> writes:

Using a variety of synthetic and real-world data sets, we show that
distinct sampling gives estimates for distinct values queries that
are within 0%-10%, whereas previous methods were typically 50%-250% off,
across the spectrum of data sets and queries studied.

Aha. It's a question of the level of error permissable. For our
estimates, being 100% off is actually OK. That's why I was looking at 5%
block sampling; it stays within the range of +/- 50% n-distinct in 95% of
cases.

Well, let's see. Say for example we're scanning 50,000 records out of 1M
records. Using the Theorem from Charikar et al quoted as Theorem 1 in section
2 we get that there is at least a 5% chance of getting a result with a ratio
error at least sqrt((1M-50k)/100k ln(20)) or 5.33.

So no, even assuming you have a good unbiased sample, a 5% sample is only
going to get you to a result within 19%-533% of the real values 95% of the
time. Not 50-250%. And that's a negative result, so it's *at best* 19-533%. On
some distributions it may be outside that range even more than 95% of the
time.

This is entirely unlike the histograms where we have a solid foundation for a
positive result. We can guarantee that the result will be outside +/- x% *at
most* 5% of the time. For most distributions it'll be outside that range even
less.

And a 5% sample is a pretty big. In fact my tests earlier showed the i/o from
5% block sampling took just as long as reading all the blocks. Even if we
figure out what's causing that (IMHO surprising) result and improve matters I
would only expect it to be 3-4x faster than a full scan.

http://archives.postgresql.org/pgsql-hackers/2006-01/msg00285.php

--
greg

#21