Group-count estimation statistics

Started by Tom Lanealmost 21 years ago17 messages
#1Tom Lane
tgl@sss.pgh.pa.us

I got a complaint from a fellow Red Hatter that PG 8.0 is way slower
than 7.4 on some statistical analysis tasks he was doing. Investigation
showed that the primary problem was selection of Sort/GroupAgg in place
of HashAgg to compute some grouped aggregates. Essentially he was doing

select sum(), avg() ... from temp_table
group by customer_id, item_id, month, year;

where the temp table had just been constructed and hadn't been analyzed
in any way. (Of course I told him "so analyze" ... but it didn't help.)

In 7.4, the utter lack of any information about the source table caused
the planner to assume only 1000 rows, so of course the estimated hash
table size was plenty small enough to fit in sort_mem. In 8.0, the
planner looks at the true physical size of the table, and even without
any stats comes out with a rowcount estimate tolerably close to the
actual value (about 10M rows). With or without stats it then estimates
that there will also be about 10M groups, under which conditions even
raising sort_mem to the max won't persuade it to hash-aggregate.
(However the actual number of groups is about 0.2M, and hash aggregation
is really a good bit faster than sorting.)

The reason this happens even with stats is that the algorithm for
estimating the number of groups in a multi-group-column situation
is basically "take the product of the number of distinct values of
each grouping column, but clamp to the number of rows estimated for
the whole table". It turns out that customer_id and item_id are
pretty highly correlated, enough that the actual number of distinct
groups is barely a hundredth of what you get from the product rule.

The only real solution, of course, is to acquire cross-column
statistics, but I don't see that happening in the near future.

As a short-term hack, I am thinking that the "clamp to size of table"
part of the rule is overly pessimistic, and that we should consider
something like "clamp to size of table / 10" instead. The justification
for this is the thought that you aren't going to bother grouping unless
it actually reduces the data volume. We have used similar rules in the
past --- for example, before the logic for trying to estimate actual
group counts was put in, the estimate for the number of output rows
from an Agg or Group node was just the number of input rows over 10.
Essentially I think that that rule wasn't too bad, and that we should
allow the statistics-based estimation code to reduce that estimate but
not increase it --- at least not when there's more than one variable
involved. I have some faith in the stats-based approach for a single
column but hardly any for multi columns, because of the correlation
issue.

Comments?

regards, tom lane

#2Stephen Frost
sfrost@snowman.net
In reply to: Tom Lane (#1)
Re: Group-count estimation statistics

* Tom Lane (tgl@sss.pgh.pa.us) wrote:

The only real solution, of course, is to acquire cross-column
statistics, but I don't see that happening in the near future.

That'd be nice, but sounds like alot of work.

As a short-term hack, I am thinking that the "clamp to size of table"
part of the rule is overly pessimistic, and that we should consider
something like "clamp to size of table / 10" instead. The justification
for this is the thought that you aren't going to bother grouping unless
it actually reduces the data volume. We have used similar rules in the

I definitely agree with this.

Stephen

#3Greg Stark
gsstark@mit.edu
In reply to: Tom Lane (#1)
Re: Group-count estimation statistics

Tom Lane <tgl@sss.pgh.pa.us> writes:

The reason this happens even with stats is that the algorithm for
estimating the number of groups in a multi-group-column situation
is basically "take the product of the number of distinct values of
each grouping column, but clamp to the number of rows estimated for
the whole table". It turns out that customer_id and item_id are
pretty highly correlated, enough that the actual number of distinct
groups is barely a hundredth of what you get from the product rule.

So why is it any more reasonable for Postgres to assume 0 correlation than any
other value. Perhaps Postgres should calculate these cases assuming some
arbitrary level of correlation.

For example, pulling an algorithm out of nowhere, it could take the most
selective value, then multiply -- instead of by the next most selective --
just the square root of the next value, then the cube root of the third value,
and the fourth root of the fourth value, etc.

So for 10M records grouped over three columns, each of which has 1,000
distinct values, you would get 1,000 * 1,000 ^ 1/2 * 1,000 ^ 1/3 or 316,228
distinct values. Which seems like not a bad guess actually.

To be honest I took the "successive roots" thing out of nowhere. I suspect
there's a more rigorous statistics approach which would have some actual
motivation. For a given assumed correlation and distribution there ought be a
way to calculate the expected number of distinct combinations. Then when we
have some mechanism for providing real correlation we can just plug that in in
place of the arbitrarily assumed correlation.

Actually the formula would be quite complex. As the total number of records
goes up the expected number of distinct values should approach the total
number of records, even if the number of distinct values of each column
doesn't change.

The only real solution, of course, is to acquire cross-column
statistics, but I don't see that happening in the near future.

There's another possible solution, if Postgres kept statistics on the actual
results of the query it could later use that feedback to come up with better
guesses even if it doesn't know *why* they're better.

--
greg

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Greg Stark (#3)
Re: Group-count estimation statistics

Greg Stark <gsstark@mit.edu> writes:

So why is it any more reasonable for Postgres to assume 0 correlation than any
other value. Perhaps Postgres should calculate these cases assuming some
arbitrary level of correlation.

[ shrug... ] Sure, if you want to do the legwork to develop something
credible. But I think I'd still throw in the number-of-rows-over-10
clamp, or something much like it.

As the total number of records
goes up the expected number of distinct values should approach the total
number of records, even if the number of distinct values of each column
doesn't change.

Well, that's what I thought when I wrote the existing code, but it's
wrong: you don't GROUP BY unique combinations of columns over huge
tables --- or at least, you shouldn't expect great performance if you do.
It'd probably be more reasonable to use a heuristic that expects a
*smaller* fraction of distinct combinations, instead of a larger one,
as the table size goes up.

There's another possible solution, if Postgres kept statistics on the actual
results of the query it could later use that feedback to come up with better
guesses even if it doesn't know *why* they're better.

That's been proposed before but I think it's a blind alley. In most
cases (certainly with anything as complex as a multiply grouped query)
you're not going to be able to derive any trustworthy corrections to
your original statistical estimates. There are too many variables and
their relationships to the end costs are not simple.

regards, tom lane

#5Josh Berkus
josh@agliodbs.com
In reply to: Tom Lane (#4)
Re: [pgsql-hackers] Group-count estimation statistics

Tom,

The only real solution, of course, is to acquire cross-column
statistics, but I don't see that happening in the near future.

Y'know, that's been on the todo list for a while. Surely someone is inspired
for 8.1/8.2? At least for columns which are indexed together?

As a short-term hack, I am thinking that the "clamp to size of table"
part of the rule is overly pessimistic, and that we should consider
something like "clamp to size of table / 10" instead.  The justification
for this is the thought that you aren't going to bother grouping unless
it actually reduces the data volume.  We have used similar rules in the
past --- for example, before the logic for trying to estimate actual
group counts was put in, the estimate for the number of output rows
from an Agg or Group node was just the number of input rows over 10.

Why 10? I'd think we could come up with a slightly less arbitrary number,
based on "At what point does the median possible cost of estmating too low
equal the median possible cost of estimating too high?" This seems
calculable based on the other information available ...

... although perhaps not without a math PhD. Surely there's one in the house?

--
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

#6Kris Jurka
books@ejurka.com
In reply to: Tom Lane (#4)
Re: Group-count estimation statistics

On Fri, 28 Jan 2005, Tom Lane wrote:

you don't GROUP BY unique combinations of columns over huge
tables --- or at least, you shouldn't expect great performance if you do.

The proposed change biases towards a hash plan which has no provision for
spilling to disk. Slow is one thing, but excessive memory usage and
possibly failing is another thing.

Kris Jurka

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Kris Jurka (#6)
Re: Group-count estimation statistics

Kris Jurka <books@ejurka.com> writes:

The proposed change biases towards a hash plan which has no provision for
spilling to disk. Slow is one thing, but excessive memory usage and
possibly failing is another thing.

Keep in mind that we are replacing 7.4 code that had a serious tendency
to select hash plans when it really shouldn't, because of underestimated
table sizes. Now that we have the physical-size-driven estimate of
table rowcounts, I think we've gone too far over in the other direction.

Which is not to say that spill-to-disk logic wouldn't be a nice thing to
add, but I'm not going to panic about its not being there today. 7.4
presented a much greater hazard than we have now, but we got through
that cycle without a large number of complaints. There's always the
enable_hashagg switch if you really find yourself backed into a corner.

regards, tom lane

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Josh Berkus (#5)
Re: [pgsql-hackers] Group-count estimation statistics

Josh Berkus <josh@agliodbs.com> writes:

Why 10? I'd think we could come up with a slightly less arbitrary number,

Well, it's probably within an order of magnitude of the right thing ;-).
We know we don't want 1, but 100 seems awfully optimistic.

If someone can come up with a more defensible number then I'm all for
it. Greg Stark's thought about a power correction seemed interesting
too, though again far too optimistic to trust without some good math
to back it up.

regards, tom lane

#9Sailesh Krishnamurthy
sailesh@cs.berkeley.edu
In reply to: Tom Lane (#1)
Re: Group-count estimation statistics

"Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes:

Tom> The only real solution, of course, is to acquire cross-column
Tom> statistics, but I don't see that happening in the near
Tom> future.

Another approach is a hybrid hashing scheme where we use a hash table
until we run out of memory at which time we start spilling to disk. In
other words, no longer use SortAgg at all ..

Under what circumstances will a SortAgg consumer more IOs than a
hybrid hash strategy ?

--
Pip-pip
Sailesh
http://www.cs.berkeley.edu/~sailesh

#10Greg Stark
gsstark@mit.edu
In reply to: Tom Lane (#8)
Re: Group-count estimation statistics

Tom Lane <tgl@sss.pgh.pa.us> writes:

Greg Stark's thought about a power correction seemed interesting too, though
again far too optimistic to trust without some good math to back it up.

Fwiw, I'm pretty sure good math is not going to back up my off-the-cuff
algorithm. But I did like the answer it gave in this instance.

I'm told an actual solution to the problem is hard and probably not even
solvable in closed form. I'm still looking around but I suspect we would need
some pretty severe simplifying assumptions to make it work.

--
greg

#11Mischa
mischa.Sandberg@telus.net
In reply to: Greg Stark (#10)
Re: Group-count estimation statistics

From: Sailesh Krishnamurthy <sailesh@cs.berkeley.edu>

"Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes:

Tom> The only real solution, of course, is to acquire cross-column
Tom> statistics, but I don't see that happening in the near
Tom> future.

Another approach is a hybrid hashing scheme where we use a hash table
until we run out of memory at which time we start spilling to disk. In
other words, no longer use SortAgg at all ..

Under what circumstances will a SortAgg consumer more IOs than a
hybrid hash strategy ?

Goetz Graefe did a heck of a lot of analysis of this, prior to his being snapped
up by Microsoft. He also worked out a lot of the nitty-gritty for hybrid hash
algorithms, extending the Grace hash for spill-to-disk, and adding a kind of
recursion for really huge sets. The figures say that hybrid hash beats
sort-aggregate, across the board.

#12Manfred Koizar
mkoi-pg@aon.at
In reply to: Tom Lane (#1)
Re: Group-count estimation statistics

On Fri, 28 Jan 2005 10:53:33 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote:

we should consider
something like "clamp to size of table / 10" instead.

... unless a *single* grouping column is estimated to have more than
N/10 distinct values, which should be easy to check.

Servus
Manfred

#13Tom Lane
tgl@sss.pgh.pa.us
In reply to: Manfred Koizar (#12)
Re: Group-count estimation statistics

Manfred Koizar <mkoi-pg@aon.at> writes:

On Fri, 28 Jan 2005 10:53:33 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote:

we should consider
something like "clamp to size of table / 10" instead.

... unless a *single* grouping column is estimated to have more than
N/10 distinct values, which should be easy to check.

Already done that way.

/*
* Clamp to size of rel, or size of rel / 10 if multiple Vars.
* The fudge factor is because the Vars are probably correlated
* but we don't know by how much.
*/
double clamp = rel->tuples;

if (relvarcount > 1)
clamp *= 0.1;
if (reldistinct > clamp)
reldistinct = clamp;

regards, tom lane

#14Manfred Koizar
mkoi-pg@aon.at
In reply to: Tom Lane (#13)
Re: Group-count estimation statistics

On Mon, 31 Jan 2005 11:20:31 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Already done that way.
if (relvarcount > 1)
clamp *= 0.1;

That's not what I meant. I tried to say that if we have a GROUP BY
several columns and one of these columns alone has more than N/10
distinct values, there's no way to get less than that many groups.

Servus
Manfred

#15Tom Lane
tgl@sss.pgh.pa.us
In reply to: Manfred Koizar (#14)
Re: Group-count estimation statistics

Manfred Koizar <mkoi-pg@aon.at> writes:

That's not what I meant. I tried to say that if we have a GROUP BY
several columns and one of these columns alone has more than N/10
distinct values, there's no way to get less than that many groups.

Oh, I see, you want a "max" calculation in there too. Seems reasonable.
Any objections?

regards, tom lane

#16Manfred Koizar
mkoi-pg@aon.at
In reply to: Tom Lane (#15)
Re: Group-count estimation statistics

On Mon, 31 Jan 2005 14:40:08 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Manfred Koizar <mkoi-pg@aon.at> writes:

That's not what I meant. I tried to say that if we have a GROUP BY
several columns and one of these columns alone has more than N/10
distinct values, there's no way to get less than that many groups.

Oh, I see, you want a "max" calculation in there too. Seems reasonable.
Any objections?

Yes. :-( What I said is only true in the absence of any WHERE clause
(or join). Otherwise the same cross-column correlation issues you tried
to work around with the N/10 clamping might come back through the
backdoor. I'm not sure whether coding for such a narrow use case is
worth the trouble. Forget my idea.

Servus
Manfred

#17Tom Lane
tgl@sss.pgh.pa.us
In reply to: Manfred Koizar (#16)
Re: Group-count estimation statistics

Manfred Koizar <mkoi-pg@aon.at> writes:

On Mon, 31 Jan 2005 14:40:08 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Oh, I see, you want a "max" calculation in there too. Seems reasonable.
Any objections?

Yes. :-( What I said is only true in the absence of any WHERE clause
(or join). Otherwise the same cross-column correlation issues you tried
to work around with the N/10 clamping might come back through the
backdoor. I'm not sure whether coding for such a narrow use case is
worth the trouble. Forget my idea.

No, I think it's still good. The WHERE clauses are factored in
separately (essentially by assuming their selectivity on the grouped
rows is the same as it would be on the raw rows, which is pretty bogus
but it's hard to do better). The important point is that the group
count before WHERE filtering certainly does behave as you suggest,
and so the clamp is going to be overoptimistic if it clamps to less than
the largest individual number-of-distinct-values.

regards, tom lane