estimating # of distinct values

Started by Tomas Vondraover 15 years ago58 messageshackers
Jump to latest
#1Tomas Vondra
tomas.vondra@2ndquadrant.com

Hi everyone,

about two weeks ago I've started a thread about cross-column stats. One
of the proposed solutions is based on number of distinct values, and the
truth is the current distinct estimator has some problems.

I've done some small research - I have read about 20 papers on this, and
I'd like to present a short summary here, possible solutions etc.

The simple truth is

1) sampling-based estimators are a dead-end
2) there are very interesting stream-based estimators
3) the stream-based estimators have some disadvantages

I wrote a more thorough summary on a wiki
(http://wiki.postgresql.org/wiki/Estimating_Distinct) with a list of the
most interesting papers etc. so just a very short explanation.

1) sampling-based estimators are a dead-end

The paper from Charikar & Chaudhuri states (and proves) that for
each sampling-based estimator, there's a data set where the estimator
produces arbitrary error (with an indispensable probability). So
replacing one sampling-based estimator with another generally does
not improve the situation - fixes one dataset, breaks another one.

The error is directly related to the size of the sample, so the
truth is that to get a good distinct estimate you need to scan a
significant portion of the table (almost all of it).

So the estimators based on tiny samples are a dead end.

2) there are very interesting stream-based estimators

A very interesting idea comes from the field of data streams. These
estimates are based no a one pass through the data, and then
incremental updates. A very nice thing is that these algorithms
don't need that much space, as they don't store the actual values.

The idea behind this is similar to the Bloom filter, i.e. set bits
of a bitmap using a hash of the value. But in this case it's not
required to be able to answer questions like 'is this an element
of the set X?' so it's actually even more space efficient. For an
introduction see the first paper published in 1985 by Flajolet
(http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.12.7100).

One of the recent articles (published in June 2010) actually presents
an optimal algorithm with O(e^-2 + log(n)) space complexity and O(1)
update complexity. The "e" means precision, i.e. that the estimate
will be within [(1-e)F, (1+e)F] where F is the real value.

The paper on self-learning bitmap states that to track 10^9 distinct
values with 1% relative error you need about 61 kilobits of space
(which is absolutely awesome). The optimal algorithm needs a bit more
space I think,

A very interesting solution id "distinct sampling" that somehow
extends the Wegman's adaptive sampling approach.

3) the stream-based estimators have some disadvantages

Not everything is perfect, though - the most serious disadvantage is
that those algorithms (usually) don't handle removal of elements.
Inserts are easy to handle, but deletes are hard (just as in case of
Bloom filters).

So using this stream-based approach might lead to degradation in
case of heavily updated tables :-(

I see two possible solutions here:

(a) use counters instead of bits, and track actual counts - but this
is tricky, especially with 'probabilistic' algorithms and needs
much more space (but still, 32x 61kb is just 250kB)

(b) count how many deletes/updates were performed, and if needed
rebuild the whole bitmap

But even though these disadvantages, there really is no other
way to enhance the estimates. I don't think this should be a
default behavior - just as in case of cross-column stats this should
be optional when the current estimator does not work well.

regards
Tomas

#2Robert Haas
robertmhaas@gmail.com
In reply to: Tomas Vondra (#1)
Re: estimating # of distinct values

2010/12/27 Tomas Vondra <tv@fuzzy.cz>:

  But even though these disadvantages, there really is no other
  way to enhance the estimates. I don't think this should be a
  default behavior - just as in case of cross-column stats this should
  be optional when the current estimator does not work well.

This is going to be a lot of work to implement, so before you do it,
we should try to reach a consensus that (a) it's part of an overall
strategy that the community generally supports and (b) we have
consensus on the design for this part.

With respect to (a), I have to admit I've found the discussion on
cross-column stats to be quite difficult to follow. I'd like to see a
very simple description of exactly what information we're going to
store, under what circumstances we'll store it, and how we'll use it
to compute selectivity estimates.

With respect to (b), I think I'd need to see a much more detailed
design for how you intend to make this work. Off the top of my head
there seems to be some pretty serious feasibility problems.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#3Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Robert Haas (#2)
Re: estimating # of distinct values

Robert Haas <robertmhaas@gmail.com> wrote:

With respect to (b), I think I'd need to see a much more detailed
design for how you intend to make this work. Off the top of my
head there seems to be some pretty serious feasibility problems.

I had one random thought on that -- it seemed like a large concern
was that there would need to be at least an occasional scan of the
entire table to rebuild the distinct value information. Don't we
already require an occasional scan of the entire table for freezing
transaction IDs? Could this be part of any vacuum of the entire
table?

-Kevin

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Kevin Grittner (#3)
Re: estimating # of distinct values

"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:

Robert Haas <robertmhaas@gmail.com> wrote:

With respect to (b), I think I'd need to see a much more detailed
design for how you intend to make this work. Off the top of my
head there seems to be some pretty serious feasibility problems.

I had one random thought on that -- it seemed like a large concern
was that there would need to be at least an occasional scan of the
entire table to rebuild the distinct value information. Don't we
already require an occasional scan of the entire table for freezing
transaction IDs? Could this be part of any vacuum of the entire
table?

Well, first, those scans occur only once every few hundred million
transactions, which is not likely a suitable timescale for maintaining
statistics. And second, we keep on having discussions about rejiggering
the whole tuple-freezing strategy. Even if piggybacking on those scans
looked useful, it'd be unwise to assume it'll continue to work the same
way it does now.

regards, tom lane

#5Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Tom Lane (#4)
Re: estimating # of distinct values

Tom Lane <tgl@sss.pgh.pa.us> wrote:

Well, first, those scans occur only once every few hundred million
transactions, which is not likely a suitable timescale for
maintaining statistics.

I was assuming that the pass of the entire table was priming for the
incremental updates described at the start of this thread. I'm not
clear on how often the base needs to be updated for the incremental
updates to keep the numbers "close enough".

And second, we keep on having discussions about rejiggering
the whole tuple-freezing strategy. Even if piggybacking on those
scans looked useful, it'd be unwise to assume it'll continue to
work the same way it does now.

Sure, it might need to trigger its own scan in the face of heavy
deletes anyway, since the original post points out that the
algorithm handles inserts better than deletes, but as long as we
currently have some sequential pass of the data, it seemed sane to
piggyback on it when possible. And maybe we should be considering
things like this when we weigh the pros and cons of rejiggering.
This issue of correlated values comes up pretty often....

-Kevin

#6Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Robert Haas (#2)
Re: estimating # of distinct values

Dne 27.12.2010 22:46, Robert Haas napsal(a):

2010/12/27 Tomas Vondra <tv@fuzzy.cz>:

But even though these disadvantages, there really is no other
way to enhance the estimates. I don't think this should be a
default behavior - just as in case of cross-column stats this should
be optional when the current estimator does not work well.

This is going to be a lot of work to implement, so before you do it,
we should try to reach a consensus that (a) it's part of an overall
strategy that the community generally supports and (b) we have
consensus on the design for this part.

Yes, it is going to be a lot of work. But I don't think we need a
consensus of the whole community prior to building something that works.
I plan to build a very simple prototype soon, so let's talk about this
later.

I've started these two threads mainly as a 'brainstorming' - to present
what I've learned from the various papers, propose a possible solution,
highlight issues I see, and get ideas on how to solve them.

With respect to (a), I have to admit I've found the discussion on
cross-column stats to be quite difficult to follow. I'd like to see a
very simple description of exactly what information we're going to
store, under what circumstances we'll store it, and how we'll use it
to compute selectivity estimates.

Yes, I know it was difficult to follow that discussion. That's why I
created a 'summary page' on wiki

http://wiki.postgresql.org/wiki/Cross_Columns_Stats

Generally we need to gather two types of stats:

(a) multi-dimensional histogram - this can be done about the same way
we create single-dimensional histograms, that's not a big problem
(although more data might be necessary to get accurate histograms)

(b) # of distinct values - this is the tricky part, as described in
this problem

With respect to (b), I think I'd need to see a much more detailed
design for how you intend to make this work. Off the top of my head
there seems to be some pretty serious feasibility problems.

Well, it's not going to be easy, that's for sure. My "big plan" is
something like this:

(a) Build a simple 'contrib-like' module that allows manual collection
of stats and computing of estimates (not integrated with the
optimizer in any way).

This should help us better understand what stats do we need etc.

(b) Minimal integration with the core - still manual collection fo
stats, the optimizer uses the stats if available.

(c) Enhancements - automatic updates of the stats, etc.

And all this should be implemented so that you don't have to pay unless
you want to use the new stats.

regards
Tomas

#7Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Kevin Grittner (#5)
Re: estimating # of distinct values

Dne 28.12.2010 00:04, Kevin Grittner napsal(a):

Tom Lane <tgl@sss.pgh.pa.us> wrote:

Well, first, those scans occur only once every few hundred million
transactions, which is not likely a suitable timescale for
maintaining statistics.

I was assuming that the pass of the entire table was priming for the
incremental updates described at the start of this thread. I'm not
clear on how often the base needs to be updated for the incremental
updates to keep the numbers "close enough".

Well, that really depends on the workload. If you never remove all
occurences of a given value (e.g. a ZIP code), you don't need to rescan
the table at all.

All you need is to build the stats once and then update them
incrementally - and I belive this could be handled by autovacuum.

And second, we keep on having discussions about rejiggering
the whole tuple-freezing strategy. Even if piggybacking on those
scans looked useful, it'd be unwise to assume it'll continue to
work the same way it does now.

Sure, it might need to trigger its own scan in the face of heavy
deletes anyway, since the original post points out that the
algorithm handles inserts better than deletes, but as long as we
currently have some sequential pass of the data, it seemed sane to
piggyback on it when possible. And maybe we should be considering
things like this when we weigh the pros and cons of rejiggering.
This issue of correlated values comes up pretty often....

Again, there are two types of stats - one of them needs to scan the
whole table (estimate of distinct values), the other one does not
(multi-dimentional histograms).

These two cases are independent - you don't necessarily need both.

Better ndistinct estimates would actually solve some of the current
issues, it's not just a matter of cross-column stats.

regards
Tomas

#8Josh Berkus
josh@agliodbs.com
In reply to: Tomas Vondra (#1)
Re: estimating # of distinct values

The simple truth is

1) sampling-based estimators are a dead-end

While I don't want to discourage you from working on steam-based
estimators ... I'd love to see you implement a proof-of-concept for
PostgreSQL, and test it ... the above is a non-argument. It requires us
to accept that sample-based estimates cannot ever be made to work,
simply because you say so.

The Charikar and Chaudhuri paper does not, in fact, say that it is
impossible to improve sampling-based estimators as you claim it does. In
fact, the authors offer several ways to improve sampling-based
estimators. Further, 2000 was hardly the end of sampling-estimation
paper publication; there are later papers with newer ideas.

For example, I still think we could tremendously improve our current
sampling-based estimator without increasing I/O by moving to block-based
estimation*. The accuracy statistics for block-based samples of 5% of
the table look quite good.

I would agree that it's impossible to get a decent estimate of
n-distinct from a 1% sample. But there's a huge difference between 5%
or 10% and "a majority of the table".

Again, don't let this discourage you from attempting to write a
steam-based estimator. But do realize that you'll need to *prove* its
superiority, head-to-head, against sampling-based estimators.

[* http://www.jstor.org/pss/1391058 (unfortunately, no longer
public-access)]

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

#9Robert Haas
robertmhaas@gmail.com
In reply to: Josh Berkus (#8)
Re: estimating # of distinct values

On Tue, Dec 28, 2010 at 1:39 AM, Josh Berkus <josh@agliodbs.com> wrote:

While I don't want to discourage you from working on steam-based
estimators ... I'd love to see you implement a proof-of-concept for
PostgreSQL, and test it ... the above is a non-argument.  It requires us
to accept that sample-based estimates cannot ever be made to work,
simply because you say so.

This argument has been made on this mailing list and on
pgsql-performance many times, so I have been assuming that this is
settled mathematics. Admittedly, I don't have a citation for this...

I would agree that it's impossible to get a decent estimate of
n-distinct from a 1% sample.  But there's a huge difference between 5%
or 10% and "a majority of the table".

Considering the way that disks work, it's not that huge. If the table
is small enough to fit in memory conveniently, it probably wouldn't be
unacceptably costly even to read the whole thing. But if it's a
terabyte on disk, reading every tenth block is probably going to carry
most of the I/O cost of reading every block. Even reading every
twentieth or hundredth block is going to be significantly more
expensive than what we do now.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#10Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Josh Berkus (#8)
Re: estimating # of distinct values

The simple truth is

1) sampling-based estimators are a dead-end

The Charikar and Chaudhuri paper does not, in fact, say that it is
impossible to improve sampling-based estimators as you claim it does. In
fact, the authors offer several ways to improve sampling-based
estimators. Further, 2000 was hardly the end of sampling-estimation
paper publication; there are later papers with newer ideas.

Well, the paper states that there is a lower bound of the possible error
of a sampling based estimator, depending on the sample size. The actual
inequality is

error(d) >= sqrt( (n-r)/2r * 1/q)

where error(D) is "ratio error" (d - estimated number of distinct values,
D - actual number of distinct values)

error(d) = max{ D/d, d/D }

And all this is with probability q >= e^{-1000} (you can choose this).

Say you have a table with 1.000.000 rows and you use a sample of 1.000
rows to do an estimate. In that case you get

erorr(d) >= 99 with q = 0.05
erorr(d) >= 70 with q = 0.1
error(d) >= 31 with q = 0.5

if you can 10% of the table, you get this

error(d) >= 9.5 with q = 0.05
error(d) >= 6.7 with q = 0.1
error(d) >= 3 with q = 0.5

So even with 10% of the table, there's a 10% probability to get an
estimate that's 7x overestimated or underestimated. With lower probability
the interval is much wider.

For example, I still think we could tremendously improve our current
sampling-based estimator without increasing I/O by moving to block-based
estimation*. The accuracy statistics for block-based samples of 5% of
the table look quite good.

Well, that's certainly possible. But you can only achieve the error(d)
lower boundary consistently (for all datasets), you can't do better. And
they've already presented an estimator that does exactly this (called AE -
Adaptive Estimator in the paper).

I would agree that it's impossible to get a decent estimate of
n-distinct from a 1% sample. But there's a huge difference between 5%
or 10% and "a majority of the table".

Sure we can do better. But there's a limit we can't cross no matter what
estimator we choose and how large the sample will be.

I've seen several post-2000 paper on sample-based estimators, but those
are mostly hybrid estimators, i.e. estimators composed of several simple
estimators. So in the end it's pretty complicated, you need to gather a
lot of different stats, and still you can't get better error than the
lower bound :-(

So yes, we can improve the current estimator (making it more complex), but
it does not solve the problem actually.

Again, don't let this discourage you from attempting to write a
steam-based estimator. But do realize that you'll need to *prove* its
superiority, head-to-head, against sxampling-based estimators.

[* http://www.jstor.org/pss/1391058 (unfortunately, no longer
public-access)]

It's available here http://www.stat.washington.edu/research/reports/1990s/

But it does not present an estimator contradicting the Charikar and
Chaudhuri paper - it's still 'just' a sample-based estimator, alough they
draw the sample at the block level. But yes, that's good idea and I've
already mentioned it in the cross-column stats thread I think.

The question is if a sample obtained in this way will be as good as the
current samples. This way you could get quite separate 'clusters' of
values, one cluster for each block, although in reality the values are
uniformly distributed. And the resulting histogram would be crippled by
this I guess too.

But if you know about interesting papers on sample-based estimators
(especially post-2000), let me know. I've searched for them, but those I
found were not very interesting IMHO.

Tomas

#11Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Tomas Vondra (#10)
Re: estimating # of distinct values

<tv@fuzzy.cz> wrote:

So even with 10% of the table, there's a 10% probability to get an
estimate that's 7x overestimated or underestimated. With lower
probability the interval is much wider.

Hmmm... Currently I generally feel I'm doing OK when the estimated
rows for a step are in the right order of magnitude -- a 7% error
would be a big improvement in most cases. Let's not lose track of
the fact that these estimates are useful even when they are not
dead-on accurate.

-Kevin

#12Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Kevin Grittner (#11)
Re: estimating # of distinct values

<tv@fuzzy.cz> wrote:

So even with 10% of the table, there's a 10% probability to get an
estimate that's 7x overestimated or underestimated. With lower
probability the interval is much wider.

Hmmm... Currently I generally feel I'm doing OK when the estimated
rows for a step are in the right order of magnitude -- a 7% error
would be a big improvement in most cases. Let's not lose track of
the fact that these estimates are useful even when they are not
dead-on accurate.

Well, but that's not 7%, thats 7x! And the theorem says 'greater or equal'
so this is actually the minimum - you can get a much bigger difference
with lower probability. So you can easily get an estimate that is a few
orders off.

Anyway I really don't want precise values, just a reasonable estimate. As
I said, we could use the AE estimate they proposed in the paper. It has
the nice feature that it actually reaches the low boundary (thus the
inequality changes to equality). The downside is that there are estimators
with better behavior on some datasets.

Tomas

#13Josh Berkus
josh@agliodbs.com
In reply to: Tomas Vondra (#12)
Re: estimating # of distinct values

Well, but that's not 7%, thats 7x! And the theorem says 'greater or equal'
so this is actually the minimum - you can get a much bigger difference
with lower probability. So you can easily get an estimate that is a few
orders off.

FWIW, based on query performance, estimates which are up to 5X off are
tolerable, and anything within 3X is considered "accurate". Above 5X
the probability of bad query plans becomes problematically high.

Of course, if you're doing cross-column stats, the accuracy of each
individual column becomes critical since estimation error could be
combiniational in the worst case (i.e. if colA is 3X and colB is 0.3X
then colA<->colB will be 9X off).

Anyway, I look forward to your experiments with stream-based estimators.

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

#14Florian Pflug
fgp@phlo.org
In reply to: Kevin Grittner (#3)
Re: estimating # of distinct values

On Dec27, 2010, at 23:49 , Kevin Grittner wrote:

Robert Haas <robertmhaas@gmail.com> wrote:

With respect to (b), I think I'd need to see a much more detailed
design for how you intend to make this work. Off the top of my
head there seems to be some pretty serious feasibility problems.

I had one random thought on that -- it seemed like a large concern
was that there would need to be at least an occasional scan of the
entire table to rebuild the distinct value information.

I believe we could actually avoid that.

First, the paper "An Optimal Algorithm for the Distinct Elements Problem"
actually contains an algorithm with *does* handle deletes - it's called
"L_0" estimate there.

Second, as Tomas pointed out, the stream-based estimator is essentially a
simplified version of a bloom filter. It starts out with a field of
N zero bits, and sets K of them to 1 for each value v in the stream.
Which bits are set to 1 depends on some hash function(s) H_i(v). It's
then easy to compute how many 1-bits you'd expect to find in the bit
field after seeing D distinct values, and by reversing that you can
estimate D from the number of 1-bits in the bit field.

To avoid having to rescan large tables, instead of storing one such
bit field, we'd store one per B pages of data. We'd then only need
to scan a range of B pages around every updated or deleted tuple,
and could afterwards compute a new global estimate of D by combining
the individual bit fields with bitwise and.

Since the need to regularly VACUUM tables hit by updated or deleted
won't go away any time soon, we could piggy-back the bit field
rebuilding onto VACUUM to avoid a second scan.

A good value for B would probably be around
1000*<size of bitfield>/<page size>. If the bitfield needs ~100k, that'd
make B ~= 12000 pages ~= 100MB.

best regards,
Florian Pflug

#15Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Florian Pflug (#14)
Re: estimating # of distinct values

Dne 30.12.2010 15:43, Florian Pflug napsal(a):

On Dec27, 2010, at 23:49 , Kevin Grittner wrote:

Robert Haas <robertmhaas@gmail.com> wrote:

With respect to (b), I think I'd need to see a much more detailed
design for how you intend to make this work. Off the top of my
head there seems to be some pretty serious feasibility problems.

I had one random thought on that -- it seemed like a large concern
was that there would need to be at least an occasional scan of the
entire table to rebuild the distinct value information.

I believe we could actually avoid that.

First, the paper "An Optimal Algorithm for the Distinct Elements Problem"
actually contains an algorithm with *does* handle deletes - it's called
"L_0" estimate there.

Hmmm, that's interesting. I know there's a part about L_0 estimation,
but that's about estimating Hamming norm of a vector - so I've ignored
it as I thought we can't use it to estimate number of distinct values.
But if it really handles deletions and if we can use it, then it's
really interesting.

Second, as Tomas pointed out, the stream-based estimator is essentially a
simplified version of a bloom filter. It starts out with a field of
N zero bits, and sets K of them to 1 for each value v in the stream.
Which bits are set to 1 depends on some hash function(s) H_i(v). It's
then easy to compute how many 1-bits you'd expect to find in the bit
field after seeing D distinct values, and by reversing that you can
estimate D from the number of 1-bits in the bit field.

No, I haven't said the stream-based estimators are simplified versions
of a Bloom filter. I said the approach is very similar - all the
algorithms use bitmaps and hash functions, but the algorithms (Bloom
filter vs. probabilistic counting and adaptive sampling) are very different.

The Bloom filter is much more straightforward. The other algorithms are
much more sophisticated which allows to use less space.

To avoid having to rescan large tables, instead of storing one such
bit field, we'd store one per B pages of data. We'd then only need
to scan a range of B pages around every updated or deleted tuple,
and could afterwards compute a new global estimate of D by combining
the individual bit fields with bitwise and.

I don't think this could help.

1) This works just with the Bloom filters, not with the other
algorithms (you can't combine the segments using bitmap OR).

2) With heavily modified tables the updates are usually 'spread'
through the whole table, so you'll have to rebuild all the
segments anyway.

Since the need to regularly VACUUM tables hit by updated or deleted
won't go away any time soon, we could piggy-back the bit field
rebuilding onto VACUUM to avoid a second scan.

Well, I guess it's a bit more complicated. First of all, there's a local
VACUUM when doing HOT updates. Second, you need to handle inserts too
(what if the table just grows?).

But I'm not a VACUUM expert, so maybe I'm wrong and this is the right
place to handle rebuilds of distinct stats.

I was thinking about something else - we could 'attach' the rebuild to
an actual seq scan if the amount of changes reaches some threshold
(since the last rebuild). Only in case the amount of changes reaches a
higher threshold, we would rebuild the stats on our own.

Something like

IF (# of updates * deletes > 5%) THEN wait for seq scan
IF (# of updates * deletes > 10%) THEN rebuild the stats

I've found a nice ppt describing how Oracle does this:

http://www.oraclegeek.net/downloads/One_Pass_Distinct_Sampling.ppt

and there's PDF version

http://www.oraclegeek.net/downloads/OnePassDistinctSampling.pdf

According to this, Oracle is using the probabilistic counting approach
(see slide 26). And once they find out there were to many changes in the
table, they rebuild the whole thing.

I'm not saying we should do exactly the same, just that this might be a
good direction.

regards
Tomas

#16Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tomas Vondra (#15)
Re: estimating # of distinct values

Excerpts from Tomas Vondra's message of jue dic 30 16:38:03 -0300 2010:

Since the need to regularly VACUUM tables hit by updated or deleted
won't go away any time soon, we could piggy-back the bit field
rebuilding onto VACUUM to avoid a second scan.

Well, I guess it's a bit more complicated. First of all, there's a local
VACUUM when doing HOT updates. Second, you need to handle inserts too
(what if the table just grows?).

But I'm not a VACUUM expert, so maybe I'm wrong and this is the right
place to handle rebuilds of distinct stats.

I was thinking that we could have two different ANALYZE modes, one
"full" and one "incremental"; autovacuum could be modified to use one or
the other depending on how many changes there are (of course, the user
could request one or the other, too; not sure what should be the default
behavior). So the incremental one wouldn't worry about deletes, only
inserts, and could be called very frequently. The other one would
trigger a full table scan (or nearly so) to produce a better estimate in
the face of many deletions.

I haven't followed this discussion closely so I'm not sure that this
would be workable.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#17Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#16)
Re: estimating # of distinct values

Alvaro Herrera <alvherre@commandprompt.com> writes:

I was thinking that we could have two different ANALYZE modes, one
"full" and one "incremental"; autovacuum could be modified to use one or
the other depending on how many changes there are (of course, the user
could request one or the other, too; not sure what should be the default
behavior).

How is an incremental ANALYZE going to work at all? It has no way to
find out the recent changes in the table, for *either* inserts or
deletes. Unless you want to seqscan the whole table looking for tuples
with xmin later than something-or-other ... which more or less defeats
the purpose.

regards, tom lane

#18Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#17)
Re: estimating # of distinct values

Excerpts from Tom Lane's message of jue dic 30 23:02:04 -0300 2010:

Alvaro Herrera <alvherre@commandprompt.com> writes:

I was thinking that we could have two different ANALYZE modes, one
"full" and one "incremental"; autovacuum could be modified to use one or
the other depending on how many changes there are (of course, the user
could request one or the other, too; not sure what should be the default
behavior).

How is an incremental ANALYZE going to work at all? It has no way to
find out the recent changes in the table, for *either* inserts or
deletes. Unless you want to seqscan the whole table looking for tuples
with xmin later than something-or-other ... which more or less defeats
the purpose.

Yeah, I was thinking that this incremental ANALYZE would be the stream
in the "stream-based estimator" but evidently it doesn't work that way.
The stream that needs to be passed to the estimator consists of new
tuples as they are being inserted into the table, so this would need to
be done by the inserter process ... or it'd need to transmit the CTIDs
for someone else to stream them ... not an easy thing, in itself.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#19Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Alvaro Herrera (#18)
Re: estimating # of distinct values

On Dec 31, 2010, at 7:34 AM, Alvaro Herrera wrote:

Excerpts from Tom Lane's message of jue dic 30 23:02:04 -0300 2010:

Alvaro Herrera <alvherre@commandprompt.com> writes:

I was thinking that we could have two different ANALYZE modes, one
"full" and one "incremental"; autovacuum could be modified to use one or
the other depending on how many changes there are (of course, the user
could request one or the other, too; not sure what should be the default
behavior).

How is an incremental ANALYZE going to work at all? It has no way to
find out the recent changes in the table, for *either* inserts or
deletes. Unless you want to seqscan the whole table looking for tuples
with xmin later than something-or-other ... which more or less defeats
the purpose.

Yeah, I was thinking that this incremental ANALYZE would be the stream
in the "stream-based estimator" but evidently it doesn't work that way.
The stream that needs to be passed to the estimator consists of new
tuples as they are being inserted into the table, so this would need to
be done by the inserter process ... or it'd need to transmit the CTIDs
for someone else to stream them ... not an easy thing, in itself.

Perhaps listen/notify could be used for this, now that it allows passing a payload.

BTW, if we reduce the frequency at which full scans of large tables are needed then presumably the cost of the scans could be largely ignored. If we don't need to scan frequently then we shouldn't care very much about how long a scan takes, which means we could throttle it heavily. Presumably even a heavily used system can spare 500kB/s of IO to perform background scanning...
--
Jim C. Nasby, Database Architect jim@nasby.net
512.569.9461 (cell) http://jim.nasby.net

#20Csaba Nagy
ncslists@googlemail.com
In reply to: Tom Lane (#17)
Re: estimating # of distinct values

On Thu, 2010-12-30 at 21:02 -0500, Tom Lane wrote:

How is an incremental ANALYZE going to work at all?

How about a kind of continuous analyze ?

Instead of analyzing just once and then drop the intermediate results,
keep them on disk for all tables and then piggyback the background
writer (or have a dedicated process if that's not algorithmically
feasible) and before writing out stuff update the statistics based on
the values found in modified buffers. Probably it could take a random
sample of buffers to minimize overhead, but if it is done by a
background thread the overhead could be minimal anyway on multi-core
systems.

Not sure this makes sense at all, but if yes it would deliver the most
up to date statistics you can think of.

Cheers,
Csaba.

#21Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Csaba Nagy (#20)
#22Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Tomas Vondra (#21)
#23Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Jim Nasby (#22)
#24Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Tomas Vondra (#23)
#25Csaba Nagy
ncslists@googlemail.com
In reply to: Tomas Vondra (#21)
#26Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Csaba Nagy (#25)
#27Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#1)
#28Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Jim Nasby (#24)
#29Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Tomas Vondra (#28)
#30Robert Haas
robertmhaas@gmail.com
In reply to: Jim Nasby (#29)
#31Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Jim Nasby (#29)
#32Robert Haas
robertmhaas@gmail.com
In reply to: Tomas Vondra (#31)
#33Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Robert Haas (#30)
#34Robert Haas
robertmhaas@gmail.com
In reply to: Jim Nasby (#33)
#35Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#34)
#36Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Robert Haas (#34)
#37Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Tom Lane (#35)
#38Robert Haas
robertmhaas@gmail.com
In reply to: Jim Nasby (#36)
#39Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Robert Haas (#32)
#40Robert Haas
robertmhaas@gmail.com
In reply to: Tomas Vondra (#39)
#41Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#40)
#42Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Robert Haas (#40)
#43Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tom Lane (#41)
#44Nathan Boley
npboley@gmail.com
In reply to: Tomas Vondra (#42)
#45Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Nathan Boley (#44)
#46Florian Pflug
fgp@phlo.org
In reply to: Nathan Boley (#44)
#47Nathan Boley
npboley@gmail.com
In reply to: Florian Pflug (#46)
#48Nathan Boley
npboley@gmail.com
In reply to: Tomas Vondra (#45)
#49Florian Pflug
fgp@phlo.org
In reply to: Nathan Boley (#47)
#50Robert Haas
robertmhaas@gmail.com
In reply to: Tomas Vondra (#42)
#51Robert Haas
robertmhaas@gmail.com
In reply to: Florian Pflug (#49)
#52Nathan Boley
npboley@gmail.com
In reply to: Florian Pflug (#49)
#53Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Robert Haas (#50)
#54Csaba Nagy
ncslists@googlemail.com
In reply to: Tomas Vondra (#42)
#55Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Nathan Boley (#48)
#56Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Robert Haas (#50)
#57Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Heikki Linnakangas (#53)
#58Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Csaba Nagy (#54)