ANALYZE sampling is too good
At multiple conferences I've heard about people trying all sorts of
gymnastics to avoid ANALYZE which they expect to take too long and
consume too much I/O. This is especially a big complain after upgrades
when their new database performs poorly until the new statistics are
in and they did pg_upgrade to avoid an extended downtime and complain
about ANALYZE taking hours.
I always gave the party line that ANALYZE only takes a small
constant-sized sample so even very large tables should be very quick.
But after hearing the same story again in Heroku I looked into it a
bit further. I was kind of shocked but the numbers.
ANALYZE takes a sample of 300 * statistics_target rows. That sounds
pretty reasonable but with default_statistics_target set to 100 that's
30,000 rows. If I'm reading the code right It takes this sample by
sampling 30,000 blocks and then (if the table is large enough) taking
an average of one row per block. Each block is 8192 bytes so that
means it's reading 240MB of each table.That's a lot more than I
realized.
It means if your table is anywhere up to 240MB you're effectively
doing a full table scan and then throwing out nearly all the data
read.
Worse, my experience with the posix_fadvise benchmarking is that on
spinning media reading one out of every 16 blocks takes about the same
time as reading them all. Presumably this is because the seek time
between tracks dominates and reading one out of every 16 blocks is
still reading every track. So in fact if your table is up to about
3-4G ANALYZE is still effectively going to do a full table scan, at
least as far as I/O time goes.
The current algorithm seems like it was designed with a 100G+ table in
mind but the consequences on the more common 100M-100G tables weren't
really considered. Consider what this means for partitioned tables. If
they partition their terabyte table into 10 partitions ANALYZE will
suddenly want to use 10x as much I/O which seems like a perverse
consequence.
I'm not sure I have a prescription but my general feeling is that
we're spending an awful lot of resources going after a statistically
valid sample when we can spend a lot less resources and get something
90% as good. Or if we're really going to read that much data that we
might as well use more of the rows we find.
--
greg
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 12/03/2013 03:30 PM, Greg Stark wrote:
It means if your table is anywhere up to 240MB you're effectively
doing a full table scan and then throwing out nearly all the data
read.
There are lots of issues with our random sampling approach for ANALYZE.
This is why, back in our Greenplum days, Simon proposed changing to a
block-based sampling approach, where we would sample random *pages*
instead of random *rows*. That would allow us to do things like sample
5% of the table, but only read 5% of the table, although we might have
to play some with OS-FS operations to make sure of that. In addition to
solving the issue you cite above, it would let us get MUCH more accurate
estimates for very large tables, where currently we sample only about
0.1% of the table.
There are fairly well researched algorithms for block-based sampling
which estimate for the skew introduced by looking at consecutive rows in
a block. In general, a minimum sample size of 5% is required, and the
error is no worse than our current system. However, the idea was shot
down at the time, partly because I think other hackers didn't get the math.
I believe that both Oracle and MSSQL use block-based sampling, but of
course, I don't know which specific algo they use.
--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Import Notes
Reply to msg id not found: WMc20920f45cd0595839dd653811437f3f10c2c020a5776d15796e4b684709fceec757c990eca9ee0f50e2569f2b1fe2fc@asav-2.01.com
On Tue, Dec 3, 2013 at 8:30 PM, Greg Stark <stark@mit.edu> wrote:
Worse, my experience with the posix_fadvise benchmarking is that on
spinning media reading one out of every 16 blocks takes about the same
time as reading them all. Presumably this is because the seek time
between tracks dominates and reading one out of every 16 blocks is
still reading every track. So in fact if your table is up to about
3-4G ANALYZE is still effectively going to do a full table scan, at
least as far as I/O time goes.
Actually, it's rotational latency the dominant cost there.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Dec 5, 2013 at 3:50 PM, Josh Berkus <josh@agliodbs.com> wrote:
There are fairly well researched algorithms for block-based sampling
which estimate for the skew introduced by looking at consecutive rows in
a block. In general, a minimum sample size of 5% is required, and the
error is no worse than our current system. However, the idea was shot
down at the time, partly because I think other hackers didn't get the math.
I think that this certainly warrants revisiting. The benefits would be
considerable.
Has anyone ever thought about opportunistic ANALYZE piggy-backing on
other full-table scans? That doesn't really help Greg, because his
complaint is mostly that a fresh ANALYZE is too expensive, but it
could be an interesting, albeit risky approach.
Opportunistically/unpredictably acquiring a ShareUpdateExclusiveLock
would be kind of weird, for one thing, but if a full table scan really
is very expensive, would it be so unreasonable to attempt to amortize
that cost?
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Dec 6, 2013 at 7:22 AM, Peter Geoghegan <pg@heroku.com> wrote:
On Thu, Dec 5, 2013 at 3:50 PM, Josh Berkus <josh@agliodbs.com> wrote:
There are fairly well researched algorithms for block-based sampling
which estimate for the skew introduced by looking at consecutive rows in
a block. In general, a minimum sample size of 5% is required, and the
error is no worse than our current system. However, the idea was shot
down at the time, partly because I think other hackers didn't get the math.I think that this certainly warrants revisiting. The benefits would be
considerable.Has anyone ever thought about opportunistic ANALYZE piggy-backing on
other full-table scans? That doesn't really help Greg, because his
complaint is mostly that a fresh ANALYZE is too expensive, but it
could be an interesting, albeit risky approach.
Is only fresh ANALYZE costly or consecutive one's are also equally costly?
Doing it in some background operation might not be a bad idea, but doing it
in backend query execution (seq scan) might add overhead for query response time
especially if part or most of data for table is in RAM, so here
overhead due to actual read
might not be very high but the calculation for analyse (like sort)
will make it costly.
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2013-12-05 17:52:34 -0800, Peter Geoghegan wrote:
Has anyone ever thought about opportunistic ANALYZE piggy-backing on
other full-table scans? That doesn't really help Greg, because his
complaint is mostly that a fresh ANALYZE is too expensive, but it
could be an interesting, albeit risky approach.
What I've been thinking of is
a) making it piggy back on scans vacuum is doing instead of doing
separate ones all the time (if possible, analyze needs to be more
frequent). Currently with quite some likelihood the cache will be gone
again when revisiting.
b) make analyze incremental. In lots of bigger tables most of the table
is static - and we actually *do* know that, thanks to the vm. So keep a
rawer form of what ends in the catalogs around somewhere, chunked by the
region of the table the statistic is from. Everytime a part of the table
changes, re-sample only that part. Then recompute the aggregate.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
It looks like this is a fairly well understood problem because in the
real world it's also often cheaper to speak to people in a small
geographic area or time interval too. These wikipedia pages sound
interesting and have some external references:
http://en.wikipedia.org/wiki/Cluster_sampling
http://en.wikipedia.org/wiki/Multistage_sampling
I suspect the hard part will be characterising the nature of the
non-uniformity in the sample generated by taking a whole block. Some
of it may come from how the rows were loaded (e.g. older rows were
loaded by pg_restore but newer rows were inserted retail) or from the
way Postgres works (e.g. hotter rows are on blocks with fewer rows in
them and colder rows are more densely packed).
I've felt for a long time that Postgres would make an excellent test
bed for some aspiring statistics research group.
--
greg
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
http://en.wikipedia.org/wiki/Cluster_sampling
http://en.wikipedia.org/wiki/Multistage_samplingI suspect the hard part will be characterising the nature of the
non-uniformity in the sample generated by taking a whole block. Some
of it may come from how the rows were loaded (e.g. older rows were
loaded by pg_restore but newer rows were inserted retail) or from the
way Postgres works (e.g. hotter rows are on blocks with fewer rows in
them and colder rows are more densely packed).
I would have thought that as VACUUM reclaims space it levels that issue in
the long run and on average, so that it could be simply ignored?
I've felt for a long time that Postgres would make an excellent test
bed for some aspiring statistics research group.
I would say "applied statistics" rather than "research". Nevertheless I
can ask my research statistician colleagues next door about their opinion
on this sampling question.
--
Fabien.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Dec 3, 2013 at 6:30 PM, Greg Stark <stark@mit.edu> wrote:
I always gave the party line that ANALYZE only takes a small
constant-sized sample so even very large tables should be very quick.
But after hearing the same story again in Heroku I looked into it a
bit further. I was kind of shocked but the numbers.ANALYZE takes a sample of 300 * statistics_target rows. That sounds
pretty reasonable but with default_statistics_target set to 100 that's
30,000 rows. If I'm reading the code right It takes this sample by
sampling 30,000 blocks and then (if the table is large enough) taking
an average of one row per block. Each block is 8192 bytes so that
means it's reading 240MB of each table.That's a lot more than I
realized.
That is a lot. On the other hand, I question the subject line:
sometimes, our ANALYZE sampling is not good enough. Before we raised
the default statistics target from 10 to 100, complaints about bad
plan choices due to insufficiently-precise statistics were legion --
and we still have people periodically proposing to sample a fixed
percentage of the table instead of a fixed amount of the table, even
on large tables, which is going the opposite direction. I think this
is because they're getting really bad n_distinct estimates, and no
fixed-size sample can reliably give a good one.
More generally, I think the basic problem that people are trying to
solve by raising the statistics target is avoid index scans on
gigantic tables. Obviously, there are a lot of other situations where
inadequate statistics can cause problems, but that's a pretty
easy-to-understand one that we do not always get right. We know that
an equality search looking for some_column = 'some_constant', where
some_constant is an MCV, must be more selective than a search for the
least-frequent MCV. If you store more and more MCVs for a table,
eventually you'll have enough that the least-frequent one is pretty
infrequent, and then things work a lot better.
This is more of a problem for big tables than for small tables. MCV
#100 can't have a frequency of greater than 1/100 = 0.01, but that's a
lot more rows on a big table than small one. On a table with 10
million rows we might estimate something close to 100,000 rows when
the real number is just a handful; when the table has only 10,000
rows, we just can't be off by as many orders of magnitude. Things
don't always work out that badly, but in the worst case they do.
Maybe there's some highly-principled statistical approach which could
be taken here, and if so that's fine, but I suspect not. So what I
think we should do is auto-tune the statistics target based on the
table size. If, say, we think that the generally useful range for the
statistics target is something like 10 to 400, then let's come up with
a formula based on table size that outputs 10 for small tables, 400
for really big tables, and intermediate values for tables in the
middle.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 12/07/2013 11:46 AM, Robert Haas wrote:
Maybe there's some highly-principled statistical approach which could
be taken here, and if so that's fine, but I suspect not. So what I
think we should do is auto-tune the statistics target based on the
table size. If, say, we think that the generally useful range for the
statistics target is something like 10 to 400, then let's come up with
a formula based on table size that outputs 10 for small tables, 400
for really big tables, and intermediate values for tables in the
middle.
The only approach which makes sense is to base it on a % of the table.
In fact, pretty much every paper which has examined statistics
estimation for database tables has determined that any estimate based on
a less-than-5% sample is going to be wildly inaccurate. Not that 5%
samples are 100% accurate, but at least they fit the 80/20 rule.
This is the reason why implementing block-based sampling is critical;
using our current "take one row out of every page" method, sampling 5%
of the table means scanning the whole thing in most tables. We also
need to decouple the number of MCVs we keep from the sample size.
Certainly our existing sampling algo seems designed to maximize IO for
the sample size.
There's other qualitative improvements we could make, which Nathan Boley
has spoken on. For example, our stats code has no way to recognize a
normal or exponential distrbution -- it assumes that all columns are
randomly distributed. If we could recoginze common distribution
patterns, then not only could we have better query estimates, those
would require keeping *fewer* stats, since all you need for a normal
distribution are the end points and the variance.
--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Import Notes
Reply to msg id not found: WM2c24630b85919c51990d26c00b056913ea69534a280eb5accc12bd9a6e33592fd3b91fbbe976fd32f0a7b59bb7b2873b@asav-3.01.com
On Sat, Dec 7, 2013 at 8:25 PM, Josh Berkus <josh@agliodbs.com> wrote:
The only approach which makes sense is to base it on a % of the table.
In fact, pretty much every paper which has examined statistics
estimation for database tables has determined that any estimate based on
a less-than-5% sample is going to be wildly inaccurate. Not that 5%
samples are 100% accurate, but at least they fit the 80/20 rule.
This is nonsense. The math behind the deductions the planner makes is
well understood and we know how to estimate the precision based on the
sample size. There are some questions that really need a proportional
sample such as n_distinct (and as Robert points out the MCV list) but
the main motivation of the sample size is the histograms and those are
used to answer questions which very clearly do not need a proportional
sample. The statistics is very clear there. Using a proportional
sample for the histograms would just be silly. It would be
substituting a gut feeling for real statistics.
The problems with using a proportional sample for things like
n_distinct or the MCV list is that you're talking about sorting or
hashing an unboundedly large set of values and storing an unboundedly
large array in the statistics table. I don't think that would be
tenable without dramatically changing the way process and store that
data to be more scalable. Using a lossy counting algorithm and
something more scalable than toasted arrays would be prerequisites I
think.
And as Robert mentions even if we solved those problems it wouldn't
help n_distinct. You really need to read all the values to deal with
n_distinct.
This is the reason why implementing block-based sampling is critical;
using our current "take one row out of every page" method, sampling 5%
of the table means scanning the whole thing in most tables. We also
need to decouple the number of MCVs we keep from the sample size.
Certainly our existing sampling algo seems designed to maximize IO for
the sample size.
This would be helpful if you could specify what you mean by
"black-based sampling". The reason these previous pleas didn't go
anywhere is not because we can't do math. It's because of the lack of
specifics here. What we do *today* is called "block-based sampling" by
the literature.
What I'm saying is we need to change the "block-based sampling" that
we do to extract more rows per block. We currently extract the
"correct" number of rows to get a strong guarantee of uniformity. If
we could get a weaker guarantee of being "close enough" to uniform
samples for the deductions we want to make and get many more rows per
block then we could read a lot fewer blocks.
Or to put it another way people could increase the statistics target
dramatically and still be reading the same number of blocks as today.
In an ideal world perhaps we could have people reading 1% of the
blocks they read today and get statistics targets 10x better than
today.
--
greg
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 08/12/13 09:25, Josh Berkus wrote:
On 12/07/2013 11:46 AM, Robert Haas wrote:
Maybe there's some highly-principled statistical approach which could
be taken here, and if so that's fine, but I suspect not. So what I
think we should do is auto-tune the statistics target based on the
table size. If, say, we think that the generally useful range for the
statistics target is something like 10 to 400, then let's come up with
a formula based on table size that outputs 10 for small tables, 400
for really big tables, and intermediate values for tables in the
middle.The only approach which makes sense is to base it on a % of the table.
In fact, pretty much every paper which has examined statistics
estimation for database tables has determined that any estimate based on
a less-than-5% sample is going to be wildly inaccurate. Not that 5%
samples are 100% accurate, but at least they fit the 80/20 rule.This is the reason why implementing block-based sampling is critical;
using our current "take one row out of every page" method, sampling 5%
of the table means scanning the whole thing in most tables. We also
need to decouple the number of MCVs we keep from the sample size.
Certainly our existing sampling algo seems designed to maximize IO for
the sample size.There's other qualitative improvements we could make, which Nathan Boley
has spoken on. For example, our stats code has no way to recognize a
normal or exponential distrbution -- it assumes that all columns are
randomly distributed. If we could recoginze common distribution
patterns, then not only could we have better query estimates, those
would require keeping *fewer* stats, since all you need for a normal
distribution are the end points and the variance.
From src/backend/commands/analyze.c
* As of May 2004 we use a new two-stage method: Stage one selects up
* to targrows random blocks (or all blocks, if there aren't so many).
* Stage two scans these blocks and uses the Vitter algorithm to create
* a random sample of targrows rows (or less, if there are less in the
* sample of blocks). The two stages are executed simultaneously: each
* block is processed as soon as stage one returns its number and while
* the rows are read stage two controls which ones are to be inserted
* into the sample.
I don't think we always read every block (been a while since I looked at
this code, so I'll recheck). From what I understand this stuff is based
on reasonable research (Vitter algorithm). Not to say it
couldn't/shouldn't be looked at again to improve it - but it is not just
dreamed up with no valid research!
Cheers
Mark
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 08/12/13 12:27, Mark Kirkwood wrote:
On 08/12/13 09:25, Josh Berkus wrote:
On 12/07/2013 11:46 AM, Robert Haas wrote:
Maybe there's some highly-principled statistical approach which could
be taken here, and if so that's fine, but I suspect not. So what I
think we should do is auto-tune the statistics target based on the
table size. If, say, we think that the generally useful range for the
statistics target is something like 10 to 400, then let's come up with
a formula based on table size that outputs 10 for small tables, 400
for really big tables, and intermediate values for tables in the
middle.The only approach which makes sense is to base it on a % of the table.
In fact, pretty much every paper which has examined statistics
estimation for database tables has determined that any estimate based on
a less-than-5% sample is going to be wildly inaccurate. Not that 5%
samples are 100% accurate, but at least they fit the 80/20 rule.This is the reason why implementing block-based sampling is critical;
using our current "take one row out of every page" method, sampling 5%
of the table means scanning the whole thing in most tables. We also
need to decouple the number of MCVs we keep from the sample size.
Certainly our existing sampling algo seems designed to maximize IO for
the sample size.There's other qualitative improvements we could make, which Nathan Boley
has spoken on. For example, our stats code has no way to recognize a
normal or exponential distrbution -- it assumes that all columns are
randomly distributed. If we could recoginze common distribution
patterns, then not only could we have better query estimates, those
would require keeping *fewer* stats, since all you need for a normal
distribution are the end points and the variance.From src/backend/commands/analyze.c
* As of May 2004 we use a new two-stage method: Stage one selects up
* to targrows random blocks (or all blocks, if there aren't so many).
* Stage two scans these blocks and uses the Vitter algorithm to create
* a random sample of targrows rows (or less, if there are less in the
* sample of blocks). The two stages are executed simultaneously: each
* block is processed as soon as stage one returns its number and while
* the rows are read stage two controls which ones are to be inserted
* into the sample.I don't think we always read every block (been a while since I looked at
this code, so I'll recheck). From what I understand this stuff is based
on reasonable research (Vitter algorithm). Not to say it
couldn't/shouldn't be looked at again to improve it - but it is not just
dreamed up with no valid research!
Since I volunteered to recheck :-)
from analyze.c again:
/*--------------------
* The following choice of minrows is based on the paper
* "Random sampling for histogram construction: how much is enough?"
* by Surajit Chaudhuri, Rajeev Motwani and Vivek Narasayya, in
* Proceedings of ACM SIGMOD International Conference on Management
* of Data, 1998, Pages 436-447. Their Corollary 1 to Theorem 5
* says that for table size n, histogram size k, maximum relative
* error in bin size f, and error probability gamma, the minimum
* random sample size is
* r = 4 * k * ln(2*n/gamma) / f^2
* Taking f = 0.5, gamma = 0.01, n = 10^6 rows, we obtain
* r = 305.82 * k
* Note that because of the log function, the dependence on n is
* quite weak; even at n = 10^12, a 300*k sample gives <= 0.66
* bin size error with probability 0.99. So there's no real need to
* scale for n, which is a good thing because we don't necessarily
* know it at this point.
*--------------------
*/
stats->minrows = 300 * attr->attstattarget;
So for typical settings (default_statictics_target = 100), we try to get
30000 rows. This means we will sample about 30000 blocks.
Indeed quickly checking with a scale 100 pgbench db and a simple patch
to make the block sampler say how many blocks it reads (note
pgbench_accounts has 163935 blocks):
bench=# ANALYZE pgbench_branches;
NOTICE: acquire sample will need 30000 blocks
NOTICE: sampled 1 blocks
ANALYZE
Time: 16.935 ms
bench=# ANALYZE pgbench_accounts;
NOTICE: acquire sample will need 30000 blocks
NOTICE: sampled 30000 blocks
ANALYZE
Time: 10059.446 ms
bench=# \q
Regards
Mark
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 08/12/13 10:27, Greg Stark wrote:
On Sat, Dec 7, 2013 at 8:25 PM, Josh Berkus <josh@agliodbs.com> wrote:
The only approach which makes sense is to base it on a % of the table.
In fact, pretty much every paper which has examined statistics
estimation for database tables has determined that any estimate based on
a less-than-5% sample is going to be wildly inaccurate. Not that 5%
samples are 100% accurate, but at least they fit the 80/20 rule.This is nonsense. The math behind the deductions the planner makes is
well understood and we know how to estimate the precision based on the
sample size. There are some questions that really need a proportional
sample such as n_distinct (and as Robert points out the MCV list) but
the main motivation of the sample size is the histograms and those are
used to answer questions which very clearly do not need a proportional
sample. The statistics is very clear there. Using a proportional
sample for the histograms would just be silly. It would be
substituting a gut feeling for real statistics.The problems with using a proportional sample for things like
n_distinct or the MCV list is that you're talking about sorting or
hashing an unboundedly large set of values and storing an unboundedly
large array in the statistics table. I don't think that would be
tenable without dramatically changing the way process and store that
data to be more scalable. Using a lossy counting algorithm and
something more scalable than toasted arrays would be prerequisites I
think.And as Robert mentions even if we solved those problems it wouldn't
help n_distinct. You really need to read all the values to deal with
n_distinct.This is the reason why implementing block-based sampling is critical;
using our current "take one row out of every page" method, sampling 5%
of the table means scanning the whole thing in most tables. We also
need to decouple the number of MCVs we keep from the sample size.
Certainly our existing sampling algo seems designed to maximize IO for
the sample size.This would be helpful if you could specify what you mean by
"black-based sampling". The reason these previous pleas didn't go
anywhere is not because we can't do math. It's because of the lack of
specifics here. What we do *today* is called "block-based sampling" by
the literature.What I'm saying is we need to change the "block-based sampling" that
we do to extract more rows per block. We currently extract the
"correct" number of rows to get a strong guarantee of uniformity. If
we could get a weaker guarantee of being "close enough" to uniform
samples for the deductions we want to make and get many more rows per
block then we could read a lot fewer blocks.Or to put it another way people could increase the statistics target
dramatically and still be reading the same number of blocks as today.
In an ideal world perhaps we could have people reading 1% of the
blocks they read today and get statistics targets 10x better than
today.
I suspect that the number of rows to sample should be something of the form:
{ N where N <= a * 100
n = { (a * N) / 100 where a * 100 < N <= 10000
{ (a * SQRT(N)) / 100 where N > 10000
n: the number of rows to sample
N: total number of rows
a: configurable coefficient (1 <= a <= 100)
For very large numbers of rows, like 10^8, taking a constant fraction
will over sample for most purposes, hence the square root.
a N n sampled%
1 10000 100 1%
100 10000 10000 100%
1 10^8 10000 0.01%
100 10^8 10^6 1%
1 10^10 10^5 0.001%
100 10^10 10^7 0.01%
For very large values of N, you can get can still get a representative
sample with a very small fraction of rows sampled.
Hmm... Yes, I can see some issues, once I tried out with test values!
However. it might inspire a more useful and practical approach!
Cheers,
Gavin
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sun, Dec 8, 2013 at 12:06 AM, Mark Kirkwood
<mark.kirkwood@catalyst.net.nz> wrote:
bench=# ANALYZE pgbench_accounts;
NOTICE: acquire sample will need 30000 blocks
NOTICE: sampled 30000 blocks
ANALYZE
Time: 10059.446 ms
bench=# \q
I did some experimenting here as well.
I hacked up a version of analyze.c that has a guc for rows_per_block
to sample. Setting it to 1 doesn't modify the behaviour at all whereas
setting it to 4 divides the number of blocks to sample by 4 which
causes it to do less I/O and use more rows from each block.
I then initialized pgbench with scale factor 100 but modified the code
to run the actual pgbench with scale factor 1. In other words I ran a
lot of updates on 1% of the database but left the other 99% untouched
from the initial load.
Then I ran "ANALYZE VERBOSE accounts" with rows_per_block set to 1, 4,
16, and 64. The latter is slightly larger than the average number of
tuples per block so the resulting sample is actually slightly short.
The whole accounts table is 1.2GB and contains 10 million rows. As
expected with rows_per_block set to 1 it reads 240MB of that
containing nearly 2 million rows (and takes nearly 20s -- doing a full
table scan for select count(*) only takes about 5s):
stark=# analyze verbose pgbench_accounts;
INFO: analyzing "public.pgbench_accounts"
INFO: "pgbench_accounts": scanned 30000 of 158756 pages, containing
1889701 live rows and 0 dead rows; 30000 rows in sample, 10000036
estimated total rows
ANALYZE
Time: 19468.987 ms
With rows_per_block=4 it reads only a quarter as many blocks but it's
not much faster:
stark=# analyze verbose pgbench_accounts;
INFO: analyzing "public.pgbench_accounts"
INFO: "pgbench_accounts": scanned 7501 of 158756 pages, containing
472489 live rows and 0 dead rows; 30000 rows in sample, 10000037
estimated total rows
ANALYZE
Time: 17062.331 ms
But with rows_per_block=16 it's much faster, 6.7s
stark=# set statistics_rows_per_block = 16;
SET
Time: 1.583 ms
stark=# analyze verbose pgbench_accounts;
INFO: analyzing "public.pgbench_accounts"
INFO: "pgbench_accounts": scanned 1876 of 158756 pages, containing
118163 live rows and 0 dead rows; 30000 rows in sample, 10000031
estimated total rows
ANALYZE
Time: 6694.331 ms
And with rows_per_block=64 it's under 2s:
stark=# set statistics_rows_per_block = 64;
SET
Time: 0.693 ms
stark=# analyze verbose pgbench_accounts;
INFO: analyzing "public.pgbench_accounts"
INFO: "pgbench_accounts": scanned 469 of 158756 pages, containing
29544 live rows and 0 dead rows; 29544 rows in sample, 10000033
estimated total rows
ANALYZE
Time: 1937.055 ms
The estimates for the total rows is just as accurate in every case.
(It seems to be consistently sightly high though which is a bit
disconcerting)
However looking at the actual pg_stats entries the stats are
noticeably less accurate for the "blockier" samples. The "bid" column
actually has 100 distinct values and so with a statistics_target of
100 each value should appear in the MCV list with a frequency of about
.01.
With rows_per_block=1 the MCV frequency list ranges from .0082 to .0123
With rows_per_block=4 the MCV frequency list ranges from .0063 to .0125
With rows_per_block=16 the MCV frequency list ranges from .0058 to .0164
With rows_per_block=64 the MCV frequency list ranges from .0021 to .0213
I'm not really sure if this is due to the blocky sample combined with
the skewed pgbench run or not. It doesn't seem to be consistently
biasing towards or against bid 1 which I believe are the only rows
that would have been touched by pgbench. Still it's suspicious that
they seem to be consistently getting less accurate as the blockiness
increases.
I've attached the results of pg_stats following the analyze with the
various levels of "blockiness".
--
greg
Attachments:
pgbench_stats.txttext/plain; charset=US-ASCII; name=pgbench_stats.txtDownload
On 12/08/2013 10:14 AM, Greg Stark wrote:
With rows_per_block=1 the MCV frequency list ranges from .0082 to .0123
With rows_per_block=4 the MCV frequency list ranges from .0063 to .0125
With rows_per_block=16 the MCV frequency list ranges from .0058 to .0164
With rows_per_block=64 the MCV frequency list ranges from .0021 to .0213I'm not really sure if this is due to the blocky sample combined with
the skewed pgbench run or not. It doesn't seem to be consistently
biasing towards or against bid 1 which I believe are the only rows
that would have been touched by pgbench. Still it's suspicious that
they seem to be consistently getting less accurate as the blockiness
increases.
They will certainly do so if you don't apply any statistical adjustments
for selecting more rows from the same pages.
So there's a set of math designed to calculate for the skew introduced
by reading *all* of the rows in each block. That's what I meant by
"block-based sampling"; you read, say, 400 pages, you compile statistics
on *all* of the rows on those pages, you apply some algorithms to adjust
for groupings of rows based on how grouped they are. And you have a
pretty good estimate of how grouped they are, because you just looked a
complete sets of rows on a bunch of nonadjacent pages.
Obviously, you need to look at more rows than you would with a
pure-random sample. Like I said, the 80%+ accurate point in the papers
seemed to be at a 5% sample. However, since those rows come from the
same pages, the cost of looking at more rows is quite small, compared to
the cost of looking at 64 times as many disk pages.
My ACM subscription has lapsed, though; someone with a current ACM
subscription could search for this; there are several published papers,
with math and pseudocode.
--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Import Notes
Reply to msg id not found: WM65e637323e39699e0f922c83a54b31425037361217738f36b909644c751b127b68859b5b0560922eba640b3ccc535d9e@asav-1.01.com
On 12/08/2013 08:14 PM, Greg Stark wrote:
The whole accounts table is 1.2GB and contains 10 million rows. As
expected with rows_per_block set to 1 it reads 240MB of that
containing nearly 2 million rows (and takes nearly 20s -- doing a full
table scan for select count(*) only takes about 5s):
One simple thing we could do, without or in addition to changing the
algorithm, is to issue posix_fadvise() calls for the blocks we're going
to read. It should at least be possible to match the speed of a plain
sequential scan that way.
- Heikki
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sun, Dec 8, 2013 at 7:03 PM, Josh Berkus <josh@agliodbs.com> wrote:
They will certainly do so if you don't apply any statistical adjustments
for selecting more rows from the same pages.So there's a set of math designed to calculate for the skew introduced
by reading *all* of the rows in each block.
I just think this is an oversimplification. There's no skew introduced
just by reading all the rows in each block unless there's some kind of
dependency between the block a row is placed on and the data in it. So
I don't believe there can be some single set of math that
automatically removes any skew automatically. The math will depend on
what the dependency is.
Just to be clear, you have to think pretty hard about the way Postgres
internals work to see what kinds of skew might be appearing here. Due
to the way Postgres updates work and HOT cleanups work "hot" tuples
will be weighted less than "cold" tuples. That's not going to be
something someone in ACM knew to design into their maths.
I do have access to ACM or other academic articles if you remember any
author names or any keywords but if it's a database journal I would
worry about patent issues. Do you remember if it was over 17 years
old?
Obviously, you need to look at more rows than you would with a
pure-random sample. Like I said, the 80%+ accurate point in the papers
seemed to be at a 5% sample.
I really don't believe the 5% thing. It's not enough for n_distinct
and it's *far* too high a value for linear properties like histograms
or nullfrac etc. From a computer point of view it's too high to be
worth bothering. If we have to read 5% of the table we might as well
do a full scan anyways, it'll be marginally slower but much better
quality results.
--
greg
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 09/12/13 08:03, Josh Berkus wrote:
So there's a set of math designed to calculate for the skew introduced
by reading *all* of the rows in each block. That's what I meant by
"block-based sampling"; you read, say, 400 pages, you compile statistics
on *all* of the rows on those pages, you apply some algorithms to adjust
for groupings of rows based on how grouped they are. And you have a
pretty good estimate of how grouped they are, because you just looked a
complete sets of rows on a bunch of nonadjacent pages.Obviously, you need to look at more rows than you would with a
pure-random sample. Like I said, the 80%+ accurate point in the papers
seemed to be at a 5% sample. However, since those rows come from the
same pages, the cost of looking at more rows is quite small, compared to
the cost of looking at 64 times as many disk pages.My ACM subscription has lapsed, though; someone with a current ACM
subscription could search for this; there are several published papers,
with math and pseudocode.
This makes more sense! Unfortunately you had confused everyone (well me
at least) by decreeing that we needed block based sampling - when we
started using that in 2004 - there is more than one way to sample blocks
it seems :-)
What you are actually suggesting is a way to do block based sampling
that requires reading fewer blocks, and that is a great idea! Of course
as Greg has just suggested - the details are gonna be the clincher. Can
you supply authors or titles of papers?
Also with reference to Greenplum - their use of postgres is fairly
special case, and an approach that works well for them is not always
suitable for more general purpose use.
Regards
Mark
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Greg,
I really don't believe the 5% thing. It's not enough for n_distinct
and it's *far* too high a value for linear properties like histograms
or nullfrac etc.
Actually, it is enough for n_distinct, or more properly, 5% is as good
as you can get for n_distinct unless you're going to jump to scanning
50% or more.
It's also applicable for the other stats; histogram buckets constructed
from a 5% sample are more likely to be accurate than those constructed
from a 0.1% sample. Same with nullfrac. The degree of improved
accuracy, would, of course, require some math to determine.
From a computer point of view it's too high to be
worth bothering. If we have to read 5% of the table we might as well
do a full scan anyways, it'll be marginally slower but much better
quality results.
Reading 5% of a 200GB table is going to be considerably faster than
reading the whole thing, if that 5% is being scanned in a way that the
FS understands.
Also, we can optimize this significantly by using the VM, as Robert (I
think) suggested.
In the advanced approaches section, there's also the idea of collecting
analyze data from table pages while they're in memory anyway for other
reasons.
You do seem kind of hostile to the idea of full-page-sampling, going
pretty far beyond the "I'd need to see the math". Why?
--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Import Notes
Reply to msg id not found: WMbdd52f6e13dbde485a46463f30c7562edcbfb22dccf7dd34a71b30152ced1c7f7238bbea167b0675b5222a63e8c398c4@asav-1.01.com