Query optimizer 8.0.1 (and 8.0)
Here's one:
I have the USA census TIGER database loaded, the WHOLE THING, the whole
country. It isn't the biggest database, but it is about 40G before
indexes. Every table is over a few million rows. I can quite honestly say,
a sequential scan is almost never the right thing to do.
Info below.
Here is the query:
select * from rt1, rt2 where rt1.tlid = rt2.tlid and (zipr = 2186 or
zipl=2186);
Well, when you look at the explain, obviously it looks like the sequential
scan is better, but trust me, it takes a good few minutes for the query to
respond with sequential scan enabled. I've run analyze and everything.
I suspect that analyze only samples a very small amount of the database
and gets the wrong idea about it. Is there a way to force analyze to
sample more rows?
tiger=# analyze verbose rt2 ;
INFO: analyzing "public.rt2"
INFO: "rt2": scanned 3000 of 1139825 pages, containing 84160 live rows
and 0 dead rows; 3000 rows in sample, 31975891 estimated total rows
ANALYZE
tiger=# analyze verbose rt1 ;
INFO: analyzing "public.rt1"
INFO: "rt1": scanned 3000 of 1527360 pages, containing 90951 live rows
and 0 dead rows; 3000 rows in sample, 46304973 estimated total rows
ANALYZE
tiger=# explain select * from rt1, rt2 where rt1.tlid = rt2.tlid and (zipr
= 2186 or zipl=2186);
QUERY PLAN
-----------------------------------------------------------------------------------------------------
Hash Join (cost=121978.81..3996190.89 rows=21118 width=520)
Hash Cond: ("outer".tlid = "inner".tlid)
-> Seq Scan on rt2 (cost=0.00..1459291.36 rows=31946636 width=218)
-> Hash (cost=120662.36..120662.36 rows=30579 width=302)
-> Index Scan using rt1_zipr, rt1_zipl on rt1
(cost=0.00..120662.36 rows=30579 width=302)
Index Cond: ((zipr = 2186) OR (zipl = 2186))
(6 rows)
tiger=# set enable_seqscan=no;
SET
tiger=# explain select * from rt1, rt2 where rt1.tlid = rt2.tlid and (zipr
= 2186 or zipl=2186);
QUERY PLAN
-----------------------------------------------------------------------------------------------
Nested Loop (cost=0.00..46868256.00 rows=21118 width=520)
-> Index Scan using rt1_zipr, rt1_zipl on rt1 (cost=0.00..120662.36
rows=30579 width=302)
Index Cond: ((zipr = 2186) OR (zipl = 2186))
-> Index Scan using rt2_tlid on rt2 (cost=0.00..1523.80 rows=396
width=218)
Index Cond: ("outer".tlid = rt2.tlid)
Table "public.rt1"
Column | Type | Modifiers
-----------+-------------------+-----------
tlid | integer |
side1 | character varying |
source | character varying |
fedirp | character varying |
fename | character varying |
fetype | character varying |
fedirs | character varying |
cfcc | character varying |
fraddl | character varying |
toaddl | character varying |
fraddr | character varying |
toaddr | character varying |
friaddl | character varying |
toiaddl | character varying |
friaddr | character varying |
toiaddr | character varying |
zipl | integer |
zipr | integer |
aianhhfpl | character varying |
aianhhfpr | character varying |
aihhtlil | character varying |
aihhtlir | character varying |
statel | character varying |
stater | character varying |
countyl | character varying |
countyr | character varying |
cousubl | character varying |
cousubr | character varying |
submcdl | character varying |
submcdr | character varying |
placel | character varying |
placer | character varying |
tractl | character varying |
tractr | character varying |
blockl | character varying |
blockr | character varying |
frlong | numeric |
frlat | numeric |
tolong | numeric |
tolat | numeric |
Indexes:
"rt1_fename" btree (fename)
"rt1_frlat" btree (frlat)
"rt1_frlong" btree (frlong)
"rt1_tlid" btree (tlid)
"rt1_tolat" btree (tolat)
"rt1_tolong" btree (tolong)
"rt1_zipl" btree (zipl)
"rt1_zipr" btree (zipr)
Table "public.rt2"
Column | Type | Modifiers
--------+---------+-----------
tlid | integer |
rtsq | integer |
long1 | numeric |
lat1 | numeric |
long2 | numeric |
lat2 | numeric |
long3 | numeric |
lat3 | numeric |
long4 | numeric |
lat4 | numeric |
long5 | numeric |
lat5 | numeric |
long6 | numeric |
lat6 | numeric |
long7 | numeric |
lat7 | numeric |
long8 | numeric |
lat8 | numeric |
long9 | numeric |
lat9 | numeric |
long10 | numeric |
lat10 | numeric |
Indexes:
"rt2_tlid" btree (tlid)
pgsql@mohawksoft.com writes:
I suspect that analyze only samples a very small amount of the database
and gets the wrong idea about it. Is there a way to force analyze to
sample more rows?
default_statistics_target. But let's see the pg_stats rows for these
columns before assuming that analyze is getting it wrong.
regards, tom lane
* pgsql@mohawksoft.com (pgsql@mohawksoft.com) wrote:
I have the USA census TIGER database loaded, the WHOLE THING, the whole
country. It isn't the biggest database, but it is about 40G before
indexes. Every table is over a few million rows. I can quite honestly say,
a sequential scan is almost never the right thing to do.
Cool, I've got it loaded here too w/ PostGIS. It's good stuff. :) I'm
curious what all you're doing with it and especially if you're using
mapserver on it or have found a way to identify roads at a more
street/city/state/country level, I've had problems doing that.
I suspect that analyze only samples a very small amount of the database
and gets the wrong idea about it. Is there a way to force analyze to
sample more rows?
Yup:
alter table <blah> alter column <blah> set statistics <blah>
As I recall, the default is 100 for statistics, I'd say increase that
number to like 200 or 300 or more and see if it helps.
Stephen
pgsql@mohawksoft.com writes:
I suspect that analyze only samples a very small amount of the database
and gets the wrong idea about it. Is there a way to force analyze to
sample more rows?default_statistics_target. But let's see the pg_stats rows for these
columns before assuming that analyze is getting it wrong.
Some more info:
I did a select count(distinct(tlid)) from rt2, and updated the statistics
with the result:
tiger=# update pg_statistic set stadistinct = 23656799 where starelid =
17236 and staattnum = 1;
UPDATE 1
tiger=# explain select * from rt1, rt2 where rt1.tlid = rt2.tlid and (zipr
tiger(# = 2186 or zipl=2186);
QUERY PLAN
-----------------------------------------------------------------------------------------------
Nested Loop (cost=0.00..425517.95 rows=21315 width=520)
-> Index Scan using rt1_zipr, rt1_zipl on rt1 (cost=0.00..121893.93
rows=30835 width=302)
Index Cond: ((zipr = 2186) OR (zipl = 2186))
-> Index Scan using rt2_tlid on rt2 (cost=0.00..9.82 rows=2 width=218)
Index Cond: ("outer".tlid = rt2.tlid)
(5 rows)
tiger=# SELECT attname, n_distinct, most_common_vals FROM pg_stats
WHERE tablename = 'rt1';
attname | n_distinct |
most_common_vals
-----------+------------+-----------------------------------------------------------------------------------------------------------------------
tlid | -1 |
side1 | 1 | {1}
source | 9 | {B,J,A,K,L,N,O,M,C}
fedirp | 8 | {N,E,S,W,SW,NE,NW,SE}
fename | 2590 | {Main,1st,Oak,2nd,9th,"Burlington Northern Santa
Fe R",Park,4th,11th,8th}
fetype | 26 | {Rd,St,Ave,Dr}
fedirs | 9 | {NE,SE,N,E,NW,SW,W,S,O}
cfcc | 51 | {A41,H12,H11,F10,A74,A31,H01}
fraddl | 767 | {1,101,2,201,301,401,100,701,298,500}
toaddl | 805 | {199,399,299,499,98,198,99,1,100,2}
fraddr | 765 | {2,1,100,200,400,300,700,101,501,299}
toaddr | 772 | {198,398,298,498,99,98,199,101,1,1098}
friaddl | 3 | {0,1,2}
toiaddl | 3 | {0,1,2}
friaddr | 3 | {0}
toiaddr | 3 | {0}
zipl | 925 |
zipr | 899 |
aianhhfpl | 42 |
aianhhfpr | 43 |
aihhtlil | 2 |
aihhtlir | 2 |
statel | 55 | {48,06,12,37,29,42,17,36,13,39}
stater | 55 | {48,06,12,37,29,42,17,36,13,39}
countyl | 189 | {005,059,013,029,003,001,017,009,031,043}
countyr | 191 | {005,059,013,001,003,029,017,025,031,043}
cousubl | 2568 |
{92601,91800,90270,90240,90468,90572,91508,91750,60000,90324}
cousubr | 2598 |
{92601,91800,90270,90240,90468,90572,91248,91750,60000,90324}
submcdl | -1 |
submcdr | -1 |
placel | 778 |
{51000,65000,44000,12000,38000,60000,63460,07000,04000,22000}
placer | 787 |
{51000,65000,12000,44000,60000,07000,38000,55000,63460,04000}
tractl | 1370 |
{950200,950100,950300,000100,970100,960100,980100,950700,970300,990100}
tractr | 1354 |
{950200,950100,950300,000100,970100,960100,990100,950700,980100,970300}
blockl | 1050 | {1000,2000,1001,1003,1005,2001,1009,2006,1002,1004}
blockr | 1055 | {1000,2000,1001,1002,1003,1005,1004,1009,2004,2002}
frlong | 134476 |
{-120.214657,-113.074100,-106.494480,-103.306945,-100.184470,-100.083614,-99.476994,-98.420248,-97.325498,-93.349865}
frlat | 143222 |
{27.759896,29.251454,29.898585,30.093247,31.814071,31.950913,32.055726,32.377503,32.523607,32.607387}
tolong | 317744 |
{-123.330861,-111.673035,-107.596898,-103.164000,-100.945693,-100.080307,-99.576886,-99.492719,-97.743722,-93.870222}
tolat | 278079 |
{27.493816,27.904316,29.691644,32.731410,33.350429,34.490563,35.551053,35.868297,39.139185,40.068098}
(40 rows)
pgsql@mohawksoft.com writes:
-> Index Scan using rt1_zipr, rt1_zipl on rt1 (cost=0.00..121893.93
rows=30835 width=302)
Index Cond: ((zipr = 2186) OR (zipl = 2186))
zipl | 925 |
zipr | 899 |
Those n_distinct values for zipl and zipr seem aberrant --- too low
compared to the estimated rowcount you're showing. What are the
true figures? Also, how about some EXPLAIN ANALYZEs and not just
EXPLAINs?
regards, tom lane
On Fri, Feb 04, 2005 at 05:23:27PM -0500, Stephen Frost wrote:
As I recall, the default is 100 for statistics, I'd say increase that
number to like 200 or 300 or more and see if it helps.
No, it's 10.
--
Jim C. Nasby, Database Consultant decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828
Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"
pgsql@mohawksoft.com writes:
-> Index Scan using rt1_zipr, rt1_zipl on rt1 (cost=0.00..121893.93
rows=30835 width=302)
Index Cond: ((zipr = 2186) OR (zipl = 2186))zipl | 925 |
zipr | 899 |Those n_distinct values for zipl and zipr seem aberrant --- too low
compared to the estimated rowcount you're showing. What are the
true figures? Also, how about some EXPLAIN ANALYZEs and not just
EXPLAINs?
^
tiger=# select count(distinct(zipr)), count(distinct(zipl)) from rt1 ;
count | count
-------+-------
35133 | 35393
(1 row)
Short summary:
I had the same problem - since the sort order of zip-codes,
counties, city names, and states don't match, the optimizer
grossly overestimated the number of pages that would be read.
I bet doing a CLUSTER by ZIP would solve that particular
query, but would break similar queries by county or by
city or by state.
I think
select attname,correlation from pg_stats where tablename = 'rt1';
will show you the same problem I had. My pg_stats
numbers are shown below.
I had a couple ugly hacks that worked around the problem
for myself.
Longer.
One interesting property of the TIGER data (and most geospatial
databases) is that the data for most of the columns are
"locally correlated" but not globally. What I mean by that
is that even though data for those zip-codes probably only
resided in a few disk pages; the data was probably loaded
in order of Tiger File number (state,county), so the "correlation"
in pg_stats for zip-code was very low. With the low correlation
the optimizer couldn't see the fact that any given zip code's
data was all together on disk. Because of this, it probably
vastly overestimated the number of pages that would be read.
Let me try a concrete example:
Zip | City | State | County | Street
------+---------------+----------+---------------+-------------
99501 } Anchorage } AK | Anchorage | 1st st.
[...]
94105 | San Francisco | CA | San Francisco | 1st St
94105 | San Francisco | CA | San Francisco | 2nd St
[... a few more disk pages for 94105 ...]
[... tens more disk pages for San Francisco ...]
[... thousands more disk pages for CA ...]
94501 | Alameda | CA | Alameda | 1st St
94501 | Alameda | CA | Alameda | 2nd St
[...]
02101 | Boston | MA | Suffolk | 1st St.
Note, that all the data for any geographical region (zip,
or city, or county) is located close together on disk.
If I do a query by City, or by State, or by Zip, I should
probably do an index scan.
But since the correlation statistic only looks the total
ordering; if we order the table so the correlation for
one column, it's statistic will look very good; but since
the columns have a different sort order the correlation
statistic for the others will be very poor.
In my copy of the TIGER data (loaded in order of
TIGER-file-number (which is ordered by state, and then
by county)), you can see the various relevant correlation
values in my database.
fl=# select attname,correlation
from pg_stats
where tablename = 'tgr_rt1';
attname | correlation
-----------+--------------
tigerfile | 1
rt | 1
version | 0.968349
tlid | 0.151139
[...]
zipl | 0.381979
zipr | 0.376332
[...]
statel | 0.998373
stater | 0.99855
countyl | 0.111207
countyr | 0.115345
cousubl | -0.0450375
cousubr | -0.0486589
[...]
placel | 0.304117
placer | 0.306714
tractl | 0.141538
tractr | 0.134357
blockl | 0.00599286
blockr | -8.73298e-05
frlong | 0.0857337
frlat | 0.174396
tolong | 0.0857399
tolat | 0.174434
(45 rows)
Note, that even though the TIGER data is sorted by
State/County, "countyl" and "countyr" have some of the
worst correlations (in the stats table); because of
the way the FIPS codes work. Every state re-cycles
the county codes starting with 1 and going up.
STATE FIPS CODE | County FIPS code
------------------+-----------------
06 (california) | 001 (alameda)
06 (california) | 003 (alpine)
...
25 (massachusets) | 001 (Barnstable)
I have a hack for 7.4 that sets the numbers hidden
in pg_statistic used for correlation to 0.75 for these
columns; and the planner started making more reasonable
guesses.
In the end, though, I just resorted to uglier hacks
that make the planner favor index scans like setting
random_page_cost artificially low.
What I think I'd like to see is for there to be
another statistic similar to "correlation" but rather
than looking at the total-ordering of the table, to
look how correlated values within any single page are.
If someone pointed me in the right direction, I might
try doing this.
Ron
PS:
I think lots of other data has the same issues.
A very large name database ordered by
"lastname,firstname" will have all people
of a given "firstname" on a relatively small
set of pages, but the current correlation value
wouldn't see that.
Firstname | Lastname | Middlename
Adam | Brown | Albert
Adam | Brown | Alex
Adam | Brown | Bob
Bill } Brown | ....
...
Adam | Smith | Albert
Adam | Smith | Alex
Adam | Smith | Bob
...
Tom Lane wrote:
Show quoted text
pgsql@mohawksoft.com writes:
-> Index Scan using rt1_zipr, rt1_zipl on rt1 (cost=0.00..121893.93
rows=30835 width=302)
Index Cond: ((zipr = 2186) OR (zipl = 2186))zipl | 925 |
zipr | 899 |Those n_distinct values for zipl and zipr seem aberrant --- too low
compared to the estimated rowcount you're showing. What are the
true figures? Also, how about some EXPLAIN ANALYZEs and not just
EXPLAINs?regards, tom lane
---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend
Ron Mayer <rm_pg@cheapcomplexdevices.com> writes:
What I think I'd like to see is for there to be
another statistic similar to "correlation" but rather
than looking at the total-ordering of the table, to
look how correlated values within any single page are.
Our current approach to correlation is surely far too simplistic; it was
just a quick hack to get *something* in there for this consideration.
I don't personally know enough about statistics to do much better :-(
regards, tom lane
pgsql@mohawksoft.com writes:
One of the things that is disturbing to me about the analyze settings is
that it wants to sample the same number of records from a table regardless
of the size of that table.
The papers that I looked at say that this rule has a good solid
statistical foundation, at least for the case of estimating histograms.
See the references cited in analyze.c.
regards, tom lane
Import Notes
Reply to msg id not found: 16390.24.91.171.78.1107735010.squirrel@mail.mohawksoft.com
pgsql@mohawksoft.com writes:
One of the things that is disturbing to me about the analyze settings is
that it wants to sample the same number of records from a table
regardless
of the size of that table.The papers that I looked at say that this rule has a good solid
statistical foundation, at least for the case of estimating histograms.
See the references cited in analyze.c.
Any and all random sampling assumes a degree of uniform distribution. This
is the basis of the model. It assumes that chunks of the whole will be
representative of the whole (to some degree). This works when normal
variations are more or less distributed uniformly. As variations and
trends becomes less uniformly distributed, more samples are required to
characterize it.
Douglas Adams had a great device called the "Total Perspective Vortex"
which infered the whole of the universe from a piece of fairy cake. It was
a subtle play on the absurd notion that a very small sample could lead to
an understanding of an infinitly larger whole.
On a very basic level, why bother sampling the whole table at all? Why not
check one block and infer all information from that? Because we know that
isn't enough data. In a table of 4.6 million rows, can you say with any
mathmatical certainty that a sample of 100 points can be, in any way,
representative?
Another problem with random sampling is trend analysis. Often times there
are minor trends in data. Ron pointed out the lastname firstname trend.
Although there seems to be no correlation between firstnames in the table,
there are clearly groups or clusters of ordered data that is an ordering
that is missed by too small a sample.
I understand why you chose the Vitter algorithm, because it provides a
basically sound methodology for sampling without knowledge of the size of
the whole, but I think we can do better. I would suggest using the current
algorithm the first time through, then adjust the number of samples [n]
based on the previous estimate of the size of the table [N]. Each
successive ANALYZE will become more accurate. The Vitter algorithm is
still useful as [N] will always be an estimate.
pgsql@mohawksoft.com writes:
On a very basic level, why bother sampling the whole table at all? Why not
check one block and infer all information from that? Because we know that
isn't enough data. In a table of 4.6 million rows, can you say with any
mathmatical certainty that a sample of 100 points can be, in any way,
representative?
This is a statistical argument, not a rhetorical one, and I'm not going
to bother answering handwaving. Show me some mathematical arguments for
a specific sampling rule and I'll listen.
regards, tom lane
pgsql@mohawksoft.com writes:
On a very basic level, why bother sampling the whole table at all? Why
not
check one block and infer all information from that? Because we know
that
isn't enough data. In a table of 4.6 million rows, can you say with any
mathmatical certainty that a sample of 100 points can be, in any way,
representative?This is a statistical argument, not a rhetorical one, and I'm not going
to bother answering handwaving. Show me some mathematical arguments for
a specific sampling rule and I'll listen.
Tom, I am floored by this response, I am shaking my head in disbelief.
It is inarguable that increasing the sample size increases the accuracy of
a study, especially when diversity of the subject is unknown. It is known
that reducing a sample size increases probability of error in any poll or
study. The required sample size depends on the variance of the whole. It
is mathmatically unsound to ASSUME any sample size is valid without
understanding the standard deviation of the set.
http://geographyfieldwork.com/MinimumSampleSize.htm
Again, I understand why you used the Vitter algorithm, but it has been
proven insufficient (as used) with the US Census TIGER database. We
understand this because we have seen that the random sampling as
implemented has insufficient information to properly characterize the
variance in the data.
On Mon, Feb 07, 2005 at 11:27:59 -0500,
pgsql@mohawksoft.com wrote:
It is inarguable that increasing the sample size increases the accuracy of
a study, especially when diversity of the subject is unknown. It is known
that reducing a sample size increases probability of error in any poll or
study. The required sample size depends on the variance of the whole. It
is mathmatically unsound to ASSUME any sample size is valid without
understanding the standard deviation of the set.
For large populations the accuracy of estimates of statistics based on random
samples from that population are not very sensitve to population size and
depends primarily on the sample size. So that you would not expect to need
to use larger sample sizes on larger data sets for data sets over some
minimum size.
On Mon, Feb 07, 2005 at 11:27:59 -0500,
pgsql@mohawksoft.com wrote:It is inarguable that increasing the sample size increases the accuracy
of
a study, especially when diversity of the subject is unknown. It is
known
that reducing a sample size increases probability of error in any poll
or
study. The required sample size depends on the variance of the whole. It
is mathmatically unsound to ASSUME any sample size is valid without
understanding the standard deviation of the set.For large populations the accuracy of estimates of statistics based on
random
samples from that population are not very sensitve to population size and
depends primarily on the sample size. So that you would not expect to need
to use larger sample sizes on larger data sets for data sets over some
minimum size.
That assumes a fairly low standard deviation. If the standard deviation is
low, then a minimal sample size works fine. If there was zero deviation in
the data, then a sample of one works fine.
If the standard deviation is high, then you need more samples. If you have
a high standard deviation and a large data set, you need more samples than
you would need for a smaller data set.
In the current implementation of analyze.c, the default is 100 samples. On
a table of 10,000 rows, that is probably a good number characterize the
data enough for the query optimizer (1% sample). For a table with 4.6
million rows, that's less than 0.002%
Think about an iregularly occuring event, unevenly distributed throughout
the data set. A randomized sample strategy normalized across the whole
data set with too few samples will mischaracterize the event or even miss
it altogether.
On Mon, Feb 07, 2005 at 13:28:04 -0500,
For large populations the accuracy of estimates of statistics based on
random
samples from that population are not very sensitve to population size and
depends primarily on the sample size. So that you would not expect to need
to use larger sample sizes on larger data sets for data sets over some
minimum size.That assumes a fairly low standard deviation. If the standard deviation is
low, then a minimal sample size works fine. If there was zero deviation in
the data, then a sample of one works fine.
This doesn't assume a low standard deviation. That wouldn't make sense
anyway since the standard deviation depends on the units used for
the measurements.
In the current implementation of analyze.c, the default is 100 samples. On
a table of 10,000 rows, that is probably a good number characterize the
data enough for the query optimizer (1% sample). For a table with 4.6
million rows, that's less than 0.002%
The fraction of rows isn't relevant unless it is a large fraction of the
total, in which case you can use a smaller sample than you might otherwise.
Think about an iregularly occuring event, unevenly distributed throughout
the data set. A randomized sample strategy normalized across the whole
data set with too few samples will mischaracterize the event or even miss
it altogether.
What you are saying here is that if you want more accurate statistics, you
need to sample more rows. That is true. However, the size of the sample
is essentially only dependent on the accuracy you need and not the size
of the population, for large populations.
Maybe I am missing something - ISTM that you can increase your
statistics target for those larger tables to obtain a larger (i.e.
better) sample.
regards
Mark
pgsql@mohawksoft.com wrote:
Show quoted text
pgsql@mohawksoft.com writes:
Any and all random sampling assumes a degree of uniform distribution. This
is the basis of the model. It assumes that chunks of the whole will be
representative of the whole (to some degree). This works when normal
variations are more or less distributed uniformly. As variations and
trends becomes less uniformly distributed, more samples are required to
characterize it.Douglas Adams had a great device called the "Total Perspective Vortex"
which infered the whole of the universe from a piece of fairy cake. It was
a subtle play on the absurd notion that a very small sample could lead to
an understanding of an infinitly larger whole.On a very basic level, why bother sampling the whole table at all? Why not
check one block and infer all information from that? Because we know that
isn't enough data. In a table of 4.6 million rows, can you say with any
mathmatical certainty that a sample of 100 points can be, in any way,
representative?Another problem with random sampling is trend analysis. Often times there
are minor trends in data. Ron pointed out the lastname firstname trend.
Although there seems to be no correlation between firstnames in the table,
there are clearly groups or clusters of ordered data that is an ordering
that is missed by too small a sample.I understand why you chose the Vitter algorithm, because it provides a
basically sound methodology for sampling without knowledge of the size of
the whole, but I think we can do better. I would suggest using the current
algorithm the first time through, then adjust the number of samples [n]
based on the previous estimate of the size of the table [N]. Each
successive ANALYZE will become more accurate. The Vitter algorithm is
still useful as [N] will always be an estimate.
On Mon, Feb 07, 2005 at 13:28:04 -0500,
What you are saying here is that if you want more accurate statistics, you
need to sample more rows. That is true. However, the size of the sample
is essentially only dependent on the accuracy you need and not the size
of the population, for large populations.
That's nonsense.
If your total data size is 100 elements in a set, then a sample size of
100 elements will cover 100% of your data.
If your total data size is 10,000 elements in a set, the a sample size of
100 elements will cover 1% of your data.
In the case of the TIGER database, the base of 100 samples is about .002%
0f the data is sampled. Think about that, that is an average of 1 sample
about every 50,000 records. You could have substantial but irregular
trends in the data that may never get detected, and this is EXACTLY what
we see. If we increase the sample size (targrows), the statistics suddenly
work better.
For instance, look at the data below.
The first analyze / select from pg_stats is with an analyze of 3000
samples. The zipl and zipr columns get calculated poorly and can cause the
planner to use a table scan instead of an index scan.
The second analyze / select from the pg_stats is with an analyse of 10000
samples. The zipl and zipr n_distinct values are still off by a factor of
10, but close enough for the planner to deal.
If the premise is that samples size doesn't make a difference, I think
we've proved that this is not true.
tiger=# analyze verbose rt1;
INFO: analyzing "public.rt1"
INFO: "rt1": scanned 3000 of 1527360 pages, containing 90978 live rows
and 0 dead rows; 3000 rows in sample, 46318719 estimated total rows
ANALYZE
tiger=# select * from pg_stats where tablename = 'rt1' and attname like
'zip%';
schemaname | tablename | attname | null_frac | avg_width | n_distinct |
most_common_vals |
most_common_freqs
| histogram_bounds
| correlation
------------+-----------+---------+-----------+-----------+------------+---------------------------------------------------------+------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------+-------------
public | rt1 | zipl | 0.672 | 4 | 960 |
{76240,52601,55746,71730,74604,92705,93117,95818} |
{0.00166667,0.00133333,0.00133333,0.00133333,0.00133333,0.00133333,0.00133333,0.00133333}
|
{1085,16652,28206,33412,43147,49428,58801,68110,77515,91340,99006} |
-0.119519
public | rt1 | zipr | 0.677 | 4 | 960 |
{76240,52601,55746,71730,74604,78577,92705,93117,95818} |
{0.00166667,0.00133333,0.00133333,0.00133333,0.00133333,0.00133333,0.00133333,0.00133333,0.00133333}
| {962,15613,28572,33606,43545,49428,60423,68064,77040,91340,99006} |
-0.104158
(2 rows)
Now this:
tiger=# analyze verbose rt1;
INFO: analyzing "public.rt1"
INFO: "rt1": scanned 10000 of 1527360 pages, containing 303419 live rows
and 0 dead rows; 10000 rows in sample, 46343004 estimated total rows
ANALYZE
tiger=# select * from pg_stats where tablename = 'rt1' and attname like
'zip%';
schemaname | tablename | attname | null_frac | avg_width | n_distinct |
most_common_vals |
most_common_freqs |
histogram_bounds | correlation
------------+-----------+---------+-----------+-----------+------------+---------------------------------------------------------------+-------------------------------------------------------------------------+-------------------------------------------------------------------+-------------
public | rt1 | zipl | 0.6807 | 4 | 2942 |
{61832,13090,17404,30907,31204,45342,47714,63050,80918,93726} |
{0.0008,0.0006,0.0005,0.0005,0.0005,0.0005,0.0005,0.0005,0.0005,0.0005} |
{654,15018,28208,33870,43006,49008,59741,68803,78640,92105,99687} |
-0.137744
public | rt1 | zipr | 0.684 | 4 | 2921 |
{13090,61832,30907,31204,45342,47714,63050,70122,80918,93726} |
{0.0006,0.0006,0.0005,0.0005,0.0005,0.0005,0.0005,0.0005,0.0005,0.0005} |
{731,14824,27871,33324,42276,48895,58401,68338,78575,92105,99654} |
-0.140663
(2 rows)
On Mon, Feb 07, 2005 at 05:16:56PM -0500, pgsql@mohawksoft.com wrote:
On Mon, Feb 07, 2005 at 13:28:04 -0500,
What you are saying here is that if you want more accurate statistics, you
need to sample more rows. That is true. However, the size of the sample
is essentially only dependent on the accuracy you need and not the size
of the population, for large populations.That's nonsense.
Huh, have you studied any statistics?
--
Alvaro Herrera (<alvherre[@]dcc.uchile.cl>)
"At least to kernel hackers, who really are human, despite occasional
rumors to the contrary" (LWN.net)
Maybe I am missing something - ISTM that you can increase your
statistics target for those larger tables to obtain a larger (i.e.
better) sample.
No one is arguing that you can't manually do things, but I am not the
first to notice this. I saw the query planner doing something completely
stupid and set off to discover why.
Think about the person using PostgreSQL for the first time. He/she does
not know about this stuff. Even if they've read the FAQs and the manual
cover to cover, it will take them some time to figure out it all works
together. PostgreSQL is a big system, and this is exactly why MySQL gets
better marks from newbes.
In this case, the behavior observed could be changed by altering the
sample size for a table. I submit that an arbitrary fixed sample size is
not a good base for the analyzer, but that the sample size should be based
on the size of the table or some calculation of its deviation.
There is no reason why old stats can't be used to create more accurate
stats. Using succesive analyze operations, we could create better
statistics for the planner. We can increase the sample size based on the
table size. We could, I suppose, also calculate some sort of deviation
statistic so that "n_distinct" can be calculated better with a smaller
sample set.
The basic problem, though, is that PostgreSQL performed incorrectly on a
simple query after indexes were created and analyze performed. Yes, it can
be corrected, that's what led me to my conclusions, but shouldn't we try
to devise a better system in the future to improve PostgreSQL so it does
not need this sort of tuning?
Show quoted text
regards
Mark
pgsql@mohawksoft.com wrote:
pgsql@mohawksoft.com writes:
Any and all random sampling assumes a degree of uniform distribution.
This
is the basis of the model. It assumes that chunks of the whole will be
representative of the whole (to some degree). This works when normal
variations are more or less distributed uniformly. As variations and
trends becomes less uniformly distributed, more samples are required to
characterize it.Douglas Adams had a great device called the "Total Perspective Vortex"
which infered the whole of the universe from a piece of fairy cake. It
was
a subtle play on the absurd notion that a very small sample could lead
to
an understanding of an infinitly larger whole.On a very basic level, why bother sampling the whole table at all? Why
not
check one block and infer all information from that? Because we know
that
isn't enough data. In a table of 4.6 million rows, can you say with any
mathmatical certainty that a sample of 100 points can be, in any way,
representative?Another problem with random sampling is trend analysis. Often times
there
are minor trends in data. Ron pointed out the lastname firstname trend.
Although there seems to be no correlation between firstnames in the
table,
there are clearly groups or clusters of ordered data that is an ordering
that is missed by too small a sample.I understand why you chose the Vitter algorithm, because it provides a
basically sound methodology for sampling without knowledge of the size
of
the whole, but I think we can do better. I would suggest using the
current
algorithm the first time through, then adjust the number of samples [n]
based on the previous estimate of the size of the table [N]. Each
successive ANALYZE will become more accurate. The Vitter algorithm is
still useful as [N] will always be an estimate.---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?