Increasing statistics results in worse estimates
I'm having a problem with analyze in 8.0.2 that is
really bothering me. I have a table that contains
address, city and state with about 7.8m rows in it.
On that table I have two non-unique indexes for city
and zipcode respectively. This table was loaded in a
manner such that it is essentially sorted by zipcode.
Sorting by zipcode implies that there should exist a
pretty strong correlation on the city column as well.
With a statistics target of 50 on city I'm getting
good estimates for row counts for arbitrary cities
(ie: explain select count(*) from addresses where city
= 'DALLAS' estimates 474k rows out of 500k actual) but
a poor estimate for the correlation (0.13 according to
pg_stats). This seems to be causing the planner to
pick a table scan for "select count(*) from test_zipc
where city = 'DALLAS' (est 474k rows)" vs picking an
index scan for "select count(*) from test_zipc where
zipcode like '75%' (est 2m rows)".
Increasing the statistics target on the city column
and re-analyzing the table seems to make the
correlation estimate better (shows about 0.5) but the
row count estimates are thrown off by 2 orders of
magnitude in some cases. Repeating the above queries
I get a row estimate of 8k for "select count(*) from
test_zipc where city = 'DALLAS'" and a row estimate of
6m for the "select count(*) from test_zipc where
zipcode like '75%'". In this case, the planner picked
an index scan for the city = 'X' condition but for
what I feel are the wrong reasons because it under
estimated the row count. Re-analyzing the table
multiple times always shows about an 8k estimate.
Is my data set that promblematic? Has anyone seen
similar behavior? Any suggestions on how to improve
these stats?
Regards,
Shelby Cain
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com
Shelby Cain <alyandon@yahoo.com> writes:
Increasing the statistics target on the city column
and re-analyzing the table seems to make the
correlation estimate better (shows about 0.5) but the
row count estimates are thrown off by 2 orders of
magnitude in some cases.
What did you increase it to, exactly? Could we see the contents of
pg_stats for these two columns at both target settings?
regards, tom lane
Shelby Cain <alyandon@yahoo.com> writes:
--- Tom Lane <tgl@sss.pgh.pa.us> wrote:What did you increase it to, exactly? Could we see
the contents of
pg_stats for these two columns at both target
settings?
Generally, the more I increased the stats target the
better the correlation estimate and the worse the row estimate.
Hm. I should have realized why correlation wouldn't be high for the
city name: given the ordering by zipcode, city name values may be
pretty well clumped, but they aren't in any kind of alphabetical
order --- and it's the overall ordering, not the clumping, that
correlation measures.
However, there is something absolutely wacko about the stats collection
process here ... you've got fairly reasonable looking results for
most-common-values of city name at the lower end of the stats settings
(HOUSTON and DALLAS are the most common, sounds about right) ... but at
the higher settings the ordering of most-common entries just goes nuts.
We've got some kind of bug there.
What exactly are you changing in the different cases ---
default_statistics_target, or are you doing an ALTER TABLE on some
of the columns (if so which)?
It might be easier to debug this if you could send me the test case.
Any problem with sending just the city name and zipcode columns
of the table (offlist of course)? COPY TO with a column list can
extract that for you.
regards, tom lane
Import Notes
Reply to msg id not found: 20050429143457.61978.qmail@web50102.mail.yahoo.comReference msg id not found: 20050429143457.61978.qmail@web50102.mail.yahoo.com | Resolved by subject fallback
--- Tom Lane <tgl@sss.pgh.pa.us> wrote:
Hm. I should have realized why correlation wouldn't
be high for the
city name: given the ordering by zipcode, city name
values may be
pretty well clumped, but they aren't in any kind of
alphabetical
order --- and it's the overall ordering, not the
clumping, that
correlation measures.
Ah. Localized clumping != Overall ordering. Thanks
for the clarification.
However, there is something absolutely wacko about
the stats collection
process here ... you've got fairly reasonable
looking results for
most-common-values of city name at the lower end of
the stats settings
(HOUSTON and DALLAS are the most common, sounds
about right) ... but at
the higher settings the ordering of most-common
entries just goes nuts.
We've got some kind of bug there.
I had noticed that as well but wasn't sure about the
whether MCV really meant what I thought it did.
What exactly are you changing in the different cases
---
default_statistics_target, or are you doing an ALTER
TABLE on some
of the columns (if so which)?
I have a setting of 30 for default_statistics_target
and I am manipulating the statistics target for city
by alter table.
It might be easier to debug this if you could send
me the test case.
Any problem with sending just the city name and
zipcode columns
of the table (offlist of course)? COPY TO with a
column list can
extract that for you.
I had already removed proprietary data to try and
whittle down the number of columns I needed to
demonstrate the weirdness so I can host a dump of the
table. However, before I take that step I should
mention that this is the native Windows port so if
that changes anything let me know.
Regards,
Shelby Cain
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com
Import Notes
Reply to msg id not found: 6667 | Resolved by subject fallback
Shelby Cain <alyandon@yahoo.com> writes:
I had already removed proprietary data to try and
whittle down the number of columns I needed to
demonstrate the weirdness so I can host a dump of the
table. However, before I take that step I should
mention that this is the native Windows port so if
that changes anything let me know.
Well, if I can't reproduce the misbehavior on a Unix machine then
there'll be reason to suspect a platform-specific bug ... but it's
still a bug.
It'll be important to know the exact datatypes of the columns,
as well as the database locale and encoding you are using.
regards, tom lane
[ redirecting to pgsql-hackers-win32 ]
Shelby Cain <alyandon@yahoo.com> writes:
--- Tom Lane <tgl@sss.pgh.pa.us> wrote:However, there is something absolutely wacko about
the stats collection
process here ... you've got fairly reasonable
looking results for
most-common-values of city name at the lower end of
the stats settings
(HOUSTON and DALLAS are the most common, sounds
about right) ... but at
the higher settings the ordering of most-common
entries just goes nuts.
We've got some kind of bug there.
I had noticed that as well but wasn't sure about the
whether MCV really meant what I thought it did.
It might be easier to debug this if you could send
me the test case.
I had already removed proprietary data to try and
whittle down the number of columns I needed to
demonstrate the weirdness so I can host a dump of the
table. However, before I take that step I should
mention that this is the native Windows port so if
that changes anything let me know.
Thanks for sending me the test data. The bad news is that I can't
reproduce any strange behavior here: the stats get marginally more
accurate as the target goes up, just as you'd expect. So it would
seem there is something broken about ANALYZE on Windows. There's
not anything magic about this particular dataset, AFAICS.
Which Windows build are you using, exactly?
Can anyone else reproduce a problem with ANALYZE producing silly
most-common-values stats at higher statistics targets? The original
thread is here:
http://archives.postgresql.org/pgsql-general/2005-04/msg01368.php
regards, tom lane