Index usage vs large repetitions of key
I have a table with a "year" column.
A count grouped by year is as follows:
=> select year, count(*) from ystats group by year;
year | count
------+--------
1992 | 367191
1993 | 341324
1994 | 324867
1995 | 314674
1996 | 307149
1997 | 303540
1998 | 298040
1999 | 300269
2000 | 302648
2001 | 305797
2002 | 127445
(11 rows)
I created an index by year, but it never seems to get used, not even with
"set enable_seqscan to off;"
I have numerous queries I do against this table that only need to acces
one or two years.
I am making the wrong assumption here that doing an index scan would be
significantly more efficient? If so how do I help the optimizer see this?
On Sat, 4 May 2002 16:25:47 -0400 (EDT)
"Francisco Reyes" <lists@natserv.com> wrote:
I have numerous queries I do against this table that only need to acces
one or two years.
Can you post some of the queries that are problematic, as well as the
output of EXPLAIN ANALYZE for them?
Cheers,
Neil
--
Neil Conway <neilconway@rogers.com>
PGP Key ID: DB3C29FC
On Sat, 4 May 2002, Neil Conway wrote:
On Sat, 4 May 2002 16:25:47 -0400 (EDT)
"Francisco Reyes" <lists@natserv.com> wrote:I have numerous queries I do against this table that only need to acces
one or two years.Can you post some of the queries that are problematic, as well as the
output of EXPLAIN ANALYZE for them?
Most of the queries are large and ugly.
Following is simple one that is very much like the common queries we have.
-- Query
explain analyze
select record_key, ystart, cstart
from ystats, hearn
where year = 2002 and
ystats.record_key = hearn.horse_key and
cstat_date > '1-1-2002'
and ystart <> cstart ;
--
-- Explain Analyze
Hash Join (cost=528.26..101381.81 rows=14 width=12) (actual
time=5237.61..16835.63 rows=69552 loops=1)
-> Seq Scan on ystats (cost=0.00..99960.80 rows=178505 width=6)
(actual time=2049.91..13066.82 rows=127445 loops=1)
-> Hash (cost=527.88..527.88 rows=154 width=6)
(actual time=833.22..833.22 rows=0 loops=1)
-> Index Scan using he_cstat_date on hearn
(cost=0.00..527.88 rows=154 width=6)
(actual time=0.47..568.92 rows=40821 loops=1)
Total runtime: 17525.13 msec
--
The estimate for ystats comes out to 178505, which is not far from the
actual 127445. This is MUCH smaller than the 3 Million + records on that
table, yet the optimizer insists on doing a sequential scan.
The estimate for hearn was also pretty bad. :-(
The optimizer estimated 154 rows and 40,821 were returned.
On Sun, 5 May 2002 felix@crowfix.com wrote:
This sequential scan bugged the heck out of me, until I finally
understood what is going on. Usually records on disk are scattered
all over, so many per data page. I think the standard is 8K. Suppose
your table has 10 records per data page. Roughly 1/20 of the records
will be selected, so it is going to hit about half the data pages. In
this case, it is better to simply march through all data pages
sequentially than to read half the data pages randomly, and also hit a
lot of index pages, also randomly scattered around the disk. It's
better for the OS to read twice as much sequentially compared to half
as much randomly. Throw in more random reading for index pages, and
the sequential scan is a big win.
Thanks for the explanation. So I guess it is a factor of how big each row
is, the percentage of records to be selected and the page size.
The particular queries I am doing would return about %10 of the records
and using 8K pages there would be about 200 rows per page. Now I better
understand why the index is not been used.
I think there is some way to force an indexed read, but I have
forgotten what little I knew about that. If there is, you could try
both ways and compare timings.
Based on this info it may make sense to let it do the sequential scan.
In the coming months the table in question is going to grow 3 to 4 times
it's number of records so at that point the index may make more sense.
Is there a drawback on having the index right now?
I guess it would make the optimizer's work more even though it would
likely not choose the index anyway.
Import Notes
Reply to msg id not found: 15574.8901.487005.774873@crowfix.com | Resolved by subject fallback
On Tue, 7 May 2002 09:48:13 -0400 (EDT)
"Francisco Reyes" <lists@natserv.com> wrote:
On Sun, 5 May 2002 felix@crowfix.com wrote:
I think there is some way to force an indexed read, but I have
forgotten what little I knew about that. If there is, you could try
both ways and compare timings.Based on this info it may make sense to let it do the sequential scan.
You can easily test this hypothesis by disabling sequential scans (SET
enable_seqscan = off;), and using EXPLAIN ANALYZE to compare the performance
of the resulting query plan with the one chosen by the planner to
begin with.
In the coming months the table in question is going to grow 3 to 4 times
it's number of records so at that point the index may make more sense.Is there a drawback on having the index right now?
Yes; inserts and updates will need to update the index. Depending on
your queries, this can be a significant performance hit.
I guess it would make the optimizer's work more even though it would
likely not choose the index anyway.
My guess would be that this wouldn't be a very significant factor.
Cheers,
Neil
--
Neil Conway <neilconway@rogers.com>
PGP Key ID: DB3C29FC
On Tue, 7 May 2002, Neil Conway wrote:
On Tue, 7 May 2002 09:48:13 -0400 (EDT)
"Francisco Reyes" <lists@natserv.com> wrote:On Sun, 5 May 2002 felix@crowfix.com wrote:
I think there is some way to force an indexed read, but I have
forgotten what little I knew about that. If there is, you could try
both ways and compare timings.Based on this info it may make sense to let it do the sequential scan.
You can easily test this hypothesis by disabling sequential scans (SET
enable_seqscan = off;), and using EXPLAIN ANALYZE to compare the performance
of the resulting query plan with the one chosen by the planner to
begin with.
I tried to set enable_seqscan = off and it still did a sequential scan.
Is there a drawback on having the index right now?
Yes; inserts and updates will need to update the index. Depending on
your queries, this can be a significant performance hit.
This is a "reporting" server and I do a set of "copy" jobs once a day,
followed by a vacuum analyze.. and a nightly "vacuum full"