Avoiding seq scan over 3.2 millions rows

Started by Andrusover 17 years ago3 messagesgeneral

kobruleht2@hot.ee

over 17 years ago

explain analyze SELECT sum(xxx)
FROM dok JOIN rid USING (dokumnr)
WHERE dok.kuupaev BETWEEN '2008-04-01' AND '2008-04-30'

"Hash Join (cost=29584.84..308259.32 rows=142686 width=0) (actual
time=68510.748..96932.174 rows=117883 loops=1)"
" Hash Cond: ("outer".dokumnr = "inner".dokumnr)"
" -> Seq Scan on rid (cost=0.00..195599.68 rows=3217368 width=4) (actual
time=17.130..56572.857 rows=3247363 loops=1)"
" -> Hash (cost=29243.76..29243.76 rows=53231 width=4) (actual
time=15878.782..15878.782 rows=44685 loops=1)"
" -> Index Scan using dok_kuupaev_idx on dok (cost=0.00..29243.76
rows=53231 width=4) (actual time=47.102..15651.059 rows=44685 loops=1)"
" Index Cond: ((kuupaev >= '2008-04-01'::date) AND (kuupaev <=
'2008-04-30'::date))"
"Total runtime: 97364.282 ms"

Query performs seq scan over 3.2 million of rows.

dokumnr is of type int, kuupaev is of type DATE.
There are regular indexes on

rid(dokumnr)
dok(dokumnr)
dok(kuupaev)

Vacuum is running automatically.
How to speed up this query ?

Andrus.

"PostgreSQL 8.1.4 on i686-pc-linux-gnu, compiled by GCC
i686-pc-linux-gnu-gcc (GCC) 3.4.6 (Gentoo 3.4.6-r1, ssp-3.4.5-1.0,
pie-8.7.9)"

Tom Lane

tgl@sss.pgh.pa.us

over 17 years ago

In reply to: Andrus (#1)

Re: Avoiding seq scan over 3.2 millions rows

"Andrus" <kobruleht2@hot.ee> writes:

explain analyze SELECT sum(xxx)
FROM dok JOIN rid USING (dokumnr)
WHERE dok.kuupaev BETWEEN '2008-04-01' AND '2008-04-30'

"Hash Join (cost=29584.84..308259.32 rows=142686 width=0) (actual
time=68510.748..96932.174 rows=117883 loops=1)"
" Hash Cond: ("outer".dokumnr = "inner".dokumnr)"
" -> Seq Scan on rid (cost=0.00..195599.68 rows=3217368 width=4) (actual
time=17.130..56572.857 rows=3247363 loops=1)"
" -> Hash (cost=29243.76..29243.76 rows=53231 width=4) (actual
time=15878.782..15878.782 rows=44685 loops=1)"
" -> Index Scan using dok_kuupaev_idx on dok (cost=0.00..29243.76
rows=53231 width=4) (actual time=47.102..15651.059 rows=44685 loops=1)"
" Index Cond: ((kuupaev >= '2008-04-01'::date) AND (kuupaev <=
'2008-04-30'::date))"
"Total runtime: 97364.282 ms"

Query performs seq scan over 3.2 million of rows.

There isn't anything particularly wrong with that plan. The alternative
that you seem to be wishing for would involve ~50000 index probes into
"rid", which is hardly going to be free.

You could try reducing random_page_cost to push the planner in the
direction of preferring the indexscan solution, but whether this is
actually better in your situation remains to be seen.

regards, tom lane

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 17 years ago

In reply to: Tom Lane (#2)

Re: Avoiding seq scan over 3.2 millions rows

"Andrus" <kobruleht2@hot.ee> writes:

explain analyze SELECT sum(xxx)
FROM dok JOIN rid USING (dokumnr)
WHERE dok.kuupaev BETWEEN '2008-04-01' AND '2008-04-30'

"Hash Join (cost=29584.84..308259.32 rows=142686 width=0) (actual
time=68510.748..96932.174 rows=117883 loops=1)"
" Hash Cond: ("outer".dokumnr = "inner".dokumnr)"
" -> Seq Scan on rid (cost=0.00..195599.68 rows=3217368 width=4)
(actual
time=17.130..56572.857 rows=3247363 loops=1)"
" -> Hash (cost=29243.76..29243.76 rows=53231 width=4) (actual
time=15878.782..15878.782 rows=44685 loops=1)"
" -> Index Scan using dok_kuupaev_idx on dok
(cost=0.00..29243.76
rows=53231 width=4) (actual time=47.102..15651.059 rows=44685 loops=1)"
" Index Cond: ((kuupaev >= '2008-04-01'::date) AND (kuupaev
<=
'2008-04-30'::date))"
"Total runtime: 97364.282 ms"

Query performs seq scan over 3.2 million of rows.

There isn't anything particularly wrong with that plan. The alternative
that you seem to be wishing for would involve ~50000 index probes into
"rid", which is hardly going to be free.

You could try reducing random_page_cost to push the planner in the
direction of preferring the indexscan solution, but whether this is
actually better in your situation remains to be seen.

Or just use "enable_seqscan = off" - that should penalize the seq scan and
use an index scan instead. You can then execute the EXPLAIN ANALYZE again
and you'll see how fast the index scan is in this case.

You may try some "standard" optimization - I'd try clustering the 'rid'
table according to the "dokumnr" column, but I doubt it will outperform
the seq scan in this case. If the rows are 'wide' you may try to 'compact'
the table so that it's possible to read it faster during the seq scan. Try
something like

CREATE TABLE rid2 AS SELECT dokumnr, xxx FROM rid;

regards
Tomas