select distinct, index not used

Started by Thomas Guettleralmost 17 years ago6 messagesgeneral
Jump to latest
#1Thomas Guettler
hv@tbz-pariv.de

Hi,

why does the statement take so long? The column 'lieferant' is indexed. But
a sequential scan gets done.

foo_egs_foo=# explain analyze SELECT DISTINCT "foo_abc_abc"."lieferant" FROM "foo_abc_abc";
QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------------
Unique (cost=3361064.73..3438087.78 rows=7 width=8) (actual time=127133.435..127491.937 rows=34 loops=1)
-> Sort (cost=3361064.73..3399576.26 rows=15404611 width=8) (actual time=127133.429..127322.101 rows=115830 loops=1)
Sort Key: lieferant
-> Seq Scan on foo_abc_abc (cost=0.00..468944.11 rows=15404611 width=8) (actual time=0.029..125458.870
rows=115830 loops=1)
Total runtime: 127609.737 ms
(5 Zeilen)

foo_egs_foo=# \d foo_abc_abc
Tabelle »public.foo_abc_abc«
Spalte | Typ | Attribute
-------------------------+------------------------+--------------------------------------------------------------
id | integer | not null default nextval('foo_abc_abc_id_seq'::regclass)
...
lieferant | character varying(32) | not null

Indexe:
»foo_abc_abc_pkey« PRIMARY KEY, btree (id)
»foo_abc_abc_lieferant« btree (lieferant)
..

version
------------------------------------------------------------------------------------------------------------------
PostgreSQL 8.2.4 on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.1.2 20061115 (prerelease) (SUSE Linux)
(1 Zeile)

--
Thomas Guettler, http://www.thomas-guettler.de/
E-Mail: guettli (*) thomas-guettler + de

#2Grzegorz Jaśkiewicz
gryzman@gmail.com
In reply to: Thomas Guettler (#1)
Re: select distinct, index not used

wasn't that improved now in 8.4, and before that on 8.3 ?
still there are some funny things with distinct/group by .
for instance, try
select count(distinct foo) from bar; vs
select count(1) from (select distinct foo from bar) f;

:)

I am not sure it pg was able to use index for that in 8.2.

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Thomas Guettler (#1)
Re: select distinct, index not used

Thomas Guettler <hv@tbz-pariv.de> writes:

why does the statement take so long? The column 'lieferant' is indexed. But
a sequential scan gets done.

It might have something to do with the fact that the planner's idea of
the size of the table is off by a factor of more than 100:

-> Seq Scan on foo_abc_abc (cost=0.00..468944.11 rows=15404611 width=8) (actual time=0.029..125458.870 rows=115830 loops=1)

You might need to review your vacuuming policy.

(However, a full table indexscan isn't going to be particularly fast in
any case; it's often the case that seqscan-and-sort is the right
decision. I'm not sure this choice was wrong.)

regards, tom lane

#4Sam Mason
sam@samason.me.uk
In reply to: Tom Lane (#3)
Re: select distinct, index not used

On Thu, Apr 16, 2009 at 11:29:25AM -0400, Tom Lane wrote:

, a full table indexscan isn't going to be particularly fast in
any case; it's often the case that seqscan-and-sort is the right
decision.

Is PG capable of "skipping" over duplicate values using an index?

For example, if I've got a table like:

CREATE TABLE foo (
id INTEGER PRIMARY KEY,
v1 BOOLEAN
);

that contains several million rows and I do a query like:

SELECT DISTINCT v1 FROM foo;

PG should only need to read three tuples from the table (assuming there
are no dead rows). I've had a look in the TODO, but haven't found
anything similar. This is obviously only a win when there are few
distinct values from compared to the number of rows.

--
Sam http://samason.me.uk/

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Sam Mason (#4)
Re: select distinct, index not used

Sam Mason <sam@samason.me.uk> writes:

On Thu, Apr 16, 2009 at 11:29:25AM -0400, Tom Lane wrote:

, a full table indexscan isn't going to be particularly fast in
any case; it's often the case that seqscan-and-sort is the right
decision.

Is PG capable of "skipping" over duplicate values using an index?

No, not at present. It's on the TODO list.

regards, tom lane

#6Thomas Guettler
hv@tbz-pariv.de
In reply to: Tom Lane (#3)
Re: select distinct, index not used

Thank you Tom. The cron job for vacuum+analyze was not installed on the host.
(I had this idea some seconds after posting)

After vacuum+analyze the performance is good. I am happy.

Nevertheless, on a different host with nearly the same data, a index scan is used.

foo_hostone_foo=# explain analyze SELECT DISTINCT "foo_abc_abc"."lieferant" FROM "foo_abc_abc";
QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------
Unique (cost=15241.56..15820.71 rows=15 width=8) (actual time=1878.213..2393.550 rows=34 loops=1)
-> Sort (cost=15241.56..15531.13 rows=115830 width=8) (actual time=1878.207..2227.478 rows=115830 loops=1)
Sort Key: lieferant
-> Seq Scan on foo_abc_abc (cost=0.00..3518.30 rows=115830 width=8) (actual time=0.042..226.883 rows=115830
loops=1)
Total runtime: 2394.960 ms
(5 Zeilen)

foo_hostone_foo=# select version();
version
------------------------------------------------------------------------------------------------------------------
PostgreSQL 8.2.4 on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.1.2 20061115 (prerelease) (SUSE Linux)
(1 Zeile)

foo_hosttwo_foo=# explain analyze SELECT DISTINCT "foo_abc_abc"."lieferant" FROM "foo_abc_abc";
QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------------------------------------
Unique (cost=0.00..65641.70 rows=19 width=18) (actual time=0.163..1490.106 rows=68 loops=1)
-> Index Scan using foo_abc_abc_lieferant on foo_abc_abc (cost=0.00..64536.38 rows=442127 width=18) (actual
time=0.155..955.844 rows=227600 loops=1)
Total runtime: 1490.481 ms
(3 Zeilen)

foo_hosttwo_foo=# select version();
version
--------------------------------------------------------------------------------------------
PostgreSQL 8.2.6 on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.2.1 (SUSE Linux)
(1 Zeile)

Tom Lane schrieb:

Thomas Guettler <hv@tbz-pariv.de> writes:

why does the statement take so long? The column 'lieferant' is indexed. But
a sequential scan gets done.

It might have something to do with the fact that the planner's idea of
the size of the table is off by a factor of more than 100:

-> Seq Scan on foo_abc_abc (cost=0.00..468944.11 rows=15404611 width=8) (actual time=0.029..125458.870 rows=115830 loops=1)

You might need to review your vacuuming policy.

(However, a full table indexscan isn't going to be particularly fast in
any case; it's often the case that seqscan-and-sort is the right
decision. I'm not sure this choice was wrong.)

regards, tom lane

--
Thomas Guettler, http://www.thomas-guettler.de/
E-Mail: guettli (*) thomas-guettler + de