select distinct, index not used

Started by Thomas Guettlerabout 17 years ago6 messagesgeneral

Jump to latest

Thomas Guettler

hv@tbz-pariv.de

about 17 years ago

Hi,

why does the statement take so long? The column 'lieferant' is indexed. But
a sequential scan gets done.

foo_egs_foo=# explain analyze SELECT DISTINCT "foo_abc_abc"."lieferant" FROM "foo_abc_abc";
QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------------
Unique (cost=3361064.73..3438087.78 rows=7 width=8) (actual time=127133.435..127491.937 rows=34 loops=1)
-> Sort (cost=3361064.73..3399576.26 rows=15404611 width=8) (actual time=127133.429..127322.101 rows=115830 loops=1)
Sort Key: lieferant
-> Seq Scan on foo_abc_abc (cost=0.00..468944.11 rows=15404611 width=8) (actual time=0.029..125458.870
rows=115830 loops=1)
Total runtime: 127609.737 ms
(5 Zeilen)

foo_egs_foo=# \d foo_abc_abc
Tabelle »public.foo_abc_abc«
Spalte | Typ | Attribute
-------------------------+------------------------+--------------------------------------------------------------
id | integer | not null default nextval('foo_abc_abc_id_seq'::regclass)
...
lieferant | character varying(32) | not null

Indexe:
»foo_abc_abc_pkey« PRIMARY KEY, btree (id)
»foo_abc_abc_lieferant« btree (lieferant)
..

version
------------------------------------------------------------------------------------------------------------------
PostgreSQL 8.2.4 on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.1.2 20061115 (prerelease) (SUSE Linux)
(1 Zeile)

--
Thomas Guettler, http://www.thomas-guettler.de/
E-Mail: guettli (*) thomas-guettler + de

Grzegorz Jaśkiewicz

gryzman@gmail.com

about 17 years ago

In reply to: Thomas Guettler (#1)

Re: select distinct, index not used

wasn't that improved now in 8.4, and before that on 8.3 ?
still there are some funny things with distinct/group by .
for instance, try
select count(distinct foo) from bar; vs
select count(1) from (select distinct foo from bar) f;

I am not sure it pg was able to use index for that in 8.2.

Tom Lane

tgl@sss.pgh.pa.us

about 17 years ago

In reply to: Thomas Guettler (#1)

Re: select distinct, index not used

Thomas Guettler <hv@tbz-pariv.de> writes:

why does the statement take so long? The column 'lieferant' is indexed. But
a sequential scan gets done.

It might have something to do with the fact that the planner's idea of
the size of the table is off by a factor of more than 100:

-> Seq Scan on foo_abc_abc (cost=0.00..468944.11 rows=15404611 width=8) (actual time=0.029..125458.870 rows=115830 loops=1)

You might need to review your vacuuming policy.

(However, a full table indexscan isn't going to be particularly fast in
any case; it's often the case that seqscan-and-sort is the right
decision. I'm not sure this choice was wrong.)

regards, tom lane

Sam Mason

sam@samason.me.uk

about 17 years ago

In reply to: Tom Lane (#3)

Re: select distinct, index not used

On Thu, Apr 16, 2009 at 11:29:25AM -0400, Tom Lane wrote:

, a full table indexscan isn't going to be particularly fast in
any case; it's often the case that seqscan-and-sort is the right
decision.

Is PG capable of "skipping" over duplicate values using an index?

For example, if I've got a table like:

CREATE TABLE foo (
id INTEGER PRIMARY KEY,
v1 BOOLEAN
);

that contains several million rows and I do a query like:

SELECT DISTINCT v1 FROM foo;

PG should only need to read three tuples from the table (assuming there
are no dead rows). I've had a look in the TODO, but haven't found
anything similar. This is obviously only a win when there are few
distinct values from compared to the number of rows.

--
Sam http://samason.me.uk/

Tom Lane

tgl@sss.pgh.pa.us

about 17 years ago

In reply to: Sam Mason (#4)

Re: select distinct, index not used

Sam Mason <sam@samason.me.uk> writes:

On Thu, Apr 16, 2009 at 11:29:25AM -0400, Tom Lane wrote:

, a full table indexscan isn't going to be particularly fast in
any case; it's often the case that seqscan-and-sort is the right
decision.

Is PG capable of "skipping" over duplicate values using an index?

No, not at present. It's on the TODO list.

regards, tom lane

Thomas Guettler

hv@tbz-pariv.de

about 17 years ago

In reply to: Tom Lane (#3)

Re: select distinct, index not used

Thank you Tom. The cron job for vacuum+analyze was not installed on the host.
(I had this idea some seconds after posting)

After vacuum+analyze the performance is good. I am happy.

Nevertheless, on a different host with nearly the same data, a index scan is used.

foo_hostone_foo=# explain analyze SELECT DISTINCT "foo_abc_abc"."lieferant" FROM "foo_abc_abc";
QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------
Unique (cost=15241.56..15820.71 rows=15 width=8) (actual time=1878.213..2393.550 rows=34 loops=1)
-> Sort (cost=15241.56..15531.13 rows=115830 width=8) (actual time=1878.207..2227.478 rows=115830 loops=1)
Sort Key: lieferant
-> Seq Scan on foo_abc_abc (cost=0.00..3518.30 rows=115830 width=8) (actual time=0.042..226.883 rows=115830
loops=1)
Total runtime: 2394.960 ms
(5 Zeilen)

foo_hostone_foo=# select version();
version
------------------------------------------------------------------------------------------------------------------
PostgreSQL 8.2.4 on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.1.2 20061115 (prerelease) (SUSE Linux)
(1 Zeile)

foo_hosttwo_foo=# explain analyze SELECT DISTINCT "foo_abc_abc"."lieferant" FROM "foo_abc_abc";
QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------------------------------------
Unique (cost=0.00..65641.70 rows=19 width=18) (actual time=0.163..1490.106 rows=68 loops=1)
-> Index Scan using foo_abc_abc_lieferant on foo_abc_abc (cost=0.00..64536.38 rows=442127 width=18) (actual
time=0.155..955.844 rows=227600 loops=1)
Total runtime: 1490.481 ms
(3 Zeilen)

foo_hosttwo_foo=# select version();
version
--------------------------------------------------------------------------------------------
PostgreSQL 8.2.6 on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.2.1 (SUSE Linux)
(1 Zeile)

Tom Lane schrieb:

Thomas Guettler <hv@tbz-pariv.de> writes:

why does the statement take so long? The column 'lieferant' is indexed. But
a sequential scan gets done.

It might have something to do with the fact that the planner's idea of
the size of the table is off by a factor of more than 100:

-> Seq Scan on foo_abc_abc (cost=0.00..468944.11 rows=15404611 width=8) (actual time=0.029..125458.870 rows=115830 loops=1)

You might need to review your vacuuming policy.

(However, a full table indexscan isn't going to be particularly fast in
any case; it's often the case that seqscan-and-sort is the right
decision. I'm not sure this choice was wrong.)

regards, tom lane

--
Thomas Guettler, http://www.thomas-guettler.de/
E-Mail: guettli (*) thomas-guettler + de