count distinct slow?

Started by Roger Packover 11 years ago2 messagesgeneral
Jump to latest
#1Roger Pack
rogerdpack2@gmail.com

Hello.

As a note, I ran into the following today (doing a select distinct is fast,
doing a count distinct is significantly slower?)

assume a table "issue" with a COLUMN nodename character varying(64);, 7.5M
rows...

select distinct substring(nodename from 1 for 9) from issue;

-- 5.8s

select count(distinct substring(nodename from 1 for 9)) from issue;

-- 190s

SELECT COUNT(*) FROM (SELECT DISTINCT substring(nodename from 1 for 9) FROM
issue) as temp;

-- 5.5s

I have an index on nodename's substring 1 for 9.

It struck me as odd that a count distinct would be far slower than
selecting distinct rows themselves. Apparently there are other workarounds
people have come up with as well [1]http://stackoverflow.com/questions/11250253/postgresql-countdistinct-very-slow/14732410#14732410. Just mentioning in case it's helpful.
Cheers!
-roger-

[1]: http://stackoverflow.com/questions/11250253/postgresql-countdistinct-very-slow/14732410#14732410
http://stackoverflow.com/questions/11250253/postgresql-countdistinct-very-slow/14732410#14732410

explains:

explain analyze select count(distinct substring(nodename from 1 for 9))
from issue;

Aggregate (cost=222791.77..222791.78 rows=1 width=16) (actual
time=190641.069..190641.071 rows=1 loops=1)
-> Seq Scan on issue (cost=0.00..185321.51 rows=7494051 width=16)
(actual time=0.016..3487.694 rows=7495551 loops=1)
Total runtime: 190641.182 ms

explain analyze select distinct substring(nodename from 1 for 9) from
issue;

HashAggregate (cost=222791.77..222846.45 rows=4375 width=16) (actual
time=6276.578..6278.004 rows=6192 loops=1)
-> Seq Scan on issue (cost=0.00..204056.64 rows=7494051 width=16)
(actual time=0.058..4293.976 rows=7495551 loops=1)
Total runtime: 6278.564 ms

explain analyze SELECT COUNT(*) FROM (SELECT DISTINCT substring(nodename
from 1 for 9) FROM issue) as temp;

Aggregate (cost=222901.14..222901.15 rows=1 width=0) (actual
time=5195.025..5195.025 rows=1 loops=1)
-> HashAggregate (cost=222791.77..222846.45 rows=4375 width=16) (actual
time=5193.121..5194.454 rows=6192 loops=1)
-> Seq Scan on issue (cost=0.00..204056.64 rows=7494051 width=16)
(actual time=0.035..3402.834 rows=7495551 loops=1)
Total runtime: 5195.160 ms

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Roger Pack (#1)
Re: count distinct slow?

Roger Pack <rogerdpack2@gmail.com> writes:

As a note, I ran into the following today (doing a select distinct is fast,
doing a count distinct is significantly slower?)

The planner appears to prefer hash aggregation for the variants of
your query wherein the DISTINCT becomes a separate plan step. This
is evidently a good choice, with only 6192 distinct values (hence
just that many hash table entries) in 7495551 input rows. However,
COUNT(DISTINCT), or any other aggregate with a DISTINCT tag, uses
sort-then-remove-adjacent-duplicates logic for DISTINCT. That's
evidently a good deal slower for your data set; most likely the data
doesn't fit in your work_mem setting so the sort spills to disk.

The reason DISTINCT aggregates don't consider hash aggregation is
partly lack of round tuits but mostly that an aggregate needs to
execute in a fairly limited amount of memory, and we can't be sure
that the hash table wouldn't get unreasonably large.

regards, tom lane

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general