How to optimize select count(*)..group by?

Started by Bryan Field-Elliotover 20 years ago6 messagesgeneral
Jump to latest
#1Bryan Field-Elliot
bryan_lists@netmeme.org

We have this simple query:

select status, count(*) from customer group by status;

There is already a btree index on status, but, the customer table is
huge, and this query must be executed very frequently... an "explain" on
this query shows that it is quite costly (and we notice it runs
slowly)...

Can someone recommend the best technique to optimize this? We can create
new indices, we can re-write this query.. But we'd rather not add new
tables or columns if possible (not just to solve this problem).

Thank you,

Bryan

#2David Fetter
david@fetter.org
In reply to: Bryan Field-Elliot (#1)
Re: How to optimize select count(*)..group by?

On Thu, Jul 28, 2005 at 09:19:49AM -0700, Bryan Field-Elliot wrote:

We have this simple query:

select status, count(*) from customer group by status;

There is already a btree index on status, but, the customer table is
huge, and this query must be executed very frequently... an
"explain" on this query shows that it is quite costly (and we notice
it runs slowly)...

Can someone recommend the best technique to optimize this? We can
create new indices, we can re-write this query.. But we'd rather not
add new tables or columns if possible (not just to solve this
problem).

You're pretty much stuck with either writing triggers that modify a
cache table or having your performance the way it is now.

Cheers,
D
--
David Fetter david@fetter.org http://fetter.org/
phone: +1 510 893 6100 mobile: +1 415 235 3778

Remember to vote!

#3Richard Huxton
dev@archonet.com
In reply to: Bryan Field-Elliot (#1)
Re: How to optimize select count(*)..group by?

Bryan Field-Elliot wrote:

We have this simple query:

select status, count(*) from customer group by status;

There is already a btree index on status, but, the customer table is
huge, and this query must be executed very frequently... an "explain" on
this query shows that it is quite costly (and we notice it runs slowly)...

Can someone recommend the best technique to optimize this? We can create
new indices, we can re-write this query.. But we'd rather not add new
tables or columns if possible (not just to solve this problem).

There's no other accurate solution. While PG's MVCC system means you
need less locking, it makes it tricky to determine whether a row is
visible without actually checking on disk.

Simplest solution - have a "customer_status_summary" table and add a
trigger to the "customer" table to keep it up to date. That way, you'll
have extremely cheap counts. Make sure you understand what locking
you'll need with your particular solution.

--
Richard Huxton
Archonet Ltd

#4Bricklen Anderson
banderson@presinet.com
In reply to: David Fetter (#2)
Re: How to optimize select count(*)..group by?

David Fetter wrote:

On Thu, Jul 28, 2005 at 09:19:49AM -0700, Bryan Field-Elliot wrote:

We have this simple query:

select status, count(*) from customer group by status;

There is already a btree index on status, but, the customer table is
huge, and this query must be executed very frequently... an
"explain" on this query shows that it is quite costly (and we notice
it runs slowly)...

Can someone recommend the best technique to optimize this? We can
create new indices, we can re-write this query.. But we'd rather not
add new tables or columns if possible (not just to solve this
problem).

You're pretty much stuck with either writing triggers that modify a
cache table or having your performance the way it is now.

Cheers,
D

How about the new bitmap index? I wonder if that'll result in better performance
for that type of query?

--
_______________________________

This e-mail may be privileged and/or confidential, and the sender does
not waive any related rights and obligations. Any distribution, use or
copying of this e-mail or the information it contains by other than an
intended recipient is unauthorized. If you received this e-mail in
error, please advise me (by return e-mail or otherwise) immediately.
_______________________________

#5Bruce Momjian
bruce@momjian.us
In reply to: Bryan Field-Elliot (#1)
Re: How to optimize select count(*)..group by?

Bryan Field-Elliot <bryan_lists@netmeme.org> writes:

We have this simple query:

select status, count(*) from customer group by status;

There is already a btree index on status, but, the customer table is
huge, and this query must be executed very frequently... an "explain" on
this query shows that it is quite costly (and we notice it runs
slowly)...

Can someone recommend the best technique to optimize this? We can create
new indices, we can re-write this query.. But we'd rather not add new
tables or columns if possible (not just to solve this problem).

Without creating a new table you have three options, none of which will be
very fast:

1) If the entire table fits in ram then you could try setting random_page_cost
close enough to 1 to cause the index to be used.

2) If it doesn't but there are a reasonably small number of distinct values of
status you would hope to see a sequential scan and a hash aggregate being
used. I would expect this to be what you would see with the default
configuration. If not you might have to increase work_mem (or sort_mem
depending on your version I think).

3) If you have a large table with a large set of status values then Postgres
may be compelled to resort the entire table. In which case you should
experiment with work_mem/sort_mem to get the largest value you can without
inducing swapping. You could also experiment with putting pgsql_tmp on
separate spindles.

None of these will be fast enough for OLTP access like a web page. You would
be best off performing this query periodically and stuffing the results in a
table for quick access.

--
greg

#6Scott Marlowe
smarlowe@g2switchworks.com
In reply to: Bryan Field-Elliot (#1)
Re: How to optimize select count(*)..group by?

On Thu, 2005-07-28 at 11:19, Bryan Field-Elliot wrote:

We have this simple query:

select status, count(*) from customer group by status;

There is already a btree index on status, but, the customer table is
huge, and this query must be executed very frequently... an "explain"
on this query shows that it is quite costly (and we notice it runs
slowly)...

Can someone recommend the best technique to optimize this? We can
create new indices, we can re-write this query.. But we'd rather not
add new tables or columns if possible (not just to solve this
problem).

With a query of the form:

select field,count([field|*]) from table WHERE somefield =
somecondition;

the query planner is going to have to scan every single row returned by
that where clause. There's no shortcut, because the visibility rules of
MVCC means you have to look at every tuple IN THE TABLE, not in the
index (it's the way postgresql is built, and it isn't likely to change
soon, because putting the visibility information in indexes is
expensive, and would result in VERY slow updates and very large
indexes).

So, the best optimization is to use a selective where clause.

If you run the query with a where clause of something like:

where processdate between '01 july 2005' and '07 july 2005'

then you should get better performance.