Yet Another COUNT(*)...WHERE...question
I'm grappling with a lot of reporting code for our app that relies on
queries such as:
SELECT COUNT(*) FROM TABLE WHERE ....(conditions)...
And I still do not find, from the discussions on this thread, any
truly viable solution for this. The one suggestion is to have a
separate counts table, which is fine for total aggregates related to,
say, an ID. E.g., a table with:
trader_id, trade_count
But this is an overall count for the trader (in my example). What if I
need a count of all his trades in the last one week. Then I need a
timestamp condition in there as well. The number of such possibilities
for multiple WHERE conditions is infinite...how should we account for
all these avenues?
Would love to hear experiences of others and what compromises they
have made. From a reporting perspective, waiting for 10 minutes for a
simple count to return seems untenable.
TIA!
"Phoenix Kiula" <phoenix.kiula@gmail.com> writes:
I'm grappling with a lot of reporting code for our app that relies on
queries such as:SELECT COUNT(*) FROM TABLE WHERE ....(conditions)...
...
The number of such possibilities for multiple WHERE conditions is
infinite...
Depends on the "conditions" bit. You can't solve all of the infinite
possibilities -- well you can, just run the query above -- but if you want to
do better it's all about understanding your data.
--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
On 15/08/07, Gregory Stark <stark@enterprisedb.com> wrote:
"Phoenix Kiula" <phoenix.kiula@gmail.com> writes:
I'm grappling with a lot of reporting code for our app that relies on
queries such as:SELECT COUNT(*) FROM TABLE WHERE ....(conditions)...
...
The number of such possibilities for multiple WHERE conditions is
infinite...Depends on the "conditions" bit. You can't solve all of the infinite
possibilities -- well you can, just run the query above -- but if you want > to do better it's all about understanding your data.
I am not sure what the advice here is. The WHERE condition comes from
the indices. So if the query was not "COUNT(*)" but just a couple of
columns, the query executes in less than a second. Just that COUNT(*)
becomes horribly slow. And since the file system based query caching
feature of PG is unclear to me (I am just moving from MySQL where the
cache is quite powerful) I don't quite know what to do to speed up
these queries!
On 8/15/07, Phoenix Kiula <phoenix.kiula@gmail.com> wrote:
I'm grappling with a lot of reporting code for our app that relies on
queries such as:SELECT COUNT(*) FROM TABLE WHERE ....(conditions)...
And I still do not find, from the discussions on this thread, any
truly viable solution for this. The one suggestion is to have a
separate counts table, which is fine for total aggregates related to,
say, an ID. E.g., a table with:trader_id, trade_count
But this is an overall count for the trader (in my example). What if I
need a count of all his trades in the last one week. Then I need a
timestamp condition in there as well. The number of such possibilities
for multiple WHERE conditions is infinite...how should we account for
all these avenues?Would love to hear experiences of others and what compromises they
have made. From a reporting perspective, waiting for 10 minutes for a
simple count to return seems untenable.
Generally, for these kinds of things it's often best to use
materialized views / rollup tables so that you aren't re-aggregating
the same data over and over.
Phoenix Kiula wrote:
SELECT COUNT(*) FROM TABLE WHERE ....(conditions)...
I am not sure what the advice here is. The WHERE condition comes from
the indices. So if the query was not "COUNT(*)" but just a couple of
columns, the query executes in less than a second. Just that COUNT(*)
becomes horribly slow.
The count(*) shouldn't slow things down compared to running the query to
fetch columns. It should be at least as fast, or faster if the columns
you fetch are large.
1. Do you have an example?
2. You're not running a query to get the columns, then a separate
count(*) to get a rowcount are you?
And since the file system based query caching
feature of PG is unclear to me (I am just moving from MySQL where the
cache is quite powerful) I don't quite know what to do to speed up
these queries!
There isn't a "file system based query caching" feature, there's your
operating-systems file-cache and PG's buffers. Neither of which cache
query-results, but cache disk pages instead.
--
Richard Huxton
Archonet Ltd
On 8/15/07, Phoenix Kiula <phoenix.kiula@gmail.com> wrote:
On 15/08/07, Gregory Stark <stark@enterprisedb.com> wrote:
"Phoenix Kiula" <phoenix.kiula@gmail.com> writes:
I'm grappling with a lot of reporting code for our app that relies on
queries such as:SELECT COUNT(*) FROM TABLE WHERE ....(conditions)...
...
The number of such possibilities for multiple WHERE conditions is
infinite...Depends on the "conditions" bit. You can't solve all of the infinite
possibilities -- well you can, just run the query above -- but if you want > to do better it's all about understanding your data.I am not sure what the advice here is. The WHERE condition comes from
the indices. So if the query was not "COUNT(*)" but just a couple of
columns, the query executes in less than a second. Just that COUNT(*)
becomes horribly slow.
Sorry, but I don't believe you. if you're doing a count(*) on the
same dataset that returns in < 1 second, then the count(*) with the
same where clause will run in < 1 second. I haven't seen pgsql do
anything else.
And since the file system based query caching
feature of PG is unclear to me
There is no "query caching" in pgsql. There is data caching. Each
query has to get planned and executed though (unless prepared, then
just executed)
(I am just moving from MySQL where the
cache is quite powerful)
As long as nothing is changing behind the query, and invalidating the
query cache. It is useful for reporting apps, but in a constantly
updating db pretty much useless.
I don't quite know what to do to speed up
these queries!
Post them with explain analyze output. i.e.
explain analyze yourqueryhere
cut and past the query and the output. as well as the table schema.
--- Scott Marlowe <scott.marlowe@gmail.com> wrote:
Generally, for these kinds of things it's often best to use
materialized views / rollup tables so that you aren't re-aggregating
the same data over and over.
I don't know if this was already mentioned, but here is one of the links that describe the method
of implementing a materialized view.
http://www.jonathangardner.net/PostgreSQL/materialized_views/matviews.html
other useful docs like this one can be found here:
http://www.postgresql.org/docs/techdocs.2
Regards,
Richard Broersma Jr.
On 15/08/07, Scott Marlowe <scott.marlowe@gmail.com> wrote:
On 8/15/07, Phoenix Kiula <phoenix.kiula@gmail.com> wrote:
On 15/08/07, Gregory Stark <stark@enterprisedb.com> wrote:
"Phoenix Kiula" <phoenix.kiula@gmail.com> writes:
I'm grappling with a lot of reporting code for our app that relies on
queries such as:SELECT COUNT(*) FROM TABLE WHERE ....(conditions)...
...
The number of such possibilities for multiple WHERE conditions is
infinite...Depends on the "conditions" bit. You can't solve all of the infinite
possibilities -- well you can, just run the query above -- but if you want > to do better it's all about understanding your data.I am not sure what the advice here is. The WHERE condition comes from
the indices. So if the query was not "COUNT(*)" but just a couple of
columns, the query executes in less than a second. Just that COUNT(*)
becomes horribly slow.Sorry, but I don't believe you. if you're doing a count(*) on the
same dataset that returns in < 1 second, then the count(*) with the
same where clause will run in < 1 second. I haven't seen pgsql do
anything else.
Sorry I was not clear. Imagine an Amazon.com search results page. It
has about 15 results on Page 1, then it shows "Page 1 of 190".
To show each page, the query probably has a "LIMIT 15 OFFSET 0" for
Page 1. However, to calculate the total number of pages, they probably
do a separate counts query, because doing a "select *" and then
counting the number of rows returned would be even more inefficient
than a count(*).
So, in reporting, two queries are fairly common I would think, unless
I am missing something?
On 8/15/07, Phoenix Kiula <phoenix.kiula@gmail.com> wrote:
On 15/08/07, Scott Marlowe <scott.marlowe@gmail.com> wrote:
On 8/15/07, Phoenix Kiula <phoenix.kiula@gmail.com> wrote:
On 15/08/07, Gregory Stark <stark@enterprisedb.com> wrote:
"Phoenix Kiula" <phoenix.kiula@gmail.com> writes:
I'm grappling with a lot of reporting code for our app that relies on
queries such as:SELECT COUNT(*) FROM TABLE WHERE ....(conditions)...
...
The number of such possibilities for multiple WHERE conditions is
infinite...Depends on the "conditions" bit. You can't solve all of the infinite
possibilities -- well you can, just run the query above -- but if you want > to do better it's all about understanding your data.I am not sure what the advice here is. The WHERE condition comes from
the indices. So if the query was not "COUNT(*)" but just a couple of
columns, the query executes in less than a second. Just that COUNT(*)
becomes horribly slow.Sorry, but I don't believe you. if you're doing a count(*) on the
same dataset that returns in < 1 second, then the count(*) with the
same where clause will run in < 1 second. I haven't seen pgsql do
anything else.Sorry I was not clear. Imagine an Amazon.com search results page. It
has about 15 results on Page 1, then it shows "Page 1 of 190".To show each page, the query probably has a "LIMIT 15 OFFSET 0" for
Page 1. However, to calculate the total number of pages, they probably
do a separate counts query, because doing a "select *" and then
counting the number of rows returned would be even more inefficient
than a count(*).
When I go to amazon.com I only ever get three pages of results. ever.
Because they know that returning 190 pages is not that useful, as
hardly anyone is going to wander through that many pages.
Google, you'll notice says "Results 1 - 10 of about 5,610,000 for
blacksmith" i.e. it's guesstimating as well. no reason for google to
look at every single row for blacksmith to know that there's about 5.6
million.
So, in reporting, two queries are fairly common I would think, unless
I am missing something?
Yes, optimization. :) You don't need an exact count to tell someone
that there's more data and they can go to it. Note that if you are
planning on doing things google sized, you'll need to do what they
did, invent your own specialized database.
For us mere mortals, it's quite likely that you can do something like:
explain select * from table where field like 'abc%';
and then parse the explain output for an approximate number.
--- Phoenix Kiula <phoenix.kiula@gmail.com> wrote:
Sorry I was not clear. Imagine an Amazon.com search results page. It
has about 15 results on Page 1, then it shows "Page 1 of 190".
I don't think that amazon or google really need to give an accurate count in determining an
estimated number of pages...
Could you determine the number of pages quickly from postgresql:
[ row count estimate ] / [ number of rows you want per page]
The estimated row count is updated every time you vacuum your tables. And getting the estimate
takes very little time.
To show each page, the query probably has a "LIMIT 15 OFFSET 0" for
Page 1.
The "LIMIT 15 OFFSET 1500" technique can be a performance killer since offset does not use an
index.
Is is better to use the last entry of each page in the query for the next page, so you can write
your query this way:
SELECT *
FROM your_table
WHERE item_nbr > [: last item on previous page :]
ORDER BY item_nbr
LIMIT 15;
This method was discuss on the list a couple of months ago.
Regards,
Richard Broersma Jr.
On Aug 15, 2007, at 9:36 AM, Phoenix Kiula wrote:
I'm grappling with a lot of reporting code for our app that relies on
queries such as:SELECT COUNT(*) FROM TABLE WHERE ....(conditions)...
And I still do not find, from the discussions on this thread, any
truly viable solution for this. The one suggestion is to have a
separate counts table, which is fine for total aggregates related to,
say, an ID. E.g., a table with:trader_id, trade_count
But this is an overall count for the trader (in my example). What if I
need a count of all his trades in the last one week. Then I need a
timestamp condition in there as well. The number of such possibilities
for multiple WHERE conditions is infinite...how should we account for
all these avenues?
There is no general solution. While theoretically the multiple WHERE
conditions are infinite, in reality their limited to your actual use
cases and the solutions are thereby dictated by those. Using a
separate cache table is often a viable option used in situations
where constantly up to date realtime values. Another common option
is smart usage of indexes, i.e remember that you can index on the
results of a function applied to row values as well as partial
indexes. Another is table partitioning. Asking how to optimize
"SELECT COUNT(*) FROM TABLE WHER... (conditions)" is not a good
question as the solution is dependent on those conditions. Pick your
most common conditions and optimize for those.
Also, in many cases for reporting apps, 10 minutes is not long at
all. If you have reports that you can't make happen faster, schedule
and automate them.
Erik Jones
Software Developer | Emma®
erik@myemma.com
800.595.4401 or 615.292.5888
615.292.0777 (fax)
Emma helps organizations everywhere communicate & market in style.
Visit us online at http://www.myemma.com
Yes, optimization. :) You don't need an exact count to tell someone
that there's more data and they can go to it.
In general, I agree. But my example of Amazon was only to illustrate
the point about two queries and why they may be needed. I seem to see
many more pages than you do, but in any case, Google and Amazon can
afford to be less precise.
Thanks for the suggestion of using EXPLAIN and parsing an
approximation, but when you need to show a trader how many trades he
has made, for instance, then approximation is not a possibility at
all. Especially not if the numbers sway so wildly --
FIRSTDB=# explain select * from trades where t_id = 'kXjha';
QUERY PLAN
-----------------------------------------------------------------------------------
Bitmap Heap Scan on trades (cost=15.77..1447.12 rows=374 width=224)
Recheck Cond: ((t_id)::text = 'kXjha'::text)
-> Bitmap Index Scan on trades_tid_date (cost=0.00..15.67 rows=374 width=0)
Index Cond: ((t_id)::text = 'kXjha'::text)
(4 rows)
FIRSTDB=# select count(*) from trades where t_id = 'kXjha';
count
-------
3891
(1 row)
Could I do something so that the EXPLAIN showed up with slightly more
close-to-accurate stats? The above query is just after a "vacuum
analyze"!
Much appreciate the suggestions.
I don't know how PSQL does it, but MySQL has an SQL_CALC_FOUND_ROWS
extension which allows the query to also return how many rows exist without
the LIMIT clause. Perhaps there is similar for PSQL (check LIMIT docs?)
- Andrew
-----Original Message-----
From: pgsql-general-owner@postgresql.org
[mailto:pgsql-general-owner@postgresql.org] On Behalf Of Scott Marlowe
Sent: Thursday, 16 August 2007 1:24 AM
To: Phoenix Kiula
Cc: Gregory Stark; Postgres General
Subject: Re: [GENERAL] Yet Another COUNT(*)...WHERE...question
On 8/15/07, Phoenix Kiula <phoenix.kiula@gmail.com> wrote:
On 15/08/07, Scott Marlowe <scott.marlowe@gmail.com> wrote:
On 8/15/07, Phoenix Kiula <phoenix.kiula@gmail.com> wrote:
On 15/08/07, Gregory Stark <stark@enterprisedb.com> wrote:
"Phoenix Kiula" <phoenix.kiula@gmail.com> writes:
I'm grappling with a lot of reporting code for our app that relies
on
queries such as:
SELECT COUNT(*) FROM TABLE WHERE ....(conditions)...
...
The number of such possibilities for multiple WHERE conditions is
infinite...Depends on the "conditions" bit. You can't solve all of the infinite
possibilities -- well you can, just run the query above -- but if
you want > to do better it's all about understanding your data.
I am not sure what the advice here is. The WHERE condition comes from
the indices. So if the query was not "COUNT(*)" but just a couple of
columns, the query executes in less than a second. Just that COUNT(*)
becomes horribly slow.Sorry, but I don't believe you. if you're doing a count(*) on the
same dataset that returns in < 1 second, then the count(*) with the
same where clause will run in < 1 second. I haven't seen pgsql do
anything else.Sorry I was not clear. Imagine an Amazon.com search results page. It
has about 15 results on Page 1, then it shows "Page 1 of 190".To show each page, the query probably has a "LIMIT 15 OFFSET 0" for
Page 1. However, to calculate the total number of pages, they probably
do a separate counts query, because doing a "select *" and then
counting the number of rows returned would be even more inefficient
than a count(*).
When I go to amazon.com I only ever get three pages of results. ever.
Because they know that returning 190 pages is not that useful, as
hardly anyone is going to wander through that many pages.
Google, you'll notice says "Results 1 - 10 of about 5,610,000 for
blacksmith" i.e. it's guesstimating as well. no reason for google to
look at every single row for blacksmith to know that there's about 5.6
million.
So, in reporting, two queries are fairly common I would think, unless
I am missing something?
Yes, optimization. :) You don't need an exact count to tell someone
that there's more data and they can go to it. Note that if you are
planning on doing things google sized, you'll need to do what they
did, invent your own specialized database.
For us mere mortals, it's quite likely that you can do something like:
explain select * from table where field like 'abc%';
and then parse the explain output for an approximate number.
---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?
In response to "Phoenix Kiula" <phoenix.kiula@gmail.com>:
Yes, optimization. :) You don't need an exact count to tell someone
that there's more data and they can go to it.In general, I agree. But my example of Amazon was only to illustrate
the point about two queries and why they may be needed. I seem to see
many more pages than you do, but in any case, Google and Amazon can
afford to be less precise.Thanks for the suggestion of using EXPLAIN and parsing an
approximation, but when you need to show a trader how many trades he
has made, for instance, then approximation is not a possibility at
all. Especially not if the numbers sway so wildly --FIRSTDB=# explain select * from trades where t_id = 'kXjha';
QUERY PLAN
-----------------------------------------------------------------------------------
Bitmap Heap Scan on trades (cost=15.77..1447.12 rows=374 width=224)
Recheck Cond: ((t_id)::text = 'kXjha'::text)
-> Bitmap Index Scan on trades_tid_date (cost=0.00..15.67 rows=374 width=0)
Index Cond: ((t_id)::text = 'kXjha'::text)
(4 rows)FIRSTDB=# select count(*) from trades where t_id = 'kXjha';
count
-------
3891
(1 row)Could I do something so that the EXPLAIN showed up with slightly more
close-to-accurate stats? The above query is just after a "vacuum
analyze"!
In the above case, you could probably materialize the data with a trigger
that updates a counter in a separate table every time a new trade is
added. This will give you 100% accurate results with _very_ fast
response time.
Part of the problem is that there's no one answer to your question, there
are multiple approaches to solving it, depending on the details of the
problem and the acceptable time/accuracy of the answers. Some basic
approaches:
1) Materialize the data. MySQL actually does this automatically for you
with MyISAM tables, which is why count(*) is so fast. But if you
absolutely need fast, accurate counts, you can build your own
triggers in PG. This is unlikely to be practical with all queries.
2) Estimate. The accuracy of estimates can vary wildly by query and
how often the database is analyzed, etc. For something like,
"show results 1 - 10 of about 50,000", estimates are great and fast,
but for other cases, not acceptable. The good news is you can get
a fast estimate from any query with no up-front work.
3) Accept that sometimes to get accurate answers it's going to take
time. Around here, we call it the "Orbitz" technique, because when
we discuss it, everyone thinks of the "please wait while I process
your query" page you get from orbitz.com. You'd be surprised how
willing your users are to wait, as long as they know they have to
wait.
4) Throw more hardware at it. If you absolutely _must_have_ super-
accurate results faster, then you may need to buy more RAM, faster
disks and faster CPUs to accomplish it.
5) Come up with something revolutionary that nobody's every thought of
before. Good luck with this one.
Of course, all of these ideas are only practical if you've already
ensured that your system is properly tuned. Crappy values for
shared_buffers and other tuning will lead you to waste time trying
to redesign something that should work just fine, so verify all your
configuration first. You may be able to get more acceptable estimates
by increasing your statistics targets, for example.
--
Bill Moran
http://www.potentialtech.com
"Scott Marlowe" wrote:
When I go to amazon.com I only ever get three pages of results. ever.
Because they know that returning 190 pages is not that useful, as
hardly anyone is going to wander through that many pages.Google, you'll notice says "Results 1 - 10 of about 5,610,000 for
blacksmith" i.e. it's guesstimating as well. no reason for google to
look at every single row for blacksmith to know that there's about 5.6
million.
But if you go to eBay, they always give you an accurate count. Even if the no.
of items found is pretty large (example: <http://search.ebay.com/new>).
Rainer
On Thu, Aug 16, 2007 at 12:12:03PM +0200, Rainer Bauer wrote:
"Scott Marlowe" wrote:
When I go to amazon.com I only ever get three pages of results. ever.
Because they know that returning 190 pages is not that useful, as
hardly anyone is going to wander through that many pages.Google, you'll notice says "Results 1 - 10 of about 5,610,000 for
blacksmith" i.e. it's guesstimating as well. no reason for google to
look at every single row for blacksmith to know that there's about 5.6
million.But if you go to eBay, they always give you an accurate count. Even if the no.
of items found is pretty large (example: <http://search.ebay.com/new>).
And I'd bet money that they're using a full text search of some kind to
get those results, which isn't remotely close to the same thing as a
generic SELECT count(*).
--
Decibel!, aka Jim Nasby decibel@decibel.org
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
Decibel! wrote:
On Thu, Aug 16, 2007 at 12:12:03PM +0200, Rainer Bauer wrote:
"Scott Marlowe" wrote:
When I go to amazon.com I only ever get three pages of results. ever.
Because they know that returning 190 pages is not that useful, as
hardly anyone is going to wander through that many pages.Google, you'll notice says "Results 1 - 10 of about 5,610,000 for
blacksmith" i.e. it's guesstimating as well. no reason for google to
look at every single row for blacksmith to know that there's about 5.6
million.But if you go to eBay, they always give you an accurate count. Even if the no.
of items found is pretty large (example: <http://search.ebay.com/new>).And I'd bet money that they're using a full text search of some kind to
get those results, which isn't remotely close to the same thing as a
generic SELECT count(*).
Without text search (but with a category restriction):
<http://collectibles.listings.ebay.com/_W0QQsacatZ1QQsocmdZListingItemList>
I only wanted to show a counter-example for a big site which uses pagination
to display result sets and still reports accurate counts.
Anyway, what Phoenix is trying to say is that 2 queries are required: One to
get the total count and one to get the tuples for the current page. I reckon
it would help, if the query returning the result set could also report the
total no. of tuples found. Somthing like
SELECT COUNT(*), * FROM <table> WHERE <cond> OFFSET <o> LIMIT <l>
Or is there a way to do that?
Rainer
"Rainer Bauer" <usenet@munnin.com> writes:
Anyway, what Phoenix is trying to say is that 2 queries are required: One to
get the total count and one to get the tuples for the current page. I reckon
it would help, if the query returning the result set could also report the
total no. of tuples found. Somthing like
SELECT COUNT(*), * FROM <table> WHERE <cond> OFFSET <o> LIMIT <l>Or is there a way to do that?
Well anything like the above would just report l as the count.
The only way to do it in Postgres currently is to create a temporary table.
Then you can populate it once, then select the count from the temporary table
in one query and the required page from it in the second query.
But temporary tables in Postgres are not really designed for this. In
particular they count as DDL so you have to grant privileges to create tables
to the application and it has to create and delete entries in pg_class for
every use.
--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
On 8/16/07, Rainer Bauer <usenet@munnin.com> wrote:
But if you go to eBay, they always give you an accurate count. Even if the no.
of items found is pretty large (example: <http://search.ebay.com/new>).And I'd bet money that they're using a full text search of some kind to
get those results, which isn't remotely close to the same thing as a
generic SELECT count(*).Without text search (but with a category restriction):
<http://collectibles.listings.ebay.com/_W0QQsacatZ1QQsocmdZListingItemList>I only wanted to show a counter-example for a big site which uses pagination
to display result sets and still reports accurate counts.
Categories are still finite state: you can simply store a count for
each category. Again it's just a case of knowing your data and
queries; it's not trying to solve a general infinite-possibilities
situation.
For instance, the OP mentioned wanting to get data on a particular
trader for the last week. Maintain a summary table that keeps counts
of each trader for each week, and ID bounds for the actual data table.
When you need to query the last 4 weeks, sum(). When you need to
query the last 30 days, sum() 4 weeks + a query on the master table
bounded by timestamp and ID range for the 5th week from the summary
table.
I'm sure there are sites out there that provide precise counts quickly
for extremely complex queries on gigantic datasets, but all the common
stuff is about specifics, not arbitrary queries. There are also
systems other than SQL RDBMS that can be used to drive such reporting.
On Thu, Aug 16, 2007 at 01:09:32PM +0200, Rainer Bauer wrote:
Anyway, what Phoenix is trying to say is that 2 queries are required: One to
get the total count and one to get the tuples for the current page. I reckon
it would help, if the query returning the result set could also report the
total no. of tuples found. Somthing like
SELECT COUNT(*), * FROM <table> WHERE <cond> OFFSET <o> LIMIT <l>
Well, thee is another possibility, use cursors:
DECLARE CURSOR ... AS <query>;
FETCH 30 -- or however many to want now
MOVE TO END -- or whatever the command is, this gives you the number of rows
Hope this helps,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/
Show quoted text
From each according to his ability. To each according to his ability to litigate.