distinct values and count over window function

Started by Rodrigo Rosenfeld Rosasover 13 years ago3 messagesgeneral

rr.rosas@gmail.com

over 13 years ago

Hello, this is my first post to this particular list (I've also posted
to performance list a while ago to report a bug).

I generate complex dynamic queries and paginate them using "LIMIT 50". I
have a need for implementing some statistics for the results as well and
I'd like to know if I can avoid extra queries or how to be able to do
that without much performance hit.

The statistics calculation that bothers me most is a distinct one. For
example, suppose you have a query that would return more than 50
distinct values while only the first 50 ones are brought by the
paginated query.

When the user asks for its statistics it should show all counts by
distinct value including those not present in the current results page
for certain columns.

I noticed that there is a plan for PG 9.2 to include a query_to_json
function that would indeed help me to get this done when used altogether
with a window function.

Our application is currently running PG 9.1 (was rolled back from 9.2
due to a performance bug that seems to be now fixed but not released
yet). Is there anyway I could avoid running another query to get this
kind of statistics?

The queries can become really complex and take almost 5s to complete in
some cases. Which would mean that performing 2 queries with those same
conditions would take about 10s... I could perform them in different
times but then the user would have to wait for 5s just to get the
statistics calculation... This is what I'd like to avoid.

This application is a clustered web application, so I'd rather avoid any
client library support for paginating results using cursors or something
like that and stick with OFFSET and LIMIT.

Any help is appreciated!

Thanks,
Rodrigo.

David G. Johnston

david.g.johnston@gmail.com

over 13 years ago

In reply to: Rodrigo Rosenfeld Rosas (#1)

Re: distinct values and count over window function

From: pgsql-general-owner@postgresql.org
[mailto:pgsql-general-owner@postgresql.org] On Behalf Of Rodrigo Rosenfeld
Rosas
Sent: Tuesday, November 27, 2012 4:23 PM
To: pgsql-general@postgresql.org
Subject: [GENERAL] distinct values and count over window function

Hello, this is my first post to this particular list (I've also posted to
performance list a while ago to report a bug).

I generate complex dynamic queries and paginate them using "LIMIT 50". I
have a need for implementing some statistics for the results as well and I'd
like to know if I can avoid extra queries or how to be able to do that
without much performance hit.

The statistics calculation that bothers me most is a distinct one. For
example, suppose you have a query that would return more than 50 distinct
values while only the first 50 ones are brought by the paginated query.

When the user asks for its statistics it should show all counts by distinct
value including those not present in the current results page for certain
columns.

I noticed that there is a plan for PG 9.2 to include a query_to_json
function that would indeed help me to get this done when used altogether
with a window function.

Our application is currently running PG 9.1 (was rolled back from 9.2 due to
a performance bug that seems to be now fixed but not released yet). Is there
anyway I could avoid running another query to get this kind of statistics?

The queries can become really complex and take almost 5s to complete in some
cases. Which would mean that performing 2 queries with those same conditions
would take about 10s... I could perform them in different times but then the
user would have to wait for 5s just to get the statistics calculation...
This is what I'd like to avoid.

This application is a clustered web application, so I'd rather avoid any
client library support for paginating results using cursors or something
like that and stick with OFFSET and LIMIT.

Any help is appreciated!

Thanks,
Rodrigo.

.

Conceptually you could execute and cache (via a Temporary or Permanent
Table) the results of the unlimited query then use that cache to calculate
your statistics as well as perform your subsequent LIMIT/OFFSET queries.

In 9.1+ you have the ability to incorporate INSERT inside a WITH/CTE so that
you can perform multiple actions like this without resorting to the use of a
function - though a function may have value as well.

David J.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Rodrigo Rosenfeld Rosas

rr.rosas@gmail.com

over 13 years ago

In reply to: David G. Johnston (#2)

Re: distinct values and count over window function

Em 27-11-2012 19:50, David Johnston escreveu:

From: pgsql-general-owner@postgresql.org
[mailto:pgsql-general-owner@postgresql.org] On Behalf Of Rodrigo Rosenfeld
Rosas
Sent: Tuesday, November 27, 2012 4:23 PM
To: pgsql-general@postgresql.org
Subject: [GENERAL] distinct values and count over window function

Hello, this is my first post to this particular list (I've also posted to
performance list a while ago to report a bug).

I generate complex dynamic queries and paginate them using "LIMIT 50". I
have a need for implementing some statistics for the results as well and I'd
like to know if I can avoid extra queries or how to be able to do that
without much performance hit.

The statistics calculation that bothers me most is a distinct one. For
example, suppose you have a query that would return more than 50 distinct
values while only the first 50 ones are brought by the paginated query.

When the user asks for its statistics it should show all counts by distinct
value including those not present in the current results page for certain
columns.

I noticed that there is a plan for PG 9.2 to include a query_to_json
function that would indeed help me to get this done when used altogether
with a window function.

Our application is currently running PG 9.1 (was rolled back from 9.2 due to
a performance bug that seems to be now fixed but not released yet). Is there
anyway I could avoid running another query to get this kind of statistics?

The queries can become really complex and take almost 5s to complete in some
cases. Which would mean that performing 2 queries with those same conditions
would take about 10s... I could perform them in different times but then the
user would have to wait for 5s just to get the statistics calculation...
This is what I'd like to avoid.

This application is a clustered web application, so I'd rather avoid any
client library support for paginating results using cursors or something
like that and stick with OFFSET and LIMIT.

Any help is appreciated!

Thanks,
Rodrigo.

.

Conceptually you could execute and cache (via a Temporary or Permanent
Table) the results of the unlimited query then use that cache to calculate
your statistics as well as perform your subsequent LIMIT/OFFSET queries.

In 9.1+ you have the ability to incorporate INSERT inside a WITH/CTE so that
you can perform multiple actions like this without resorting to the use of a
function - though a function may have value as well.

David J.

Indeed David, thanks! Temporary tables seems like a good enough
solution. I tested with a complex query here, which takes about 300ms
in my dev machine.

I was implementing pagination by issuing 2 similar queries for the first
page, one of them being a count(*) one. Both summed for about 500ms
while they complete in around 350ms if I use a temporary table in within
a transaction using "CREATE TEMP TABLE partial_results ON COMMIT DROP AS".

I'd love to store those partial results for a longer time but this would
complicate things a lot. I'd need a way to ensure each query will use a
different name for the results table and I should also implement a
system that would take care of dropping such tables as the used memory
passes some caching threshold or when some timeout has occurred.

Are there any explicit caching mechanisms like described above built-in
into PostgreSQL?

Anyway, I won't have time to look into this at least until Monday so
I'll use the temp table strategy described above for now due to a tight
dead-line.

Thanks a lot!

Best,
Rodrigo.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general