BUG #2658: Query not using index

Started by Graham Davisover 19 years ago25 messagesbugs
Jump to latest
#1Graham Davis
gdavis@refractions.net

The following bug has been logged online:

Bug reference: 2658
Logged by: Graham Davis
Email address: gdavis@refractions.net
PostgreSQL version: 8.1.4
Operating system: Linux
Description: Query not using index
Details:

I know that in version 8 you guys added support so that aggregate functions
can take advantage of indexes. However, I have a simple query that is not
taking advantage of an index where I believe it should.

I have a large table full of GPS positions. I want to query the table for
the most recent location of each asset (an asset is essentially a vehicle).
The ts column is the timestamp, so I am using this to figure out the most
recent position. I use the following query to do it:

SELECT assetid, max(ts) AS ts
FROM asset_positions
GROUP BY assetid;

I have an index on (ts), another index on (assetid) and a multikey index on
(assetid, ts). I know the assetid index is pointless since the multikey one
takes its place, but I put it there while testing just to make sure. The
ANALYZE EXPLAIN for this query is:

QUERY PLAN
----------------------------------------------------------------------------
-------------------------------------------------------------
HashAggregate (cost=125423.96..125424.21 rows=20 width=12) (actual
time=39693.995..39694.036 rows=20 loops=1)
-> Seq Scan on asset_positions (cost=0.00..116654.64 rows=1753864
width=12) (actual time=20002.362..34724.896 rows=1738693 loops=1)
Total runtime: 39694.245 ms
(3 rows)

You can see it is doing a sequential scan on the table when it should be
using the (assetid, ts) index, or at the very least the (ts) index. This
query takes about 40 seconds to complete with a table of 1.7 million rows.
I tested running the query without the group by as follows:

SELECT max(ts) AS ts
FROM asset_positions;

This query DOES use the (ts) index and takes less than 1 ms to complete. So
I'm not sure why my initial query is not using one of the indexes. I have
to use the GROUP BY in my query so that I get the max ts of EACH asset.

I've tried restructuring my query so that it will use an index, but nothing
seems to work. I tried this syntax for example:

SELECT DISTINCT ON (assetid) assetid, ts
FROM asset_positions
ORDER BY assetid, ts DESC;

It still does a sequential scan and takes 40+ seconds to complete. If I am
missing something here, please let me know, but I believe this is a bug that
needs addressing. If it is not a bug (and there just simply isn't support
for this with multikey indexes yet), please let me know so I can either try
restructuring the coding I am working on, or move on for now. The
documentation does not mention anything about this, but I know from reading
a list of changes in version 8 that this sort of support was added for
aggregate functions. If you need more information, please let me know,
thanks in advance.

#2Bruno Wolff III
bruno@wolff.to
In reply to: Graham Davis (#1)
Re: BUG #2658: Query not using index

This shouldn't have been submitted to the bugs list, as it isn't a bug.
The best spot for this kind of question is the performance list so I am
copying it there and redirecting followups there.

On Wed, Sep 27, 2006 at 20:56:32 +0000,
Graham Davis <gdavis@refractions.net> wrote:

SELECT assetid, max(ts) AS ts
FROM asset_positions
GROUP BY assetid;

I have an index on (ts), another index on (assetid) and a multikey index on
(assetid, ts). I know the assetid index is pointless since the multikey one
takes its place, but I put it there while testing just to make sure. The
ANALYZE EXPLAIN for this query is:

QUERY PLAN
----------------------------------------------------------------------------
-------------------------------------------------------------
HashAggregate (cost=125423.96..125424.21 rows=20 width=12) (actual
time=39693.995..39694.036 rows=20 loops=1)
-> Seq Scan on asset_positions (cost=0.00..116654.64 rows=1753864
width=12) (actual time=20002.362..34724.896 rows=1738693 loops=1)
Total runtime: 39694.245 ms
(3 rows)

You can see it is doing a sequential scan on the table when it should be
using the (assetid, ts) index, or at the very least the (ts) index. This
query takes about 40 seconds to complete with a table of 1.7 million rows.
I tested running the query without the group by as follows:

SELECT DISTINCT ON (assetid) assetid, ts
FROM asset_positions
ORDER BY assetid, ts DESC;

This is almost what you want to do to get an alternative plan. But you
need to ORDER BY assetid DESC, ts DESC to make use of the multicolumn
index. If you really need the other output order, reverse it in your
application or use the above as a subselect in another query that orders
by assetid ASC.

#3Graham Davis
gdavis@refractions.net
In reply to: Bruno Wolff III (#2)
Re: BUG #2658: Query not using index

Hi,

Adding DESC to both columns in the SORT BY did not make the query use
the multikey index. So both

SELECT DISTINCT ON (assetid) assetid, ts
FROM asset_positions
ORDER BY assetid, ts DESC;

and

SELECT DISTINCT ON (assetid) assetid, ts
FROM asset_positions
ORDER BY assetid DESC, ts DESC;

use the same query plans and both do sequential scans without using either the (assetid, ts) or (ts) indexes. Any other ideas on how to make this query use an index? Thanks,

--
Graham Davis
Refractions Research Inc.
gdavis@refractions.net

Show quoted text

On Wed, Sep 27, 2006 at 20:56:32 +0000,
Graham Davis <gdavis@refractions.net> wrote:

SELECT assetid, max(ts) AS ts
FROM asset_positions
GROUP BY assetid;

I have an index on (ts), another index on (assetid) and a multikey index on
(assetid, ts). I know the assetid index is pointless since the multikey one
takes its place, but I put it there while testing just to make sure. The
ANALYZE EXPLAIN for this query is:

QUERY PLAN
----------------------------------------------------------------------------
-------------------------------------------------------------
HashAggregate (cost=125423.96..125424.21 rows=20 width=12) (actual
time=39693.995..39694.036 rows=20 loops=1)
-> Seq Scan on asset_positions (cost=0.00..116654.64 rows=1753864
width=12) (actual time=20002.362..34724.896 rows=1738693 loops=1)
Total runtime: 39694.245 ms
(3 rows)

You can see it is doing a sequential scan on the table when it should be
using the (assetid, ts) index, or at the very least the (ts) index. This
query takes about 40 seconds to complete with a table of 1.7 million rows.
I tested running the query without the group by as follows:

SELECT DISTINCT ON (assetid) assetid, ts
FROM asset_positions
ORDER BY assetid, ts DESC;

This is almost what you want to do to get an alternative plan. But you
need to ORDER BY assetid DESC, ts DESC to make use of the multicolumn
index. If you really need the other output order, reverse it in your
application or use the above as a subselect in another query that orders
by assetid ASC.

#4Chris Browne
cbbrowne@acm.org
In reply to: Graham Davis (#1)
Re: BUG #2658: Query not using index

gdavis@refractions.net (Graham Davis) writes:

Adding DESC to both columns in the SORT BY did not make the query use
the multikey index. So both

SELECT DISTINCT ON (assetid) assetid, ts
FROM asset_positions ORDER BY assetid, ts DESC;

and

SELECT DISTINCT ON (assetid) assetid, ts
FROM asset_positions ORDER BY assetid DESC, ts DESC;

use the same query plans and both do sequential scans without using
either the (assetid, ts) or (ts) indexes. Any other ideas on how to
make this query use an index? Thanks,

Why do you want to worsen performance by forcing the use of an index?

You are reading through the entire table, after all, and doing so via
a sequential scan is normally the fastest way to do that. An index
scan would only be more efficient if you don't have enough space in
memory to store all assetid values.
--
(reverse (concatenate 'string "gro.mca" "@" "enworbbc"))
http://www3.sympatico.ca/cbbrowne/emacs.html
Expect the unexpected.
-- The Hitchhiker's Guide to the Galaxy, page 7023

#5Graham Davis
gdavis@refractions.net
In reply to: Chris Browne (#4)
Re: BUG #2658: Query not using index

The asset_positions table has about 1.7 million rows, and this query
takes over 40 seconds to do a sequential scan. Initially I was trying
to get the original query:

SELECT assetid, max(ts) AS ts
FROM asset_positions
GROUP BY assetid;

to use the multikey index since I read that PostgreSQL 8 added support
for aggregates to use indexes. However, the GROUP BY was causing the query
plan to not use any index (removing the GROUP by allowed the query to
use the ts index and it took only 50 ms to run). Since I need the query
to find the max time
for EACH asset, I can't just drop the GROUP BY from my query. So I was
trying some alternate ways of writing the query (as described in the
below email) to
force the use of one of these indexes.

40 seconds is much too slow for this query to run and I'm assuming that
the use of an index will make it much faster (as seen when I removed the
GROUP BY clause). Any tips?

Graham.

Chris Browne wrote:

gdavis@refractions.net (Graham Davis) writes:

Adding DESC to both columns in the SORT BY did not make the query use
the multikey index. So both

SELECT DISTINCT ON (assetid) assetid, ts
FROM asset_positions ORDER BY assetid, ts DESC;

and

SELECT DISTINCT ON (assetid) assetid, ts
FROM asset_positions ORDER BY assetid DESC, ts DESC;

use the same query plans and both do sequential scans without using
either the (assetid, ts) or (ts) indexes. Any other ideas on how to
make this query use an index? Thanks,

Why do you want to worsen performance by forcing the use of an index?

You are reading through the entire table, after all, and doing so via
a sequential scan is normally the fastest way to do that. An index
scan would only be more efficient if you don't have enough space in
memory to store all assetid values.

--
Graham Davis
Refractions Research Inc.
gdavis@refractions.net

#6Graham Davis
gdavis@refractions.net
In reply to: Chris Browne (#4)
Re: BUG #2658: Query not using index

Also, the multikey index of (assetid, ts) would already be sorted and
that is why using such an index in this case is
faster than doing a sequential scan that does the sorting afterwards.

Graham.

Chris Browne wrote:

gdavis@refractions.net (Graham Davis) writes:

Adding DESC to both columns in the SORT BY did not make the query use
the multikey index. So both

SELECT DISTINCT ON (assetid) assetid, ts
FROM asset_positions ORDER BY assetid, ts DESC;

and

SELECT DISTINCT ON (assetid) assetid, ts
FROM asset_positions ORDER BY assetid DESC, ts DESC;

use the same query plans and both do sequential scans without using
either the (assetid, ts) or (ts) indexes. Any other ideas on how to
make this query use an index? Thanks,

Why do you want to worsen performance by forcing the use of an index?

You are reading through the entire table, after all, and doing so via
a sequential scan is normally the fastest way to do that. An index
scan would only be more efficient if you don't have enough space in
memory to store all assetid values.

--
Graham Davis
Refractions Research Inc.
gdavis@refractions.net

#7Chris Browne
cbbrowne@acm.org
In reply to: Graham Davis (#1)
Re: BUG #2658: Query not using index

gdavis@refractions.net (Graham Davis) writes:

40 seconds is much too slow for this query to run and I'm assuming
that the use of an index will make it much faster (as seen when I
removed the GROUP BY clause). Any tips?

Assumptions are dangerous things.

An aggregate like this has *got to* scan the entire table, and given
that that is the case, an index scan is NOT optimal; a seq scan is.

An index scan is just going to be slower.
--
let name="cbbrowne" and tld="linuxdatabases.info" in String.concat "@" [name;tld];;
http://cbbrowne.com/info/linux.html
"The computer is the ultimate polluter: its feces are
indistinguishable from the food it produces." -- Alan J. Perlis

#8Graham Davis
gdavis@refractions.net
In reply to: Chris Browne (#7)
Re: BUG #2658: Query not using index

How come an aggreate like that has to use a sequential scan? I know
that PostgreSQL use to have to do a sequential scan for all aggregates,
but there was support added to version 8 so that aggregates would take
advantage of indexes. This is why

SELECT max(ts) AS ts
FROM asset_positions;

Uses an index on the ts column and only takes 50 milliseconds. When I
added the group by it would not use a multikey index or any other
index. Is there just no support for aggregates to use multikey
indexes? Sorry to be so pushy, but I just want to make sure I
understand why the above query can use an index and the following can't:

SELECT assetid, max(ts) AS ts
FROM asset_positions
GROUP BY assetid;

--
Graham Davis
Refractions Research Inc.
gdavis@refractions.net

Chris Browne wrote:

Show quoted text

gdavis@refractions.net (Graham Davis) writes:

40 seconds is much too slow for this query to run and I'm assuming
that the use of an index will make it much faster (as seen when I
removed the GROUP BY clause). Any tips?

Assumptions are dangerous things.

An aggregate like this has *got to* scan the entire table, and given
that that is the case, an index scan is NOT optimal; a seq scan is.

An index scan is just going to be slower.

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Graham Davis (#8)
Re: BUG #2658: Query not using index

Graham Davis <gdavis@refractions.net> writes:

How come an aggreate like that has to use a sequential scan? I know
that PostgreSQL use to have to do a sequential scan for all aggregates,
but there was support added to version 8 so that aggregates would take
advantage of indexes.

Not in a GROUP BY context, only for the simple case. Per the comment in
planagg.c:

* We don't handle GROUP BY, because our current implementations of
* grouping require looking at all the rows anyway, and so there's not
* much point in optimizing MIN/MAX.

The problem is that using an index to obtain the maximum value of ts for
a given value of assetid is not the same thing as finding out what all
the distinct values of assetid are.

This could possibly be improved but it would take a considerable amount
more work. It's definitely not in the category of "bug fix".

regards, tom lane

#10Bruno Wolff III
bruno@wolff.to
In reply to: Graham Davis (#6)
Re: BUG #2658: Query not using index

On Tue, Oct 03, 2006 at 12:13:43 -0700,
Graham Davis <gdavis@refractions.net> wrote:

Also, the multikey index of (assetid, ts) would already be sorted and
that is why using such an index in this case is
faster than doing a sequential scan that does the sorting afterwards.

That isn't necessarily true. The sequentional scan and sort will need a lot
fewer disk seeks and could run faster than using an index scan that has
the disk drives doing seeks for every tuple (in the worst case, where
the on disk order of tuples doesn't match the order in the index).

If your server is caching most of the blocks than the index scan might
give better results. You might try disabling sequentional scans to
try to coerce the other plan and see what results you get. If it is
substantially faster the other way, then you might want to look at lowering
the random page cost factor. However, since this can affect other queries
you need to be careful that you don't speed up one query at the expense
of a lot of other queries.

#11Graham Davis
gdavis@refractions.net
In reply to: Tom Lane (#9)
Re: BUG #2658: Query not using index

Thanks Tom, that explains it and makes sense. I guess I will have to
accept this query taking 40 seconds, unless I can figure out another way
to write it so it can use indexes. If there are any more syntax
suggestions, please pass them on. Thanks for the help everyone.

Graham.

Tom Lane wrote:

Graham Davis <gdavis@refractions.net> writes:

How come an aggreate like that has to use a sequential scan? I know
that PostgreSQL use to have to do a sequential scan for all aggregates,
but there was support added to version 8 so that aggregates would take
advantage of indexes.

Not in a GROUP BY context, only for the simple case. Per the comment in
planagg.c:

* We don't handle GROUP BY, because our current implementations of
* grouping require looking at all the rows anyway, and so there's not
* much point in optimizing MIN/MAX.

The problem is that using an index to obtain the maximum value of ts for
a given value of assetid is not the same thing as finding out what all
the distinct values of assetid are.

This could possibly be improved but it would take a considerable amount
more work. It's definitely not in the category of "bug fix".

regards, tom lane

--
Graham Davis
Refractions Research Inc.
gdavis@refractions.net

#12Mark Lewis
mark.lewis@mir3.com
In reply to: Graham Davis (#11)
Re: BUG #2658: Query not using index

Have you looked into a materialized view sort of approach? You could
create a table which had assetid as a primary key, and max_ts as a
column. Then use triggers to keep that table up to date as rows are
added/updated/removed from the main table.

This approach would only make sense if there were far fewer distinct
assetid values than rows in the main table, and would get slow if you
commonly delete rows from the main table or decrease the value for ts in
the row with the highest ts for a given assetid.

-- Mark Lewis

Show quoted text

On Tue, 2006-10-03 at 13:52 -0700, Graham Davis wrote:

Thanks Tom, that explains it and makes sense. I guess I will have to
accept this query taking 40 seconds, unless I can figure out another way
to write it so it can use indexes. If there are any more syntax
suggestions, please pass them on. Thanks for the help everyone.

Graham.

Tom Lane wrote:

Graham Davis <gdavis@refractions.net> writes:

How come an aggreate like that has to use a sequential scan? I know
that PostgreSQL use to have to do a sequential scan for all aggregates,
but there was support added to version 8 so that aggregates would take
advantage of indexes.

Not in a GROUP BY context, only for the simple case. Per the comment in
planagg.c:

* We don't handle GROUP BY, because our current implementations of
* grouping require looking at all the rows anyway, and so there's not
* much point in optimizing MIN/MAX.

The problem is that using an index to obtain the maximum value of ts for
a given value of assetid is not the same thing as finding out what all
the distinct values of assetid are.

This could possibly be improved but it would take a considerable amount
more work. It's definitely not in the category of "bug fix".

regards, tom lane

#13Graham Davis
gdavis@refractions.net
In reply to: Mark Lewis (#12)
Re: BUG #2658: Query not using index

The "summary table" approach maintained by triggers is something we are
considering, but it becomes a bit more complicated to implement.
Currently we have groups of new positions coming in every few seconds or
less. They are not guaranteed to be in order. So for instance, a group
of positions from today could come in and be inserted, then a group of
positions that got lost from yesterday could come in and be inserted
afterwards.

This means the triggers would have to do some sort of logic to figure
out if the newly inserted position is actually the most recent by
timestamp. If positions are ever deleted or updated, the same sort of
query that is currently running slow will need to be executed in order
to get the new most recent position. So there is the possibility that
new positions can be inserted faster than the triggers can calculate
and maintain the summary table. There are some other complications
with maintaining such a summary table in our system too, but I won't get
into those.

Right now I'm just trying to see if I can get the query itself running
faster, which would be the easiest solution for now.

Graham.

Mark Lewis wrote:

Have you looked into a materialized view sort of approach? You could
create a table which had assetid as a primary key, and max_ts as a
column. Then use triggers to keep that table up to date as rows are
added/updated/removed from the main table.

This approach would only make sense if there were far fewer distinct
assetid values than rows in the main table, and would get slow if you
commonly delete rows from the main table or decrease the value for ts in
the row with the highest ts for a given assetid.

-- Mark Lewis

On Tue, 2006-10-03 at 13:52 -0700, Graham Davis wrote:

Thanks Tom, that explains it and makes sense. I guess I will have to
accept this query taking 40 seconds, unless I can figure out another way
to write it so it can use indexes. If there are any more syntax
suggestions, please pass them on. Thanks for the help everyone.

Graham.

Tom Lane wrote:

Graham Davis <gdavis@refractions.net> writes:

How come an aggreate like that has to use a sequential scan? I know
that PostgreSQL use to have to do a sequential scan for all aggregates,
but there was support added to version 8 so that aggregates would take
advantage of indexes.

Not in a GROUP BY context, only for the simple case. Per the comment in
planagg.c:

* We don't handle GROUP BY, because our current implementations of
* grouping require looking at all the rows anyway, and so there's not
* much point in optimizing MIN/MAX.

The problem is that using an index to obtain the maximum value of ts for
a given value of assetid is not the same thing as finding out what all
the distinct values of assetid are.

This could possibly be improved but it would take a considerable amount
more work. It's definitely not in the category of "bug fix".

regards, tom lane

--
Graham Davis
Refractions Research Inc.
gdavis@refractions.net

#14Mark Lewis
mark.lewis@mir3.com
In reply to: Graham Davis (#13)
Re: BUG #2658: Query not using index

Hmmm. How many distinct assetids are there?
-- Mark Lewis

Show quoted text

On Tue, 2006-10-03 at 14:23 -0700, Graham Davis wrote:

The "summary table" approach maintained by triggers is something we are
considering, but it becomes a bit more complicated to implement.
Currently we have groups of new positions coming in every few seconds or
less. They are not guaranteed to be in order. So for instance, a group
of positions from today could come in and be inserted, then a group of
positions that got lost from yesterday could come in and be inserted
afterwards.

This means the triggers would have to do some sort of logic to figure
out if the newly inserted position is actually the most recent by
timestamp. If positions are ever deleted or updated, the same sort of
query that is currently running slow will need to be executed in order
to get the new most recent position. So there is the possibility that
new positions can be inserted faster than the triggers can calculate
and maintain the summary table. There are some other complications
with maintaining such a summary table in our system too, but I won't get
into those.

Right now I'm just trying to see if I can get the query itself running
faster, which would be the easiest solution for now.

Graham.

Mark Lewis wrote:

Have you looked into a materialized view sort of approach? You could
create a table which had assetid as a primary key, and max_ts as a
column. Then use triggers to keep that table up to date as rows are
added/updated/removed from the main table.

This approach would only make sense if there were far fewer distinct
assetid values than rows in the main table, and would get slow if you
commonly delete rows from the main table or decrease the value for ts in
the row with the highest ts for a given assetid.

-- Mark Lewis

On Tue, 2006-10-03 at 13:52 -0700, Graham Davis wrote:

Thanks Tom, that explains it and makes sense. I guess I will have to
accept this query taking 40 seconds, unless I can figure out another way
to write it so it can use indexes. If there are any more syntax
suggestions, please pass them on. Thanks for the help everyone.

Graham.

Tom Lane wrote:

Graham Davis <gdavis@refractions.net> writes:

How come an aggreate like that has to use a sequential scan? I know
that PostgreSQL use to have to do a sequential scan for all aggregates,
but there was support added to version 8 so that aggregates would take
advantage of indexes.

Not in a GROUP BY context, only for the simple case. Per the comment in
planagg.c:

* We don't handle GROUP BY, because our current implementations of
* grouping require looking at all the rows anyway, and so there's not
* much point in optimizing MIN/MAX.

The problem is that using an index to obtain the maximum value of ts for
a given value of assetid is not the same thing as finding out what all
the distinct values of assetid are.

This could possibly be improved but it would take a considerable amount
more work. It's definitely not in the category of "bug fix".

regards, tom lane

#15Graham Davis
gdavis@refractions.net
In reply to: Mark Lewis (#14)
Re: BUG #2658: Query not using index

Not many. It fluctuates, but there are usually only ever a few hundred
at most. Each assetid has multi-millions of positions though.

Mark Lewis wrote:

Hmmm. How many distinct assetids are there?
-- Mark Lewis

On Tue, 2006-10-03 at 14:23 -0700, Graham Davis wrote:

The "summary table" approach maintained by triggers is something we are
considering, but it becomes a bit more complicated to implement.
Currently we have groups of new positions coming in every few seconds or
less. They are not guaranteed to be in order. So for instance, a group
of positions from today could come in and be inserted, then a group of
positions that got lost from yesterday could come in and be inserted
afterwards.

This means the triggers would have to do some sort of logic to figure
out if the newly inserted position is actually the most recent by
timestamp. If positions are ever deleted or updated, the same sort of
query that is currently running slow will need to be executed in order
to get the new most recent position. So there is the possibility that
new positions can be inserted faster than the triggers can calculate
and maintain the summary table. There are some other complications
with maintaining such a summary table in our system too, but I won't get
into those.

Right now I'm just trying to see if I can get the query itself running
faster, which would be the easiest solution for now.

Graham.

Mark Lewis wrote:

Have you looked into a materialized view sort of approach? You could
create a table which had assetid as a primary key, and max_ts as a
column. Then use triggers to keep that table up to date as rows are
added/updated/removed from the main table.

This approach would only make sense if there were far fewer distinct
assetid values than rows in the main table, and would get slow if you
commonly delete rows from the main table or decrease the value for ts in
the row with the highest ts for a given assetid.

-- Mark Lewis

On Tue, 2006-10-03 at 13:52 -0700, Graham Davis wrote:

Thanks Tom, that explains it and makes sense. I guess I will have to
accept this query taking 40 seconds, unless I can figure out another way
to write it so it can use indexes. If there are any more syntax
suggestions, please pass them on. Thanks for the help everyone.

Graham.

Tom Lane wrote:

Graham Davis <gdavis@refractions.net> writes:

How come an aggreate like that has to use a sequential scan? I know
that PostgreSQL use to have to do a sequential scan for all aggregates,
but there was support added to version 8 so that aggregates would take
advantage of indexes.

Not in a GROUP BY context, only for the simple case. Per the comment in
planagg.c:

* We don't handle GROUP BY, because our current implementations of
* grouping require looking at all the rows anyway, and so there's not
* much point in optimizing MIN/MAX.

The problem is that using an index to obtain the maximum value of ts for
a given value of assetid is not the same thing as finding out what all
the distinct values of assetid are.

This could possibly be improved but it would take a considerable amount
more work. It's definitely not in the category of "bug fix".

regards, tom lane

--
Graham Davis
Refractions Research Inc.
gdavis@refractions.net

#16Adnan DURSUN
a_dursun@hotmail.com
In reply to: Graham Davis (#1)
PostgreSQL Caching

Hi,

I wonder how PostgreSQL caches the SQL query results. For example ;

* does postgres cache query result in memory that done by session A
?
* does session B use these results ?

Best Regards

Adnan DURSUN

#17Mark Lewis
mark.lewis@mir3.com
In reply to: Graham Davis (#15)
Re: BUG #2658: Query not using index

A few hundred is quite a lot for the next proposal and it's kind of an
ugly one, but might as well throw the idea out since you never know.

Have you considered creating one partial index per assetid? Something
along the lines of "CREATE INDEX asset_index_N ON asset_positions(ts)
WHERE assetid=N"? I'd guess that the planner probably wouldn't be smart
enough to use the partial indexes unless you issued a separate query for
each assetid, but each one of those queries should be really fast. Of
course, this is all assuming that PG knows how to use partial indexes to
satisfy MAX queries; I'm not sure if it does.

-- Mark Lewis

Show quoted text

On Tue, 2006-10-03 at 14:35 -0700, Graham Davis wrote:

Not many. It fluctuates, but there are usually only ever a few hundred
at most. Each assetid has multi-millions of positions though.

Mark Lewis wrote:

Hmmm. How many distinct assetids are there?
-- Mark Lewis

On Tue, 2006-10-03 at 14:23 -0700, Graham Davis wrote:

The "summary table" approach maintained by triggers is something we are
considering, but it becomes a bit more complicated to implement.
Currently we have groups of new positions coming in every few seconds or
less. They are not guaranteed to be in order. So for instance, a group
of positions from today could come in and be inserted, then a group of
positions that got lost from yesterday could come in and be inserted
afterwards.

This means the triggers would have to do some sort of logic to figure
out if the newly inserted position is actually the most recent by
timestamp. If positions are ever deleted or updated, the same sort of
query that is currently running slow will need to be executed in order
to get the new most recent position. So there is the possibility that
new positions can be inserted faster than the triggers can calculate
and maintain the summary table. There are some other complications
with maintaining such a summary table in our system too, but I won't get
into those.

Right now I'm just trying to see if I can get the query itself running
faster, which would be the easiest solution for now.

Graham.

Mark Lewis wrote:

Have you looked into a materialized view sort of approach? You could
create a table which had assetid as a primary key, and max_ts as a
column. Then use triggers to keep that table up to date as rows are
added/updated/removed from the main table.

This approach would only make sense if there were far fewer distinct
assetid values than rows in the main table, and would get slow if you
commonly delete rows from the main table or decrease the value for ts in
the row with the highest ts for a given assetid.

-- Mark Lewis

On Tue, 2006-10-03 at 13:52 -0700, Graham Davis wrote:

Thanks Tom, that explains it and makes sense. I guess I will have to
accept this query taking 40 seconds, unless I can figure out another way
to write it so it can use indexes. If there are any more syntax
suggestions, please pass them on. Thanks for the help everyone.

Graham.

Tom Lane wrote:

Graham Davis <gdavis@refractions.net> writes:

How come an aggreate like that has to use a sequential scan? I know
that PostgreSQL use to have to do a sequential scan for all aggregates,
but there was support added to version 8 so that aggregates would take
advantage of indexes.

Not in a GROUP BY context, only for the simple case. Per the comment in
planagg.c:

* We don't handle GROUP BY, because our current implementations of
* grouping require looking at all the rows anyway, and so there's not
* much point in optimizing MIN/MAX.

The problem is that using an index to obtain the maximum value of ts for
a given value of assetid is not the same thing as finding out what all
the distinct values of assetid are.

This could possibly be improved but it would take a considerable amount
more work. It's definitely not in the category of "bug fix".

regards, tom lane

#18Tomeh, Husam
htomeh@firstam.com
In reply to: Adnan DURSUN (#16)
Re: PostgreSQL Caching

Like many descent RDBMS, Postgresql server allocates its own shared
memory area where data is cached in. When receiving a query request,
Postgres engine checks first its shared memory buffers, if not found,
the engine performs disk I/Os to retrieve data from PostgreSQL data
files and place it in the shared buffer area before serving it back to
the client. Blocks in the shared buffers are shared by other sessions
and can therefore be possibly accessed by other sessions. Postgresql
shared buffers can be allocated by setting the postgresql.conf parameter
namely, shared_buffers.

Sincerely,

--
Husam

-----Original Message-----
From: pgsql-performance-owner@postgresql.org
[mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Adnan
DURSUN
Sent: Tuesday, October 03, 2006 2:49 PM
To: pgsql-performance@postgresql.org
Subject: [PERFORM] PostgreSQL Caching

Hi,

I wonder how PostgreSQL caches the SQL query results. For example ;

* does postgres cache query result in memory that done by
session A
?
* does session B use these results ?

Best Regards

Adnan DURSUN

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster
**********************************************************************
This message contains confidential information intended only for the use of the addressee(s) named above and may contain information that is legally privileged. If you are not the addressee, or the person responsible for delivering it to the addressee, you are hereby notified that reading, disseminating, distributing or copying this message is strictly prohibited. If you have received this message by mistake, please immediately notify us by replying to the message and delete the original message immediately thereafter.

Thank you.

FADLD Tag
**********************************************************************

#19Tom Lane
tgl@sss.pgh.pa.us
In reply to: Mark Lewis (#17)
Re: BUG #2658: Query not using index

Mark Lewis <mark.lewis@mir3.com> writes:

Have you considered creating one partial index per assetid? Something
along the lines of "CREATE INDEX asset_index_N ON asset_positions(ts)
WHERE assetid=N"? I'd guess that the planner probably wouldn't be smart
enough to use the partial indexes unless you issued a separate query for
each assetid, but each one of those queries should be really fast.

Actually, a single index on (assetid, ts) is sufficient to handle

select max(ts) from asset_positions where assetid = constant

The problem is to know what values of "constant" to issue the query for,
and this idea doesn't seem to help with that.

If Graham is willing to assume that the set of assetids changes slowly,
perhaps he could keep a summary table that contains all the valid
assetids (or maybe there already is such a table? is assetid a foreign
key?) and do

select pk.assetid,
(select max(ts) from asset_positions where assetid = pk.assetid)
from other_table pk;

I'm pretty sure the subselect would be planned the way he wants.

regards, tom lane

#20Adnan DURSUN
a_dursun@hotmail.com
In reply to: Tomeh, Husam (#18)
Re: PostgreSQL Caching

Thanks,

I wonder these ;

* When any session updates the data that allready in shared buffer,
does Postgres sychronize the data both disk and shared buffers area
immediately ?
* Does postgres cache SQL execution plan analyze results in memory
to use for other sessions ? For example ;
When session A execute "SELECT * FROM tab WHERE col1 = val1 AND col2
= val2", does postgres save the parser/optimizer result in memory in order
to use by other session to prevent duplicate execution of parser
and optimizer so therefore get time ?. Because an execution plan is created
before..

Sincenerly

Adnan DURSUN

----- Original Message -----
From: "Tomeh, Husam" <htomeh@firstam.com>
To: "Adnan DURSUN" <a_dursun@hotmail.com>;
<pgsql-performance@postgresql.org>
Sent: Wednesday, October 04, 2006 1:11 AM
Subject: Re: [PERFORM] PostgreSQL Caching

Like many descent RDBMS, Postgresql server allocates its own shared
memory area where data is cached in. When receiving a query request,
Postgres engine checks first its shared memory buffers, if not found,
the engine performs disk I/Os to retrieve data from PostgreSQL data
files and place it in the shared buffer area before serving it back to
the client. Blocks in the shared buffers are shared by other sessions
and can therefore be possibly accessed by other sessions. Postgresql
shared buffers can be allocated by setting the postgresql.conf parameter
namely, shared_buffers.

Sincerely,

--
Husam

-----Original Message-----
From: pgsql-performance-owner@postgresql.org
[mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Adnan
DURSUN
Sent: Tuesday, October 03, 2006 2:49 PM
To: pgsql-performance@postgresql.org
Subject: [PERFORM] PostgreSQL Caching

Hi,

I wonder how PostgreSQL caches the SQL query results. For example ;

* does postgres cache query result in memory that done by
session A
?
* does session B use these results ?

Best Regards

Adnan DURSUN

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster
**********************************************************************
This message contains confidential information intended only for the use of
the addressee(s) named above and may contain information that is legally
privileged. If you are not the addressee, or the person responsible for
delivering it to the addressee, you are hereby notified that reading,
disseminating, distributing or copying this message is strictly prohibited.
If you have received this message by mistake, please immediately notify us
by replying to the message and delete the original message immediately
thereafter.

Thank you.

FADLD Tag
**********************************************************************

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

#21Tomeh, Husam
htomeh@firstam.com
In reply to: Adnan DURSUN (#20)
#22Adnan DURSUN
a_dursun@hotmail.com
In reply to: Tomeh, Husam (#21)
#23Dave Dutcher
dave@tridecap.com
In reply to: Adnan DURSUN (#22)
#24Brad Nicholson
bnichols@ca.afilias.info
In reply to: Dave Dutcher (#23)
#25Brad Nicholson
bnichols@ca.afilias.info
In reply to: Tomeh, Husam (#21)