Select Last n Rows Matching an Index Condition (and caches)

Started by Alex Stapletonabout 21 years ago9 messagesgeneral
Jump to latest
#1Alex Stapleton
alexs@advfn.com

We have a ~10million row table but are expecting it to get larger, possibly
by a factor of 10 or more. The rows are quite long and fixed length (just
over 500 bytes.)

We have an index of (symbol, source, date) on this table and doing queries
like this

SELECT * FROM article WHERE symbol=12646 AND source = 19 ORDER BY time DESC
LIMIT 1000;

To get the latest 1000 rows for that symbol and source.

However this takes quite a while at the best of times, (1-10 seconds.) The
query without the order by and the limit tends to return about 70000 rows
which adds up to about 30MB of data. Once the pages are in the cache they
take around 100ms but this is to be expected. Unfortunately the initial
query required to cache it is unnacceptably long for web application like
ours.

My (not yet implemented) solution to this problem is to add a SEQUENCE and
index it so that by adding a WHERE id > [max_id]-1000 and ordering by time
DESC will reduce the I/O quite a lot. Am I right here? It would be nice if
there was a way to get PostgreSQL to try and precache the tables pages as
well, is there anyway I could achieve something like that? I have toyed with
creating a ramdisk to store a lookup table of sorts on (we only care about a
few columns initially) to speed this up a bit but its a right pain in the
arse to do by the looks of things.

#2Alban Hertroys
alban@magproductions.nl
In reply to: Alex Stapleton (#1)
Re: Select Last n Rows Matching an Index Condition (and

Alex Stapleton wrote:

SELECT * FROM article WHERE symbol=12646 AND source = 19 ORDER BY time DESC
LIMIT 1000;

To get the latest 1000 rows for that symbol and source.

My (not yet implemented) solution to this problem is to add a SEQUENCE and
index it so that by adding a WHERE id > [max_id]-1000 and ordering by time
DESC will reduce the I/O quite a lot. Am I right here? It would be nice if
there was a way to get PostgreSQL to try and precache the tables pages as
well, is there anyway I could achieve something like that? I have toyed with
creating a ramdisk to store a lookup table of sorts on (we only care about a
few columns initially) to speed this up a bit but its a right pain in the
arse to do by the looks of things.

First question that always gets asked here: What's the output of explain
analyse? Without that, people here can't see where the slowdown is.

I expect though, that the problem is the ordering by time. I imagine
that you could create an index on time, maybe truncated to months or
something similar (You can create indices based on functions). That
index alone should speed up the ordering already.
It could also be used to cluster the table, which should speed up things
some more, I suppose.

--
Alban Hertroys
MAG Productions

T: +31(0)53 4346874
F: +31(0)53 4346876
E: alban@magproductions.nl
W: http://www.magproductions.nl

#3Alex Stapleton
alexs@advfn.com
In reply to: Alban Hertroys (#2)
Re: Select Last n Rows Matching an Index Condition (and caches)

Woops sorry we have indexes on (symbol, source, time) and there is no date
column :/

SELECT * FROM article WHERE symbol=12646 AND source = 19 ORDER BY time DESC
LIMIT 1000;
QUERY PLAN
----------------------------------------------------------------------------
----------------------------------------
Limit (cost=321163.29..321165.79 rows=1000 width=466) (actual
time=33243.924..33246.021 rows=1000 loops=1)
-> Sort (cost=321163.29..321402.31 rows=95609 width=466) (actual
time=33243.917..33244.626 rows=1000 loops=1)
Sort Key: "time"
-> Seq Scan on article (cost=0.00..301724.00 rows=95609
width=466) (actual time=0.022..32979.685 rows=42959 loops=1)
Filter: ((symbol = 12646) AND (source = 19))
Total runtime: 33258.706 ms
(6 rows)

explain analyze SELECT * FROM article WHERE symbol=12646 AND source = 19;
QUERY PLAN
----------------------------------------------------------------------------
----------------------------------------
Seq Scan on article (cost=0.00..301724.00 rows=95609 width=466) (actual
time=0.021..33275.433 rows=42959 loops=1)
Filter: ((symbol = 12646) AND (source = 19))
Total runtime: 33320.920 ms
(3 rows)

We can't use CLUSTER because we need the DB up all the time.

The analyze suggests that it's the I/O taking most of the time to me.

-----Original Message-----
From: Alban Hertroys [mailto:alban@magproductions.nl]
Sent: 18 March 2005 09:48
To: Alex Stapleton
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Select Last n Rows Matching an Index Condition
(and caches)

Alex Stapleton wrote:

SELECT * FROM article WHERE symbol=12646 AND source = 19 ORDER BY time

DESC

LIMIT 1000;

To get the latest 1000 rows for that symbol and source.

My (not yet implemented) solution to this problem is to add a SEQUENCE and
index it so that by adding a WHERE id > [max_id]-1000 and ordering by time
DESC will reduce the I/O quite a lot. Am I right here? It would be nice if
there was a way to get PostgreSQL to try and precache the tables pages as
well, is there anyway I could achieve something like that? I have toyed

with

creating a ramdisk to store a lookup table of sorts on (we only care about

a

few columns initially) to speed this up a bit but its a right pain in the
arse to do by the looks of things.

First question that always gets asked here: What's the output of explain
analyse? Without that, people here can't see where the slowdown is.

I expect though, that the problem is the ordering by time. I imagine
that you could create an index on time, maybe truncated to months or
something similar (You can create indices based on functions). That
index alone should speed up the ordering already.
It could also be used to cluster the table, which should speed up things
some more, I suppose.

--
Alban Hertroys
MAG Productions

T: +31(0)53 4346874
F: +31(0)53 4346876
E: alban@magproductions.nl
W: http://www.magproductions.nl

#4Martijn van Oosterhout
kleptog@svana.org
In reply to: Alex Stapleton (#1)
Re: Select Last n Rows Matching an Index Condition (and caches)

On Fri, Mar 18, 2005 at 09:29:06AM -0000, Alex Stapleton wrote:

We have a ~10million row table but are expecting it to get larger, possibly
by a factor of 10 or more. The rows are quite long and fixed length (just
over 500 bytes.)

We have an index of (symbol, source, date) on this table and doing queries
like this

SELECT * FROM article WHERE symbol=12646 AND source = 19 ORDER BY time DESC
LIMIT 1000;

To get the latest 1000 rows for that symbol and source.

However this takes quite a while at the best of times, (1-10 seconds.) The
query without the order by and the limit tends to return about 70000 rows
which adds up to about 30MB of data. Once the pages are in the cache they
take around 100ms but this is to be expected. Unfortunately the initial
query required to cache it is unnacceptably long for web application like
ours.

I think the normal approach for this is an index on
(symbol,source,time). You may need to change the query to:

SELECT * FROM article WHERE symbol=12646 AND source = 19 ORDER BY
symbol desc, source DESC, time DESC LIMIT 1000;

The EXPLAIN ANALYZE output would also be very helpful...

Hope this helps,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
tool for doing 5% of the work and then sitting around waiting for someone
else to do the other 95% so you can sue them.

#5Alex Stapleton
alexs@advfn.com
In reply to: Martijn van Oosterhout (#4)
Re: Select Last n Rows Matching an Index Condition (and caches)

If I SET ENABLE_SEQSCAN TO OFF it uses the Index, but it does an Index scan
backward if I do

SELECT * FROM article WHERE symbol=12646 AND source = 19 ORDER BY
symbol desc, source DESC, time DESC LIMIT 1000;

Which is better but still quite slow.

-----Original Message-----
From: Hegyvari Krisztian [mailto:hegyvari.krisztian@ardents.hu]
Sent: 18 March 2005 10:25
To: Alex Stapleton; pgsql-general@postgresql.org
Subject: RE: [GENERAL] Select Last n Rows Matching an Index Condition
(and caches)

Does not it look like the index you are actually using is on article and
then PG has to filter for symbol and source?

-----Original Message-----
From: pgsql-general-owner@postgresql.org
[mailto:pgsql-general-owner@postgresql.org] On Behalf Of Alex Stapleton
Sent: Friday, March 18, 2005 10:57 AM
To: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Select Last n Rows Matching an Index Condition
(and caches)

Woops sorry we have indexes on (symbol, source, time) and there is no
date
column :/

SELECT * FROM article WHERE symbol=12646 AND source = 19 ORDER BY time
DESC
LIMIT 1000;
QUERY PLAN
------------------------------------------------------------------------
----
----------------------------------------
Limit (cost=321163.29..321165.79 rows=1000 width=466) (actual
time=33243.924..33246.021 rows=1000 loops=1)
-> Sort (cost=321163.29..321402.31 rows=95609 width=466) (actual
time=33243.917..33244.626 rows=1000 loops=1)
Sort Key: "time"
-> Seq Scan on article (cost=0.00..301724.00 rows=95609
width=466) (actual time=0.022..32979.685 rows=42959 loops=1)
Filter: ((symbol = 12646) AND (source = 19))
Total runtime: 33258.706 ms
(6 rows)

explain analyze SELECT * FROM article WHERE symbol=12646 AND source =
19;
QUERY PLAN
------------------------------------------------------------------------
----
----------------------------------------
Seq Scan on article (cost=0.00..301724.00 rows=95609 width=466)
(actual
time=0.021..33275.433 rows=42959 loops=1)
Filter: ((symbol = 12646) AND (source = 19))
Total runtime: 33320.920 ms
(3 rows)

We can't use CLUSTER because we need the DB up all the time.

The analyze suggests that it's the I/O taking most of the time to me.

-----Original Message-----
From: Alban Hertroys [mailto:alban@magproductions.nl]
Sent: 18 March 2005 09:48
To: Alex Stapleton
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Select Last n Rows Matching an Index Condition
(and caches)

Alex Stapleton wrote:

SELECT * FROM article WHERE symbol=12646 AND source = 19 ORDER BY time

DESC

LIMIT 1000;

To get the latest 1000 rows for that symbol and source.

My (not yet implemented) solution to this problem is to add a SEQUENCE

and

index it so that by adding a WHERE id > [max_id]-1000 and ordering by

time

DESC will reduce the I/O quite a lot. Am I right here? It would be

nice if

there was a way to get PostgreSQL to try and precache the tables pages

as

well, is there anyway I could achieve something like that? I have

toyed
with

creating a ramdisk to store a lookup table of sorts on (we only care

about
a

few columns initially) to speed this up a bit but its a right pain in

the

arse to do by the looks of things.

First question that always gets asked here: What's the output of explain
analyse? Without that, people here can't see where the slowdown is.

I expect though, that the problem is the ordering by time. I imagine
that you could create an index on time, maybe truncated to months or
something similar (You can create indices based on functions). That
index alone should speed up the ordering already.
It could also be used to cluster the table, which should speed up things
some more, I suppose.

--
Alban Hertroys
MAG Productions

T: +31(0)53 4346874
F: +31(0)53 4346876
E: alban@magproductions.nl
W: http://www.magproductions.nl

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

#6Hegyvari Krisztian
Hegyvari.Krisztian@ardents.hu
In reply to: Alex Stapleton (#5)
Re: Select Last n Rows Matching an Index Condition (and caches)

Does not it look like the index you are actually using is on article and
then PG has to filter for symbol and source?

-----Original Message-----
From: pgsql-general-owner@postgresql.org
[mailto:pgsql-general-owner@postgresql.org] On Behalf Of Alex Stapleton
Sent: Friday, March 18, 2005 10:57 AM
To: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Select Last n Rows Matching an Index Condition
(and caches)

Woops sorry we have indexes on (symbol, source, time) and there is no
date
column :/

SELECT * FROM article WHERE symbol=12646 AND source = 19 ORDER BY time
DESC
LIMIT 1000;
QUERY PLAN
------------------------------------------------------------------------
----
----------------------------------------
Limit (cost=321163.29..321165.79 rows=1000 width=466) (actual
time=33243.924..33246.021 rows=1000 loops=1)
-> Sort (cost=321163.29..321402.31 rows=95609 width=466) (actual
time=33243.917..33244.626 rows=1000 loops=1)
Sort Key: "time"
-> Seq Scan on article (cost=0.00..301724.00 rows=95609
width=466) (actual time=0.022..32979.685 rows=42959 loops=1)
Filter: ((symbol = 12646) AND (source = 19))
Total runtime: 33258.706 ms
(6 rows)

explain analyze SELECT * FROM article WHERE symbol=12646 AND source =
19;
QUERY PLAN
------------------------------------------------------------------------
----
----------------------------------------
Seq Scan on article (cost=0.00..301724.00 rows=95609 width=466)
(actual
time=0.021..33275.433 rows=42959 loops=1)
Filter: ((symbol = 12646) AND (source = 19))
Total runtime: 33320.920 ms
(3 rows)

We can't use CLUSTER because we need the DB up all the time.

The analyze suggests that it's the I/O taking most of the time to me.

-----Original Message-----
From: Alban Hertroys [mailto:alban@magproductions.nl]
Sent: 18 March 2005 09:48
To: Alex Stapleton
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Select Last n Rows Matching an Index Condition
(and caches)

Alex Stapleton wrote:

SELECT * FROM article WHERE symbol=12646 AND source = 19 ORDER BY time

DESC

LIMIT 1000;

To get the latest 1000 rows for that symbol and source.

My (not yet implemented) solution to this problem is to add a SEQUENCE

and

index it so that by adding a WHERE id > [max_id]-1000 and ordering by

time

DESC will reduce the I/O quite a lot. Am I right here? It would be

nice if

there was a way to get PostgreSQL to try and precache the tables pages

as

well, is there anyway I could achieve something like that? I have

toyed
with

creating a ramdisk to store a lookup table of sorts on (we only care

about
a

few columns initially) to speed this up a bit but its a right pain in

the

arse to do by the looks of things.

First question that always gets asked here: What's the output of explain
analyse? Without that, people here can't see where the slowdown is.

I expect though, that the problem is the ordering by time. I imagine
that you could create an index on time, maybe truncated to months or
something similar (You can create indices based on functions). That
index alone should speed up the ordering already.
It could also be used to cluster the table, which should speed up things
some more, I suppose.

--
Alban Hertroys
MAG Productions

T: +31(0)53 4346874
F: +31(0)53 4346876
E: alban@magproductions.nl
W: http://www.magproductions.nl

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

#7Alex Stapleton
alexs@advfn.com
In reply to: Alex Stapleton (#1)
Re: Select Last n Rows Matching an Index Condition (and caches)

I should point out that theres no garuntee our data is inserted in anything
like the order we want (time desc) but there is a high correlation. Most of
the time it is almost in order.

-----Original Message-----
From: pgsql-general-owner@postgresql.org
[mailto:pgsql-general-owner@postgresql.org]On Behalf Of Alex Stapleton
Sent: 18 March 2005 09:29
To: pgsql-general@postgresql.org
Subject: [GENERAL] Select Last n Rows Matching an Index Condition (and
caches)

We have a ~10million row table but are expecting it to get larger, possibly
by a factor of 10 or more. The rows are quite long and fixed length (just
over 500 bytes.)

We have an index of (symbol, source, date) on this table and doing queries
like this

SELECT * FROM article WHERE symbol=12646 AND source = 19 ORDER BY time DESC
LIMIT 1000;

To get the latest 1000 rows for that symbol and source.

However this takes quite a while at the best of times, (1-10 seconds.) The
query without the order by and the limit tends to return about 70000 rows
which adds up to about 30MB of data. Once the pages are in the cache they
take around 100ms but this is to be expected. Unfortunately the initial
query required to cache it is unnacceptably long for web application like
ours.

My (not yet implemented) solution to this problem is to add a SEQUENCE and
index it so that by adding a WHERE id > [max_id]-1000 and ordering by time
DESC will reduce the I/O quite a lot. Am I right here? It would be nice if
there was a way to get PostgreSQL to try and precache the tables pages as
well, is there anyway I could achieve something like that? I have toyed with
creating a ramdisk to store a lookup table of sorts on (we only care about a
few columns initially) to speed this up a bit but its a right pain in the
arse to do by the looks of things.

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

#8Martijn van Oosterhout
kleptog@svana.org
In reply to: Alex Stapleton (#5)
Re: Select Last n Rows Matching an Index Condition (and caches)

On Fri, Mar 18, 2005 at 10:23:14AM -0000, Alex Stapleton wrote:

If I SET ENABLE_SEQSCAN TO OFF it uses the Index, but it does an Index scan
backward if I do

SELECT * FROM article WHERE symbol=12646 AND source = 19 ORDER BY
symbol desc, source DESC, time DESC LIMIT 1000;

Which is better but still quite slow.

What's the EXPLAIN ANALYZE of this query? Have you done VACUUM FULL
and/or ANALYZE recently?

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
tool for doing 5% of the work and then sitting around waiting for someone
else to do the other 95% so you can sue them.

#9Thomas F.O'Connell
tfo@sitening.com
In reply to: Alex Stapleton (#3)
Re: Select Last n Rows Matching an Index Condition (and caches)

You might unknowingly be asking a FAQ. See the end of this section:

http://www.postgresql.org/docs/faqs.FAQ.html#4.6

Depending on the version of postgres you're running and the data types
of the symbol and source columns, you might need to quote (or cast)
your constant data for symbol and source, e.g.:

SELECT *
FROM article
WHERE symbol='12646'
AND source = '19'
ORDER BY time DESC
LIMIT 1000;

-tfo

--
Thomas F. O'Connell
Co-Founder, Information Architect
Sitening, LLC
http://www.sitening.com/
110 30th Avenue North, Suite 6
Nashville, TN 37203-6320
615-260-0005

On Mar 18, 2005, at 3:56 AM, Alex Stapleton wrote:

Show quoted text

Woops sorry we have indexes on (symbol, source, time) and there is no
date
column :/

SELECT * FROM article WHERE symbol=12646 AND source = 19 ORDER BY time
DESC
LIMIT 1000;
QUERY PLAN
-----------------------------------------------------------------------
-----
----------------------------------------
Limit (cost=321163.29..321165.79 rows=1000 width=466) (actual
time=33243.924..33246.021 rows=1000 loops=1)
-> Sort (cost=321163.29..321402.31 rows=95609 width=466) (actual
time=33243.917..33244.626 rows=1000 loops=1)
Sort Key: "time"
-> Seq Scan on article (cost=0.00..301724.00 rows=95609
width=466) (actual time=0.022..32979.685 rows=42959 loops=1)
Filter: ((symbol = 12646) AND (source = 19))
Total runtime: 33258.706 ms
(6 rows)

explain analyze SELECT * FROM article WHERE symbol=12646 AND source =
19;
QUERY PLAN
-----------------------------------------------------------------------
-----
----------------------------------------
Seq Scan on article (cost=0.00..301724.00 rows=95609 width=466)
(actual
time=0.021..33275.433 rows=42959 loops=1)
Filter: ((symbol = 12646) AND (source = 19))
Total runtime: 33320.920 ms
(3 rows)

We can't use CLUSTER because we need the DB up all the time.

The analyze suggests that it's the I/O taking most of the time to me.

-----Original Message-----
From: Alban Hertroys [mailto:alban@magproductions.nl]
Sent: 18 March 2005 09:48
To: Alex Stapleton
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Select Last n Rows Matching an Index Condition
(and caches)

Alex Stapleton wrote:

SELECT * FROM article WHERE symbol=12646 AND source = 19 ORDER BY time

DESC

LIMIT 1000;

To get the latest 1000 rows for that symbol and source.

My (not yet implemented) solution to this problem is to add a
SEQUENCE and
index it so that by adding a WHERE id > [max_id]-1000 and ordering by
time
DESC will reduce the I/O quite a lot. Am I right here? It would be
nice if
there was a way to get PostgreSQL to try and precache the tables
pages as
well, is there anyway I could achieve something like that? I have
toyed

with

creating a ramdisk to store a lookup table of sorts on (we only care
about

a

few columns initially) to speed this up a bit but its a right pain in
the
arse to do by the looks of things.

First question that always gets asked here: What's the output of
explain
analyse? Without that, people here can't see where the slowdown is.

I expect though, that the problem is the ordering by time. I imagine
that you could create an index on time, maybe truncated to months or
something similar (You can create indices based on functions). That
index alone should speed up the ordering already.
It could also be used to cluster the table, which should speed up
things
some more, I suppose.

--
Alban Hertroys
MAG Productions

T: +31(0)53 4346874
F: +31(0)53 4346876
E: alban@magproductions.nl
W: http://www.magproductions.nl

---------------------------(end of
broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly