Top-k optimizations?
Folks,
As this came up in a work situation, I was wondering a little bit
about the top-k issue. Right now, top-k is implemented (most easily,
I think) via a SELECT with a LIMIT and no OFFSET. 3 questions arise
from this.
1. Are there currently any optimizations specific to top-k in
PostgreSQL? If so, what are they?
2. What kinds of top-k optimizations *can't* be part of PostgreSQL
(things that would break MVCC, e.g.)?
3. What kinds of top-k optimizations might (eventually) be included
in PostgreSQL?
Hoping this stimulates some friendly & informative discussion... :)
Cheers,
D
--
David Fetter david@fetter.org http://fetter.org/
phone: +1 510 893 6100 mobile: +1 415 235 3778
Remember to vote!
On Thu, 13 Jan 2005, David Fetter wrote:
3. What kinds of top-k optimizations might (eventually) be included
in PostgreSQL?
See the TODO item:
Allow ORDER BY ... LIMIT 1 to select high/low value without sort or index
using a sequential scan for highest/lowest values
If only one value is needed, there is no need to sort the entire table.
Instead a sequential scan could get the matching value.
There is some discussion of this on the -general list here:
Kris Jurka
David Fetter wrote:
Folks,
As this came up in a work situation, I was wondering a little bit
about the top-k issue. Right now, top-k is implemented (most easily,
I think) via a SELECT with a LIMIT and no OFFSET. 3 questions arise
from this.
I think the simplest LIMIT query doesn't make it easy to show ties;
but if you don't want the equivalent of MSSQL's "with ties" clause
I think LIMIT works well.
1. Are there currently any optimizations specific to top-k in
PostgreSQL? If so, what are they?
Well, when I do queries like
"select * from customers order by dollarsspent desc limit 3"
it happily uses an index on dollarsspent.
3. What kinds of top-k optimizations might (eventually) be included
in PostgreSQL?
I think a slightly related topic is whether syntactically it'd be
nice if postgresql had the SQL 2003 optional olap features to
specify ways of doing top-k queries as described here:
http://troels.arvin.dk/db/rdbms/#select-top-n
SELECT * FROM (
SELECT
RANK() OVER (ORDER BY age ASC) AS ranking,
person_id, person_name, age
FROM person
) AS foo
WHERE ranking <= 3
Seems IBM and Oracle support that syntax or something very similar.
Yeah, I know it's probably an orthogonal question to the
optimizations one, but might make porting nicer.