MAX/MIN optimization via rewrite (plus query rewrites generally)

Started by Mark Kirkwoodover 21 years ago36 messageshackers
Jump to latest
#1Mark Kirkwood
mark.kirkwood@catalyst.net.nz

I am looking at implementing this TODO item. e.g. (max case):

rewrite
SELECT max(foo) FROM bar
as
SELECT foo FROM bar ORDER BY foo DESC LIMIT 1
if there is an index on bar(foo)

Suggestions about the most suitable point in the parser/planner stage to
perform this sort of rewrite would be most welcome! (as this would be my
first non trivial getting of hands dirty in the code).

My initial thoughts revolved around extending the existing RULE system
to be able to handle more general types of rewrite - like conditionals
in SELECT rules and rewrites that change elements of the query other
than the target relation.

Planning for future note: I would like whatever mechanism that is added
for this MAX/MIN stuff to be amenable to more subtle things like
aggregate navigation (see R.Kimball's article
http://www.dbmsmag.com/9608d54.html).

regards

Mark

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Mark Kirkwood (#1)
Re: MAX/MIN optimization via rewrite (plus query rewrites generally)

Mark Kirkwood <markir@coretech.co.nz> writes:

I am looking at implementing this TODO item. e.g. (max case):

My initial thoughts revolved around extending the existing RULE system
to be able to handle more general types of rewrite - like conditionals
in SELECT rules and rewrites that change elements of the query other
than the target relation.

The rule rewriter is almost certainly the wrong place, because it has
only the most superficial understanding of a query's semantics.
Doing this processing there would require re-inventing (or at least
duplicating the execution of) a lot of the planner's query analysis
work.

My thoughts would run towards doing this after the prepqual and
prepjointree steps (probably somewhere in grouping_planner). Even there
is a bit early since you'd have to duplicate plancat.c's extraction of
information about related indexes; but possibly it'd be reasonable to
move the add_base_rels_to_query() call out of query_planner and do it in
grouping_planner.

A more radical way of handling it would be to detect the relevance of an
indexscan in indxpath.c and generate a special kind of Path node; this
would not generalize to other sorts of things as you were hoping, but
I'm unconvinced that the mechanism is going to be very general-purpose
anyway. The major advantage is that this would work conveniently for
comparing the cost of a rewritten query to a non-rewritten one.

How are you planning to represent the association between MIN/MAX and
particular index orderings in the system catalogs?

regards, tom lane

#3Alvaro Herrera
alvherre@dcc.uchile.cl
In reply to: Tom Lane (#2)
Re: MAX/MIN optimization via rewrite (plus query rewrites generally)

On Wed, Nov 10, 2004 at 07:18:59PM -0500, Tom Lane wrote:

A more radical way of handling it would be to detect the relevance of an
indexscan in indxpath.c and generate a special kind of Path node; this
would not generalize to other sorts of things as you were hoping, but
I'm unconvinced that the mechanism is going to be very general-purpose
anyway. The major advantage is that this would work conveniently for
comparing the cost of a rewritten query to a non-rewritten one.

What about having a new column in pg_aggregate which would point to a
function that would try to optimize the aggregate's handling?

There could be multiple calls to that function along the query's way to
executor, each one at a different point (with a parameter specifying
which one it is), that would try to rewrite the query optimizing the
aggregate. So we could optimize some aggregates at one point, and
others at a different point, whichever makes the most sense.

--
Alvaro Herrera (<alvherre[a]dcc.uchile.cl>)
"Hay dos momentos en la vida de un hombre en los que no deber�a
especular: cuando puede permit�rselo y cuando no puede" (Mark Twain)

#4Mark Kirkwood
mark.kirkwood@catalyst.net.nz
In reply to: Tom Lane (#2)
Re: MAX/MIN optimization via rewrite (plus query rewrites

Tom Lane wrote:

A more radical way of handling it would be to detect the relevance of an
indexscan in indxpath.c and generate a special kind of Path node; this
would not generalize to other sorts of things as you were hoping, but
I'm unconvinced that the mechanism is going to be very general-purpose
anyway. The major advantage is that this would work conveniently for
comparing the cost of a rewritten query to a non-rewritten one.

I like this point - it makes sense to check that the rewritten query is
less costly to execute than the original!

How are you planning to represent the association between MIN/MAX and
particular index orderings in the system catalogs?

That is the next item to think on, we could have a rewrite catalog that
holds possible transformations for certain functions (certain aggregates
at this stage I guess). This is a bit like Alvaro's idea - however it
may be better to represent it the way he suggested!

regards

Mark

#5Bruno Wolff III
bruno@wolff.to
In reply to: Alvaro Herrera (#3)
Re: MAX/MIN optimization via rewrite (plus query rewrites generally)

On Wed, Nov 10, 2004 at 22:21:31 -0300,
Alvaro Herrera <alvherre@dcc.uchile.cl> wrote:

On Wed, Nov 10, 2004 at 07:18:59PM -0500, Tom Lane wrote:

A more radical way of handling it would be to detect the relevance of an
indexscan in indxpath.c and generate a special kind of Path node; this
would not generalize to other sorts of things as you were hoping, but
I'm unconvinced that the mechanism is going to be very general-purpose
anyway. The major advantage is that this would work conveniently for
comparing the cost of a rewritten query to a non-rewritten one.

What about having a new column in pg_aggregate which would point to a
function that would try to optimize the aggregate's handling?

I think you want to store an operator class and a direction. This allows
you to figure out what indexes might be usable. This could be used
on all of the max and min aggregates and the boolean and and or aggregates.

#6Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Mark Kirkwood (#1)
Re: MAX/MIN optimization via rewrite (plus query rewrites generally)

On Thu, Nov 11, 2004 at 11:48:49AM +1300, Mark Kirkwood wrote:

I am looking at implementing this TODO item. e.g. (max case):

rewrite
SELECT max(foo) FROM bar
as
SELECT foo FROM bar ORDER BY foo DESC LIMIT 1
if there is an index on bar(foo)

Out of curiosity, will you be doing this in such a way that

SELECT min(foo), max(foo) FROM bar

will end up as

SELECT (SELECT foo FROM bar ORDER BY foo ASC LIMIT 1), (SELECT ... DESC
LIMIT 1)

?
--
Jim C. Nasby, Database Consultant decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

#7Mark Kirkwood
mark.kirkwood@catalyst.net.nz
In reply to: Jim Nasby (#6)
Re: MAX/MIN optimization via rewrite (plus query rewrites

Your example and ones like :

SELECT max(foo), count(foo) FROM bar
SELECT max(a.foo1), max(b.foo2) FROM bar1 AS a NATURAL JOIN bar2 AS b

have made me realize that the scope of "what should be optimized" is
somewhat subtle.

I am inclined to keep it simple (i.e rather limited) for a first cut,
and if that works well, then look at extending to more complex rewrites.

What do you think?

Jim C. Nasby wrote:

Show quoted text

On Thu, Nov 11, 2004 at 11:48:49AM +1300, Mark Kirkwood wrote:

I am looking at implementing this TODO item. e.g. (max case):

rewrite
SELECT max(foo) FROM bar
as
SELECT foo FROM bar ORDER BY foo DESC LIMIT 1
if there is an index on bar(foo)

Out of curiosity, will you be doing this in such a way that

SELECT min(foo), max(foo) FROM bar

will end up as

SELECT (SELECT foo FROM bar ORDER BY foo ASC LIMIT 1), (SELECT ... DESC
LIMIT 1)

?

#8Bruno Wolff III
bruno@wolff.to
In reply to: Mark Kirkwood (#7)
Re: MAX/MIN optimization via rewrite (plus query rewrites

On Thu, Nov 11, 2004 at 17:57:42 +1300,
Mark Kirkwood <markir@coretech.co.nz> wrote:

Your example and ones like :

SELECT max(foo), count(foo) FROM bar
SELECT max(a.foo1), max(b.foo2) FROM bar1 AS a NATURAL JOIN bar2 AS b

have made me realize that the scope of "what should be optimized" is
somewhat subtle.

I am inclined to keep it simple (i.e rather limited) for a first cut,
and if that works well, then look at extending to more complex rewrites.

What do you think?

I don't think you should be rewriting queries as much as providing
alternate plans and letting the rest of the optimizer decided which
plan to use. If you just rewrite a query you might lock yourself into
using a poor plan.

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#3)
Re: MAX/MIN optimization via rewrite (plus query rewrites generally)

Alvaro Herrera <alvherre@dcc.uchile.cl> writes:

What about having a new column in pg_aggregate which would point to a
function that would try to optimize the aggregate's handling?

I can't get very excited about this, because how would you make a
reasonably stable/narrow API for such a thing? The function as you
propose it would have to know everything about not only the planner's
data representations but the N specific places it would be called from.

The existing selectivity functions are bad enough on this score ...

regards, tom lane

#10John Hansen
john@geeknet.com.au
In reply to: Tom Lane (#9)
Re: MAX/MIN optimization via rewrite (plus query rewrites generally)

Why not just change the function all together to 'select $1 from $2
order by $1 desc limit 1;'

Is there ANY situation where max(col) as it is, would be faster?

... John

#11Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Mark Kirkwood (#7)
Re: MAX/MIN optimization via rewrite (plus query rewrites

Certainly handling only one case is better than none. I just wanted to
bring up the multiple aggregate scenario. Also, consider that

SELECT min(a), max(a), min(b), max(c) FROM table

could be optimized as well (into 4 index scans, assuming a, b, and c all
had indexes).

I don't think any other aggregates are candidates for optimization right
now, though I guess I could be wrong.

On Thu, Nov 11, 2004 at 05:57:42PM +1300, Mark Kirkwood wrote:

Your example and ones like :

SELECT max(foo), count(foo) FROM bar
SELECT max(a.foo1), max(b.foo2) FROM bar1 AS a NATURAL JOIN bar2 AS b

have made me realize that the scope of "what should be optimized" is
somewhat subtle.

I am inclined to keep it simple (i.e rather limited) for a first cut,
and if that works well, then look at extending to more complex rewrites.

What do you think?

Jim C. Nasby wrote:

On Thu, Nov 11, 2004 at 11:48:49AM +1300, Mark Kirkwood wrote:

I am looking at implementing this TODO item. e.g. (max case):

rewrite
SELECT max(foo) FROM bar
as
SELECT foo FROM bar ORDER BY foo DESC LIMIT 1
if there is an index on bar(foo)

Out of curiosity, will you be doing this in such a way that

SELECT min(foo), max(foo) FROM bar

will end up as

SELECT (SELECT foo FROM bar ORDER BY foo ASC LIMIT 1), (SELECT ... DESC
LIMIT 1)

?

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

--
Jim C. Nasby, Database Consultant decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

#12Mark Kirkwood
mark.kirkwood@catalyst.net.nz
In reply to: Bruno Wolff III (#8)
Re: MAX/MIN optimization via rewrite (plus query rewrites

There seems to be (as Tom indicated) a choice of approaches:

i) rewrite max/min querys and then plan 'em
ii) provide alternate plans based on presence of certain aggregate types
in the query

when I first examined this TODO item, I was really thinking about i),
but I suspect that ii) is probably the best approach.

regards

Mark

Bruno Wolff III wrote:

Show quoted text

On Thu, Nov 11, 2004 at 17:57:42 +1300,
Mark Kirkwood <markir@coretech.co.nz> wrote:

Your example and ones like :

SELECT max(foo), count(foo) FROM bar
SELECT max(a.foo1), max(b.foo2) FROM bar1 AS a NATURAL JOIN bar2 AS b

have made me realize that the scope of "what should be optimized" is
somewhat subtle.

I am inclined to keep it simple (i.e rather limited) for a first cut,
and if that works well, then look at extending to more complex rewrites.

What do you think?

I don't think you should be rewriting queries as much as providing
alternate plans and letting the rest of the optimizer decided which
plan to use. If you just rewrite a query you might lock yourself into
using a poor plan.

#13Mark Kirkwood
mark.kirkwood@catalyst.net.nz
In reply to: John Hansen (#10)
Re: MAX/MIN optimization via rewrite (plus query rewrites

Probably for a small table, where the machinery of reading the index,
followed by checking the table for non-visible tuples is more costly
than just scanning the table!

regards

Mark

John Hansen wrote:

Show quoted text

Why not just change the function all together to 'select $1 from $2
order by $1 desc limit 1;'

Is there ANY situation where max(col) as it is, would be faster?

... John

#14Bruce Momjian
bruce@momjian.us
In reply to: Bruno Wolff III (#8)
Re: MAX/MIN optimization via rewrite (plus query rewrites

Bruno Wolff III <bruno@wolff.to> writes:

I don't think you should be rewriting queries as much as providing
alternate plans and letting the rest of the optimizer decided which
plan to use. If you just rewrite a query you might lock yourself into
using a poor plan.

Moreover, none of these rewritten queries would work properly for

select min(foo) from tab group by bar

This should still be aware it can use an index on <bar,foo> and get much
better performance. Well it can't know, but it should be able to. All it
should really need to know is that min() only needs a particular subset of the
dataset -- namely the first record, as long as the records are provided in a
particular order.

Also, the same code ought to be able to handle

select first(foo) from tab group by bar

Which is exactly equivalent to the min() case except that no particular
ordering is required.

--
greg

#15Simon Riggs
simon@2ndQuadrant.com
In reply to: Mark Kirkwood (#1)
Re: MAX/MIN optimization via rewrite (plus query

On Wed, 2004-11-10 at 22:48, Mark Kirkwood wrote:

Planning for future note: I would like whatever mechanism that is added
for this MAX/MIN stuff to be amenable to more subtle things like
aggregate navigation (see R.Kimball's article
http://www.dbmsmag.com/9608d54.html).

With you on that one...

--
Best Regards, Simon Riggs

#16Zeugswetter Andreas SB SD
ZeugswetterA@spardat.at
In reply to: John Hansen (#10)
Re: MAX/MIN optimization via rewrite (plus query rewrites generally)

How are you planning to represent the association between MIN/MAX and
particular index orderings in the system catalogs?

Don't we already have that info to decide whether an index handles
an "ORDER BY" without a sort node ?

Andreas

#17Bruno Wolff III
bruno@wolff.to
In reply to: John Hansen (#10)
Re: MAX/MIN optimization via rewrite (plus query rewrites generally)

On Thu, Nov 11, 2004 at 17:52:19 +1100,
John Hansen <john@geeknet.com.au> wrote:

Why not just change the function all together to 'select $1 from $2
order by $1 desc limit 1;'

Is there ANY situation where max(col) as it is, would be faster?

Yes. A couple I can think of are:
When count(col) is also being used.
When a GROUP BY is being used and there isn't an index that can both be used
to do the grouping and col order within each group.

#18Bruno Wolff III
bruno@wolff.to
In reply to: Jim Nasby (#11)
Re: MAX/MIN optimization via rewrite (plus query rewrites

On Thu, Nov 11, 2004 at 01:18:05 -0600,
"Jim C. Nasby" <decibel@decibel.org> wrote:

Certainly handling only one case is better than none. I just wanted to
bring up the multiple aggregate scenario. Also, consider that

SELECT min(a), max(a), min(b), max(c) FROM table

could be optimized as well (into 4 index scans, assuming a, b, and c all
had indexes).

I don't think any other aggregates are candidates for optimization right
now, though I guess I could be wrong.

Remember that max and min are a number of aggregates, as each datatype
which have max and min functions have different ones from those used
by other datatypes.
I think someone added boolean aggregates for and and or in version 8.
If so, those can also use indexes in the same way.

#19Tom Lane
tgl@sss.pgh.pa.us
In reply to: Zeugswetter Andreas SB SD (#16)
Re: MAX/MIN optimization via rewrite (plus query rewrites generally)

"Zeugswetter Andreas DAZ SD" <ZeugswetterA@spardat.at> writes:

How are you planning to represent the association between MIN/MAX and
particular index orderings in the system catalogs?

Don't we already have that info to decide whether an index handles
an "ORDER BY" without a sort node ?

We know how to determine that an index matches an ORDER BY clause.
But what has an aggregate called MAX() got to do with ORDER BY? Magic
assumptions about operators named "<" are not acceptable answers; there
has to be a traceable connection in the catalogs.

As a real-world example of why I won't hold still for hard-wiring this:
a complex-number data type might have btree opclasses allowing it to be
sorted either by real part or by absolute value. One might then define
max_real() and max_abs() aggregates on the type. It should be possible
to optimize such aggregates the same way as any other max() aggregate.

regards, tom lane

#20Alvaro Herrera
alvherre@dcc.uchile.cl
In reply to: Tom Lane (#9)
Re: MAX/MIN optimization via rewrite (plus query rewrites generally)

On Thu, Nov 11, 2004 at 01:08:39AM -0500, Tom Lane wrote:

Alvaro Herrera <alvherre@dcc.uchile.cl> writes:

What about having a new column in pg_aggregate which would point to a
function that would try to optimize the aggregate's handling?

I can't get very excited about this, because how would you make a
reasonably stable/narrow API for such a thing? The function as you
propose it would have to know everything about not only the planner's
data representations but the N specific places it would be called from.

No, the function would discard all calls except the one it knows how
to optimize. The point in having multiple call places is that some
aggregates will likely be optimized in some place, and others somewhere
else. Most likely, a first patch would include only the call site that
would help in optimizing min() and max().

Of course, an aggregate could specify no optimizing function, which
would be the current situation, where no aggregate knows how to optimize
itself.

Re: knowing internal representation, I think this is required anyway;
else the optimization would only work on a very limited numer of
situations.

--
Alvaro Herrera (<alvherre[a]dcc.uchile.cl>)
"A wizard is never late, Frodo Baggins, nor is he early.
He arrives precisely when he means to." (Gandalf, en LoTR FoTR)

#21Bruno Wolff III
bruno@wolff.to
In reply to: Simon Riggs (#15)
#22Bruno Wolff III
bruno@wolff.to
In reply to: Tom Lane (#19)
#23Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#19)
#24Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruno Wolff III (#22)
#25Bruce Momjian
bruce@momjian.us
In reply to: Simon Riggs (#15)
#26Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#23)
#27Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#20)
#28Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#26)
#29Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#28)
#30Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#29)
#31Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Bruno Wolff III (#17)
#32Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Bruno Wolff III (#21)
#33Jan Wieck
JanWieck@Yahoo.com
In reply to: Mark Kirkwood (#7)
#34Bruce Momjian
bruce@momjian.us
In reply to: Jan Wieck (#33)
#35Dawid Kuroczko
qnex42@gmail.com
In reply to: Bruce Momjian (#34)
#36Mark Kirkwood
mark.kirkwood@catalyst.net.nz
In reply to: Mark Kirkwood (#1)