Optimisation deficiency: currval('seq')-->seq scan, constant-->index scan
It seems that optimiser is unaware that currval('seq') can be treated as
a constant within
an expression and thus produces suboptimal plans for WHERE clauses that
use currval
thus using a seq scan instead of index scan.
Is it possible (planned) to mark functions as returning a constant when
given a constant
argument and start using it _as a constant_ (pre-evaluated) in queries
The current behaviour is not very intuitive.
------------
Hannu
Hannu Krosing <hannu@tm.ee> writes:
It seems that optimiser is unaware that currval('seq') can be treated
as a constant within an expression and thus produces suboptimal plans
for WHERE clauses that use currval thus using a seq scan instead of
index scan.
currval() does not qualify to be marked cachable, since it does not
always return the same result given the same arguments.
There are a few functions that are not cachable but could be treated
as constants within a single transaction, now() being the most obvious
example. Currently there is no intermediate function type between
"cachable" and "noncachable" but I have toyed with the idea of inventing
one. Getting the semantics right could be tricky however.
However, even if we had a concept of "constant within a transaction/
scan/whatever", currval() would not qualify --- what if there is a
nextval() being invoked somewhere else in the query, possibly inside a
user-defined function where the optimizer has no chance of seeing it?
In short, there is no way of optimizing currval() in the way you want
without risking breakage.
For interactive queries you could fake the behavior you want by creating
a user-defined function that just calls currval(), and then marking this
function cachable. Don't try calling such a function inside a SQL or
plpgsql function however, or you will be burnt by premature constant-
folding. Basically, this technique leaves it in your hands to determine
whether the optimization is safe.
regards, tom lane
Hi!
On Sun, 20 Aug 2000, Hannu Krosing wrote:
It seems that optimiser is unaware that currval('seq') can be treated as
a constant within
an expression and thus produces suboptimal plans for WHERE clauses that
use currval
thus using a seq scan instead of index scan.Is it possible (planned) to mark functions as returning a constant when
given a constant
argument and start using it _as a constant_ (pre-evaluated) in queries
Just one question regrarding this:
Suppose you have
select ... where x in (select currval('seq')) and y in (select
nextval('seq'))....
What's the precise semantics of this? Should there be any precise
semantics? Whats the order of execution? currval before or after
nextval? It seems to me that the declarative nature of SQL makes that no
order whatsoever should be assumed...
In the case of uncorrelated queries, there is the option of
materializing (which I think - after looking at the code - that pg does
not use) the subqueries results as there is no need to recompute them. In
this case materializing vs re-executing seems to cause a semantinc
difference because in mater there is only one execution of nextval and in
reexecution nextval is executed unknown number of times.
If all this as pre-evaluated this last problem would disapear.
Side-effects, side-effects, ...
Best regards,
Tiago
PS - I'm starting the thesis part of a MSc which will be about query
optimization in pg. Here the thesis part of the MSc takes arround one
year, so at least for the next year I'll try to work hard on pg.
At 01:25 PM 8/20/00 -0400, Tom Lane wrote:
However, even if we had a concept of "constant within a transaction/
scan/whatever", currval() would not qualify --- what if there is a
nextval() being invoked somewhere else in the query, possibly inside a
user-defined function where the optimizer has no chance of seeing it?In short, there is no way of optimizing currval() in the way you want
without risking breakage.
Does Postgres guarantee order of execution of functions? Many languages
don't other than to follow precedence rules, which is why functions with
side-effects (such as nextval) can yield "implementation defined" or
whatever results.
My point is that such queries may not yield predictable results. Perhaps
today due to left-right execution order or the like, but do you want to
guarantee a defined order of execution in the future?
- Don Baccus, Portland OR <dhogaza@pacifier.com>
Nature photos, on-line guides, Pacific Northwest
Rare Bird Alert Service and other goodies at
http://donb.photo.net.
=?iso-8859-1?Q?Tiago_Ant=E3o?= <tra@fct.unl.pt> writes:
PS - I'm starting the thesis part of a MSc which will be about query
optimization in pg. Here the thesis part of the MSc takes arround one
year, so at least for the next year I'll try to work hard on pg.
Cool! Please keep us posted on what you're doing or thinking about
doing, so that there's not duplicate or wasted effort.
regards, tom lane
Don Baccus <dhogaza@pacifier.com> writes:
Does Postgres guarantee order of execution of functions?
No, and I don't recall having seen anything about it in the SQL spec
either. If you were doing something like
select foo, nextval('seq') from tab where bar < currval('seq')
then there's no issue of "order of evaluation" per se: nextval will be
evaluated at just those rows where the WHERE clause has already
succeeded. However, the results would still depend on the order in
which tuples are scanned, an order which is most definitely not
guaranteed by the spec nor by our implementation. (Also, in a
pipelined implementation it's conceivable that the WHERE clause would
get evaluated for additional tuples before nextval has been evaluated
at a matching tuple.)
However, that just shows that some patterns of usage of the function
will yield unpredictable results. I don't think that translates to an
argument that the optimizer is allowed to make semantics-altering
transformations...
regards, tom lane
Tom Lane wrote:
Don Baccus <dhogaza@pacifier.com> writes:
Does Postgres guarantee order of execution of functions?
No, and I don't recall having seen anything about it in the SQL spec
either. If you were doing something likeselect foo, nextval('seq') from tab where bar < currval('seq')
then there's no issue of "order of evaluation" per se: nextval will be
evaluated at just those rows where the WHERE clause has already
succeeded. However, the results would still depend on the order in
which tuples are scanned, an order which is most definitely not
guaranteed by the spec nor by our implementation. (Also, in a
pipelined implementation it's conceivable that the WHERE clause would
get evaluated for additional tuples before nextval has been evaluated
at a matching tuple.)However, that just shows that some patterns of usage of the function
will yield unpredictable results. I don't think that translates to an
argument that the optimizer is allowed to make semantics-altering
transformations...
IMHO, if semantics in undefined then altering it should be OK, no?
What I mean is that there is no safe use of nextval and currval in the
same sql sentence, even if it is used automatically, as in "DEFAULT
NEXTVAL('S')"
and thus marking it as constant is as correct as not marking it, only
more predictable.
And predictability is GOOD ;)
I would even suggest that PG would warn about or even refuse to run
queries
that have both nextval and curval of the same sequence inside them
(and pre-evaluate nextval) as only that case has _any_ predictability.
------------
Hannu
Don Baccus wrote:
At 02:18 PM 8/20/00 -0400, Tom Lane wrote:
However, that just shows that some patterns of usage of the function
will yield unpredictable results. I don't think that translates to an
argument that the optimizer is allowed to make semantics-altering
transformations...Very much depends on the language spec, if such usage is "implementation
defined" you can do pretty much whatever you want. Agressive optimizers
in traditional compilers take advantage of this.In the case given, though, there's no particular reason why an application
can't grab "currval()" and then use the value returned in the subsequent
query.On the other hand, heuristics like "if there's no nextval() in the
query, then currval() can be treated as a constant" are very common in
the traditional compiler world, too ...
And it seems to me that even if there are both nextval and curval we can
crab "curval()" and use the value returned.
It will fail if we have no preceeding nextval inside the session, but so
will
any other plan that tries to evaluate currval before nextval.
So I don't see that we would be violating the spec more by marking
currval as const than by not doing so.
And we do get faster queries, even for the weird queres with undefined
behaviour ;)
---------------
Hannu
At 02:18 PM 8/20/00 -0400, Tom Lane wrote:
However, that just shows that some patterns of usage of the function
will yield unpredictable results. I don't think that translates to an
argument that the optimizer is allowed to make semantics-altering
transformations...
Very much depends on the language spec, if such usage is "implementation
defined" you can do pretty much whatever you want. Agressive optimizers
in traditional compilers take advantage of this.
In the case given, though, there's no particular reason why an application
can't grab "currval()" and then use the value returned in the subsequent
query.
On the other hand, heuristics like "if there's no nextval() in the
query, then currval() can be treated as a constant" are very common in
the traditional compiler world, too ...
- Don Baccus, Portland OR <dhogaza@pacifier.com>
Nature photos, on-line guides, Pacific Northwest
Rare Bird Alert Service and other goodies at
http://donb.photo.net.
Hi!
On Sun, 20 Aug 2000, Tom Lane wrote:
Cool! Please keep us posted on what you're doing or thinking about
doing, so that there's not duplicate or wasted effort.
I'm starting to look at pg code, so some comments that follow can be
completly dumb or useless :-).
One thing it might be interesting (please tell me if you think
otherwise) would be to improve pg with better statistical information, by
using, for example, histograms. With this probably better estimations
could be done. I think I'd like to do this (I've not looked the code
thoughly at this stage, so it could be really a bad idea...)...
BTW, I'm open to suggestions, if you think there is something that would
be good to be on the pg optimizer, please tell me.
I'd like also to comment on a matter that is on
http://www.postgresql.org/docs/pgsql/doc/TODO.detail/optimizer, regarding
vacuum and analyze beeing together: I think it is a good idea for them to
be together because if there is no vacuum then optimizer should also model
the "clusterization" of tables, a seqscan on a highly unclustered table
could be very expensive. There is a good article regarding this:
http://www.db2mag.com/summer00/programmer.shtml section "use the index any
way that works best or use it the normal way?"
Best regards,
Tiago
Hi!
On Mon, 21 Aug 2000, Hannu Krosing wrote:
And predictability is GOOD ;)
I would even suggest that PG would warn about or even refuse to run
queries
that have both nextval and curval of the same sequence inside them
(and pre-evaluate nextval) as only that case has _any_ predictability.
Isn't the problem more general than just nextval? Any user defined
function with side-effects would be a problematic one... plus a user
defined function might not be constant:
select ... from ... where x in (select side_effects(x) ...)
On correlated subqueries there is no guarantee of being constant.
In Prolog, which is a declarative language with some similarities to
relational algebra the ideia is: "if you use predicates with side effects,
then you're on your own".
Tiago
PS - Apologies for any irrelevant comment, I'm just starting to look to pg
code.
At 01:34 PM 8/21/00 +0100, Tiago Ant�o wrote:
Hi!
On Mon, 21 Aug 2000, Hannu Krosing wrote:
And predictability is GOOD ;)
I would even suggest that PG would warn about or even refuse to run
queries
that have both nextval and curval of the same sequence inside them
(and pre-evaluate nextval) as only that case has _any_ predictability.Isn't the problem more general than just nextval?
Yes, it is.
- Don Baccus, Portland OR <dhogaza@pacifier.com>
Nature photos, on-line guides, Pacific Northwest
Rare Bird Alert Service and other goodies at
http://donb.photo.net.
=?iso-8859-1?Q?Tiago_Ant=E3o?= <tra@fct.unl.pt> writes:
One thing it might be interesting (please tell me if you think
otherwise) would be to improve pg with better statistical information, by
using, for example, histograms.
Yes, that's been on the todo list for a while.
There is a good article regarding this:
http://www.db2mag.com/summer00/programmer.shtml
Interesting article. We do most of what she talks about, but we don't
have anything like the ClusterRatio statistic. We need it --- that was
just being discussed a few days ago in another thread. Do you have any
reference on exactly how DB2 defines that stat?
regards, tom lane
=?iso-8859-1?Q?Tiago_Ant=E3o?= <tra@fct.unl.pt> writes:
Isn't the problem more general than just nextval?
Yes it is, and that's why I'm not very excited about the idea of
adding special-case logic for nextval/currval into the optimizer.
It's fairly easy to get around this problem in plpgsql, eg
declare x int;
begin
x := currval('seq');
return f1 from foo where seqfld = x;
so I really am going to resist suggestions that the optimizer should
make invalid assumptions about currval by itself ...
regards, tom lane
Tom Lane wrote:
=?iso-8859-1?Q?Tiago_Ant=E3o?= <tra@fct.unl.pt> writes:
Isn't the problem more general than just nextval?
Yes it is, and that's why I'm not very excited about the idea of
adding special-case logic for nextval/currval into the optimizer.It's fairly easy to get around this problem in plpgsql,
it is, once you know that psql implements volatile currval ;)
eg
declare x int;
begin
x := currval('seq');
return f1 from foo where seqfld = x;so I really am going to resist suggestions that the optimizer should
make invalid assumptions about currval by itself ...
Why is assuming a constant currval any more "invalid" than not doing so ?
As the execution order of functions is undefined, can't we safely state that
all
currval's are evaluated first, before any other functions that could change
its return value ?
currval is not like random which changes its value without any external
reason.
Afaik, assuming it to return a constant within a single query is at least as
correct as not doing so, only more predictable.
----------------
Hannu
Tiago Ant�o wrote:
Hi!
On Mon, 21 Aug 2000, Hannu Krosing wrote:
And predictability is GOOD ;)
I would even suggest that PG would warn about or even refuse to run
queries
that have both nextval and curval of the same sequence inside them
(and pre-evaluate nextval) as only that case has _any_ predictability.Isn't the problem more general than just nextval? Any user defined
function with side-effects would be a problematic one... plus a user
defined function might not be constant:
select ... from ... where x in (select side_effects(x) ...)
On correlated subqueries there is no guarantee of being constant.In Prolog, which is a declarative language with some similarities to
relational algebra the ideia is: "if you use predicates with side effects,
then you're on your own".
And you are probably even worse off in SQL where the query plan changes as
tables are filled up, so you can't even find out what will happen by testing.
with currval marked as constant I would at least know about what currval
will return.
---------------
Hannu
On Mon, 21 Aug 2000, Tom Lane wrote:
One thing it might be interesting (please tell me if you think
otherwise) would be to improve pg with better statistical information, by
using, for example, histograms.Yes, that's been on the todo list for a while.
If it's ok and nobody is working on that, I'll look on that subject.
I'll start by looking at the analize portion of vacuum. I'm thinking in
using arrays for the histogram (I've never used the array data type of
postgres).
Should I use 7.0.2 or the cvs version?
Interesting article. We do most of what she talks about, but we don't
have anything like the ClusterRatio statistic. We need it --- that was
just being discussed a few days ago in another thread. Do you have any
reference on exactly how DB2 defines that stat?
I don't remember seeing that information spefically. From what I've
read I can speculate:
1. They have clusterratios for both indexes and the relation itself.
2. They might use an index even if there is no "order by" if the table
has a low clusterratio: just to get the RIDs, then sort the RIDs and
fetch.
3. One possible way to calculate this ratio:
a) for tables
SeqScan
if tuple points to a next tuple on the same page then its
"good"
ratio = # good tuples / # all tuples
b) for indexes (high speculation ratio here)
foreach pointed RID in index
if RID is in same page of next RID in index than mark as
"good"
I suspect that if a tuple size is big (relative to page size) than the
cluster ratio is always low.
A tuple might also be "good" if it pointed to the next page.
Tiago
Hannu Krosing <hannu@tm.ee> writes:
Why is assuming a constant currval any more "invalid" than not doing so ?
Because it's wrong: it changes the behavior from what happens if the
optimizer does not do anything special with the function.
The fact that some cases involving currval+nextval (but not all) yield
unpredictable results is not an adequate argument for causing the
behavior of other cases to change. Especially not when there's a
perfectly good way for you to make it do what you want...
regards, tom lane
Tom Lane wrote:
Hannu Krosing <hannu@tm.ee> writes:
Why is assuming a constant currval any more "invalid" than not doing so ?
Because it's wrong: it changes the behavior from what happens if the
optimizer does not do anything special with the function.
Optimizer already does "something special" regarding the function - it
decides the
order of execution of rows, and when both currval and nextval are
present it changes
the end result by doing so. If only currval is present currval is
constant.
But the case when "optimiser does not do anything with the function" is
completely
unpredictable in face of optimiser changing the order of things getting
scanned,
columns getting scanned and functions getting evaluated.
And I'm somewhat suspicious that we have any regression tests that are
dependent
of left-to-right or top-to-bottom execution of functions.
The fact that some cases involving currval+nextval (but not all)
Could you give me a good example of currval+nextval that has a
SQL[92/99]-defined
result, or even a predictable result?
yield unpredictable results is not an adequate argument for causing the
behavior of other cases to change.
Are not all the other cases returning "undefined" (by the standard)
results ?
I mean that the fact that a seasoned pg coder has a feel for what will
happen
for some combination of nextval/currval for some combinations of indexes
and table
sizes does not make even his assumptions always right or future-proof.
Especially not when there's a perfectly good way for you to make it do what you want...
You mean marking it const in my personal copy of pgsql ? ;)
I did
update pg_proc set proiscachable='t' where proname = 'currval';
And now it seems to do the right thing -
amphora2=# explain
amphora2-# select * from item where item_id = currval('item_id_seq');
NOTICE: QUERY PLAN:
Index Scan using item_pkey on item (cost=0.00..2.03 rows=1 width=140)
- Thanks.
Do you know of any circumstances where I would get _wrong_ answers by
doing the above ?
By wrong I mean really wrong, not just different from the case where
proiscachable='f'.
Can I now trust the optimiser to always pre-evalueate the currval() or
are there some
circumstances where the behaviour is still unpredictable ?
PS. I would not call plpgsql or temporary tables a perfectly good way ?
Plpgsql is not even installed by default (on linux at least).
-------------
Hannu
Hannu Krosing <hannu@tm.ee> writes:
Tom Lane wrote:
The fact that some cases involving currval+nextval (but not all)
Could you give me a good example of currval+nextval that has a
SQL[92/99]-defined result, or even a predictable result?
currval & nextval aren't in the SQL standard, so asking for a standard-
defined result is rather pointless. However, it's certainly possible to
imagine cases where the result is predictable. For example,
UPDATE table SET dataval = foo, seqval = nextval('seq')
WHERE seqval = currval('seq')
is predictable if the seqval column is unique. Admittedly in that case
it wouldn't matter whether we pre-evaluated currval or not. But you'd
have to be very careful about what you mean by "pre-evaluation". For
example, the above could be executed many times within one interactive
query --- say, it could be executed inside a trigger function that's
fired multiple times by an interactive SELECT. Then the results will
change depending on just when you pre-evaluate currval. That's why I'd
rather leave it to the user to evaluate currval separately if he wants
pre-evaluation. That way the user can control what happens. If we
hard-wire an overly-optimistic pre-evaluation policy into the optimizer
then that policy will be wrong for some applications.
Especially not when there's a perfectly good way for you to make it do what you want...
You mean marking it const in my personal copy of pgsql ? ;)
No, I meant putting a pre-evaluation into a plpgsql function, as I
illustrated earlier in this thread.
Do you know of any circumstances where I would get _wrong_ answers by
doing the above ?
I already told you earlier in this thread: it will fail inside sql or
plpgsql functions, because the optimizer will freeze the value of the
allegedly constant function sooner than you want, ie during first
execution of the sql/plpgsql function (assuming the input argument looks
like a constant, of course).
regards, tom lane