Function execution costs 'n all that
So I've been working on the scheme I suggested a few days ago of
representing "equivalence classes" of variables explicitly, and avoiding
the current ad-hocery of generating and then removing redundant clauses
in favor of generating only the ones we want in the first place. Any
clause that looks like an equijoin gets sent to the EquivalenceClass
machinery by distribute_qual_to_rels, and not put into the
restrictlist/joinlist data structure at all. Then we make passes over
the EquivalenceClass lists at appropriate times to generate the clauses
we want. This is turning over well enough now to pass the regression
tests, but I noticed that one query in opr_sanity got a whole lot slower
than before. The query is
SELECT p1.opcname, p1.opcfamily
FROM pg_opclass AS p1
WHERE NOT EXISTS(SELECT 1 FROM pg_amop AS p2
WHERE p2.amopfamily = p1.opcfamily
AND binary_coercible(p1.opcintype, p2.amoplefttype));
and investigation showed that the plan changed from (8.2 and before)
Seq Scan on pg_opclass p1 (cost=0.00..393.94 rows=51 width=68)
Filter: (NOT (subplan))
SubPlan
-> Seq Scan on pg_amop p2 (cost=0.00..7.66 rows=2 width=0)
Filter: ((amopfamily = $0) AND binary_coercible($1, amoplefttype))
to
Seq Scan on pg_opclass p1 (cost=0.00..393.94 rows=51 width=68)
Filter: (NOT (subplan))
SubPlan
-> Seq Scan on pg_amop p2 (cost=0.00..7.66 rows=2 width=0)
Filter: (binary_coercible($1, amoplefttype) AND (amopfamily = $0))
thus resulting in many more calls of binary_coercible() which is a
pretty expensive function. This is not too surprising: the clause
p2.amopfamily = p1.opcfamily got diverted through the EquivalenceClass
code for just long enough to end up behind the other one in the table's
restrictlist :-(
In short, this approach results in a whole lot less stability in the
order in which WHERE clauses are evaluated. That might be a killer
objection to the whole thing, but on the other hand we've never made
any strong promises about WHERE evaluation order.
Instead, I'm thinking it might be time to re-introduce some notion of
function execution cost into the system, and make use of that info to
sort WHERE clauses into a reasonable execution order. This example
would be fixed with even a very stupid rule-of-thumb about SQL functions
being more expensive than C functions, but if we're going to go to the
trouble it seems like it'd be a good idea to provide a way to label
user-defined functions with execution costs.
Would a simple constant value be workable, or do we need some more
complex model (and if so what)?
Comments?
regards, tom lane
Tom Lane wrote:
Instead, I'm thinking it might be time to re-introduce some notion of
function execution cost into the system, and make use of that info to
sort WHERE clauses into a reasonable execution order.
That sounds like a straightforward idea.
This example
would be fixed with even a very stupid rule-of-thumb about SQL functions
being more expensive than C functions, but if we're going to go to the
trouble it seems like it'd be a good idea to provide a way to label
user-defined functions with execution costs.
Agreed.
Would a simple constant value be workable, or do we need some more
complex model (and if so what)?
A simple constant would probably be enough. If we want anything fancier
than that, it should be up to the author of the function to define the
cost model as well. I'm envisioning that you could attach a custom cost
function to a user-defined function which would return the estimated CPU
cost. And # of rows returned for a set-returning function.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki@enterprisedb.com> writes:
Tom Lane wrote:
Would a simple constant value be workable, or do we need some more
complex model (and if so what)?
A simple constant would probably be enough. If we want anything fancier
than that, it should be up to the author of the function to define the
cost model as well. I'm envisioning that you could attach a custom cost
function to a user-defined function which would return the estimated CPU
cost. And # of rows returned for a set-returning function.
But what will such an estimation function work on? In general the
planner does not have the value(s) that will be passed to the actual
function at runtime. It might have expressions or estimates but
those data structures are certainly not something we could pass to
non-C-coded functions. Are we willing to restrict these functions
to being coded in C, as selectivity estimation functions are?
If we go this route it seems like we'll need four more columns in
pg_proc (cost estimation function OID, rowcount estimation function OID,
fallback cost constant, fallback rowcount constant).
BTW, I'm thinking that a "cost constant" probably ought to be measured
in units of cpu_operator_cost. The default for built-in functions would
thus be 1, at least till such time as someone wants to refine the
estimates. We'd probably want the default for PL and SQL functions to
be 10 or 100 or so.
regards, tom lane
Tom Lane wrote:
Heikki Linnakangas <heikki@enterprisedb.com> writes:
A simple constant would probably be enough. If we want anything fancier
than that, it should be up to the author of the function to define the
cost model as well. I'm envisioning that you could attach a custom cost
function to a user-defined function which would return the estimated CPU
cost. And # of rows returned for a set-returning function.But what will such an estimation function work on? In general the
planner does not have the value(s) that will be passed to the actual
function at runtime. It might have expressions or estimates but
those data structures are certainly not something we could pass to
non-C-coded functions. Are we willing to restrict these functions
to being coded in C, as selectivity estimation functions are?
Yeah, I don't know. If the planner knows the actual value, that would
certainly be the easiest for the UDF writer to work with. Anything more
than that gets really complicated.
If we go this route it seems like we'll need four more columns in
pg_proc (cost estimation function OID, rowcount estimation function OID,
fallback cost constant, fallback rowcount constant).
What would the fallbacks be for?
BTW, I'm thinking that a "cost constant" probably ought to be measured
in units of cpu_operator_cost. The default for built-in functions would
thus be 1, at least till such time as someone wants to refine the
estimates. We'd probably want the default for PL and SQL functions to
be 10 or 100 or so.
Agreed.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki@enterprisedb.com> writes:
Tom Lane wrote:
If we go this route it seems like we'll need four more columns in
pg_proc (cost estimation function OID, rowcount estimation function OID,
fallback cost constant, fallback rowcount constant).
What would the fallbacks be for?
By "fallback" I meant "this is what to use if no estimation function is
provided".
I'm envisioning that the CREATE FUNCTION syntax would add optional
clauses
COST function-name-or-numeric-constant
ROWS function-name-or-numeric-constant
that would be used to fill these columns.
regards, tom lane
On Mon, 15 Jan 2007, Tom Lane wrote:
So I've been working on the scheme I suggested a few days ago of
representing "equivalence classes" of variables explicitly, and avoiding
the current ad-hocery of generating and then removing redundant clauses
in favor of generating only the ones we want in the first place. Any
clause that looks like an equijoin gets sent to the EquivalenceClass
machinery by distribute_qual_to_rels, and not put into the
restrictlist/joinlist data structure at all. Then we make passes over
the EquivalenceClass lists at appropriate times to generate the clauses
we want. This is turning over well enough now to pass the regression
tests,
That was quick...
In short, this approach results in a whole lot less stability in the
order in which WHERE clauses are evaluated. That might be a killer
objection to the whole thing, but on the other hand we've never made
any strong promises about WHERE evaluation order.
Showing my ignorance here, but I've never been a fan of "syntax based
optimization," though it is better than no optimization. If people are
counting on order for optimization, then, hmmm... If you can provide a way
to at least _try_ to do better, then don't worry about it. It will improve
with time.
Instead, I'm thinking it might be time to re-introduce some notion of
function execution cost into the system, and make use of that info to
sort WHERE clauses into a reasonable execution order.
Ingres did/does it that way, IIRC. It's a solid strategy.
This example
would be fixed with even a very stupid rule-of-thumb about SQL functions
being more expensive than C functions, but if we're going to go to the
trouble it seems like it'd be a good idea to provide a way to label
user-defined functions with execution costs.Would a simple constant value be workable, or do we need some more
complex model (and if so what)?
Ingres would, if I'm not mistaken, gain through historical use through
histograms. Short of that, you've got classes of functions, agregations,
for example, and there's sure to be missing information to make a great
decision at planning time. However, I take it that the cost here is
primarily CPU and not I/O. I therefore propose that the engine evaluate -
benchmark, if you will - all functions as they are ingested, or
vacuum-like at some later date (when valid data for testing may exist),
and assign a cost relative to what it already knows - the built-ins, for
example. Doing so could allow this strategy to be functional in short
order and be improved with time so all the work doesn't have to be
implemented on day 1. And, DBA/sys-admin tweaking can always be done by
updating the catalogues.
HTH,
Richard
--
Richard Troy, Chief Scientist
Science Tools Corporation
510-924-1363 or 202-747-1263
rtroy@ScienceTools.com, http://ScienceTools.com/
On Mon, 2007-01-15 at 10:51 -0800, Richard Troy wrote:
I therefore propose that the engine evaluate -
benchmark, if you will - all functions as they are ingested, or
vacuum-like at some later date (when valid data for testing may exist),
and assign a cost relative to what it already knows - the built-ins, for
example.
That seems pretty unworkable. It is unsafe, for one: evaluating a
function may have side effects (inside or outside the database), so the
DBMS cannot just invoke user-defined functions at whim. Also, the
relationship between a function's arguments and its performance will
often be highly complex -- it would be very difficult, not too mention
computationally infeasible, to reconstruct that relationship
automatically, especially without any real knowledge about the
function's behavior.
-Neil
Neil Conway wrote:
On Mon, 2007-01-15 at 10:51 -0800, Richard Troy wrote:
I therefore propose that the engine evaluate -
benchmark, if you will - all functions as they are ingested, or
vacuum-like at some later date (when valid data for testing may exist),
and assign a cost relative to what it already knows - the built-ins, for
example.That seems pretty unworkable. It is unsafe, for one: evaluating a
function may have side effects (inside or outside the database), so the
DBMS cannot just invoke user-defined functions at whim. Also, the
relationship between a function's arguments and its performance will
often be highly complex -- it would be very difficult, not too mention
computationally infeasible, to reconstruct that relationship
automatically, especially without any real knowledge about the
function's behavior.
Non-developer here, but we use a lot of plpgsql functions at work. And
the functions we use fall into two broad, ill-defined catagories-
"expensive" functions and "cheap" functions. What I'd like as a user is
some way to tell the planner "this function is expensive- prefer plans
which call this function less even if they're otherwise more expensive"
or "this function is cheap, prefer plans that are otherwise less
expensive even if they call this function more often". Precise cost
estimates aren't that important, IMHO.
Brian
On Mon, 15 Jan 2007, Neil Conway wrote:
On Mon, 2007-01-15 at 10:51 -0800, Richard Troy wrote:
I therefore propose that the engine evaluate -
benchmark, if you will - all functions as they are ingested, or
vacuum-like at some later date (when valid data for testing may exist),
and assign a cost relative to what it already knows - the built-ins, for
example.That seems pretty unworkable. It is unsafe, for one: evaluating a
function may have side effects (inside or outside the database), so the
DBMS cannot just invoke user-defined functions at whim. Also, the
relationship between a function's arguments and its performance will
often be highly complex -- it would be very difficult, not too mention
computationally infeasible, to reconstruct that relationship
automatically, especially without any real knowledge about the
function's behavior.-Neil
Hi Neil,
Tom had already proposed:
I'm envisioning that the CREATE FUNCTION syntax would add optional
clausesCOST function-name-or-numeric-constant
ROWS function-name-or-numeric-constantthat would be used to fill these columns.
I was considering these ideas in the mix; let the user provide either a
numeric or a function, the distinction here being that instead of running
that function at planning-time, it could be run "off-line", so to speak.
Richard
--
Richard Troy, Chief Scientist
Science Tools Corporation
510-924-1363 or 202-747-1263
rtroy@ScienceTools.com, http://ScienceTools.com/
Brian Hurt <bhurt@janestcapital.com> writes:
Non-developer here, but we use a lot of plpgsql functions at work. And
the functions we use fall into two broad, ill-defined catagories-
"expensive" functions and "cheap" functions. What I'd like as a user is
some way to tell the planner "this function is expensive- prefer plans
which call this function less even if they're otherwise more expensive"
or "this function is cheap, prefer plans that are otherwise less
expensive even if they call this function more often". Precise cost
estimates aren't that important, IMHO.
Right, so a plain constant cost would be plenty for your situation.
I suspect there's an 80/20 rule at work here --- the estimator-function
side of this will take most of the effort to design/implement, but not
get used nearly as much as the plain-constant form ... maybe we should
just do the constant for starters and see how many people really want to
write C-code estimators ...
regards, tom lane
On Mon, 2007-01-15 at 15:05 -0500, Tom Lane wrote:
maybe we should just do the constant for starters and see how many
people really want to write C-code estimators ...
+1
BTW, your proposal would still pushdown all qualifiers, right?
Hellerstein's xfunc work discusses situations in which it makes sense to
pullup expensive qualifiers above joins, for example, in order to reduce
the number of tuples the qualifier is applied to. Unfortunately, this
would probably increase the optimizer's search space by a fairly
significant degree, so it might need to be controlled by a GUC variable,
or only applied when the estimated cost of applying a qualifier is
particularly large relative to the total estimated cost of the plan.
-Neil
Neil Conway <neilc@samurai.com> writes:
BTW, your proposal would still pushdown all qualifiers, right?
Yeah, I have no intention of readopting xfunc in the near future ...
especially seeing that it's possible for the user to force that
sort of thing if he really has to.
SELECT * FROM (SELECT ... OFFSET 0) ss
WHERE expensive_function(...);
regards, tom lane
On Mon, 2007-01-15 at 15:05 -0500, Tom Lane wrote:
Brian Hurt <bhurt@janestcapital.com> writes:
Non-developer here, but we use a lot of plpgsql functions at work. And
the functions we use fall into two broad, ill-defined catagories-
"expensive" functions and "cheap" functions. What I'd like as a user is
some way to tell the planner "this function is expensive- prefer plans
which call this function less even if they're otherwise more expensive"
or "this function is cheap, prefer plans that are otherwise less
expensive even if they call this function more often". Precise cost
estimates aren't that important, IMHO.Right, so a plain constant cost would be plenty for your situation.
I suspect there's an 80/20 rule at work here --- the estimator-function
side of this will take most of the effort to design/implement, but not
get used nearly as much as the plain-constant form ... maybe we should
just do the constant for starters and see how many people really want to
write C-code estimators ...regards, tom lane
Hi Tom et al,
Having worked with stored procedures on large datasets for reporting, I
would say that it would be useful to have a non-constant estimator for
the number of rows, whereas a single CPU cost constant should be fine.
Where I have struggled with this has been joining onto slightly more
exotic queries when doing large scale data processing as part of a
custom report or an application upgrade.
Using PL/PGSQL I would find it useful to have access to the constants
passed into a function to be used to help provide a row count estimate
(typically useful for passing in table/column names), e.g.
SELECT * FROM my_func('my_table1') AS t1, my_table2 AS t2 WHERE t1.id =
t2.id;
CREATE FUNCTION my_func(text) AS $$
...
$$ LANGUAGE 'plpgsql' COST 1.0 ROWS my_func_row_cost;
In my cost function, I could then estimate the number of rows using
something like below, where all constants are passed into the cost
function as parameters, e.g.:
CREATE FUNCTION my_func_row_cost(text) AS $$
DECLARE
foo bigint;
BEGIN
EXECUTE INTO foo 'SELECT COUNT(*) FROM ' || quote_literal($1);
RETURN foo;
END;
$$ LANGUAGE 'plpgsql';
In the case where a parameter was not a constant but a column name, then
it would be reasonable in my mind to simply replace that parameter with
NULL when passing to the row cost function, e.g.
SELECT * FROM my_table1 WHERE my_table1.junk = (SELECT
my_func(my_table1.name));
In this case, the text parameter passed to my_func_row_cost would be
replaced by NULL to indicate that it was non-constant.
Of course, even with constants passed upon input, it still may not be
possible to calculate a number of rows that can be returned - it could
be the case that the only parameter passed to cost function has been
converted to NULL because it is a column name. Perhaps in this case it
would be useful to specify returning NULL from my_func_row_cost means "I
can't return anything meaningful, so use the fallback values".
Kind regards,
Mark.
Tom Lane wrote:
Instead, I'm thinking it might be time to re-introduce some notion of
function execution cost into the system, and make use of that info to
sort WHERE clauses into a reasonable execution order.
I imagine you've thought of this already but just in case, the cost of the
function call has to be combined with the selectivity to get this right. If
you can do an expensive but very selective clause first and save 100 cheap
calls that almost always return true it may still be worthwhile.
--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Gregory Stark <gsstark@mit.edu> writes:
I imagine you've thought of this already but just in case, the cost of the
function call has to be combined with the selectivity to get this right. If
you can do an expensive but very selective clause first and save 100 cheap
calls that almost always return true it may still be worthwhile.
I've thought of it, but I haven't figured out a reasonable algorithm for
ordering the clauses in view of that. Have you?
regards, tom lane
"Tom Lane" <tgl@sss.pgh.pa.us> writes:
Gregory Stark <gsstark@mit.edu> writes:
I imagine you've thought of this already but just in case, the cost of the
function call has to be combined with the selectivity to get this right. If
you can do an expensive but very selective clause first and save 100 cheap
calls that almost always return true it may still be worthwhile.I've thought of it, but I haven't figured out a reasonable algorithm for
ordering the clauses in view of that. Have you?
Hum, I hadn't tried. Now that I think about it it's certainly not obvious.
And picturing the possible failure modes I would rather it execute cheap
expressions more often than necessary than call some user-defined perl
function that could be doing i/o or involve waiting on other resources any
more than absolutely necessary. So I guess what you originally described is
safest.
--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Tom Lane wrote:
BTW, I'm thinking that a "cost constant" probably ought to be measured
in units of cpu_operator_cost. The default for built-in functions would
thus be 1, at least till such time as someone wants to refine the
estimates. We'd probably want the default for PL and SQL functions to
be 10 or 100 or so.
Any chance that costs could eventually change to real-world units?
It's hard for me to guess how many cpu_operator_cost units
something might take; but relatively easy for me to measure
or estimate in fractions-of-a-seconds how long something takes.
I could imagine having the other planner costs be measured in seconds
too - perhaps with the goal of eventually writing some auto-config
code that tries to measure values like cpu_tuple_cost on a given
piece of hardware.
Ron Mayer <rm_pg@cheapcomplexdevices.com> writes:
Tom Lane wrote:
BTW, I'm thinking that a "cost constant" probably ought to be measured
in units of cpu_operator_cost.
Any chance that costs could eventually change to real-world units?
Define "real world units".
If you like you can try to adjust things so that cpu_operator_cost bears
some relation to elapsed-msec-on-your-own-machine, and scale everything
else accordingly; indeed that's why we added seq_page_cost in 8.2.
But it's awfully hard to see the numbers being very portable.
regards, tom lane
On Mon, 2007-01-15 at 13:54 -0500, Neil Conway wrote:
On Mon, 2007-01-15 at 10:51 -0800, Richard Troy wrote:
I therefore propose that the engine evaluate -
benchmark, if you will - all functions as they are ingested, or
vacuum-like at some later date (when valid data for testing may exist),
and assign a cost relative to what it already knows - the built-ins, for
example.That seems pretty unworkable. It is unsafe, for one: evaluating a
function may have side effects (inside or outside the database), so the
Would any form of cost estimate have meaning if the function has side
effects? If it's a volatile function, doesn't that mean that the planner
can't avoid or favor executing it?
Regards,
Jeff Davis
Jeff Davis <pgsql@j-davis.com> writes:
Would any form of cost estimate have meaning if the function has side
effects? If it's a volatile function, doesn't that mean that the planner
can't avoid or favor executing it?
No, not really. If the function is down inside a sub-select or
something like that, the number of executions could depend on any number
of factors (like whether we put it on the inside or outside of a join)
--- and if it's expensive then we will and should try to arrange the
query to minimize the number of executions. We're certainly not going
to drop back to all-plain-nestloops just because the query contains one
volatile function.
(Now mind you, a smart user will probably avoid using a volatile
function like that, but I don't think it's an adequate argument for
saying that we don't need cost information.)
regards, tom lane