An Idea for planner hints
Hi
Since the discussion about how to force a specific plan has
come up, I though I'd post an idea I had for this a while ago.
It's not reall well though out yet, but anyway.
When the topic of optimizer hints comes up, people often suggest
that there should be a way to force postgres to use a certain
index, or do joins in a certain order. AFAIK, this mimics what
oracle does - you can put comments into your query that specify
what index to use. This approach has two major drawbacks
.) Plans that seem good now might not seem that good a few months
later - your data might have changed, and other execution plans
might fit better now
.) You have to change all your queries to make use of features
in new postgres versions, like bitmap scans.
My experience with the postgres optimizer is that it usually performs
great - and if it doesn't, that always boiled down to two problems
(at least for me)
.) The query is autogenerated, and includes complex, and highly inter-
dependent where (or join) conditions. This leads to wrong estimates
of where selectivity, and thus to bad plans.
.) There are correlations between columns and/or tables that postgres
doesn't know about (and has no chance of knowing about). Again, this
leads to vastly wrong estimates of row counts, and to bad plans.
I think that those bad estimates of the selectivity of where-clauses
(or on-clauses for joins) is where postgres could use hints.
Image a query like "select ... from t1 join t2 on t1.t2_id = t2.id and
<expr>". Lets say that "<expr>" is true for only 1% of the rows in t2 -
but those are exactly the rows that have matching rows in t1.
Postgres would probably guess that this join will produce about 1/100
of the rows that t1 has - but I _know_ that it will produce 100 (!)
times more rows.
Now, I'd like to hand that information to postgres. I wouldn't want
to force any particular access method or join order, but rather I'd
just tell it "hey, this expression has selectivity 1 in this context,
not 0.01 as you might think".
Could that work?
greetings, Florian Pflug
If this feature I'm proposing already exists, sorry for the waste of bandwidth,
and could someone please point me to it? :)
What if there were a mode that told postgres to do an exhaustive search (or if
not exhaustive, then much more extensive search) of all plans (or many plans),
trying each plan, reporting the performance of each, and discarding the query
results, much like "explain analyze" does. Postgres could then dump the best
plan in machine readable (and semi-human readable) form which the planner could
parse and use at some later date in lieu of a SQL query.
This would allow people with reasonably static table statistics (where the best
plan is not likely to change) to spend upfront cycles investigating the best
plan and then embed that plan in their business logic. Since the stored plan is
both written-by and read-by postgres, it can get quite complicated without
putting a burden on humans to read and write such complicated things. It would
also remove the risk that the planner will occasionally (due to its
nondeterministic workings) choose a really bad plan and stall a production system.
mark
Florian G. Pflug wrote:
Show quoted text
Hi
Since the discussion about how to force a specific plan has
come up, I though I'd post an idea I had for this a while ago.
It's not reall well though out yet, but anyway.When the topic of optimizer hints comes up, people often suggest
that there should be a way to force postgres to use a certain
index, or do joins in a certain order. AFAIK, this mimics what
oracle does - you can put comments into your query that specify
what index to use. This approach has two major drawbacks
.) Plans that seem good now might not seem that good a few months
later - your data might have changed, and other execution plans
might fit better now
.) You have to change all your queries to make use of features
in new postgres versions, like bitmap scans.My experience with the postgres optimizer is that it usually performs
great - and if it doesn't, that always boiled down to two problems
(at least for me)
.) The query is autogenerated, and includes complex, and highly inter-
dependent where (or join) conditions. This leads to wrong estimates
of where selectivity, and thus to bad plans.
.) There are correlations between columns and/or tables that postgres
doesn't know about (and has no chance of knowing about). Again, this
leads to vastly wrong estimates of row counts, and to bad plans.I think that those bad estimates of the selectivity of where-clauses
(or on-clauses for joins) is where postgres could use hints.Image a query like "select ... from t1 join t2 on t1.t2_id = t2.id and
<expr>". Lets say that "<expr>" is true for only 1% of the rows in t2 -
but those are exactly the rows that have matching rows in t1.Postgres would probably guess that this join will produce about 1/100
of the rows that t1 has - but I _know_ that it will produce 100 (!)
times more rows.Now, I'd like to hand that information to postgres. I wouldn't want
to force any particular access method or join order, but rather I'd
just tell it "hey, this expression has selectivity 1 in this context,
not 0.01 as you might think".Could that work?
greetings, Florian Pflug
---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly
On Tue, Aug 08, 2006 at 01:55:35PM +0200, Florian G. Pflug wrote:
Hi
Since the discussion about how to force a specific plan has
come up, I though I'd post an idea I had for this a while ago.
It's not reall well though out yet, but anyway.
<snip>
Image a query like "select ... from t1 join t2 on t1.t2_id = t2.id and
<expr>". Lets say that "<expr>" is true for only 1% of the rows in t2 -
but those are exactly the rows that have matching rows in t1.Postgres would probably guess that this join will produce about 1/100
of the rows that t1 has - but I _know_ that it will produce 100 (!)
times more rows.
ISTM theat the easiest way would be to introduce a sort of predicate
like so:
SELECT * FROM foo, bar WHERE pg_selectivity(foo.a = bar.a, 0.1);
If you teach the optimiser that pg_selectivity always has the
selectivity of the second argument, you're done. Other than that you
just need to define pg_selectivity as a no-op.
One thing though: when people think of selectivity, they think "number
of rows in foo that have a match in bar" whereas selectivity for
postgres means "chance this expression will be true". They are related
but not the same thing. Converting from one to the other will have it's
own pitfalls...
Hope this helps,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/
Show quoted text
From each according to his ability. To each according to his ability to litigate.
Martijn van Oosterhout wrote:
On Tue, Aug 08, 2006 at 01:55:35PM +0200, Florian G. Pflug wrote:
Hi
Since the discussion about how to force a specific plan has
come up, I though I'd post an idea I had for this a while ago.
It's not reall well though out yet, but anyway.<snip>
Image a query like "select ... from t1 join t2 on t1.t2_id = t2.id and
<expr>". Lets say that "<expr>" is true for only 1% of the rows in t2 -
but those are exactly the rows that have matching rows in t1.Postgres would probably guess that this join will produce about 1/100
of the rows that t1 has - but I _know_ that it will produce 100 (!)
times more rows.ISTM theat the easiest way would be to introduce a sort of predicate
like so:SELECT * FROM foo, bar WHERE pg_selectivity(foo.a = bar.a, 0.1);
Ideally, though it needs to be defined upon the table(s) in question,
possibly with a WHERE clause as with indexes:
CREATE STATISTIC <...defn here...>
ON invoices (cli_id), clients (id)
WHERE invoices.paid = false
WITH PRIORITY 100;
(I'm thinking the priority so you can delete any rules with a low
priority while keeping ones you think are vital)
--
Richard Huxton
Archonet Ltd
Martijn van Oosterhout <kleptog@svana.org> writes:
ISTM theat the easiest way would be to introduce a sort of predicate
like so:
SELECT * FROM foo, bar WHERE pg_selectivity(foo.a = bar.a, 0.1);
The one saving grace of Florian's proposal was that you could go hack
the statistics *without* changing your queries. This throws that away
again.
The thing I object to about the "I want to decorate my queries with
planner hints" mindset is that it's coming at it from the wrong
direction. You should never be thinking in terms of "fix this one
query", because that just leads back into the same dead end that your
fix doesn't work tomorrow. What you *should* be thinking about is "why
did the planner get this wrong, and how do I fix the generic problem?".
If you attack it that way then your fix is much more likely to work on
the next slightly-different query.
So some kind of override for statistical guesses doesn't seem completely
silly to me. But it needs to be declarative information that's stored
somewhere out of view of the actual SQL queries. IMHO anyway.
regards, tom lane
On Tue, Aug 08, 2006 at 04:14:45PM -0400, Tom Lane wrote:
Martijn van Oosterhout <kleptog@svana.org> writes:
ISTM theat the easiest way would be to introduce a sort of predicate
like so:SELECT * FROM foo, bar WHERE pg_selectivity(foo.a = bar.a, 0.1);
The one saving grace of Florian's proposal was that you could go hack
the statistics *without* changing your queries. This throws that away
again.
Well, that true. I was thinking of the easy way.
To run with something suggested in this thread, do you think it would
be more reasonable to be able to provide statistics information for
joins, which currently we have no grip on at all. Something like:
CREATE STATISTIC foo
ON table1 a, table2 b
WHERE a.x = b.x
AS SELECTIVITY < 0.1;
The idea being that if the planner see those tables being joined on
those fields, that it will do its guess on the number of rows, but caps
the selectivity to less than 0.1.
My main problem is that selectivity is the wrong measurement. What
users really want to be able to communicate is:
1. If you join tables a and b on x, the number of resulting rows will be
the number of roows selected from b (since b.x id a foreign key
referencing a.x).
2. That on average there is a N:1 ratio of results between a.x and b.x.
So if you take a value of a.x and look it up in b, on average you'll
get N results. This can be a valid measurement for any two columns, not
just ones related by a foreign key.
For either of those, selectivity is the wrong variable, but I'll be
damned if I can think of a better way of expressing it...
The interesting case would be joins across a number of tables and be
able to tell the planner information about that, but that's an even
harder problem.
Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/
Show quoted text
From each according to his ability. To each according to his ability to litigate.
Tom Lane wrote:
The thing I object to about the "I want to decorate my queries with
planner hints" mindset is that it's coming at it from the wrong
direction. You should never be thinking in terms of "fix this one
query", because that just leads back into the same dead end that your
fix doesn't work tomorrow. What you *should* be thinking about is "why
did the planner get this wrong, and how do I fix the generic problem?".
If you attack it that way then your fix is much more likely to work on
the next slightly-different query.So some kind of override for statistical guesses doesn't seem completely
silly to me. But it needs to be declarative information that's stored
somewhere out of view of the actual SQL queries. IMHO anyway.regards, tom lane
Imagine a join between two tables:
select a.x, b.y where a.x = f(b.y) from a, b;
I may know that, given the data I've put into the tables, only one value in b
will ever match one value in a. Or perhaps I know that no more than ten rows in
b will match a given value in a. But how can the statistics from ANALYZE ever
see through arbitrary math functions to know this sort of thing?
The current analyze functionality, as I understand it, can store information
about a given table, but not about the relationships between the data in several
tables, which is the information the planner would need to choose the right
plan. Do all the requests from postgres users for giving hints to the planner
involve this type of situation, where the hints are not about a single table,
but rather about the relationship between two or more tables and specific joins
between them?
Do I understand correctly? Is this a reasonable analysis?
mark
Martijn van Oosterhout <kleptog@svana.org> writes:
My main problem is that selectivity is the wrong measurement. What
users really want to be able to communicate is:
1. If you join tables a and b on x, the number of resulting rows will be
the number of roows selected from b (since b.x id a foreign key
referencing a.x).
FWIW, I believe the planner already gets that case right, because a.x
will be unique and it should know that. (Maybe not if the FK is across
a multi-column key, but in principle it should get it right.)
I agree though that meta-knowledge like this is important, and that
standard SQL frequently doesn't provide any adequate way to declare it.
regards, tom lane
On Tue, 2006-08-08 at 22:14, Tom Lane wrote:
So some kind of override for statistical guesses doesn't seem completely
silly to me. But it needs to be declarative information that's stored
somewhere out of view of the actual SQL queries. IMHO anyway.
The real problem is that sometimes there's no way to get a better plan
without some code change in the planner. And given the postgres release
policy, that might be as far as 1 year away for a normal user... of
course it's open source, you can patch, but would I trust a patch which
is not tested by the community ? So mostly I can't wait for code
changes, and then a generic tool to fix _now_ the one bad query which
brings my system down would be nice. This is why hints would be nice, to
quick-fix immediate problems. Of course they can and would be abused, as
anything else.
On the planner improvements part, would it be possible to save
statistics about join criteria between tables ? I'm not sure where that
would belong, but I guess it would be possible to have a special kind of
ANALYZE which analyzes multiple tables and their correlations... this
way the user would not need to hard-code the statistics hints, but the
system could generate them.
Cheers,
Csaba.
Tom Lane wrote:
Martijn van Oosterhout <kleptog@svana.org> writes:
ISTM theat the easiest way would be to introduce a sort of predicate
like so:SELECT * FROM foo, bar WHERE pg_selectivity(foo.a = bar.a, 0.1);
The one saving grace of Florian's proposal was that you could go hack
the statistics *without* changing your queries. This throws that away
again.
I think for this to be really effective, you'd actually need both - a
query-independent way specifying selectivities, and a way to influence
the estimates for a _single_ query.
Image a complex, autogenerated query with looks something like this
select ....
from t1
join t2 on ...
join t3 on ...
join t4 on ...
...
...
where
<big, complicated expression derived from some user input>.
This big, complicated expression looks different for every query - and
currently, postgres often vastly overestimates the selectivity of this
expression. This leads to weird join orders, and generally very bad
performance. Of course, *I* don't know the selectivity of this
expression myself - but experience tells me that on average it's
something like 50%, and not 1% as postgres believes. So, in that case,
being able to write
select ... join .... where pg_selectivity(<expression>, 0.5)
would be a big win.
The thing I object to about the "I want to decorate my queries with
planner hints" mindset is that it's coming at it from the wrong
direction. You should never be thinking in terms of "fix this one
query", because that just leads back into the same dead end that your
fix doesn't work tomorrow. What you *should* be thinking about is "why
did the planner get this wrong, and how do I fix the generic problem?".
If you attack it that way then your fix is much more likely to work on
the next slightly-different query.
Fixing the generic problem is surely the best _if_ there is a fix for
the generic problem at all. But if your where-conditions involves fields
from 10 different tables, then IMHO there is no way to _ever_ guarantee
that postgres will get correct selectivity estimates. But since (at
least for me) overestimating selectivity hurts fare more than
underestimating it, forcing postgres to just assume a certain
selectivity could help.
I'm not in any way saying that there should _only_ be selectivity
annotations inside the query - a query-independent mechanism would
be a very nice thing to have. But a query-independent mechanism
wont be sufficient in all cases IMHO.
greetings, Florian Pflug
On Wed, Aug 09, 2006 at 12:57:39PM +0200, Florian G. Pflug wrote:
Fixing the generic problem is surely the best _if_ there is a fix for
the generic problem at all. But if your where-conditions involves fields
from 10 different tables, then IMHO there is no way to _ever_ guarantee
that postgres will get correct selectivity estimates. But since (at
least for me) overestimating selectivity hurts fare more than
underestimating it, forcing postgres to just assume a certain
selectivity could help.
I'm not sure if the problem is totally solvable, but we can certainly
do a lot better than we do now.
ISTM that what's really missing at the moment is some kind of
post-mortem analysis that looks at the EXPLAIN ANALYZE output, pulls it
apart and say: 'look, we went wrong here'. For leaf nodes trying to
estimate the selectivity on a single table it easy. But working out the
selectivity of join nodes is harder.
Where we really fall down right now it that we do not recognise highly
correlated columns. If we have the expression WHERE a = 1 AND b = 2 we
assume the expressions are independant and multiply the selectivities
together. Often this is the wrong thing to do.
This also a problem for columns in different tables that get joined on.
Currently we don't do anything special there either.
Perhaps the way to go would be to allow users to declare columns often
used together and have ANALYSE collect information on correlation which
can be used later...
Have a ncie day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/
Show quoted text
From each according to his ability. To each according to his ability to litigate.
In article <14860.1155090146@sss.pgh.pa.us>,
tgl@sss.pgh.pa.us (Tom Lane) wrote:
Martijn van Oosterhout <kleptog@svana.org> writes:
My main problem is that selectivity is the wrong measurement. What
users really want to be able to communicate is:1. If you join tables a and b on x, the number of resulting rows will be
the number of roows selected from b (since b.x id a foreign key
referencing a.x).FWIW, I believe the planner already gets that case right, because a.x
will be unique and it should know that. (Maybe not if the FK is across
a multi-column key, but in principle it should get it right.)I agree though that meta-knowledge like this is important, and that
standard SQL frequently doesn't provide any adequate way to declare it.regards, tom lane
Every once in a while people talk about collecting better statistics,
correlating multi-column correlations etc. But there never seems to be
a way to collect that data/statistics.
Would it be possible to determine the additional statistics the planner
needs, modify the statistics table to have them and document how to
insert data there? We wouldn't have a good automated way to determine
the information but a properly educated DBA could tweak things until
they are satisfied.
At worse if this new information is unpopulated then things would be as
they are now. But if a human can insert the right information then some
control over the planner would be possible.
Is this a viable idea? Would this satisfy those that need to control
the planner immediately without code changes?
-arturo
"Florian G. Pflug" <fgp@phlo.org> writes:
Image a complex, autogenerated query with looks something like this
select ....
from t1
join t2 on ...
join t3 on ...
join t4 on ...
...
...
where
<big, complicated expression derived from some user input>.
This big, complicated expression looks different for every query - and
currently, postgres often vastly overestimates the selectivity of this
expression.
This is a straw man. There is no way that your application can throw in
a chosen-at-random selectivity value for a join condition that it
doesn't understand and have that be more likely to be right than the
planner's guess.
regards, tom lane
SELECT * FROM foo, bar WHERE pg_selectivity(foo.a = bar.a, 0.1);
ISTM that you introduced the Oracle silliness again, putting the hint into the
query.
My suggestion would be to tell about it separately. Something like
CREATE HINT FOR JOIN foo, bar ON foo.a=bar.b AS <some hint>;
This way hints can be added and removed without ever touching the existing
queries.
--
Med venlig hilsen
Kaare Rasmussen, Jasonic
Jasonic Telefon: +45 3816 2582
Nordre Fasanvej 12
2000 Frederiksberg Email: kaare@jasonic.dk
Tom Lane wrote:
"Florian G. Pflug" <fgp@phlo.org> writes:
Image a complex, autogenerated query with looks something like this
select ....
from t1
join t2 on ...
join t3 on ...
join t4 on ...
...
...
where
<big, complicated expression derived from some user input>.This big, complicated expression looks different for every query - and
currently, postgres often vastly overestimates the selectivity of this
expression.This is a straw man. There is no way that your application can throw in
a chosen-at-random selectivity value for a join condition that it
doesn't understand and have that be more likely to be right than the
planner's guess.
No, my application probably won't get it right, _but_
.) I can at least _choose_ what selectivity to use. My experience is
that a selectivity that is too small (meaning that postgres
underestimates the number of records resulting for a join or where)
is usually much worse than a overly large selectivity (meaning that
postgres expects more records than it actually finds). Forcing a
high selectivity (thus letting postgres expect a lot of records)
therefore should lead to better plans then letting postgres
underestimating the selectivity.
.) Often, my application (or I) *can* guess betten then postgres. My
application, for example, executes the same set of about 100 queries
every day to build cache tables. Since I _know_ how many records the
query returned yesterday, I can use that value to get a *very*
good approximation of the selectivity. This is something my app
can do easily, while postgres would have really a hard time to figure
that out.
greetings, Florian Pflug
Been suggested before... the problem is actually doing something useful
with all that data that's collected, as well as how to collect it
without greatly impacting the system.
On Tue, Aug 08, 2006 at 08:23:05AM -0700, Mark Dilger wrote:
If this feature I'm proposing already exists, sorry for the waste of
bandwidth, and could someone please point me to it? :)What if there were a mode that told postgres to do an exhaustive search (or
if not exhaustive, then much more extensive search) of all plans (or many
plans), trying each plan, reporting the performance of each, and discarding
the query results, much like "explain analyze" does. Postgres could then
dump the best plan in machine readable (and semi-human readable) form which
the planner could parse and use at some later date in lieu of a SQL query.This would allow people with reasonably static table statistics (where the
best plan is not likely to change) to spend upfront cycles investigating
the best plan and then embed that plan in their business logic. Since the
stored plan is both written-by and read-by postgres, it can get quite
complicated without putting a burden on humans to read and write such
complicated things. It would also remove the risk that the planner will
occasionally (due to its nondeterministic workings) choose a really bad
plan and stall a production system.mark
Florian G. Pflug wrote:
Hi
Since the discussion about how to force a specific plan has
come up, I though I'd post an idea I had for this a while ago.
It's not reall well though out yet, but anyway.When the topic of optimizer hints comes up, people often suggest
that there should be a way to force postgres to use a certain
index, or do joins in a certain order. AFAIK, this mimics what
oracle does - you can put comments into your query that specify
what index to use. This approach has two major drawbacks
.) Plans that seem good now might not seem that good a few months
later - your data might have changed, and other execution plans
might fit better now
.) You have to change all your queries to make use of features
in new postgres versions, like bitmap scans.My experience with the postgres optimizer is that it usually performs
great - and if it doesn't, that always boiled down to two problems
(at least for me)
.) The query is autogenerated, and includes complex, and highly inter-
dependent where (or join) conditions. This leads to wrong estimates
of where selectivity, and thus to bad plans.
.) There are correlations between columns and/or tables that postgres
doesn't know about (and has no chance of knowing about). Again, this
leads to vastly wrong estimates of row counts, and to bad plans.I think that those bad estimates of the selectivity of where-clauses
(or on-clauses for joins) is where postgres could use hints.Image a query like "select ... from t1 join t2 on t1.t2_id = t2.id and
<expr>". Lets say that "<expr>" is true for only 1% of the rows in t2 -
but those are exactly the rows that have matching rows in t1.Postgres would probably guess that this join will produce about 1/100
of the rows that t1 has - but I _know_ that it will produce 100 (!)
times more rows.Now, I'd like to hand that information to postgres. I wouldn't want
to force any particular access method or join order, but rather I'd
just tell it "hey, this expression has selectivity 1 in this context,
not 0.01 as you might think".Could that work?
greetings, Florian Pflug
---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
On Wed, Aug 09, 2006 at 02:02:10PM +0200, Martijn van Oosterhout wrote:
On Wed, Aug 09, 2006 at 12:57:39PM +0200, Florian G. Pflug wrote:
Fixing the generic problem is surely the best _if_ there is a fix for
the generic problem at all. But if your where-conditions involves fields
from 10 different tables, then IMHO there is no way to _ever_ guarantee
that postgres will get correct selectivity estimates. But since (at
least for me) overestimating selectivity hurts fare more than
underestimating it, forcing postgres to just assume a certain
selectivity could help.I'm not sure if the problem is totally solvable, but we can certainly
do a lot better than we do now.ISTM that what's really missing at the moment is some kind of
post-mortem analysis that looks at the EXPLAIN ANALYZE output, pulls it
apart and say: 'look, we went wrong here'. For leaf nodes trying to
estimate the selectivity on a single table it easy. But working out the
selectivity of join nodes is harder.Where we really fall down right now it that we do not recognise highly
correlated columns. If we have the expression WHERE a = 1 AND b = 2 we
assume the expressions are independant and multiply the selectivities
together. Often this is the wrong thing to do.This also a problem for columns in different tables that get joined on.
Currently we don't do anything special there either.Perhaps the way to go would be to allow users to declare columns often
used together and have ANALYSE collect information on correlation which
can be used later...
One thing that would help tremendously would be to collect stats on
multi-column indexes. That would probably hit a good chunk of our
problem areas.
Something this is related to is providing estimates for functions (which
has been discussed in the past). There were numerous proposals there,
but the one that stuck in my head was allowing users to define functions
that would provide appropriate stats based on some input. Granted,
that's a pretty low-level construct, but it's more than we have now, and
would allow for better schemes to be built on top of it.
As for query hints, I really wish we'd just bite the bullet and add
them. Yes, they're far from perfect, yes, we should "just fix the
planner", yes, it's ugly that they're per-statement, but ultimately
sometimes you have to just flat-out tell the planner to do things a
certain way. I suspect enough time has been spent debating them since
7.2 that they could have been implemented by now.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
Jim C. Nasby wrote:
Been suggested before... the problem is actually doing something useful
with all that data that's collected, as well as how to collect it
without greatly impacting the system.
Identifying the best plan by means of actually running multiple plans and timing
them is useful. That would be the point.
As far as "without greatly impacting the system", I don't think that is a real
concern. The whole idea is to greatly impact the system *once*, sometime when
the DBA doesn't mind impacting the system (like before you go live on a
production network, or between midnight and 3 AM, or whatever), and then store
the best plan for future use.
The planner trades-off the desire to find the best plan and the need to find a
plan quickly. It also chooses a plan based on statistics and not based on
actual runtimes (because there is a chicken-and-egg problem: how do you know
which plan has the smallest runtime without running it?), so the chosen plan
that looks best based on statistics might not actually be best.
The idea I'm proposing circumvents the whole trade-off problem by explicitly
choosing to do something that makes the planner run really slowly and take a
really long time. But it doesn't do it "at runtime", in the sense that you
don't do it for each query. You just do it once up front and be done with it.
Of course, this is only useful for people with reasonably static queries and
reasonably static table statistics, so that a good plan found up-front continues
to be a good plan as it is repeatedly used.
My personal motivation is that I have tables whose statistics are quite static.
The data itself changes, but the statistical distribution from which the data
is pulled is unchanging, so the table statistics end up about the same even as
the data itself is added and deleted. On top of that, the planner keeps
choosing the wrong plan, which I know to be true because I can make individual
queries run faster by structuring them in ways that the planner can't see
through and "optimize" away the particular plan that I am effectively giving it.
But this is a PITA for me, especially since I don't always know what the best
plan might be and have to try them all until I find the right one. (With the
added complexity that I can't always figure out how to trick the planner into
choosing a specific plan, and hence can't test it.) It would be *so much
easier* to have an option to tell the planner to try them all.
mark
Show quoted text
On Tue, Aug 08, 2006 at 08:23:05AM -0700, Mark Dilger wrote:
If this feature I'm proposing already exists, sorry for the waste of
bandwidth, and could someone please point me to it? :)What if there were a mode that told postgres to do an exhaustive search (or
if not exhaustive, then much more extensive search) of all plans (or many
plans), trying each plan, reporting the performance of each, and discarding
the query results, much like "explain analyze" does. Postgres could then
dump the best plan in machine readable (and semi-human readable) form which
the planner could parse and use at some later date in lieu of a SQL query.This would allow people with reasonably static table statistics (where the
best plan is not likely to change) to spend upfront cycles investigating
the best plan and then embed that plan in their business logic. Since the
stored plan is both written-by and read-by postgres, it can get quite
complicated without putting a burden on humans to read and write such
complicated things. It would also remove the risk that the planner will
occasionally (due to its nondeterministic workings) choose a really bad
plan and stall a production system.mark
"Jim C. Nasby" <jnasby@pervasive.com> writes:
On Wed, Aug 09, 2006 at 02:02:10PM +0200, Martijn van Oosterhout wrote:
Perhaps the way to go would be to allow users to declare columns often
used together and have ANALYSE collect information on correlation which
can be used later...
One thing that would help tremendously would be to collect stats on
multi-column indexes. That would probably hit a good chunk of our
problem areas.
But it would specifically fail to cover join situations. I kinda like
Martijn's thought of allowing users to specify combinations of columns
to collect correlation stats about.
(Not sure how we'd implement that, seeing that ANALYZE currently works
on one table at a time, but it's probably doable --- and it'd fix the
fundamental problem for correlation statistics, which is how not to try
to collect stats about an exponential number of combinations ...)
regards, tom lane
(Not sure how we'd implement that, seeing that ANALYZE currently works
on one table at a time, but it's probably doable --- and it'd fix the
fundamental problem for correlation statistics, which is how not to try
to collect stats about an exponential number of combinations ...)
An exponential number of combinations? Is it possible to use FK
relationships to determine what tables & columns are likely to be used
in future joins.
Josh