Limiting the number of parameterized indexpaths created
I looked into the complaint of unreasonable planner runtime in bug #7626,
http://archives.postgresql.org/pgsql-bugs/2012-10/msg00232.php
In the given example, the indexed relation "foo" has join clauses with
30 other relations. The code that I added in commit
3b8968f25232ad09001bf35ab4cc59f5a501193e will try all 2^30 combinations
of those rels as possible outer relations for a parameterized indexscan
:-(. So clearly, the idea that we can just try everything and not have
some kind of heuristic restriction isn't going to work.
This particular example, and probably a lot of similar examples, does
have a principled non-heuristic fix, which comes from the fact that we
recognize all the join clauses as belonging to the same equivalence
class. Therefore, using more than one of them in an indexscan is
useless. (The code already does know that and discards the extra
clauses, but that happens too far downstream to prevent the exponential
search time here.) The extra function eclass_already_used() added in
the attached patch attacks the problem this way, and is sufficient to
resolve the bug for the case presented.
However, we still have an issue for cases where the join clauses aren't
equivalence clauses. (For instance, if you just change all the "="
signs to "<" in Brian's example, it still takes forever.) So we also
need some heuristic rule to limit the number of cases considered.
I spent a good deal of time thinking about how the indexscan code might
make use of the joinlist data structure (which prevents exponential
planning time growth overall by limiting which joins we'll consider)
to fix this; the idea being to not consider outer-relation sets that
couldn't actually be presented for use because of joinlist restrictions.
But this looked messy, and probably expensive in itself. Moreover it
seemed like practical implementations of the idea would carry the
restriction that we could never generate parameterized paths for
joinlist subproblems, which would be a real shame. And we'd be doing a
lot of work for what is probably a very unusual corner case, given that
I think eclass_already_used() will fix the problem in most real cases.
So what I'm proposing instead, which is implemented in the other half of
the attached patch, is that we simply put an arbitrary limit on how many
outer-relation sets we'll consider while generating parameterized
indexscans. As written, the patch limits the number of relid sets to 10
times the number of join clauses it's working with, which is small
enough to keep the runtime of this test case to something reasonable.
I think that in practice this will still allow useful
combination-outer-rel cases to be found. I wonder though if anyone has
a better idea?
regards, tom lane
On 30 October 2012 21:57, Tom Lane <tgl@sss.pgh.pa.us> wrote:
So what I'm proposing instead, which is implemented in the other half of
the attached patch, is that we simply put an arbitrary limit on how many
outer-relation sets we'll consider while generating parameterized
indexscans. As written, the patch limits the number of relid sets to 10
times the number of join clauses it's working with, which is small
enough to keep the runtime of this test case to something reasonable.
I think that in practice this will still allow useful
combination-outer-rel cases to be found. I wonder though if anyone has
a better idea?
That seems like a useful limit, but it is arbitrary and not
controllable by user.
An example like that is actually fairly common when you put back
together a deep class hierarchy with one table per class. I'm a little
surprised the whole thing doesn't collapse anyway, when using
constants as shown since aid = 2 AND aid =3 will be no rows. But I
guess that's no really important.
We should be able to do a better job of recognising some other
aspect/symmetry of this situation and then simplifying the problem
from that. Calculating all paths treats the problem as a complete
unknown and we must be able to do better than that. If we look at the
problem from the perspective of how we would handle it if one of the
tables was very large, how would that change things? Can we use a
greedy algorithm starting with largest table?
This is hand-waving, but it is useful to discuss these things. I'd be
happy if you could give a fuller explanation of this issue so we can
all learn.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Simon Riggs <simon@2ndQuadrant.com> writes:
We should be able to do a better job of recognising some other
aspect/symmetry of this situation and then simplifying the problem
from that. Calculating all paths treats the problem as a complete
unknown and we must be able to do better than that. If we look at the
problem from the perspective of how we would handle it if one of the
tables was very large, how would that change things? Can we use a
greedy algorithm starting with largest table?
Doesn't seem particularly relevant. The issue is which parameterized
paths to generate for the current index of the current table. It could
possibly make sense to try harder if the current table is large than
if it's small --- but no matter how large it is, we can't afford to
enumerate all 2^N possibilities if we have join clauses linking to N
other relations and N is more than a handful.
The reason this is a problem is that you might have indexes, or even
individual join clauses, that require multiple other rels to be used
effectively. Suppose we have an index t1ab on t1(a, b) and a query
WHERE t1.a = t2.x AND t1.b = t3.y
If t1 is large while t2 and t3 are both small, the best plan is to join
t2 and t3 first (even if that's a cartesian join) and then use both t2.x
and t3.y as parameters for an indexscan on t1ab. The same observation
can apply even for single index columns and single join clauses:
WHERE t1.a = (t2.x + t3.y)
So the context where this is an issue is that we're considering the
specific index t1ab and wondering what parameterized indexscan paths
to create using it. To find a good plan in these examples, we have to
create a path that's parameterized by both t2 and t3. It's not so hard
to consider that possibility here, but what if there are 30 different
join clauses (linking to 30 other rels) that could be used with the same
index? We can't afford to try all the combinations exhaustively, and
even if we could, it's just about certain that most of them aren't worth
our time.
Before 9.2, we didn't have this problem in this guise because we didn't
generate parameterized paths bottom-up; instead, given that we had
already decided to join t2 and t3 for some reason, we would look to see
what indexpaths we could make for t1 using values from t2 and/or t3.
However, that approach had big problems of its own: it couldn't find
plans that involved pushing parameters down through an intervening join.
Another point here is that no such path can get chosen unless the
planner decides to consider joining t2 and t3 to begin with --- and if
that's a cartesian join, it will reject the idea out of hand. In the
previous coding there was basically no realistic fix for that; if we let
the thing try useless cartesian joins, planning time on large queries
will go to hell in a handbasket. With the new code, we could in
principle make note of interesting parameterized paths that require
multiple outer relations, and then allow those specific combinations of
relations to get cartesian-joined. I've not had time to pursue this yet
but I think it should happen sometime.
So I think bottom-up creation of parameterized paths is better than what
we were doing pre-9.2, but we do have to stop it from going crazy when
there are too many join clauses for the same index.
Now the $64 question is how hard do we have to try to find the "best"
alternative when there are too many. I think queries in which there are
a lot of join clauses touching the same index but they are not related
by equivalence classes are probably pretty pathological. It's
questionable whether we should spend a lot of skull sweat on such cases,
or for that matter that we'd even recognize an especially good solution
if we saw it.
One idea that occurred to me while typing this up is to restrict the
maximum number of other-rels that we'll consider together, ie, when we
have 30 other rels, don't try to form all 2^30 subsets but only those
containing at most K rels, for some pretty small K (say, the number of
index columns). But I'm not really sure it's worth the effort.
regards, tom lane
On 31 October 2012 19:44, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Before 9.2, we didn't have this problem in this guise because we didn't
generate parameterized paths bottom-up; instead, given that we had
already decided to join t2 and t3 for some reason, we would look to see
what indexpaths we could make for t1 using values from t2 and/or t3.
However, that approach had big problems of its own: it couldn't find
plans that involved pushing parameters down through an intervening join.
Given that most joins patterns are repeated consistently across
multiple SQL statements, is it possible to pre-calculate paths at
ANALYZE time, or perhaps cache them (persistently?) when first
generated? That way we just look up best paths, which would make join
planning much faster.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Oct 30, 2012 at 5:57 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I looked into the complaint of unreasonable planner runtime in bug #7626,
http://archives.postgresql.org/pgsql-bugs/2012-10/msg00232.phpIn the given example, the indexed relation "foo" has join clauses with
30 other relations. The code that I added in commit
3b8968f25232ad09001bf35ab4cc59f5a501193e will try all 2^30 combinations
of those rels as possible outer relations for a parameterized indexscan
:-(. So clearly, the idea that we can just try everything and not have
some kind of heuristic restriction isn't going to work.
You know, when I read this, my first thought was ... why is this an
exponential relationship instead of a linear one? Even now, I'm not
sure I quite understand that. With a parameterized path, we get an
index scan (or index-only scan) with a.id taking its value from some
outer scan, but it can't take its value from more than one outer scan.
Can it? So what does it mean to parameterize the scan of foo by both
ag1 (aid) and ag2 (aid)?
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes:
On Tue, Oct 30, 2012 at 5:57 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I looked into the complaint of unreasonable planner runtime in bug #7626,
http://archives.postgresql.org/pgsql-bugs/2012-10/msg00232.php
You know, when I read this, my first thought was ... why is this an
exponential relationship instead of a linear one?
Because it's considering *combinations* of outer relations for a
parameterized scan. For instance consider an index on t(a,b)
and a query
WHERE t.a = x.c1 AND t.b = y.c2
There are three different parameterized paths we could create: one
relying on x only, one relying on y only, one relying on both.
The one relying on y only is probably going to suck, if this is a
btree index, but at the level we're working at here that's not yet
apparent. The other two are definitely both worthy of consideration,
since it might or might not be worth it to join x and y first in order
to use both conditions in scanning t.
So in general, given join clauses that reference N different outer
relations, you could have as many as 2^N-1 sets of outer relations that
could possibly generate usefully-different parameterized paths. In
practice, since all these clauses must be usable with the same index,
there's probably not nearly that many useful combinations --- but again,
it's hard to know exactly which ones are winners in advance of doing any
cost calculations.
regards, tom lane
On Mon, Nov 5, 2012 at 2:05 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
On Tue, Oct 30, 2012 at 5:57 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I looked into the complaint of unreasonable planner runtime in bug #7626,
http://archives.postgresql.org/pgsql-bugs/2012-10/msg00232.phpYou know, when I read this, my first thought was ... why is this an
exponential relationship instead of a linear one?Because it's considering *combinations* of outer relations for a
parameterized scan. For instance consider an index on t(a,b)
and a query
WHERE t.a = x.c1 AND t.b = y.c2
There are three different parameterized paths we could create: one
relying on x only, one relying on y only, one relying on both.
Sure, but that example is different from the test case provided in the
bug report. I agree that here we need to try paths parameterized by
a, b, or both a and b. Things might blow up multiplicatively, because
we have join clauses referencing both t.a and t.b. But they shouldn't
blow up exponentially, because each of t.a and t.b can only be
parameterized by ONE thing (I think). And in the example in the bug
report, only one column of the table (foo.id) is mentioned. foo.id
can be driven by ag1.aid OR ag2.aid OR ag3.aid OR ..., but not more
than one of those at a time.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes:
On Mon, Nov 5, 2012 at 2:05 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
There are three different parameterized paths we could create: one
relying on x only, one relying on y only, one relying on both.
Sure, but that example is different from the test case provided in the
bug report. I agree that here we need to try paths parameterized by
a, b, or both a and b. Things might blow up multiplicatively, because
we have join clauses referencing both t.a and t.b. But they shouldn't
blow up exponentially, because each of t.a and t.b can only be
parameterized by ONE thing (I think).
Um, no. This is a useful counterexample:
WHERE t.a > x.c1 AND t.a < y.c2
With a range constraint like this one, it's possible for the
doubly-parameterized path to be quite useful while either
singly-parameterized path is basically useless. And these examples
aren't even going into cases you might get with non-btree indexes,
where clauses could interact in much more complicated ways.
And in the example in the bug
report, only one column of the table (foo.id) is mentioned. foo.id
can be driven by ag1.aid OR ag2.aid OR ag3.aid OR ..., but not more
than one of those at a time.
In the example, we do figure out that the clauses are redundant, but
only further downstream. The code that's got the problem can't really
assume such a thing. As patched, it will indeed limit what it considers
to at most one additional clause per index column, once it's hit the
heuristic limit --- but it's entirely possible for it to miss useful
combinations because of that.
regards, tom lane
On Mon, Nov 5, 2012 at 2:44 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
On Mon, Nov 5, 2012 at 2:05 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
There are three different parameterized paths we could create: one
relying on x only, one relying on y only, one relying on both.Sure, but that example is different from the test case provided in the
bug report. I agree that here we need to try paths parameterized by
a, b, or both a and b. Things might blow up multiplicatively, because
we have join clauses referencing both t.a and t.b. But they shouldn't
blow up exponentially, because each of t.a and t.b can only be
parameterized by ONE thing (I think).Um, no. This is a useful counterexample:
WHERE t.a > x.c1 AND t.a < y.c2
With a range constraint like this one, it's possible for the
doubly-parameterized path to be quite useful while either
singly-parameterized path is basically useless. And these examples
aren't even going into cases you might get with non-btree indexes,
where clauses could interact in much more complicated ways.
Well, OK. So maybe you also need the operator to be the same as well.
And in the example in the bug
report, only one column of the table (foo.id) is mentioned. foo.id
can be driven by ag1.aid OR ag2.aid OR ag3.aid OR ..., but not more
than one of those at a time.In the example, we do figure out that the clauses are redundant, but
only further downstream. The code that's got the problem can't really
assume such a thing. As patched, it will indeed limit what it considers
to at most one additional clause per index column, once it's hit the
heuristic limit --- but it's entirely possible for it to miss useful
combinations because of that.
Seems unfortunate, but I don't understand the code well enough to know
how to do better.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes:
On Mon, Nov 5, 2012 at 2:44 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Um, no. This is a useful counterexample:
WHERE t.a > x.c1 AND t.a < y.c2
Well, OK. So maybe you also need the operator to be the same as well.
Nope. A counterexample to that claim is a GIN index on an array column:
WHERE t.arraycol @> array[1,2,3] AND t.arraycol @> array[4,5,6]
This restriction is equivalent to
WHERE t.arraycol @> array[1,2,3,4,5,6]
which is substantially more selective than either constraint alone.
If the two RHS arrays are not constants, but are coming from different
tables x and y, then we have something isomorphic to the previous
example (at least from the perspective of indxpath.c), but it would
not be good for indxpath.c to assume that these clauses couldn't be
useful together.
We *can* make a simplifying assumption of the kind you suggest when
we know that the clauses were all generated from the same equivalence
class, because then we have very strong assumptions about what the
clauses' semantics are. (And indeed the patch does take care of that
case separately.) But for the general case of non-equijoin clauses
we can't assume very much at all about whether clauses are redundant,
at least not without knowledge that indxpath.c hasn't got.
.... As patched, it will indeed limit what it considers
to at most one additional clause per index column, once it's hit the
heuristic limit --- but it's entirely possible for it to miss useful
combinations because of that.
Seems unfortunate, but I don't understand the code well enough to know
how to do better.
Me either. What I will say is that as patched, the code will still
find all useful clause combinations as long as there aren't too many
other relations involved. I've not been able to think of another way
of restricting the search that doesn't reject possibly-useful
combinations even in otherwise-very-simple queries.
regards, tom lane
On Mon, Nov 5, 2012 at 3:07 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
On Mon, Nov 5, 2012 at 2:44 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Um, no. This is a useful counterexample:
WHERE t.a > x.c1 AND t.a < y.c2Well, OK. So maybe you also need the operator to be the same as well.
Nope. A counterexample to that claim is a GIN index on an array column:
WHERE t.arraycol @> array[1,2,3] AND t.arraycol @> array[4,5,6]
This restriction is equivalent to
WHERE t.arraycol @> array[1,2,3,4,5,6]
which is substantially more selective than either constraint alone.
If the two RHS arrays are not constants, but are coming from different
tables x and y, then we have something isomorphic to the previous
example (at least from the perspective of indxpath.c), but it would
not be good for indxpath.c to assume that these clauses couldn't be
useful together.
Neat example.
We *can* make a simplifying assumption of the kind you suggest when
we know that the clauses were all generated from the same equivalence
class, because then we have very strong assumptions about what the
clauses' semantics are. (And indeed the patch does take care of that
case separately.) But for the general case of non-equijoin clauses
we can't assume very much at all about whether clauses are redundant,
at least not without knowledge that indxpath.c hasn't got.
OK. Fortunately, I don't think we need to care too much about that
case, since non-equijoins are pretty rare. A reasonable heuristic
restriction seems fine for that case ... at least until the next
problem case shows up.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company