Re: disfavoring unparameterized nested loops

Started by Benjamin Coutuover 3 years ago24 messages

ben.coutu@zeyos.com

over 3 years ago

I'd like to revamp this important discussion.

As is well described in this fairly recent paper here https://www.vldb.org/pvldb/vol9/p204-leis.pdf (which also looks at Postgres) "estimation errors quickly grow as the number of joins increases, and that these errors are usually the reason for bad plans" - I think we can all get behind that statement.

While nested loop joins work great when cardinality estimates are correct, they are notoriously bad when the optimizer underestimates and they degrade very fast in such cases - the good vs. bad here is very asymmetric. On the other hand hash joins degrade much more gracefully - they are considered very robust against underestimation. The above mentioned paper illustrates that all mayor DBMS (including Postgres) tend to underestimate and usually that underestimation increases drastically with the number of joins (see Figures 3+4 of the paper).

Now, a simple approach to guarding against bad plans that arise from underestimation could be to use what I would call a nested-loop-conviction-multiplier based on the current depth of the join tree, e.g. for a base table that multiplier would obviously be 1, but would then grow (e.g.) quadratically. That conviction-multiplier would *NOT* be used to skew the cardinality estimates themselves, but rather be applied to the overall nested loop join cost at each particular stage of the plan when comparing it to other more robust join strategies like hash or sort-merge joins. That way when we can be sure to have a good estimate at the bottom of the join tree we treat all things equal, but favor nested loops less and less as we move up the join tree for the sake of robustness.
Also, we can expand the multiplier whenever we fall back to using the default cardinality constant as surely all bets are off at that point - we should definitely treat nested loop joins as out of favor in this instance and that could easily be incorporated by simply increasing the conviction-mutliplier.

What are your thoughts on this simple idea - is it perhaps too simple?

Cheers, Ben

Peter Geoghegan

pg@bowt.ie

over 3 years ago

In reply to: Benjamin Coutu (#1)

On Thu, Sep 29, 2022 at 7:32 AM Benjamin Coutu <ben.coutu@zeyos.com> wrote:

Also, we can expand the multiplier whenever we fall back to using the default cardinality constant as surely all bets are off at that point - we should definitely treat nested loop joins as out of favor in this instance and that could easily be incorporated by simply increasing the conviction-mutliplier.

What are your thoughts on this simple idea - is it perhaps too simple?

Offhand I'd say it's more likely to be too complicated. Without
meaning to sound glib, the first question that occurs to me is "will
we also need a conviction multiplier conviction multiplier?". Anything
like that is going to have unintended consequences that might very
well be much worse than the problem that you set out to solve.

Personally I still like the idea of just avoiding unparameterized
nested loop joins altogether when an "equivalent" hash join plan is
available. I think of it as preferring the hash join plan because it
will have virtually the same performance characteristics when you have
a good cardinality estimate (probably very often), but far better
performance characteristics when you don't. We can perhaps be
approximately 100% sure that something like that will be true in all
cases, no matter the details. That seems like a very different concept
to what you've proposed.

--
Peter Geoghegan

Bruce Momjian

bruce@momjian.us

over 3 years ago

In reply to: Peter Geoghegan (#2)

On Thu, Sep 29, 2022 at 04:12:06PM -0700, Peter Geoghegan wrote:

Offhand I'd say it's more likely to be too complicated. Without
meaning to sound glib, the first question that occurs to me is "will
we also need a conviction multiplier conviction multiplier?". Anything
like that is going to have unintended consequences that might very
well be much worse than the problem that you set out to solve.

Personally I still like the idea of just avoiding unparameterized
nested loop joins altogether when an "equivalent" hash join plan is
available. I think of it as preferring the hash join plan because it
will have virtually the same performance characteristics when you have
a good cardinality estimate (probably very often), but far better
performance characteristics when you don't. We can perhaps be
approximately 100% sure that something like that will be true in all
cases, no matter the details. That seems like a very different concept
to what you've proposed.

I think the point the original poster as making, and I have made in the
past, is that even of two optimizer costs are the same, one might be
more penalized by misestimation than the other, and we don't have a good
way of figuring that into our plan choices.

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

Indecision is a decision. Inaction is an action. Mark Batterson

Peter Geoghegan

pg@bowt.ie

over 3 years ago

In reply to: Bruce Momjian (#3)

On Thu, Sep 29, 2022 at 4:27 PM Bruce Momjian <bruce@momjian.us> wrote:

I think the point the original poster as making, and I have made in the
past, is that even of two optimizer costs are the same, one might be
more penalized by misestimation than the other, and we don't have a good
way of figuring that into our plan choices.

Right. But that seems fraught with difficulty. I suspect that the
costs that the planner attributes to each plan often aren't very
reliable in any absolute sense, even when everything is working very
well by every available metric. Even a very noisy cost model with
somewhat inaccurate selectivity estimates will often pick the cheapest
plan, or close enough.

Having a cost-based optimizer that determines the cheapest plan quite
reliably is one thing. Taking the same underlying information and
adding the dimension of risk to it and expecting a useful result is
quite another -- that seems far harder.

--
Peter Geoghegan

Tom Lane

tgl@sss.pgh.pa.us

over 3 years ago

In reply to: Bruce Momjian (#3)

Bruce Momjian <bruce@momjian.us> writes:

I think the point the original poster as making, and I have made in the
past, is that even of two optimizer costs are the same, one might be
more penalized by misestimation than the other, and we don't have a good
way of figuring that into our plan choices.

Agreed, but dealing with uncertainty in those numbers is an enormous
task if you want to do it right. "Doing it right", IMV, would start
out by extending all the selectivity estimation functions to include
error bars; then we could have error bars on rowcount estimates and
then costs; then we could start adding policies about avoiding plans
with too large a possible upper-bound cost. Trying to add such
policy with no data to go on is not going to work well.

I think Peter's point is that a quick-n-dirty patch is likely to make
as many cases worse as it makes better. That's certainly my opinion
about the topic.

regards, tom lane

Bruce Momjian

bruce@momjian.us

over 3 years ago

In reply to: Tom Lane (#5)

On Thu, Sep 29, 2022 at 07:46:18PM -0400, Tom Lane wrote:

Bruce Momjian <bruce@momjian.us> writes:

I think the point the original poster as making, and I have made in the
past, is that even of two optimizer costs are the same, one might be
more penalized by misestimation than the other, and we don't have a good
way of figuring that into our plan choices.

Agreed, but dealing with uncertainty in those numbers is an enormous
task if you want to do it right. "Doing it right", IMV, would start
out by extending all the selectivity estimation functions to include
error bars; then we could have error bars on rowcount estimates and
then costs; then we could start adding policies about avoiding plans
with too large a possible upper-bound cost. Trying to add such
policy with no data to go on is not going to work well.

I think Peter's point is that a quick-n-dirty patch is likely to make
as many cases worse as it makes better. That's certainly my opinion
about the topic.

Agreed on all points --- I was thinking error bars too.

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

Indecision is a decision. Inaction is an action. Mark Batterson

Peter Geoghegan

pg@bowt.ie

over 3 years ago

In reply to: Tom Lane (#5)

On Thu, Sep 29, 2022 at 4:46 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Agreed, but dealing with uncertainty in those numbers is an enormous
task if you want to do it right. "Doing it right", IMV, would start
out by extending all the selectivity estimation functions to include
error bars; then we could have error bars on rowcount estimates and
then costs; then we could start adding policies about avoiding plans
with too large a possible upper-bound cost. Trying to add such
policy with no data to go on is not going to work well.

In general I suspect that we'd be better off focussing on mitigating
the impact at execution time. There are at least a few things that we
could do there, at least in theory. Mostly very ambitious, long term
things.

I like the idea of just avoiding unparameterized nested loop joins
altogether when an "equivalent" hash join plan is available because
it's akin to an execution-time mitigation, despite the fact that it
happens during planning. While it doesn't actually change anything in
the executor, it is built on the observation that we have virtually
everything to gain and nothing to lose during execution, no matter
what happens.

It seems like a very small oasis of certainty in a desert of
uncertainty -- which seems nice, as far as it goes.

I think Peter's point is that a quick-n-dirty patch is likely to make
as many cases worse as it makes better. That's certainly my opinion
about the topic.

Right. Though I am actually sympathetic to the idea that users might
gladly pay a cost for performance stability -- even a fairly large
cost. That part doesn't seem like the problem.

--
Peter Geoghegan

Bruce Momjian

bruce@momjian.us

over 3 years ago

In reply to: Bruce Momjian (#6)

On Thu, Sep 29, 2022 at 07:51:47PM -0400, Bruce Momjian wrote:

On Thu, Sep 29, 2022 at 07:46:18PM -0400, Tom Lane wrote:

Agreed, but dealing with uncertainty in those numbers is an enormous
task if you want to do it right. "Doing it right", IMV, would start
out by extending all the selectivity estimation functions to include
error bars; then we could have error bars on rowcount estimates and
then costs; then we could start adding policies about avoiding plans
with too large a possible upper-bound cost. Trying to add such
policy with no data to go on is not going to work well.

I think Peter's point is that a quick-n-dirty patch is likely to make
as many cases worse as it makes better. That's certainly my opinion
about the topic.

Agreed on all points --- I was thinking error bars too.

Actually, if we wanted to improve things in this area, we should have a
set of queries that don't chose optimal plans we can test with. We used
to see them a lot before we had extended statistics, but I don't
remember seeing many recently, let alone a collection of them. I guess
that is good.

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

Indecision is a decision. Inaction is an action. Mark Batterson

David Rowley

dgrowleyml@gmail.com

over 3 years ago

In reply to: Peter Geoghegan (#7)

On Fri, 30 Sept 2022 at 13:06, Peter Geoghegan <pg@bowt.ie> wrote:

I like the idea of just avoiding unparameterized nested loop joins
altogether when an "equivalent" hash join plan is available because
it's akin to an execution-time mitigation, despite the fact that it
happens during planning. While it doesn't actually change anything in
the executor, it is built on the observation that we have virtually
everything to gain and nothing to lose during execution, no matter
what happens.

I'm not sure if it's a good idea to assume that performing
non-parameterised Nested Loops when we shouldn't is the only shape of
plan that causes us problems.

We also have the case where we assume early start-up plans are
favourable. For example:

SELECT * FROM t WHERE a = 1 ORDER BY b LIMIT 10;

where we have two indexes, one on t(a) and another on t(b).

Should we use the t(b) index and filter out the rows that don't match
a = 1 and hope we get 10 a=1 rows soon in the t(b) index? or do we use
t(a) and then perform a sort? Best case for using the t(b) index is
that we find 10 a=1 rows in the first 10 rows of the index scan, the
worst case is that there are no rows with a=1.

Having something coded into the cost model is a more generic way of
addressing this issue. Providing we design the cost model correctly,
we'd be able to address future issues we discover using which ever
cost model infrastructure that we design for this.

I understand that what you propose would be a fast way to fix this
issue. However, if we went and changed the join path creation code to
not add non-parameterised nested loop paths when other paths exist,
then how could we ever dare to put that code back again when we come
up with a better solution?

David

#10

Benjamin Coutu

ben.coutu@zeyos.com

over 3 years ago

In reply to: Peter Geoghegan (#4)

Right. But that seems fraught with difficulty. I suspect that the
costs that the planner attributes to each plan often aren't very
reliable in any absolute sense, even when everything is working very
well by every available metric. Even a very noisy cost model with
somewhat inaccurate selectivity estimates will often pick the cheapest
plan, or close enough.

Sure, the absolute cost of a complex plan will always be inaccurate at best.
My point is that we can be very confident in the cardinalities of base tables. As the paper states in "3.1. Estimates for Base Tables":

"The median q-error is close to the optimal value of 1 for all systems,
indicating that the majority of all selections are estimated correctly."

Thanks to the statistics will practically never be off by an order of magnitude when estimating base table cardinalities.

The paper also clearly shows (and that certainly coincides with my experience) that those cardinality underestimations grow exponentially as they propagate up the join tree.

Given the research I'd stipulate that at any given level of the join tree, the current depth is a reasonable indicator of underestimation. Taking that into account (even if only to mitigate nested loops on higher levels) is IMV a principled approach, and not necesseraly a hack.

Obviously having something like error bars as proposed by Tom would be even better and perhaps more general, but that is on a whole different level in terms of complexity and I certainly have no idea how we would easily get there.

#11

Benjamin Coutu

ben.coutu@zeyos.com

over 3 years ago

In reply to: Peter Geoghegan (#7)

In general I suspect that we'd be better off focussing on mitigating
the impact at execution time. There are at least a few things that we
could do there, at least in theory. Mostly very ambitious, long term
things.

I think these things are orthogonal. No matter how good the cost model ever gets, we will always have degenerate cases.
Having some smarts about that in the executor is surely a good thing, but it shouldn't distract us from improving on the planner front.

I like the idea of just avoiding unparameterized nested loop joins
altogether when an "equivalent" hash join plan is available because
it's akin to an execution-time mitigation, despite the fact that it
happens during planning. While it doesn't actually change anything in
the executor, it is built on the observation that we have virtually
everything to gain and nothing to lose during execution, no matter
what happens.

I agree with you, that those plans are too risky. But let's maybe find a more general way of dealing with this.

Show quoted text

Right. Though I am actually sympathetic to the idea that users might
gladly pay a cost for performance stability -- even a fairly large
cost. That part doesn't seem like the problem.

#12

Benjamin Coutu

ben.coutu@zeyos.com

over 3 years ago

In reply to: Tom Lane (#5)

Agreed, but dealing with uncertainty in those numbers is an enormous
task if you want to do it right. "Doing it right", IMV, would start
out by extending all the selectivity estimation functions to include
error bars; then we could have error bars on rowcount estimates and
then costs; then we could start adding policies about avoiding plans
with too large a possible upper-bound cost. Trying to add such
policy with no data to go on is not going to work well.

Error bars would be fantastic, no question. But that would make things very complex.
A lot of judgment calls would be necessary for the policy behind upper-bound pruning, picking up on Peter's comment about "conviction multiplier of conviction multiplier" ;)
Also, the math in deriving those bounds based on the stats and how they propagate up the join tree doesn't seem trivial either.

I think Peter's point is that a quick-n-dirty patch is likely to make
as many cases worse as it makes better. That's certainly my opinion
about the topic.

As in my reply to Peter, I think the join level/depth metric is a simple but principled way of dealing with it, given the referenced research.
In the first step, we'd use this merely to be more risk-averse towards nested loop joins as we climb up the join tree - we are not fiddling with the cost model itself, nor the join ordering, just when it comes to considering that particular join algorithm. Later this could be expanded to be more broadly scoped.

Please not give up on a simple way to reap most of the fruits just yet.

#13

Benjamin Coutu

ben.coutu@zeyos.com

over 3 years ago

In reply to: Bruce Momjian (#8)

Actually, if we wanted to improve things in this area, we should have a
set of queries that don't chose optimal plans we can test with. We used
to see them a lot before we had extended statistics, but I don't
remember seeing many recently, let alone a collection of them. I guess
that is good.

In the VLDB paper they actually created their own "Join Order Benchmark", which is publicly available under https://github.com/gregrahn/join-order-benchmark
It would probably be more suited for this kind of testing than, e.g. the TPC benchmarks.

If there is interest, I could also compile a set of relevant cases based on the message history of the performance mailing list.

#14

Robert Haas

robertmhaas@gmail.com

over 3 years ago

In reply to: Tom Lane (#5)

On Thu, Sep 29, 2022 at 7:46 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Agreed, but dealing with uncertainty in those numbers is an enormous
task if you want to do it right. "Doing it right", IMV, would start
out by extending all the selectivity estimation functions to include
error bars; then we could have error bars on rowcount estimates and
then costs; then we could start adding policies about avoiding plans
with too large a possible upper-bound cost. Trying to add such
policy with no data to go on is not going to work well.

I think that the point of the paper which started the thread that this
discussion branched from was essentially that trying to add such a
policy with no data to go on worked extremely well in practice, and
other database systems are already doing it, and we're hurting
ourselves by not doing it. And specifically what they did was to
disfavor unparameterized nested loops.

And I think that actually makes a lot of sense. If you think through
parameterized nested loops, unparameterized nested loops, hash joins,
and merge joins, which is basically all the strategies, and you
imagine having many more or fewer rows on one side of the join or the
other than you thought, unparameterized nested loops are the standout
case. It's the only kind of join where the expense grows as the
product of the sizes of the two inputs. The other cases tend to be
more like O(n+m) rather than O(nm), and even if there are some lg n or
lg m terms in there too they don't tend to make a whole lot of
difference in practice.

So I think what you would find if you did all of this analysis is
that, basically, every time you did the cost computation for a
parameterized nested loop, a hash join, or a merge join, the error
bars would be whatever they were, and then when you did a cost
computation for an unparameterized nested loop, the error bars would
be way worse. Like, if we assume that the estimates for each side of a
hash join are off by 100x, then the cost will be off by roughly 100x
if it still fits in work_mem and by several hundred x if it now spills
to disk. But if we assume the same thing for an unparameterized nested
loop, the cost is now off by 10000x. And that is massively, massively
more, so clearly we should avoid the unparameterized nested loop. But
it's unnecessary to do this computation at runtime for every separate
unparameterized nested loop: we can do it right here, in a generic
way, for every such loop.

Now, it's true that the actual risk depends on how certain we are of
the estimates for the input rels, but that's difficult to quantify and
I'm not convinced it really matters at all. Like, if the input is a
base relation with no filter condition, then the error should be
small, unless the table size changes a lot between planning and
execution. If it's the output of a user-defined aggregate, the error
could be really, really large. But that has no impact on the
*relative* dangers of the unparameterized nested loop vs. some other
join method. If join A is between two input rels whose sizes are
probably known fairly precisely, and join B is between two input rels
whose sizes we might be drastically wrong about, then a hash join is
riskier for join B than it is for join A. But that really does not
matter. What *does* matter is that an unparameterized nested loop is
riskier for join A than a hash join is for join A; and likewise for
join B.

I think we're kind of just making life complicated for ourselves here
by pretending that unparameterized nested loops are part of some
general class of uncertainty problems that we need to worry about. In
some sense, they are, and David Rowley is right to mention the other
one that comes up pretty frequently. But like that list of two is
pretty much the whole list. I think we've talked ourselves into
believing that this problem is much harder than it really is. Maybe a
blanket ban on unparameterized nested loops is too strong (or maybe
it's exactly the right thing) but it can't possibly be wrong to think
about that case in particular as something we need to solve. It's the
only join method that can go quadratic in easy, realistic scenarios --
and it often does.

--
Robert Haas
EDB: http://www.enterprisedb.com

#15

Peter Geoghegan

pg@bowt.ie

over 3 years ago

In reply to: Robert Haas (#14)

On Fri, Sep 30, 2022 at 10:43 AM Robert Haas <robertmhaas@gmail.com> wrote:

But it's unnecessary to do this computation at runtime for every separate
unparameterized nested loop: we can do it right here, in a generic
way, for every such loop.

It's not just that the risks are ludicrously high, of course. The
potential benefits must *also* be very low. It's both factors,
together.

Now, it's true that the actual risk depends on how certain we are of
the estimates for the input rels, but that's difficult to quantify and
I'm not convinced it really matters at all.

We're talking about a problem that is fairly unlikely to occur in
general, I think -- let's not forget that. These are presumably rare
events that nevertheless cause many real practical problems.

If we're going to add error bars, why wouldn't we also need error bars
for our error bars?

I think we're kind of just making life complicated for ourselves here
by pretending that unparameterized nested loops are part of some
general class of uncertainty problems that we need to worry about. In
some sense, they are, and David Rowley is right to mention the other
one that comes up pretty frequently. But like that list of two is
pretty much the whole list. I think we've talked ourselves into
believing that this problem is much harder than it really is.

--
Peter Geoghegan

#16

Peter Geoghegan

pg@bowt.ie

over 3 years ago

In reply to: David Rowley (#9)

On Thu, Sep 29, 2022 at 9:00 PM David Rowley <dgrowleyml@gmail.com> wrote:

I understand that what you propose would be a fast way to fix this
issue. However, if we went and changed the join path creation code to
not add non-parameterised nested loop paths when other paths exist,
then how could we ever dare to put that code back again when we come
up with a better solution?

But why would it matter, even then?

I don't deny that something like that could make sense, but I don't
see why it should be in tension with this proposal. We're talking
about a plan shape that is (in practical terms) inherently
unreasonable, given the availability of an alternative plan shape. Why
wouldn't that continue to be true in every such case, forever?

To put it another way, the proposal seems like taking away something
that we don't want to have, ever. It seems like a subtractive thing to
me. The potential upside of allowing unparameterized nestloop joins
seems infinitesimal; zero for all practical purposes. So even with a
far more sophisticated framework for "plan riskiness" in place, it
would still make sense to treat unparameterized nestloop joins as
inherently undesirable. There is perhaps a theoretical sense in which
that isn't quite true, but it's true for all practical purposes, which
should be enough.

--
Peter Geoghegan

#17

Peter Geoghegan

pg@bowt.ie

over 3 years ago

In reply to: Benjamin Coutu (#11)

On Thu, Sep 29, 2022 at 11:44 PM Benjamin Coutu <ben.coutu@zeyos.com> wrote:

I think these things are orthogonal.

I agree that they're orthogonal. I just meant that execution time
strategies seem underexplored in general.

No matter how good the cost model ever gets, we will always have degenerate cases.

Sure, but the model isn't the problem here, really -- not to me. The
problem is that the planner can in some cases choose a plan that is
inherently unreasonable, at least in practical terms. You're talking
about uncertainties. But I'm actually talking about the opposite thing
-- certainty (albeit a limited kind of certainty that applies only to
one narrow set of conditions).

It's theoretically possible that bogosort will be faster than
quicksort in some individual cases. After all, bogosort is O(n) in the
best case, which is impossible to beat! But there is no practical
sense in which bogosort could ever be better than quicksort. Having
fewer choices by just not offering inherently bad choices seems quite
unrelated to what you're talking about.

For all I know you might be onto something. But it really seems
independent to me.

--
Peter Geoghegan

#18

Benjamin Coutu

ben.coutu@zeyos.com

over 3 years ago

In reply to: Peter Geoghegan (#17)

Sure, but the model isn't the problem here, really -- not to me. The
problem is that the planner can in some cases choose a plan that is
inherently unreasonable, at least in practical terms. You're talking
about uncertainties. But I'm actually talking about the opposite thing
-- certainty (albeit a limited kind of certainty that applies only to
one narrow set of conditions).

I absolutely agree and support your proposal to simply not generate those paths at all unless necessary.

For all I know you might be onto something. But it really seems
independent to me.

Yeah, I‘m sorry if I highjacked this thread for something related but technically different. I just wanted to expand on your proposal by taking into account the join depth and also not just talking about unparametrized nested loop joins. The research is very clear that the uncertainty is proportional to the join level, and that is the point I am trying to focus the discussion on.

I really encourage everyone to read the VLDB paper. BTW, the unnamed proprietary DBMSs in that paper are the 3 big ones from Washington, California and NY, in that order.

#19

Robert Haas

robertmhaas@gmail.com

over 3 years ago

In reply to: Peter Geoghegan (#15)

On Fri, Sep 30, 2022 at 2:24 PM Peter Geoghegan <pg@bowt.ie> wrote:

It's not just that the risks are ludicrously high, of course. The
potential benefits must *also* be very low. It's both factors,
together.

Hmm, maybe. But it also wouldn't surprise me very much if someone can
come up with a test case where a nested loop with a single row (or
maybe no rows) on one side or the other and it's significantly faster
than any alternative plan. I believe, though, that even if such cases
exist, they are probably relatively few in number compared to the
cases where parameterized nested loops hurt, and the savings are
probably small compared to the multiple-orders-of-magnitude slowdowns
that you can get when a nested loop goes bad. But they might still be
relatively large -- 2x, 3x? -- in absolute terms.

In the prior discussion, the only person who showed a case in which he
thought that an unparameterized nested loop might be a clear winner
was Tom, but it was just sort of a general "this kind of case might be
a problem" thing rather than a fully worked out example with real
timings. Perhaps someone ought to try to characterize the kinds of
cases he mentioned, to help us get a clearer feeling about what, if
anything, we're gaining from the current scheme.

--
Robert Haas
EDB: http://www.enterprisedb.com

#20

Peter Geoghegan

pg@bowt.ie

over 3 years ago

In reply to: Robert Haas (#19)

On Sun, Oct 2, 2022 at 3:43 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Sep 30, 2022 at 2:24 PM Peter Geoghegan <pg@bowt.ie> wrote:

It's not just that the risks are ludicrously high, of course. The
potential benefits must *also* be very low. It's both factors,
together.

Hmm, maybe. But it also wouldn't surprise me very much if someone can
come up with a test case where a nested loop with a single row (or
maybe no rows) on one side or the other and it's significantly faster
than any alternative plan.

That's certainly possible, but wouldn't the difference all come from
fixed startup costs? If we're talking about a single row, with a
minimal test case, then the potential downside of this more
conservative strategy might indeed amount to something like a 2x or 3x
slowdown, if we look at it in isolation. But why measure it that way?
I think that absolute differences like milliseconds of execution time
are much more relevant.

Real production databases have many queries with very diverse
characteristics -- there is a lot going on at any given moment. The
proportion of queries that will be affected either way by avoiding
unparamaterized nested loop joins is probably going to be very small.
Nobody is going to notice if only a small subset or all queries are
maybe 1 ms or 2 ms slower. As long as it's well within the margin of
noise in 100% of all cases, it really shouldn't matter.

AFAICT the choice is one of "bounded, low upside versus unbounded,
high downside".

I believe, though, that even if such cases
exist, they are probably relatively few in number compared to the
cases where parameterized nested loops hurt, and the savings are
probably small compared to the multiple-orders-of-magnitude slowdowns
that you can get when a nested loop goes bad. But they might still be
relatively large -- 2x, 3x? -- in absolute terms.

I suspect it won't even matter if disallowing unparamaterized nested
loop joins loses on average.

I am reminded of this:
https://en.wikipedia.org/wiki/St._Petersburg_paradox#Expected_utility_theory

--
Peter Geoghegan

#21

Peter Geoghegan

pg@bowt.ie

over 3 years ago

In reply to: Benjamin Coutu (#18)

On Fri, Sep 30, 2022 at 3:19 PM Benjamin Coutu <ben.coutu@zeyos.com> wrote:

For all I know you might be onto something. But it really seems
independent to me.

Yeah, I‘m sorry if I highjacked this thread for something related but technically different.

I certainly wouldn't say that you hijacked the thread. I'm glad that
you revived the discussion, in fact.

The way that I've framed the problem is probably at least somewhat
controversial. In fact I'm almost certain that at least one or two
people will flat out disagree with me. But even if everybody else
already thought about unparameterized nested loop joins in the same
terms, it might still be useful to make the arguments that you've
made.

What I'm saying is that the probability of "getting it right" is
virtually irrelevant in the case of these unparameterized nested loop
join plans specifically. Any probability that's less than 1.0 is
already unacceptable, more or less. A probability of 1.0 is never
unattainable in the real world, no matter what, so why should the true
probability (whatever that means) matter at all? The appropriate
course of action will still be "just don't do that, ever".

To me this dynamic seems qualitatively different to other cases, where
we might want to give some weight to uncertainty. Understanding where
the boundaries lie between those trickier cases and this simpler case
seems important and relevant to me.

--
Peter Geoghegan

#22

Andrey Lepikhov

a.lepikhov@postgrespro.ru

over 2 years ago

In reply to: Benjamin Coutu (#1)

On 29/9/2022 21:32, Benjamin Coutu wrote:

I'd like to revamp this important discussion.

As is well described in this fairly recent paper here https://www.vldb.org/pvldb/vol9/p204-leis.pdf (which also looks at Postgres) "estimation errors quickly grow as the number of joins increases, and that these errors are usually the reason for bad plans" - I think we can all get behind that statement.

While nested loop joins work great when cardinality estimates are correct, they are notoriously bad when the optimizer underestimates and they degrade very fast in such cases - the good vs. bad here is very asymmetric. On the other hand hash joins degrade much more gracefully - they are considered very robust against underestimation. The above mentioned paper illustrates that all mayor DBMS (including Postgres) tend to underestimate and usually that underestimation increases drastically with the number of joins (see Figures 3+4 of the paper).

Now, a simple approach to guarding against bad plans that arise from underestimation could be to use what I would call a nested-loop-conviction-multiplier based on the current depth of the join tree, e.g. for a base table that multiplier would obviously be 1, but would then grow (e.g.) quadratically. That conviction-multiplier would *NOT* be used to skew the cardinality estimates themselves, but rather be applied to the overall nested loop join cost at each particular stage of the plan when comparing it to other more robust join strategies like hash or sort-merge joins. That way when we can be sure to have a good estimate at the bottom of the join tree we treat all things equal, but favor nested loops less and less as we move up the join tree for the sake of robustness.
Also, we can expand the multiplier whenever we fall back to using the default cardinality constant as surely all bets are off at that point - we should definitely treat nested loop joins as out of favor in this instance and that could easily be incorporated by simply increasing the conviction-mutliplier.

What are your thoughts on this simple idea - is it perhaps too simple?

In my practice, parameterized nested loop reduces, sometimes
drastically, execution time. If your query touches a lot of tables but
extracts only a tiny part of the data, and you have good coverage by
indexes, PNL works great.
Moreover, I have pondered extending parameterization through subqueries
and groupings.

What could you say about a different way: hybrid join? In MS SQL Server,
they have such a feature [1]https://techcommunity.microsoft.com/t5/sql-server-blog/introducing-batch-mode-adaptive-joins/ba-p/385411, and, according to their description, it
requires low overhead. They start from HashJoin and switch to NestLoop
if the inner input contains too small tuples. It solves the issue, Isn't it?

[1]: https://techcommunity.microsoft.com/t5/sql-server-blog/introducing-batch-mode-adaptive-joins/ba-p/385411
https://techcommunity.microsoft.com/t5/sql-server-blog/introducing-batch-mode-adaptive-joins/ba-p/385411

--
regards,
Andrey Lepikhov
Postgres Professional

#23

David Rowley

dgrowleyml@gmail.com

over 2 years ago

In reply to: Andrey Lepikhov (#22)

On Wed, 20 Sept 2023 at 19:56, Andrey Lepikhov
<a.lepikhov@postgrespro.ru> wrote:

What could you say about a different way: hybrid join? In MS SQL Server,
they have such a feature [1], and, according to their description, it
requires low overhead. They start from HashJoin and switch to NestLoop
if the inner input contains too small tuples. It solves the issue, Isn't it?

A complexity which you may not be considering here is that Nested Loop
joins always preserve the tuple order from the outer side of the join,
whereas hash joins will not do this when multi-batching.

I've no idea how the SQL Server engineers solved that.

David

Show quoted text

[1]
https://techcommunity.microsoft.com/t5/sql-server-blog/introducing-batch-mode-adaptive-joins/ba-p/385411

#24

Lepikhov Andrei

a.lepikhov@postgrespro.ru

over 2 years ago

In reply to: David Rowley (#23)

On Wed, Sep 20, 2023, at 4:49 PM, David Rowley wrote:

On Wed, 20 Sept 2023 at 19:56, Andrey Lepikhov
<a.lepikhov@postgrespro.ru> wrote:

What could you say about a different way: hybrid join? In MS SQL Server,
they have such a feature [1], and, according to their description, it
requires low overhead. They start from HashJoin and switch to NestLoop
if the inner input contains too small tuples. It solves the issue, Isn't it?

A complexity which you may not be considering here is that Nested Loop
joins always preserve the tuple order from the outer side of the join,
whereas hash joins will not do this when multi-batching.

My idea here is the same as MS SQL guys did: prefetch from the HashJoin inner some predefined number of tuples and, if the planner has made a mistake and overestimated it, move hash join inner to NestLoop as an outer.
The opposite strategy, "underestimation" - starting with NestLoop and switching to HashJoin looks more difficult, but the main question is: is it worthwhile to research?

I've no idea how the SQL Server engineers solved that.

[1]
https://techcommunity.microsoft.com/t5/sql-server-blog/introducing-batch-mode-adaptive-joins/ba-p/385411

--
Regards,
Andrei Lepikhov