Yet another abort-early plan disaster on 9.3

Started by Josh Berkusover 11 years ago80 messageshackers
Jump to latest
#1Josh Berkus
josh@agliodbs.com

Folks,

Just encountered another case of critical fail for abort-early query
plans. In this case, it will completely prevent a user from upgrading
to 9.3; this is their most common query, and on 9.3 it takes 1000X longer.

Maybe we should think about removing abort-early plans from 9.5?
Clearly we don't understand them well enough for them to work for users.

Query:

SELECT "categories".* FROM "categories" WHERE "categories"."user_id" IN
( SELECT to_user_id FROM "tags" WHERE "tags"."from_user_id" = 53529975 )
ORDER BY recorded_on DESC LIMIT 20;

Here's the plan from 9.1:

Limit (cost=1613.10..1613.15 rows=20 width=194) (actual
time=0.503..0.509 rows=20 loops=1)
-> Sort (cost=1613.10..1736.14 rows=49215 width=194) (actual
time=0.502..0.505 rows=20 loops=1)
Sort Key: categories.recorded_on
Sort Method: top-N heapsort Memory: 30kB
-> Nested Loop (cost=248.80..303.51 rows=49215 width=194)
(actual time=0.069..0.347 rows=81 loops=1)
-> HashAggregate (cost=248.80..248.81 rows=1 width=4)
(actual time=0.050..0.054 rows=8 loops=1)
-> Index Scan using unique_index_tags on tags
(cost=0.00..248.54 rows=103 width=4) (actual time=0.020..0.033 rows=8
loops=1)
Index Cond: (from_user_id = 53529975)
-> Index Scan using index_categories_on_user_id on
categories (cost=0.00..54.34 rows=29 width=194) (actual
time=0.010..0.028 rows=10 loops=8)
Index Cond: (user_id = tags.to_user_id)
Total runtime: 0.641 ms

And from 9.3:

Limit (cost=1.00..2641.10 rows=20 width=202) (actual
time=9.933..711.372 rows=20 loops=1)
-> Nested Loop Semi Join (cost=1.00..9641758.39 rows=73041
width=202) (actual time=9.931..711.361 rows=20 loops=1)
-> Index Scan Backward using index_categories_on_recorded_on
on categories (cost=0.43..406943.98 rows=4199200 width=202) (actual
time=0.018..275.020 rows=170995 loops=1)
-> Index Scan using unique_index_tags on tags
(cost=0.57..2.20 rows=1 width=4) (actual time=0.002..0.002 rows=0
loops=170995)
Index Cond: ((from_user_id = 53529975) AND (to_user_id =
categories.user_id))
Total runtime: 711.457 ms

So, here's what's happening here:

As usual, PostgreSQL is dramatically undercounting n_distinct: it shows
chapters.user_id at 146,000 and the ratio of to_user_id:from_user_id as
being 1:105 (as opposed to 1:6, which is about the real ratio). This
means that PostgreSQL thinks it can find the 20 rows within the first 2%
of the index ... whereas it actually needs to scan 50% of the index to
find them.

Removing LIMIT causes 9.3 to revert to the "good" plan, as expected.

This is the core issue with abort-early plans; they depend on our
statistics being extremely accurate, which we know they are not. And if
they're wrong, the execution time climbs by 1000X or more. Abort-early
plans are inherently riskier than other types of query plans.

What I'm not clear on is why upgrading from 9.1 to 9.3 would bring about
this change. The stats are no more than 10% different across the
version change.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

#2Merlin Moncure
mmoncure@gmail.com
In reply to: Josh Berkus (#1)
Re: Yet another abort-early plan disaster on 9.3

On Wed, Sep 17, 2014 at 7:11 PM, Josh Berkus <josh@agliodbs.com> wrote:

Folks,

Just encountered another case of critical fail for abort-early query
plans. In this case, it will completely prevent a user from upgrading
to 9.3; this is their most common query, and on 9.3 it takes 1000X longer.

Maybe we should think about removing abort-early plans from 9.5?
Clearly we don't understand them well enough for them to work for users.

Query:

SELECT "categories".* FROM "categories" WHERE "categories"."user_id" IN
( SELECT to_user_id FROM "tags" WHERE "tags"."from_user_id" = 53529975 )
ORDER BY recorded_on DESC LIMIT 20;

Here's the plan from 9.1:

Limit (cost=1613.10..1613.15 rows=20 width=194) (actual
time=0.503..0.509 rows=20 loops=1)
-> Sort (cost=1613.10..1736.14 rows=49215 width=194) (actual
time=0.502..0.505 rows=20 loops=1)
Sort Key: categories.recorded_on
Sort Method: top-N heapsort Memory: 30kB
-> Nested Loop (cost=248.80..303.51 rows=49215 width=194)
(actual time=0.069..0.347 rows=81 loops=1)
-> HashAggregate (cost=248.80..248.81 rows=1 width=4)
(actual time=0.050..0.054 rows=8 loops=1)
-> Index Scan using unique_index_tags on tags
(cost=0.00..248.54 rows=103 width=4) (actual time=0.020..0.033 rows=8
loops=1)
Index Cond: (from_user_id = 53529975)
-> Index Scan using index_categories_on_user_id on
categories (cost=0.00..54.34 rows=29 width=194) (actual
time=0.010..0.028 rows=10 loops=8)
Index Cond: (user_id = tags.to_user_id)
Total runtime: 0.641 ms

And from 9.3:

Limit (cost=1.00..2641.10 rows=20 width=202) (actual
time=9.933..711.372 rows=20 loops=1)
-> Nested Loop Semi Join (cost=1.00..9641758.39 rows=73041
width=202) (actual time=9.931..711.361 rows=20 loops=1)
-> Index Scan Backward using index_categories_on_recorded_on
on categories (cost=0.43..406943.98 rows=4199200 width=202) (actual
time=0.018..275.020 rows=170995 loops=1)
-> Index Scan using unique_index_tags on tags
(cost=0.57..2.20 rows=1 width=4) (actual time=0.002..0.002 rows=0
loops=170995)
Index Cond: ((from_user_id = 53529975) AND (to_user_id =
categories.user_id))
Total runtime: 711.457 ms

So, here's what's happening here:

As usual, PostgreSQL is dramatically undercounting n_distinct: it shows
chapters.user_id at 146,000 and the ratio of to_user_id:from_user_id as
being 1:105 (as opposed to 1:6, which is about the real ratio). This
means that PostgreSQL thinks it can find the 20 rows within the first 2%
of the index ... whereas it actually needs to scan 50% of the index to
find them.

Removing LIMIT causes 9.3 to revert to the "good" plan, as expected.

This is the core issue with abort-early plans; they depend on our
statistics being extremely accurate, which we know they are not. And if
they're wrong, the execution time climbs by 1000X or more. Abort-early
plans are inherently riskier than other types of query plans.

What I'm not clear on is why upgrading from 9.1 to 9.3 would bring about
this change. The stats are no more than 10% different across the
version change.

Amusingly on-topic rant I happened to read immediately after this by chance:

http://wp.sigmod.org/?p=1075

Is there a canonical case of where 'abort early' plans help? (I'm new
to that term -- is it a recent planner innovation...got any handy
links?)

merlin

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

#3Josh Berkus
josh@agliodbs.com
In reply to: Josh Berkus (#1)
Re: Yet another abort-early plan disaster on 9.3

On 09/19/2014 10:15 AM, Merlin Moncure wrote:

On Wed, Sep 17, 2014 at 7:11 PM, Josh Berkus <josh@agliodbs.com> wrote:

This is the core issue with abort-early plans; they depend on our
statistics being extremely accurate, which we know they are not. And if
they're wrong, the execution time climbs by 1000X or more. Abort-early
plans are inherently riskier than other types of query plans.

What I'm not clear on is why upgrading from 9.1 to 9.3 would bring about
this change. The stats are no more than 10% different across the
version change.

Amusingly on-topic rant I happened to read immediately after this by chance:

http://wp.sigmod.org/?p=1075

Is there a canonical case of where 'abort early' plans help? (I'm new
to that term -- is it a recent planner innovation...got any handy
links?)

Yeah, here's an example of the canonical case:

Table t1 ( a, b, c )

- "b" is low-cardinality
- "c" is high-cardinality
- There are separate indexes on both b and c.

SELECT a, b, c FROM t1
WHERE b = 2
ORDER BY c LIMIT 1;

In this case, the fastest plan is usually to use the index on C and
return the first row where the filter condition matches the filter on b.
This can be an order of magnitude faster than using the index on b and
then resorting by c and taking the first row, if (b=2) happens to match
20% of the table.

This is called an "abort early" plan because we expect to never finish
the scan on the index on c. We expect to scan the index on c, find the
first row that matches b=2 and exit.

The problem with such plans is that they are "risky". As in, if we are
wrong about our (b=2) stats, then we've just adopted a query plan which
will be 10X to 1000X slower than the more conventional plan.

We can see this in the bad plan I posted:

Limit (cost=1.00..2641.10 rows=20 width=202) (actual
time=9.933..711.372 rows=20 loops=1)
-> Nested Loop Semi Join (cost=1.00..9641758.39 rows=73041
width=202) (actual time=9.931..711.361 rows=20 loops=1)
-> Index Scan Backward using index_categories_on_recorded_on
on categories (cost=0.43..406943.98 rows=4199200 width=202) (actual
time=0.018..275.020 rows=170995 loops=1)

Notice how the total cost of the plan is a fraction of the cost of the
two steps which preceeded it? This is an indication that the planner
expects to be able to abort the index scan and nestloop join before it's
more than 2% through it.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

#4Bruce Momjian
bruce@momjian.us
In reply to: Josh Berkus (#3)
Re: Yet another abort-early plan disaster on 9.3

On 19 Sep 2014 19:40, "Josh Berkus" <josh@agliodbs.com> wrote:

On 09/19/2014 10:15 AM, Merlin Moncure wrote:

On Wed, Sep 17, 2014 at 7:11 PM, Josh Berkus <josh@agliodbs.com> wrote:

This is the core issue with abort-early plans; they depend on our
statistics being extremely accurate, which we know they are not. And if
they're wrong, the execution time climbs by 1000X or more. Abort-early
plans are inherently riskier than other types of query plans.

All plans are risky if the stats are wrong. It's one of the perennial
digressions that many postgres newcomers make to track worst case costs and
provide a knob for planner aggressiveness but it always breaks down when
you try to quantify the level of risk because you discover that even such
simple things as indeed scans versus sequential scans can be equally risky
either way.

What I'm not clear on is why upgrading from 9.1 to 9.3 would bring

about

this change. The stats are no more than 10% different across the
version change.

There's no difference. Postgres has been estimating LIMIT costs this way
since before I came to postgres in 7.3.

Is there a canonical case of where 'abort early' plans help? (I'm new
to that term -- is it a recent planner innovation...got any handy
links?)

Yeah, here's an example of the canonical case:

Table t1 ( a, b, c )

- "b" is low-cardinality
- "c" is high-cardinality
- There are separate indexes on both b and c.

SELECT a, b, c FROM t1
WHERE b = 2
ORDER BY c LIMIT 1;

You badly want a partial index on c WHERE b=2 for each value of 2 which
appears in your queries.

It would be neat to have an opclass which worked like that. Which would
amount to having prefix compression perhaps.

What plan does 9.1 come up with?

#5Claudio Freire
klaussfreire@gmail.com
In reply to: Bruce Momjian (#4)
Re: Yet another abort-early plan disaster on 9.3

On Sat, Sep 20, 2014 at 3:38 AM, Greg Stark <stark@mit.edu> wrote:

Is there a canonical case of where 'abort early' plans help? (I'm new
to that term -- is it a recent planner innovation...got any handy
links?)

Yeah, here's an example of the canonical case:

Table t1 ( a, b, c )

- "b" is low-cardinality
- "c" is high-cardinality
- There are separate indexes on both b and c.

SELECT a, b, c FROM t1
WHERE b = 2
ORDER BY c LIMIT 1;

You badly want a partial index on c WHERE b=2 for each value of 2 which
appears in your queries.

It would be neat to have an opclass which worked like that. Which would
amount to having prefix compression perhaps.

I've been looking at that exactly.

One complexity of it, is that splitting becomes much harder. As in, recursive.

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#4)
Re: Yet another abort-early plan disaster on 9.3

Greg Stark <stark@mit.edu> writes:

On 19 Sep 2014 19:40, "Josh Berkus" <josh@agliodbs.com> wrote:

Yeah, here's an example of the canonical case:

Table t1 ( a, b, c )

- "b" is low-cardinality
- "c" is high-cardinality
- There are separate indexes on both b and c.

SELECT a, b, c FROM t1
WHERE b = 2
ORDER BY c LIMIT 1;

You badly want a partial index on c WHERE b=2 for each value of 2 which
appears in your queries.

Well, if it's *only* b = 2 that you ever search for, then maybe a partial
index would be a good answer. Personally I'd use a plain btree index on
(b, c). The planner's been able to match this type of query to
multicolumn indexes for a long time.

regards, tom lane

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

#7Josh Berkus
josh@agliodbs.com
In reply to: Josh Berkus (#1)
Re: Yet another abort-early plan disaster on 9.3

On 09/19/2014 11:38 PM, Greg Stark wrote:

On 19 Sep 2014 19:40, "Josh Berkus" <josh@agliodbs.com
<mailto:josh@agliodbs.com>> wrote:

On 09/19/2014 10:15 AM, Merlin Moncure wrote:

On Wed, Sep 17, 2014 at 7:11 PM, Josh Berkus <josh@agliodbs.com

<mailto:josh@agliodbs.com>> wrote:

This is the core issue with abort-early plans; they depend on our
statistics being extremely accurate, which we know they are not. And if
they're wrong, the execution time climbs by 1000X or more. Abort-early
plans are inherently riskier than other types of query plans.

All plans are risky if the stats are wrong. It's one of the perennial
digressions that many postgres newcomers make to track worst case costs
and provide a knob for planner aggressiveness but it always breaks down
when you try to quantify the level of risk because you discover that
even such simple things as indeed scans versus sequential scans can be
equally risky either way.

I've had a *wee* bit more experience with query plans than most Postgres
newcomers, Greg.

While all query plan changes can result in regressions if they're bad,
there are certain kinds of plans which depend more on accurate stats
than others. Abort-early plans are the most extreme of these. Most
likely we should adjust the cost model for abort-early plans to take in
our level of uncertainty, especially since we *know* that our n-distinct
estimation is crap. For example, we could increase the estimated cost
for an abort-early index scan by 10X, to reflect our weak confidence in
its correctness.

We could also probably do the same for plans which depend on column
correlation estimates being accurate.

What I'm not clear on is why upgrading from 9.1 to 9.3 would bring

about

this change. The stats are no more than 10% different across the
version change.

There's no difference. Postgres has been estimating LIMIT costs this way
since before I came to postgres in 7.3.

Then why is the plan different in 9.1 and 9.3 with identical stats (I
tested)?

It would be neat to have an opclass which worked like that. Which would
amount to having prefix compression perhaps.

What plan does 9.1 come up with?

That was the "good" plan from my original post.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

#8Merlin Moncure
mmoncure@gmail.com
In reply to: Josh Berkus (#7)
Re: Yet another abort-early plan disaster on 9.3

On Sat, Sep 20, 2014 at 1:33 PM, Josh Berkus <josh@agliodbs.com> wrote:

For example, we could increase the estimated cost
for an abort-early index scan by 10X, to reflect our weak confidence in
its correctness.

Has any progress been made on the performance farm? The problem with
suggestions like this (which seem pretty reasonable to me) is that
we've got no way of quantifying the downside. I think this is one
example of a class of plans that are high risk. Another one off the
top of my head is nestloop joins based on assumed selectivity of
multiple stacked quals. About 90% of the time, my reflective
workaround to these types of problems is to 'disable_nestloop' which
works around 90% of the time and the result are solved with monkeying
around with 'OFFSET 0' etc. In the past, a GUC controlling planner
risk has been much discussed -- maybe it's still worth considering?

merlin

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

#9Josh Berkus
josh@agliodbs.com
In reply to: Josh Berkus (#1)
Re: Yet another abort-early plan disaster on 9.3

On 09/22/2014 06:55 AM, Merlin Moncure wrote:

Has any progress been made on the performance farm? The problem with
suggestions like this (which seem pretty reasonable to me) is that
we've got no way of quantifying the downside.

Yeah, that's certainly an issue. The problem is that we'd need a
benchmark which actually created complex query plans. I believe that
Mark Wong is working on TPCH-based benchmarks, so maybe we'll get that.

I think this is one
example of a class of plans that are high risk. Another one off the
top of my head is nestloop joins based on assumed selectivity of
multiple stacked quals.

Yeah, that's another good example.

About 90% of the time, my reflective
workaround to these types of problems is to 'disable_nestloop' which
works around 90% of the time and the result are solved with monkeying
around with 'OFFSET 0' etc. In the past, a GUC controlling planner
risk has been much discussed -- maybe it's still worth considering?

We've hashed that out a bit, but frankly I think it's much more
profitable to pursue fixing the actual problem than providing a
workaround like "risk", such as:

a) fixing n_distinct estimation
b) estimating stacked quals using better math (i.e. not assuming total
randomness)
c) developing some kind of correlation stats

Otherwise we would be just providing users with another knob there's no
rational way to set.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

#10Simon Riggs
simon@2ndQuadrant.com
In reply to: Josh Berkus (#9)
Re: Yet another abort-early plan disaster on 9.3

On 23 September 2014 00:56, Josh Berkus <josh@agliodbs.com> wrote:

We've hashed that out a bit, but frankly I think it's much more
profitable to pursue fixing the actual problem than providing a
workaround like "risk", such as:

a) fixing n_distinct estimation
b) estimating stacked quals using better math (i.e. not assuming total
randomness)
c) developing some kind of correlation stats

Otherwise we would be just providing users with another knob there's no
rational way to set.

I believe this is a serious issue for PostgreSQL users and one that
needs to be addressed.

n_distinct can be fixed manually, so that is less of an issue.

The problem, as I see it, is different. We assume that if there are
100 distinct values and you use LIMIT 1 that you would only need to
scan 1% of rows. We assume that the data is arranged in the table in a
very homogenous layout. When data is not, and it seldom is, we get
problems.

Simply put, assuming that LIMIT will reduce the size of all scans is
just way wrong. I've seen many plans where increasing the LIMIT
dramatically improves the plan.

If we can at least agree it is a problem, we can try to move forwards.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

#11Merlin Moncure
mmoncure@gmail.com
In reply to: Simon Riggs (#10)
Re: Yet another abort-early plan disaster on 9.3

On Fri, Sep 26, 2014 at 3:06 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

The problem, as I see it, is different. We assume that if there are
100 distinct values and you use LIMIT 1 that you would only need to
scan 1% of rows. We assume that the data is arranged in the table in a
very homogenous layout. When data is not, and it seldom is, we get
problems.

Hm, good point -- 'data proximity'. At least in theory, can't this be
measured and quantified? For example, given a number of distinct
values, you could estimate the % of pages read (or maybe non
sequential seeks relative to the number of pages) you'd need to read
all instances of a particular value in the average (or perhaps the
worst) case. One way of trying to calculate that would be to look at
proximity of values in sampled pages (and maybe a penalty assigned for
high update activity relative to table size). Data proximity would
then become a cost coefficient to the benefits of LIMIT.

merlin

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

#12Simon Riggs
simon@2ndQuadrant.com
In reply to: Merlin Moncure (#11)
Re: Yet another abort-early plan disaster on 9.3

On 29 September 2014 16:00, Merlin Moncure <mmoncure@gmail.com> wrote:

On Fri, Sep 26, 2014 at 3:06 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

The problem, as I see it, is different. We assume that if there are
100 distinct values and you use LIMIT 1 that you would only need to
scan 1% of rows. We assume that the data is arranged in the table in a
very homogenous layout. When data is not, and it seldom is, we get
problems.

Hm, good point -- 'data proximity'. At least in theory, can't this be
measured and quantified? For example, given a number of distinct
values, you could estimate the % of pages read (or maybe non
sequential seeks relative to the number of pages) you'd need to read
all instances of a particular value in the average (or perhaps the
worst) case. One way of trying to calculate that would be to look at
proximity of values in sampled pages (and maybe a penalty assigned for
high update activity relative to table size). Data proximity would
then become a cost coefficient to the benefits of LIMIT.

The necessary first step to this is to realise that we can't simply
apply the LIMIT as a reduction in query cost, in all cases.

The way I'm seeing it, you can't assume the LIMIT will apply to any
IndexScan that doesn't have an index condition. If it has just a
filter, or nothing at all, just an ordering then it could easily scan
the whole index if the stats are wrong.

So plans like this could be wrong, by assuming the scan will end
earlier because of the LIMIT than it actually will.

Limit
IndexScan (no index cond)

Limit
NestJoin
IndexScan (no index cond)
SomeScan

Limit
NestJoin
NestJoin
IndexScan (no index cond)
SomeScan
SomeScan

and deeper...

I'm looking for a way to identify and exclude such plans, assuming
that this captures at least some of the problem plans.

Comments? Test Cases?

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

#13Josh Berkus
josh@agliodbs.com
In reply to: Josh Berkus (#1)
Re: Yet another abort-early plan disaster on 9.3

On 09/26/2014 01:06 AM, Simon Riggs wrote:

On 23 September 2014 00:56, Josh Berkus <josh@agliodbs.com> wrote:

We've hashed that out a bit, but frankly I think it's much more
profitable to pursue fixing the actual problem than providing a
workaround like "risk", such as:

a) fixing n_distinct estimation
b) estimating stacked quals using better math (i.e. not assuming total
randomness)
c) developing some kind of correlation stats

Otherwise we would be just providing users with another knob there's no
rational way to set.

I believe this is a serious issue for PostgreSQL users and one that
needs to be addressed.

n_distinct can be fixed manually, so that is less of an issue.

It's an issue for the 99.8% of our users who don't know what n_distinct
is, let alone how to calculate it. Also, changing it requires an
exclusive lock on the table. Of course, you and I have been over this
issue before.

One thing I'm wondering is why our estimator is creates n_distinct as a
% so seldom. Really, any time n_distinct is over 10K we should be
estimating a % instead. Now, estimating that % has its own issues, but
it does seem like a peculiar quirk of our stats model.

Anyway, in the particular case I posted fixing n_distinct to realistic
numbers (%) fixed the query plan.

The problem, as I see it, is different. We assume that if there are
100 distinct values and you use LIMIT 1 that you would only need to
scan 1% of rows. We assume that the data is arranged in the table in a
very homogenous layout. When data is not, and it seldom is, we get
problems.

Simply put, assuming that LIMIT will reduce the size of all scans is
just way wrong. I've seen many plans where increasing the LIMIT
dramatically improves the plan.

If we can at least agree it is a problem, we can try to move forwards.

That is certainly another problem. Does correlation stat figure in the
LIMIT calculation at all, currently? That's what correlation stat is
for, no?

Also, to be fair, physical correlation of rows can also lead to
abort-early plans being extra fast, if everything we want is towards the
beginning of the index. Which means we'd need working multi-column
correlation, which is a known hard problem.

For example, consider the query:

SELECT id, updated_on FROM audit_log
WHERE updated_on < '2010-01-01'
ORDER BY id LIMIT 10;

In an append-only table, that query is liable to be very fast with an
abort-early plan scanning on an ID index (AEP from now on), since the
oldest rows are likely to correspond with the smallest IDs. But then
the user does this:

SELECT id, updated_on FROM audit_log
WHERE updated_on < '2010-01-01'
ORDER BY id DESC LIMIT 10;

... and a completely different plan is called for, because using an AEP
will result in reverse scanning most of the index. However, I'm not
sure the planner knows the difference, since it's only comparing the
estimated selectivity of (updated_on < '2010-01-01') and seeing that
it's 20% of rows. I bet you'd get an AEP in the 2nd case too.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

#14Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#12)
Re: Yet another abort-early plan disaster on 9.3

Simon Riggs <simon@2ndquadrant.com> writes:

The way I'm seeing it, you can't assume the LIMIT will apply to any
IndexScan that doesn't have an index condition. If it has just a
filter, or nothing at all, just an ordering then it could easily scan
the whole index if the stats are wrong.

That statement applies with equal force to *any* plan with a LIMIT;
it's not just index scans.

The real question is to what extent are the tuples satisfying the extra
filter condition randomly distributed with respect to the index order
(or physical order, if it's a seqscan). The existing cost estimation
code effectively assumes that they're perfectly uniformly distributed;
which is a good average-case assumption but can be horribly wrong in
the worst case.

If we could settle on some other model for the probable distribution
of the matching tuples, we could adjust the cost estimates for LIMIT
accordingly. I have not enough statistics background to know what a
realistic alternative would be.

Another possibility is to still assume a uniform distribution but estimate
for, say, a 90% probability instead of 50% probability that we'll find
enough tuples after scanning X amount of the table. Again, I'm not too
sure what that translates to in terms of the actual math, but it sounds
like something a statistics person could do in their sleep.

I do not think we should estimate for the worst case though. If we do,
we'll hear cries of anguish from a lot of people, including many of the
same ones complaining now, because the planner stopped picking fast-start
plans even for cases where they are orders of magnitude faster than the
alternatives.

regards, tom lane

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

#15Gavin Flower
GavinFlower@archidevsys.co.nz
In reply to: Tom Lane (#14)
Re: Yet another abort-early plan disaster on 9.3

On 30/09/14 12:00, Tom Lane wrote:

Simon Riggs <simon@2ndquadrant.com> writes:

The way I'm seeing it, you can't assume the LIMIT will apply to any
IndexScan that doesn't have an index condition. If it has just a
filter, or nothing at all, just an ordering then it could easily scan
the whole index if the stats are wrong.

That statement applies with equal force to *any* plan with a LIMIT;
it's not just index scans.

The real question is to what extent are the tuples satisfying the extra
filter condition randomly distributed with respect to the index order
(or physical order, if it's a seqscan). The existing cost estimation
code effectively assumes that they're perfectly uniformly distributed;
which is a good average-case assumption but can be horribly wrong in
the worst case.

If we could settle on some other model for the probable distribution
of the matching tuples, we could adjust the cost estimates for LIMIT
accordingly. I have not enough statistics background to know what a
realistic alternative would be.

Another possibility is to still assume a uniform distribution but estimate
for, say, a 90% probability instead of 50% probability that we'll find
enough tuples after scanning X amount of the table. Again, I'm not too
sure what that translates to in terms of the actual math, but it sounds
like something a statistics person could do in their sleep.

I do not think we should estimate for the worst case though. If we do,
we'll hear cries of anguish from a lot of people, including many of the
same ones complaining now, because the planner stopped picking fast-start
plans even for cases where they are orders of magnitude faster than the
alternatives.

regards, tom lane

If you analyzed the tables in most production databases, you would find that they are almost invariably not uniformly distributed.

Most likely they will be clumped, and if you plotted the frequency of values of a given column in a given range against the number of blocks, you are likely to see one or more distinct peaks. If the table has been CLUSTERed on that column, then they will very likely to be in one clump spanning contiguous blocks.

I suspect that there are two distinct populations: one relating to values present before the last VACUUM, and ones added since.

There are so many factors to consider, pattern of CRUD operations, range of values in query, ... that probably prevent using very sophisticated approaches, but I would be happy to be proved wrong!

Though I am fairly confident that if the distribution is not known in advance for a given table, then the percentage required to process to satisfy the limit is likely to be a lot larger than a uniform distribution would suggest.

Would it be feasible to get a competent statistician to advise what data to collect, and to analyze it? Maybe it is possible to get a better estimate on how much of a table needs to be scanned, based on some fairly simple statistics. But unless research is done, it is probably impossible to determine what statistics might be useful, and how effective a better estimate could be.

I have a nasty feeling that assuming a uniform distribution, may still
end up being the best we can do - but I maybe being unduly pessimistic!.

Cheers,
Gavin

#16Bruce Momjian
bruce@momjian.us
In reply to: Simon Riggs (#10)
Re: Yet another abort-early plan disaster on 9.3

On Fri, Sep 26, 2014 at 9:06 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

If we can at least agree it is a problem, we can try to move forwards.

Well that's a good question. I don't think we do and I think the
reason why is because we haven't actually pinned down exactly what is
the problem.

The real problem here is that the ideal index for the query isn't there
and Postgres is trying to choose between two inappropriatedoes not
exist indexes where that decision is very very hard for it to make. If
it guesses
wrong in *either* direction it'll go very poorly. We can try to
improve the frequency of getting the right decision but it'll never be
100% and even if it was it'll still not perform as well as the right
index would have.

I have seen plenty of applications where the slowdown was in the
reverse direction --
where a query like "find the last login for the current user" was
planned just as Josh is asking for by retrieving all the records for
the user and sorting by login time and it caused large problems in
production when some users had a disproportionately large number of
records.

The real solution for users is to create the compound index on both columns (or
partial index in some cases). Trying to make do with an ordered scan
or a index scan and sort are both going to cause problems in real
world usage.

In fact I think the real story here is that Postgres is doing a
surprisingly good job at making do without the right index and that's
causing users to get surprisingly far before they run into problems.
That may not be the best thing for users in the long run but that's a
problem that should be solved by better development tools to help
users identify scalability problems early.

--
greg

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

#17Simon Riggs
simon@2ndQuadrant.com
In reply to: Josh Berkus (#13)
Re: Yet another abort-early plan disaster on 9.3

On 29 September 2014 22:54, Josh Berkus <josh@agliodbs.com> wrote:

On 09/26/2014 01:06 AM, Simon Riggs wrote:

On 23 September 2014 00:56, Josh Berkus <josh@agliodbs.com> wrote:

We've hashed that out a bit, but frankly I think it's much more
profitable to pursue fixing the actual problem than providing a
workaround like "risk", such as:

a) fixing n_distinct estimation
b) estimating stacked quals using better math (i.e. not assuming total
randomness)
c) developing some kind of correlation stats

Otherwise we would be just providing users with another knob there's no
rational way to set.

I believe this is a serious issue for PostgreSQL users and one that
needs to be addressed.

n_distinct can be fixed manually, so that is less of an issue.

It's an issue for the 99.8% of our users who don't know what n_distinct
is, let alone how to calculate it. Also, changing it requires an
exclusive lock on the table. Of course, you and I have been over this
issue before.

In 9.4 you'll be able to set n_distinct using only a Share Update
Exclusive lock.

So that's no longer a problem.

The quality of the n_distinct itself is an issue, but with no current
solution, but then that is why you can set it manually,

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

#18Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#14)
Re: Yet another abort-early plan disaster on 9.3

On 30 September 2014 00:00, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Simon Riggs <simon@2ndquadrant.com> writes:

The way I'm seeing it, you can't assume the LIMIT will apply to any
IndexScan that doesn't have an index condition. If it has just a
filter, or nothing at all, just an ordering then it could easily scan
the whole index if the stats are wrong.

That statement applies with equal force to *any* plan with a LIMIT;
it's not just index scans.

Agreed

The real question is to what extent are the tuples satisfying the extra
filter condition randomly distributed with respect to the index order
(or physical order, if it's a seqscan).

Agreed

The existing cost estimation
code effectively assumes that they're perfectly uniformly distributed;
which is a good average-case assumption but can be horribly wrong in
the worst case.

Agreed. This is the main observation from which we can work.

If we could settle on some other model for the probable distribution
of the matching tuples, we could adjust the cost estimates for LIMIT
accordingly. I have not enough statistics background to know what a
realistic alternative would be.

I'm not sure that the correlation alone is sufficient to be able to do
that. We'd need to estimate where the values looked for are likely to
be wrt other values, then increase estimate accordingly. That sounds
like a lot of pushups grovelling through quals and comparing against
stats. So my thinking is actually to rule that out, unless you've some
ideas for how to do that?

Another possibility is to still assume a uniform distribution but estimate
for, say, a 90% probability instead of 50% probability that we'll find
enough tuples after scanning X amount of the table. Again, I'm not too
sure what that translates to in terms of the actual math, but it sounds
like something a statistics person could do in their sleep.

I do not think we should estimate for the worst case though. If we do,
we'll hear cries of anguish from a lot of people, including many of the
same ones complaining now, because the planner stopped picking fast-start
plans even for cases where they are orders of magnitude faster than the
alternatives.

Fast start plans still make sense when performing an IndexScan with no
filter conditions. Those types of plan should not be changed from
current costing - they are accurate, good and very important because
of their frequency in real workloads.

What I think we are seeing is Ordered plans being selected too often
in preference to Sorted plans when we make selectivity or stats
errors. As well as data distributions that aren't correctly described
by the statistics causing much longer execution times.

Here are some plan selection strategies

* Cost based - attempt to exactly calculate the cost based upon
existing stats - increase the complexity of cost calc to cover other
aspects. Even if we do that, these may not be that helpful in covering
the cases where the stats turn out to be wrong.

* Risk based - A risk adjusted viewpoint would be that we should treat
the cost as mid-way between the best and the worst. The worst is
clearly scanning (100% - N) of the tuples, the best is just N tuples.
So we should be costing scans with excess filter conditions as a (100%
Scan)/2, no matter the conditions, based purely upon risk.

* Simplified heuristic - deselect ordered plans when they are driven
from scans without quals or indexscans with filters, since the risk
adjusted cost is likely to be higher than the sorted cost. Inspecting
the plan tree for this could be quite costly, so would only be done
when the total cost is $high, prior to it being adjusted by LIMIT.

In terms of practical steps... I suggest the following:

* Implement enable_orderedscan = on (default) | off. A switch to allow
plans to de-select ordered plans, so we can more easily see the
effects of such plans in the wild.

* Code heuristic approach - I can see where to add my heuristic in the
grouping planner. So we just need to do a left? deep search of the
plan tree looking for scans of the appropriate type and bail out if we
find one.

Thoughts?

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

#19Graeme B. Bell
grb@skogoglandskap.no
In reply to: Simon Riggs (#18)
Re: Yet another abort-early plan disaster on 9.3

The existing cost estimation
code effectively assumes that they're perfectly uniformly distributed;
which is a good average-case assumption but can be horribly wrong in
the worst case.

Sorry, just an outsider jumping in with a quick comment.

Every year or two the core count goes up. Can/should/does postgres ever attempt two strategies in parallel, in cases where strategy A is generally good but strategy B prevents bad worst case behaviour? Kind of like a Schrödinger's Cat approach to scheduling. What problems would it raise?

Graeme.

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

#20Claudio Freire
klaussfreire@gmail.com
In reply to: Graeme B. Bell (#19)
Re: Yet another abort-early plan disaster on 9.3

On Tue, Sep 30, 2014 at 8:34 AM, Graeme B. Bell <grb@skogoglandskap.no> wrote:

The existing cost estimation
code effectively assumes that they're perfectly uniformly distributed;
which is a good average-case assumption but can be horribly wrong in
the worst case.

Sorry, just an outsider jumping in with a quick comment.

Every year or two the core count goes up. Can/should/does postgres ever attempt two strategies in parallel, in cases where strategy A is generally good but strategy B prevents bad worst case behaviour? Kind of like a Schrödinger's Cat approach to scheduling.

What problems would it raise?

Interleaved I/O, that would kill performance for both plans if it
happens on rotating media.

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

#21Tom Lane
tgl@sss.pgh.pa.us
In reply to: Graeme B. Bell (#19)
#22Jeff Janes
jeff.janes@gmail.com
In reply to: Gavin Flower (#15)
#23Graeme B. Bell
grb@skogoglandskap.no
In reply to: Tom Lane (#21)
#24Jeff Janes
jeff.janes@gmail.com
In reply to: Josh Berkus (#13)
#25Gavin Flower
GavinFlower@archidevsys.co.nz
In reply to: Jeff Janes (#22)
#26Merlin Moncure
mmoncure@gmail.com
In reply to: Jeff Janes (#22)
#27Simon Riggs
simon@2ndQuadrant.com
In reply to: Jeff Janes (#24)
#28Josh Berkus
josh@agliodbs.com
In reply to: Josh Berkus (#1)
#29Simon Riggs
simon@2ndQuadrant.com
In reply to: Josh Berkus (#28)
In reply to: Simon Riggs (#29)
#31Ryan Johnson
ryan.johnson@cs.utoronto.ca
In reply to: Merlin Moncure (#11)
#32Josh Berkus
josh@agliodbs.com
In reply to: Josh Berkus (#1)
In reply to: Josh Berkus (#32)
#34Jeff Janes
jeff.janes@gmail.com
In reply to: Josh Berkus (#32)
#35Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Jeff Janes (#34)
#36Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Peter Geoghegan (#33)
#37Bruce Momjian
bruce@momjian.us
In reply to: Josh Berkus (#32)
#38Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Bruce Momjian (#37)
#39Craig James
cjames@emolecules.com
In reply to: Tomas Vondra (#38)
#40Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Craig James (#39)
#41Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#38)
#42Craig James
cjames@emolecules.com
In reply to: Tomas Vondra (#40)
#43Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Craig James (#42)
#44Josh Berkus
josh@agliodbs.com
In reply to: Josh Berkus (#1)
#45Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Josh Berkus (#44)
#46Bruce Momjian
bruce@momjian.us
In reply to: Tomas Vondra (#45)
#47Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Bruce Momjian (#46)
#48Bruce Momjian
bruce@momjian.us
In reply to: Tomas Vondra (#47)
#49Jeff Janes
jeff.janes@gmail.com
In reply to: Tomas Vondra (#41)
#50Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Jeff Janes (#49)
#51Simon Riggs
simon@2ndQuadrant.com
In reply to: Simon Riggs (#12)
#52Merlin Moncure
mmoncure@gmail.com
In reply to: Simon Riggs (#51)
#53Simon Riggs
simon@2ndQuadrant.com
In reply to: Merlin Moncure (#52)
#54Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#50)
#55Josh Berkus
josh@agliodbs.com
In reply to: Josh Berkus (#1)
#56Simon Riggs
simon@2ndQuadrant.com
In reply to: Josh Berkus (#55)
#57Simon Riggs
simon@2ndQuadrant.com
In reply to: Simon Riggs (#18)
#58Simon Riggs
simon@2ndQuadrant.com
In reply to: Simon Riggs (#57)
#59Simon Riggs
simon@2ndQuadrant.com
In reply to: Simon Riggs (#58)
#60Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Tomas Vondra (#54)
#61Michael Paquier
michael@paquier.xyz
In reply to: Simon Riggs (#59)
#62Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Heikki Linnakangas (#60)
#63Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#54)
#64Bruce Momjian
bruce@momjian.us
In reply to: Tomas Vondra (#63)
#65Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Bruce Momjian (#64)
#66Jeff Janes
jeff.janes@gmail.com
In reply to: Tomas Vondra (#63)
#67Jeff Janes
jeff.janes@gmail.com
In reply to: Jeff Janes (#66)
#68Robert Haas
robertmhaas@gmail.com
In reply to: Tomas Vondra (#63)
#69Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Robert Haas (#68)
#70Robert Haas
robertmhaas@gmail.com
In reply to: Heikki Linnakangas (#69)
#71Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Robert Haas (#68)
#72Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Robert Haas (#70)
#73Robert Haas
robertmhaas@gmail.com
In reply to: Tomas Vondra (#71)
#74Jeff Janes
jeff.janes@gmail.com
In reply to: Tomas Vondra (#71)
#75Robert Haas
robertmhaas@gmail.com
In reply to: Jeff Janes (#74)
#76Josh Berkus
josh@agliodbs.com
In reply to: Josh Berkus (#1)
#77Robert Haas
robertmhaas@gmail.com
In reply to: Josh Berkus (#76)
#78Josh Berkus
josh@agliodbs.com
In reply to: Josh Berkus (#1)
#79Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Jeff Janes (#74)
#80Evgeniy Shishkin
itparanoia@gmail.com
In reply to: Michael Paquier (#61)