Weird behaviour of the planner

Started by Guillaume Lelargeover 7 years ago3 messagesgeneral

guillaume@lelarge.info

over 7 years ago

Hi,

While working on a slow query of one of our customers, I found this
behaviour which seems quite weird to me. Here it is:

(no changes on any parameter)
EXPLAIN (ANALYZE) SELECT DISTINCT * FROM gleu2;

QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
Unique (cost=12005.97..12049.20 rows=1 width=1430) (actual
time=20055.294..20323.348 rows=1 loops=1)
-> Sort (cost=12005.97..12006.30 rows=132 width=1430) (actual
time=20055.290..20105.738 rows=60000 loops=1)
Sort Key: (... 130 columns ...)
Sort Method: external sort Disk: 84464kB
-> Seq Scan on gleu2 (cost=0.00..12001.32 rows=132 width=1430)
(actual time=0.109..114.142 rows=60000 loops=1)
Planning time: 10.012 ms
Execution time: 20337.854 ms
(7 rows)

That looks quite good. The total cost is 12049, so I expect this plan to
have the smaller cost as it's the choosen plan. Now, I'm disabling Sort,
and here is what I get:

SET enable_sort TO off;
EXPLAIN (ANALYZE) SELECT DISTINCT * FROM gleu2;

QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=12044.22..12044.23 rows=1 width=1430) (actual
time=508.342..508.343 rows=1 loops=1)
Group Key: (... 130 columns ...)
-> Seq Scan on gleu2 (cost=0.00..12001.32 rows=132 width=1430) (actual
time=0.036..57.088 rows=60000 loops=1)
Planning time: 1.335 ms
Execution time: 508.529 ms
(5 rows)

I now have a Hash Aggregate. Sounds good, but I do have a smaller total
cost than the former plan. How is it possible that the planner chooses a
plan with a higher cost? (and a much much higher duration, but it can't
obviously know that).

Still working on my customer's issue, I found that it worked great before
commit 3fc6e2d7f5b652b417fa6937c34de2438d60fa9f (Make the upper part of the
planner work by generating and comparing Paths.) and not after. That kind
of patch is way out of my league to be honest :) But before that commit,
the Hash Aggregate is choosen right out of the box. And after that commit,
the Sort is choosen even with its higher cost.

Oh, BTW, yeah, I know. The row estimate is quite off... I'm not asking how
to make my query go faster, I'm just wondering why PostgreSQL chooses a
higher-cost plan :)

Thanks.

--
Guillaume.

David Rowley

dgrowleyml@gmail.com

over 7 years ago

In reply to: Guillaume Lelarge (#1)

Re: Weird behaviour of the planner

On 2 August 2018 at 02:48, Guillaume Lelarge <guillaume@lelarge.info> wrote:

EXPLAIN (ANALYZE) SELECT DISTINCT * FROM gleu2;

QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
Unique (cost=12005.97..12049.20 rows=1 width=1430) (actual
time=20055.294..20323.348 rows=1 loops=1)
-> Sort (cost=12005.97..12006.30 rows=132 width=1430) (actual
time=20055.290..20105.738 rows=60000 loops=1)
Sort Key: (... 130 columns ...)
Sort Method: external sort Disk: 84464kB
-> Seq Scan on gleu2 (cost=0.00..12001.32 rows=132 width=1430)
(actual time=0.109..114.142 rows=60000 loops=1)
Planning time: 10.012 ms
Execution time: 20337.854 ms
(7 rows)

That looks quite good. The total cost is 12049, so I expect this plan to
have the smaller cost as it's the choosen plan. Now, I'm disabling Sort, and
here is what I get:

SET enable_sort TO off;
EXPLAIN (ANALYZE) SELECT DISTINCT * FROM gleu2;

QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=12044.22..12044.23 rows=1 width=1430) (actual
time=508.342..508.343 rows=1 loops=1)
Group Key: (... 130 columns ...)
-> Seq Scan on gleu2 (cost=0.00..12001.32 rows=132 width=1430) (actual
time=0.036..57.088 rows=60000 loops=1)
Planning time: 1.335 ms
Execution time: 508.529 ms
(5 rows)

When the planner adds a new path it compares the cost not exactly, but
with a 'fuzz factor'. It's very likely that the hashagg path did not
make it as it was not fuzzily any cheaper than the unique path. By
default, this fuzz factor is 1%.

It seems in your case the costs don't quite match reality which is
quite likely due to the poor row estimates on "gleu2". Has that table
been analyzed recently? or is there some reason that auto-vacuum is
not getting to it?

There's a bit more reading of what I'm talking about in
https://github.com/postgres/postgres/blob/master/src/backend/optimizer/util/pathnode.c#L141

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Guillaume Lelarge

guillaume@lelarge.info

over 7 years ago

In reply to: David Rowley (#2)

Re: Weird behaviour of the planner

2018-08-01 16:59 GMT+02:00 David Rowley <david.rowley@2ndquadrant.com>:

On 2 August 2018 at 02:48, Guillaume Lelarge <guillaume@lelarge.info>
wrote:

EXPLAIN (ANALYZE) SELECT DISTINCT * FROM gleu2;

QUERY PLAN
------------------------------------------------------------

---------------------------------------------------------

Unique (cost=12005.97..12049.20 rows=1 width=1430) (actual
time=20055.294..20323.348 rows=1 loops=1)
-> Sort (cost=12005.97..12006.30 rows=132 width=1430) (actual
time=20055.290..20105.738 rows=60000 loops=1)
Sort Key: (... 130 columns ...)
Sort Method: external sort Disk: 84464kB
-> Seq Scan on gleu2 (cost=0.00..12001.32 rows=132 width=1430)
(actual time=0.109..114.142 rows=60000 loops=1)
Planning time: 10.012 ms
Execution time: 20337.854 ms
(7 rows)

That looks quite good. The total cost is 12049, so I expect this plan to
have the smaller cost as it's the choosen plan. Now, I'm disabling Sort,

and

here is what I get:

SET enable_sort TO off;
EXPLAIN (ANALYZE) SELECT DISTINCT * FROM gleu2;

QUERY PLAN
------------------------------------------------------------

---------------------------------------------------------

HashAggregate (cost=12044.22..12044.23 rows=1 width=1430) (actual
time=508.342..508.343 rows=1 loops=1)
Group Key: (... 130 columns ...)
-> Seq Scan on gleu2 (cost=0.00..12001.32 rows=132 width=1430)

(actual

time=0.036..57.088 rows=60000 loops=1)
Planning time: 1.335 ms
Execution time: 508.529 ms
(5 rows)

When the planner adds a new path it compares the cost not exactly, but
with a 'fuzz factor'. It's very likely that the hashagg path did not
make it as it was not fuzzily any cheaper than the unique path. By
default, this fuzz factor is 1%.

That may definitely be it. There's not much of a difference in the total
costs.

It seems in your case the costs don't quite match reality which is

quite likely due to the poor row estimates on "gleu2". Has that table
been analyzed recently? or is there some reason that auto-vacuum is
not getting to it?

This is a small test case of a much bigger query joining a large number of
tables, materialized views, views (calling functions), etc. The actual plan
contains 84 nodes (32 scans, lots of joins, and a few other nodes). The
mis-estimate is to follow what the big query gives me.

There's a bit more reading of what I'm talking about in

https://github.com/postgres/postgres/blob/master/src/
backend/optimizer/util/pathnode.c#L141

I'm gonna read that. Thank you.

--
Guillaume.