Query plan prefers hash join when nested loop is much faster

Started by iulian dragosover 5 years ago8 messagesgeneral

iulian.dragos@databricks.com

over 5 years ago

Hi,

I am trying to understand why the query planner insists on using a hash
join, and how to make it choose the better option, which in this case would
be a nested loop. I have two tables:

// about 200 million rows
CREATE TABLE module_result(
*id* bigserial PRIMARY KEY,
name_id bigint NOT NULL references result_name(id),
*run_id* integer NOT NULL references run (id),
logs text NOT NULL,
status result_status NOT NULL
);
CREATE INDEX ON module_result (run_id);

// 500 million rows
CREATE TABLE test_result(
id bigserial PRIMARY KEY,
name_id bigint NOT NULL references result_name(id),
*module_result_id* bigint NOT NULL references module_result (id),
seconds float NOT NULL,
failure_msg text, -- Either a <failure>...</failure> or an <error
message="... />
status result_status NOT NULL
);
CREATE INDEX ON test_result (module_result_id);

I'm trying to select all test cases that belong to a given run_id, which
logically involves finding all IDs in module_result that belong to a given
run, and then selecting the test results for those IDs (run_id has several
module_result_id, which in turn have several test_results each).

EXPLAIN ANALYZE SELECT test_result.status, count(test_result.status) as
"Count" FROM test_result INNER JOIN module_result ON module_result.id =
test_result.module_result_id WHERE module_resul
t.run_id=158523 GROUP BY test_result.status

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| QUERY PLAN

|
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Finalize GroupAggregate (cost=7771702.73..7771804.08 rows=3 width=12)
(actual time=32341.993..32341.994 rows=2 loops=1)
|
| Group Key: test_result.status

|
| -> Gather Merge (cost=7771702.73..7771804.02 rows=6 width=12) (actual
time=32341.970..32343.222 rows=6 loops=1)
|
| Workers Planned: 2

|
| Workers Launched: 2

|
| -> Partial GroupAggregate (cost=7770702.71..7770803.30 rows=3
width=12) (actual time=32340.278..32340.286 rows=2 loops=3)
|
| Group Key: test_result.status

|
| -> Sort (cost=7770702.71..7770736.23 rows=13408 width=4)
(actual time=32339.698..32339.916 rows=4941 loops=3)
|
| Sort Key: test_result.status

|
| Sort Method: quicksort Memory: 431kB

|
| Worker 0: Sort Method: quicksort Memory: 433kB

|
| Worker 1: Sort Method: quicksort Memory: 409kB

|
| -> Hash Join (cost=586.15..7769783.54 rows=13408
width=4) (actual time=18112.078..32339.011 rows=4941 loops=3)
|
| Hash Cond: (test_result.module_result_id =
module_result.id)
|
| -> Parallel Seq Scan on test_result
(cost=0.00..7145224.72 rows=237703872 width=12) (actual
time=0.034..15957.894 rows=190207740 loops=3) |
| -> Hash (cost=438.41..438.41 rows=11819
width=8) (actual time=3.905..3.905 rows=14824 loops=3)
|
| Buckets: 16384 Batches: 1 Memory Usage:
708kB
|
| -> Index Scan using
module_result_run_id_idx on module_result (cost=0.57..438.41 rows=11819
width=8) (actual time=0.017..2.197 rows=14824 loops=3) |
| Index Cond: (run_id = 158523)

|
| Planning Time: 0.178 ms

|
| Execution Time: 32343.330 ms

|
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
EXPLAIN
Time: 32.572s (32 seconds), executed in: 32.551s (32 seconds)

This plan takes about 30s to execute. If I turn off seqscan, I get a nested
loop join that takes about 0.02s to execute:

set enable_seqscan = off

SET
Time: 0.305s

explain analyze select test_result.status, count(test_result.status) as

"Count" from test_result inner join module_result ON module_result.id =
test_result.module_result_id where module_resul
t.run_id=158523 group by test_result.status

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| QUERY PLAN

|
|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Finalize GroupAggregate (cost=34297042.16..34297143.50 rows=3 width=12)
(actual time=15.014..15.015 rows=2 loops=1)
|
| Group Key: test_result.status

|
| -> Gather Merge (cost=34297042.16..34297143.44 rows=6 width=12)
(actual time=15.005..15.850 rows=6 loops=1)
|
| Workers Planned: 2

|
| Workers Launched: 2

|
| -> Partial GroupAggregate (cost=34296042.13..34296142.72 rows=3
width=12) (actual time=12.937..12.940 rows=2 loops=3)
|
| Group Key: test_result.status

|
| -> Sort (cost=34296042.13..34296075.65 rows=13408
width=4) (actual time=12.339..12.559 rows=4941 loops=3)
|
| Sort Key: test_result.status

|
| Sort Method: quicksort Memory: 461kB

|
| Worker 0: Sort Method: quicksort Memory: 403kB

|
| Worker 1: Sort Method: quicksort Memory: 408kB

|
| -> Nested Loop (cost=232.74..34295122.96 rows=13408
width=4) (actual time=0.232..11.671 rows=4941 loops=3)
|
| -> Parallel Bitmap Heap Scan on module_result
(cost=232.17..44321.35 rows=4925 width=8) (actual time=0.218..0.671
rows=4941 loops=3) |
| Recheck Cond: (run_id = 158523)

|
| Heap Blocks: exact=50

|
| -> Bitmap Index Scan on
module_result_run_id_idx (cost=0.00..229.21 rows=11819 width=0) (actual
time=0.592..0.592 rows=14824 loops=1) |
| Index Cond: (run_id = 158523)

|
| -> Index Scan using
test_result_module_result_id_idx on test_result (cost=0.57..6911.17
rows=4331 width=12) (actual time=0.002..0.002 rows=1 loops=14824) |
| Index Cond: (module_result_id =
module_result.id)
|
| Planning Time: 0.214 ms

|
| Execution Time: 15.932 ms

I don't think it's recommended to turn off seqscan in production, so I'm
looking for a way to make the query planner choose the significantly faster
plan. How can I do that? It's probably related to some statistics, but they
are up to date (I run ANALYZE several times).

Any pointers would be very helpful,

thank you,
iulian

Michael Lewis

mlewis@entrata.com

over 5 years ago

In reply to: iulian dragos (#1)

Re: Query plan prefers hash join when nested loop is much faster

Your system is preferring sequential scan to
using test_result_module_result_id_idx in this case. What type of storage
do you use, what type of cache hits do you expect, and what do you have
random_page_cost set to? That comes to mind as a significant factor in
choosing index scans based on costs.

iulian dragos

iulian.dragos@databricks.com

over 5 years ago

In reply to: Michael Lewis (#2)

Re: Query plan prefers hash join when nested loop is much faster

Hi Michael,

Thanks for the answer. It's an RDS instance using SSD storage and the
default `random_page_cost` set to 4.0. I don't expect a lot of repetitive
queries here, so I think caching may not be extremely useful. I wonder if
the selectivity of the query is wrongly estimated (out of 500 million rows,
only a few thousands are returned).

I tried lowering the `random_page_cost` to 1.2 and it didn't make a
difference in the query plan.

iulian

On Fri, Aug 21, 2020 at 6:30 PM Michael Lewis <mlewis@entrata.com> wrote:

Show quoted text

Your system is preferring sequential scan to
using test_result_module_result_id_idx in this case. What type of storage
do you use, what type of cache hits do you expect, and what do you have
random_page_cost set to? That comes to mind as a significant factor in
choosing index scans based on costs.

iulian dragos

iulian.dragos@databricks.com

over 5 years ago

In reply to: iulian dragos (#3)

Re: Query plan prefers hash join when nested loop is much faster

On Mon, Aug 24, 2020 at 4:21 PM iulian dragos <iulian.dragos@databricks.com>
wrote:

Hi Michael,

Thanks for the answer. It's an RDS instance using SSD storage and the
default `random_page_cost` set to 4.0. I don't expect a lot of repetitive
queries here, so I think caching may not be extremely useful. I wonder if
the selectivity of the query is wrongly estimated (out of 500 million rows,
only a few thousands are returned).

I tried lowering the `random_page_cost` to 1.2 and it didn't make a
difference in the query plan.

I experimented a bit more with different values for this setting. The only
way I could make it use the index was to use a value strictly less than
`seq_page_cost` (0.8 for instance). That doesn't sound right, though.

The size of the effective_cache_size is fairly high as well (32 GB) for an
instance with 64GB (db.m5.4xlarge).

iulian

Show quoted text

David Rowley

dgrowleyml@gmail.com

over 5 years ago

In reply to: iulian dragos (#1)

Re: Query plan prefers hash join when nested loop is much faster

On Sat, 22 Aug 2020 at 00:35, iulian dragos
<iulian.dragos@databricks.com> wrote:

I am trying to understand why the query planner insists on using a hash join, and how to make it choose the better option, which in this case would be a nested loop.

| -> Index Scan using test_result_module_result_id_idx on test_result (cost=0.57..6911.17 rows=4331 width=12) (actual time=0.002..0.002 rows=1 loops=14824) |
| Index Cond: (module_result_id = module_result.id) |

You might want to check if the pg_stats view reports a realistic
n_distinct value for test_result.module_result_id. If the
pg_class.retuples is correct for that relation then that would
indicate the n_distinct estimate is about 115000. Going by the number
of rows you've mentioned it would appear a more realistic value for
that would be -0.4. which is 0 - 1 / (500000000 / 200000000.0).
However, that's assuming each module_result has a test_result. You
could run a SELECT COUNT(DISTINCT module_result_id) FROM test_result;
to get a better idea.

If ANALYZE is not getting you a good value for n_distinct, then you
can overwrite it. See [1]https://www.postgresql.org/docs/current/sql-altertable.html, search for n_distinct.

David

[1]: https://www.postgresql.org/docs/current/sql-altertable.html

iulian dragos

iulian.dragos@databricks.com

over 5 years ago

In reply to: David Rowley (#5)

Re: Query plan prefers hash join when nested loop is much faster

On Tue, Aug 25, 2020 at 12:27 AM David Rowley <dgrowleyml@gmail.com> wrote:

On Sat, 22 Aug 2020 at 00:35, iulian dragos
<iulian.dragos@databricks.com> wrote:

I am trying to understand why the query planner insists on using a hash

join, and how to make it choose the better option, which in this case would
be a nested loop.

| -> Index Scan using

test_result_module_result_id_idx on test_result (cost=0.57..6911.17
rows=4331 width=12) (actual time=0.002..0.002 rows=1 loops=14824) |

| Index Cond: (module_result_id =

module_result.id)
|

You might want to check if the pg_stats view reports a realistic
n_distinct value for test_result.module_result_id. If the
pg_class.retuples is correct for that relation then that would
indicate the n_distinct estimate is about 115000. Going by the number
of rows you've mentioned it would appear a more realistic value for
that would be -0.4. which is 0 - 1 / (500000000 / 200000000.0).
However, that's assuming each module_result has a test_result. You
could run a SELECT COUNT(DISTINCT module_result_id) FROM test_result;
to get a better idea.

If ANALYZE is not getting you a good value for n_distinct, then you
can overwrite it. See [1], search for n_distinct.

Thanks for the tip! Indeed, `n_distinct` isn't right. I found it in
pg_stats set at 131736.0, but the actual number is much higher: 210104361.
I tried to set it manually, but the plan is still the same (both the actual
number and a percentage, -0.4, as you suggested):

ALTER TABLE test_result ALTER COLUMN module_result_id SET

(n_distinct=210104361)

You're about to run a destructive command.
Do you want to proceed? (y/n): y
Your call!
ALTER TABLE
Time: 0.205s

Show quoted text

David

[1] https://www.postgresql.org/docs/current/sql-altertable.html

David Rowley

dgrowleyml@gmail.com

over 5 years ago

In reply to: iulian dragos (#6)

Re: Query plan prefers hash join when nested loop is much faster

On Tue, 25 Aug 2020 at 22:10, iulian dragos
<iulian.dragos@databricks.com> wrote:

Thanks for the tip! Indeed, `n_distinct` isn't right. I found it in pg_stats set at 131736.0, but the actual number is much higher: 210104361. I tried to set it manually, but the plan is still the same (both the actual number and a percentage, -0.4, as you suggested):

You'll need to run ANALYZE on the table after doing the ALTER TABLE to
change the n_distinct. The ANALYZE writes the value to pg_statistic.
ALTER TABLE only takes it as far as pg_attribute's attoptions.
ANALYZE reads that column to see if the n_distinct estimate should be
overwritten before writing out pg_statistic

Just remember if you're hardcoding a positive value that it'll stay
fixed until you change it. If the table is likely to grow, then you
might want to reconsider using a positive value and consider using a
negative value as mentioned in the doc link.

David

iulian dragos

iulian.dragos@databricks.com

over 5 years ago

In reply to: David Rowley (#7)

Re: Query plan prefers hash join when nested loop is much faster

On Tue, Aug 25, 2020 at 12:36 PM David Rowley <dgrowleyml@gmail.com> wrote:

On Tue, 25 Aug 2020 at 22:10, iulian dragos
<iulian.dragos@databricks.com> wrote:

Thanks for the tip! Indeed, `n_distinct` isn't right. I found it in

pg_stats set at 131736.0, but the actual number is much higher: 210104361.
I tried to set it manually, but the plan is still the same (both the actual
number and a percentage, -0.4, as you suggested):

You'll need to run ANALYZE on the table after doing the ALTER TABLE to
change the n_distinct. The ANALYZE writes the value to pg_statistic.
ALTER TABLE only takes it as far as pg_attribute's attoptions.
ANALYZE reads that column to see if the n_distinct estimate should be
overwritten before writing out pg_statistic

Ah, rookie mistake. Thanks for clarifying this. Indeed, after I ran ANALYZE
the faster plan was selected! Yay!

Just remember if you're hardcoding a positive value that it'll stay
fixed until you change it. If the table is likely to grow, then you
might want to reconsider using a positive value and consider using a
negative value as mentioned in the doc link.

Good point, I went for -0.4 and that seems to be doing the trick!

Thanks a lot for helping out!

Show quoted text

David