Query plan prefers hash join when nested loop is much faster
Hi,
I am trying to understand why the query planner insists on using a hash
join, and how to make it choose the better option, which in this case would
be a nested loop. I have two tables:
// about 200 million rows
CREATE TABLE module_result(
*id* bigserial PRIMARY KEY,
name_id bigint NOT NULL references result_name(id),
*run_id* integer NOT NULL references run (id),
logs text NOT NULL,
status result_status NOT NULL
);
CREATE INDEX ON module_result (run_id);
// 500 million rows
CREATE TABLE test_result(
id bigserial PRIMARY KEY,
name_id bigint NOT NULL references result_name(id),
*module_result_id* bigint NOT NULL references module_result (id),
seconds float NOT NULL,
failure_msg text, -- Either a <failure>...</failure> or an <error
message="... />
status result_status NOT NULL
);
CREATE INDEX ON test_result (module_result_id);
I'm trying to select all test cases that belong to a given run_id, which
logically involves finding all IDs in module_result that belong to a given
run, and then selecting the test results for those IDs (run_id has several
module_result_id, which in turn have several test_results each).
EXPLAIN ANALYZE SELECT test_result.status, count(test_result.status) as
"Count" FROM test_result INNER JOIN module_result ON module_result.id =
test_result.module_result_id WHERE module_resul
t.run_id=158523 GROUP BY test_result.status
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| QUERY PLAN
|
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Finalize GroupAggregate (cost=7771702.73..7771804.08 rows=3 width=12)
(actual time=32341.993..32341.994 rows=2 loops=1)
|
| Group Key: test_result.status
|
| -> Gather Merge (cost=7771702.73..7771804.02 rows=6 width=12) (actual
time=32341.970..32343.222 rows=6 loops=1)
|
| Workers Planned: 2
|
| Workers Launched: 2
|
| -> Partial GroupAggregate (cost=7770702.71..7770803.30 rows=3
width=12) (actual time=32340.278..32340.286 rows=2 loops=3)
|
| Group Key: test_result.status
|
| -> Sort (cost=7770702.71..7770736.23 rows=13408 width=4)
(actual time=32339.698..32339.916 rows=4941 loops=3)
|
| Sort Key: test_result.status
|
| Sort Method: quicksort Memory: 431kB
|
| Worker 0: Sort Method: quicksort Memory: 433kB
|
| Worker 1: Sort Method: quicksort Memory: 409kB
|
| -> Hash Join (cost=586.15..7769783.54 rows=13408
width=4) (actual time=18112.078..32339.011 rows=4941 loops=3)
|
| Hash Cond: (test_result.module_result_id =
module_result.id)
|
| -> Parallel Seq Scan on test_result
(cost=0.00..7145224.72 rows=237703872 width=12) (actual
time=0.034..15957.894 rows=190207740 loops=3) |
| -> Hash (cost=438.41..438.41 rows=11819
width=8) (actual time=3.905..3.905 rows=14824 loops=3)
|
| Buckets: 16384 Batches: 1 Memory Usage:
708kB
|
| -> Index Scan using
module_result_run_id_idx on module_result (cost=0.57..438.41 rows=11819
width=8) (actual time=0.017..2.197 rows=14824 loops=3) |
| Index Cond: (run_id = 158523)
|
| Planning Time: 0.178 ms
|
| Execution Time: 32343.330 ms
|
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
EXPLAIN
Time: 32.572s (32 seconds), executed in: 32.551s (32 seconds)
This plan takes about 30s to execute. If I turn off seqscan, I get a nested
loop join that takes about 0.02s to execute:
set enable_seqscan = off
SET
Time: 0.305s
explain analyze select test_result.status, count(test_result.status) as
"Count" from test_result inner join module_result ON module_result.id =
test_result.module_result_id where module_resul
t.run_id=158523 group by test_result.status
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| QUERY PLAN
|
|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Finalize GroupAggregate (cost=34297042.16..34297143.50 rows=3 width=12)
(actual time=15.014..15.015 rows=2 loops=1)
|
| Group Key: test_result.status
|
| -> Gather Merge (cost=34297042.16..34297143.44 rows=6 width=12)
(actual time=15.005..15.850 rows=6 loops=1)
|
| Workers Planned: 2
|
| Workers Launched: 2
|
| -> Partial GroupAggregate (cost=34296042.13..34296142.72 rows=3
width=12) (actual time=12.937..12.940 rows=2 loops=3)
|
| Group Key: test_result.status
|
| -> Sort (cost=34296042.13..34296075.65 rows=13408
width=4) (actual time=12.339..12.559 rows=4941 loops=3)
|
| Sort Key: test_result.status
|
| Sort Method: quicksort Memory: 461kB
|
| Worker 0: Sort Method: quicksort Memory: 403kB
|
| Worker 1: Sort Method: quicksort Memory: 408kB
|
| -> Nested Loop (cost=232.74..34295122.96 rows=13408
width=4) (actual time=0.232..11.671 rows=4941 loops=3)
|
| -> Parallel Bitmap Heap Scan on module_result
(cost=232.17..44321.35 rows=4925 width=8) (actual time=0.218..0.671
rows=4941 loops=3) |
| Recheck Cond: (run_id = 158523)
|
| Heap Blocks: exact=50
|
| -> Bitmap Index Scan on
module_result_run_id_idx (cost=0.00..229.21 rows=11819 width=0) (actual
time=0.592..0.592 rows=14824 loops=1) |
| Index Cond: (run_id = 158523)
|
| -> Index Scan using
test_result_module_result_id_idx on test_result (cost=0.57..6911.17
rows=4331 width=12) (actual time=0.002..0.002 rows=1 loops=14824) |
| Index Cond: (module_result_id =
module_result.id)
|
| Planning Time: 0.214 ms
|
| Execution Time: 15.932 ms
|
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
EXPLAIN
Time: 0.235s
I don't think it's recommended to turn off seqscan in production, so I'm
looking for a way to make the query planner choose the significantly faster
plan. How can I do that? It's probably related to some statistics, but they
are up to date (I run ANALYZE several times).
Any pointers would be very helpful,
thank you,
iulian
Your system is preferring sequential scan to
using test_result_module_result_id_idx in this case. What type of storage
do you use, what type of cache hits do you expect, and what do you have
random_page_cost set to? That comes to mind as a significant factor in
choosing index scans based on costs.
Hi Michael,
Thanks for the answer. It's an RDS instance using SSD storage and the
default `random_page_cost` set to 4.0. I don't expect a lot of repetitive
queries here, so I think caching may not be extremely useful. I wonder if
the selectivity of the query is wrongly estimated (out of 500 million rows,
only a few thousands are returned).
I tried lowering the `random_page_cost` to 1.2 and it didn't make a
difference in the query plan.
iulian
On Fri, Aug 21, 2020 at 6:30 PM Michael Lewis <mlewis@entrata.com> wrote:
Show quoted text
Your system is preferring sequential scan to
using test_result_module_result_id_idx in this case. What type of storage
do you use, what type of cache hits do you expect, and what do you have
random_page_cost set to? That comes to mind as a significant factor in
choosing index scans based on costs.
On Mon, Aug 24, 2020 at 4:21 PM iulian dragos <iulian.dragos@databricks.com>
wrote:
Hi Michael,
Thanks for the answer. It's an RDS instance using SSD storage and the
default `random_page_cost` set to 4.0. I don't expect a lot of repetitive
queries here, so I think caching may not be extremely useful. I wonder if
the selectivity of the query is wrongly estimated (out of 500 million rows,
only a few thousands are returned).I tried lowering the `random_page_cost` to 1.2 and it didn't make a
difference in the query plan.
I experimented a bit more with different values for this setting. The only
way I could make it use the index was to use a value strictly less than
`seq_page_cost` (0.8 for instance). That doesn't sound right, though.
The size of the effective_cache_size is fairly high as well (32 GB) for an
instance with 64GB (db.m5.4xlarge).
iulian
Show quoted text
On Sat, 22 Aug 2020 at 00:35, iulian dragos
<iulian.dragos@databricks.com> wrote:
I am trying to understand why the query planner insists on using a hash join, and how to make it choose the better option, which in this case would be a nested loop.
| -> Index Scan using test_result_module_result_id_idx on test_result (cost=0.57..6911.17 rows=4331 width=12) (actual time=0.002..0.002 rows=1 loops=14824) |
| Index Cond: (module_result_id = module_result.id) |
You might want to check if the pg_stats view reports a realistic
n_distinct value for test_result.module_result_id. If the
pg_class.retuples is correct for that relation then that would
indicate the n_distinct estimate is about 115000. Going by the number
of rows you've mentioned it would appear a more realistic value for
that would be -0.4. which is 0 - 1 / (500000000 / 200000000.0).
However, that's assuming each module_result has a test_result. You
could run a SELECT COUNT(DISTINCT module_result_id) FROM test_result;
to get a better idea.
If ANALYZE is not getting you a good value for n_distinct, then you
can overwrite it. See [1]https://www.postgresql.org/docs/current/sql-altertable.html, search for n_distinct.
David
[1]: https://www.postgresql.org/docs/current/sql-altertable.html
On Tue, Aug 25, 2020 at 12:27 AM David Rowley <dgrowleyml@gmail.com> wrote:
On Sat, 22 Aug 2020 at 00:35, iulian dragos
<iulian.dragos@databricks.com> wrote:I am trying to understand why the query planner insists on using a hash
join, and how to make it choose the better option, which in this case would
be a nested loop.| -> Index Scan using
test_result_module_result_id_idx on test_result (cost=0.57..6911.17
rows=4331 width=12) (actual time=0.002..0.002 rows=1 loops=14824) || Index Cond: (module_result_id =
module_result.id)
|You might want to check if the pg_stats view reports a realistic
n_distinct value for test_result.module_result_id. If the
pg_class.retuples is correct for that relation then that would
indicate the n_distinct estimate is about 115000. Going by the number
of rows you've mentioned it would appear a more realistic value for
that would be -0.4. which is 0 - 1 / (500000000 / 200000000.0).
However, that's assuming each module_result has a test_result. You
could run a SELECT COUNT(DISTINCT module_result_id) FROM test_result;
to get a better idea.If ANALYZE is not getting you a good value for n_distinct, then you
can overwrite it. See [1], search for n_distinct.
Thanks for the tip! Indeed, `n_distinct` isn't right. I found it in
pg_stats set at 131736.0, but the actual number is much higher: 210104361.
I tried to set it manually, but the plan is still the same (both the actual
number and a percentage, -0.4, as you suggested):
ALTER TABLE test_result ALTER COLUMN module_result_id SET
(n_distinct=210104361)
You're about to run a destructive command.
Do you want to proceed? (y/n): y
Your call!
ALTER TABLE
Time: 0.205s
Show quoted text
David
[1] https://www.postgresql.org/docs/current/sql-altertable.html
On Tue, 25 Aug 2020 at 22:10, iulian dragos
<iulian.dragos@databricks.com> wrote:
Thanks for the tip! Indeed, `n_distinct` isn't right. I found it in pg_stats set at 131736.0, but the actual number is much higher: 210104361. I tried to set it manually, but the plan is still the same (both the actual number and a percentage, -0.4, as you suggested):
You'll need to run ANALYZE on the table after doing the ALTER TABLE to
change the n_distinct. The ANALYZE writes the value to pg_statistic.
ALTER TABLE only takes it as far as pg_attribute's attoptions.
ANALYZE reads that column to see if the n_distinct estimate should be
overwritten before writing out pg_statistic
Just remember if you're hardcoding a positive value that it'll stay
fixed until you change it. If the table is likely to grow, then you
might want to reconsider using a positive value and consider using a
negative value as mentioned in the doc link.
David
On Tue, Aug 25, 2020 at 12:36 PM David Rowley <dgrowleyml@gmail.com> wrote:
On Tue, 25 Aug 2020 at 22:10, iulian dragos
<iulian.dragos@databricks.com> wrote:Thanks for the tip! Indeed, `n_distinct` isn't right. I found it in
pg_stats set at 131736.0, but the actual number is much higher: 210104361.
I tried to set it manually, but the plan is still the same (both the actual
number and a percentage, -0.4, as you suggested):You'll need to run ANALYZE on the table after doing the ALTER TABLE to
change the n_distinct. The ANALYZE writes the value to pg_statistic.
ALTER TABLE only takes it as far as pg_attribute's attoptions.
ANALYZE reads that column to see if the n_distinct estimate should be
overwritten before writing out pg_statistic
Ah, rookie mistake. Thanks for clarifying this. Indeed, after I ran ANALYZE
the faster plan was selected! Yay!
Just remember if you're hardcoding a positive value that it'll stay
fixed until you change it. If the table is likely to grow, then you
might want to reconsider using a positive value and consider using a
negative value as mentioned in the doc link.
Good point, I went for -0.4 and that seems to be doing the trick!
Thanks a lot for helping out!
Show quoted text
David