BUG #18854: PostgreSQL chooses a suboptimal execution plan when using a specific WHERE filter
The following bug has been logged on the website:
Bug reference: 18854
Logged by: Yoni Sade
Email address: yonisade83@gmail.com
PostgreSQL version: 16.8
Operating system: RDS Linux
Description:
Query has a specific WHERE filter for tiny lookup table "t8" (19 rows) which
generates an execution plan which has two "Nested Loop" steps instead of
"Hash Join" steps which makes the query finish after 90 seconds (and spiking
the server CPU when being ran by dozens of sessions simultaneously) instead
of ~3 seconds (when commenting out that specific WHERE filter clause I
mentioned below):
Query #1 (runtime: ~90 seconds)::
SELECT count(*) AS total FROM ( SELECT 1
FROM v_m
JOIN (SELECT *, generate_series(mv.start_year, mv.end_year) as year
FROM mv) t3
on v_m.m_id = t3.m_id
LEFT JOIN t9 ON t3.mm_id = t9.mm_id and t3.year = t9.year
LEFT JOIN v_mm ON t3.mm_id = v_mm.mm_id
LEFT JOIN t8 ON v_m.id = t8.mm_f_id
LEFT JOIN t7 ON t8.sm_f_id = t7.f_id
WHERE v_m.is_exists = true
AND t3.is_active = true
AND v_m.mk_id IN (7)
AND t3.year >= 2021
AND t3.year <= 2024
AND t8.sm_f_id IN (8) -- this specific clause slows down the query
GROUP BY v_m.m_id, v_m.m_h, t3.make, v_m.img, COALESCE(t9.is_new, false),
t3.year) T;
Execution plan #1:
Node Type Entity Cost Rows Time Condition
Aggregate [NULL] 44501.78 - 44501.79 1 [NULL] [NULL]
Group [NULL] 44501.48 - 44501.68 8 [NULL] [NULL]
Incremental Sort [NULL] 44501.48 - 44501.60 8 [NULL] [NULL]
Nested Loop [NULL] 29224.92 - 44501.36 8 [NULL] [NULL]
Nested Loop [NULL] 29224.92 - 44486.38 4 [NULL] [NULL]
Nested Loop [NULL] 2048.29 - 13628.25 1 [NULL] [NULL]
Nested Loop [NULL] 2047.87 - 13623.80 1 [NULL] [NULL]
Nested Loop [NULL] 1047.87 - 1060.89 1 [NULL] COALESCE(((count(*)) > 0),
FALSE)
Index Scan t1 0.27 - 8.29 1 [NULL] (mk_id = 7)
Aggregate [NULL] 1047.60 - 1049.60 200 [NULL] [NULL]
Seq Scan t2 0.00 - 993.28 10864 [NULL] is_active
Subquery Scan [NULL] 1000.00 - 12562.73 15 [NULL] "((t3.year >= 2021)
AND (t3.year <= 2024))"
Gather [NULL] 1000.00 - 12516.33 3093 [NULL] [NULL]
ProjectSet [NULL] 0.00 - 11207.03 1819000 [NULL] [NULL]
Index Only Scan t9 0.42 - 4.44 1 [NULL] "((mm_id = t3.mm_id)
AND (YEAR = t3.year))"
Hash Join [NULL] 27176.63 - 30848.49 771 [NULL] (t4.f_id = t6.mm_f_id)
Unique [NULL] 27175.38 - 28718.18 154280 [NULL] [NULL]
Sort [NULL] 27175.38 - 27561.08 154280 [NULL] [NULL]
Append [NULL] 16.07 - 3859.65 154280 [NULL] [NULL]
Hash Join [NULL] 16.07 - 773.18 38220 [NULL] (t4.f_id = t7.f_id)
Hash Join [NULL] 15.85 - 2315.07 116060 [NULL] (t5.t_id = t.t_id)
Hash [NULL] 1.24 - 1.24 1 [NULL] [NULL]
Seq Scan t8 0.00 - 1.24 1 [NULL] (sm_f_id = 8)
Materialize [NULL] 0.00 - 14.88 2 [NULL] [NULL]
Seq Scan t7 0.00 - 14.88 2 [NULL] (f_id = 8)
Query #2 (runtime: ~3 seconds):
SELECT count(*) AS total FROM ( SELECT 1
FROM v_m
JOIN (SELECT *, generate_series(mv.start_year, mv.end_year) as year
FROM mv) t3
on v_m.m_id = t3.m_id
LEFT JOIN t9 ON t3.mm_id = t9.mm_id and t3.year = t9.year
LEFT JOIN v_mm ON t3.mm_id = v_mm.mm_id
LEFT JOIN t8 ON v_m.id = t8.mm_f_id
LEFT JOIN t7 ON t8.sm_f_id = t7.f_id
WHERE v_m.is_exists = true
AND t3.is_active = true
AND v_m.mk_id IN (7)
AND t3.year >= 2021
AND t3.year <= 2024
GROUP BY v_m.m_id, v_m.m_h, t3.make, v_m.img, COALESCE(t9.is_new, false),
t3.year) T;
Execution plan #2:
Node Type Entity Cost Rows Time Condition
Aggregate [NULL] 44501.78 - 44501.79 1 [NULL] [NULL]
Group [NULL] 44501.48 - 44501.68 8 [NULL] [NULL]
Incremental Sort [NULL] 44501.48 - 44501.60 8 [NULL] [NULL]
Nested Loop [NULL] 29224.92 - 44501.36 8 [NULL] [NULL]
Nested Loop [NULL] 29224.92 - 44486.38 4 [NULL] [NULL]
Nested Loop [NULL] 2048.29 - 13628.25 1 [NULL] [NULL]
Nested Loop [NULL] 2047.87 - 13623.80 1 [NULL] [NULL]
Nested Loop [NULL] 1047.87 - 1060.89 1 [NULL] COALESCE(((count(*)) > 0),
FALSE)
Index Scan t1 0.27 - 8.29 1 [NULL] (mk_id = 7)
Aggregate [NULL] 1047.60 - 1049.60 200 [NULL] [NULL]
Seq Scan t2 0.00 - 993.28 10864 [NULL] is_active
Subquery Scan [NULL] 1000.00 - 12562.73 15 [NULL] "((t3.year >= 2021)
AND (t3.year <= 2024))"
Gather [NULL] 1000.00 - 12516.33 3093 [NULL] [NULL]
ProjectSet [NULL] 0.00 - 11207.03 1819000 [NULL] [NULL]
Index Only Scan t9 0.42 - 4.44 1 [NULL] "((mm_id = t3.mm_id)
AND (YEAR = t3.year))"
Hash Join [NULL] 27176.63 - 30848.49 771 [NULL] (t4.f_id = t6.mm_f_id)
Unique [NULL] 27175.38 - 28718.18 154280 [NULL] [NULL]
Sort [NULL] 27175.38 - 27561.08 154280 [NULL] [NULL]
Append [NULL] 16.07 - 3859.65 154280 [NULL] [NULL]
Hash Join [NULL] 16.07 - 773.18 38220 [NULL] (t4.f_id = t7.f_id)
Hash Join [NULL] 15.85 - 2315.07 116060 [NULL] (t5.t_id = t.t_id)
Hash [NULL] 1.24 - 1.24 1 [NULL] [NULL]
Seq Scan t8 0.00 - 1.24 1 [NULL] (sm_f_id = 8)
Materialize [NULL] 0.00 - 14.88 2 [NULL] [NULL]
Seq Scan t7 0.00 - 14.88 2 [NULL] (f_id = 8)
On 3/18/25 12:34, PG Bug reporting form wrote:
Query has a specific WHERE filter for tiny lookup table "t8" (19 rows) which
generates an execution plan which has two "Nested Loop" steps instead of
"Hash Join" steps which makes the query finish after 90 seconds (and spiking
the server CPU when being ran by dozens of sessions simultaneously) instead
of ~3 seconds (when commenting out that specific WHERE filter clause I
mentioned below):
May you provide EXPLAIN ANALYZE VERBOSE in standard text output format?
--
regards, Andrei Lepikhov
I would like to add that commenting out that WHERE clause doesn't work for
all of this query variations so I've found another workaround for now
("LIMIT 1000000" in *bold*)
to make the optimizer better estimate the cardinality of the subquery and
change the plan to a better one:
*JOIN (SELECT *, generate_series(mv.start_year, mv.end_year) as year*
*FROM mv LIMIT 1000000) t3*
This is the previous (suboptimal) plan in the format you asked for (with
table/column names obfuscated):
Aggregate (cost=44501.78..44501.79 rows=1 width=8) (actual time=140784.531..
140784.633 rows=1 loops=1)
Output: count(*)
-> Group (cost=44501.48..44501.68 rows=8 width=679) (actual time=4656.869..
140784.563 rows=153 loops=1)
Output: NULL::integer, t1.***********, t1.***********, t3.***********,
t1.***********, (COALESCE(p.is_new, false)), t3.year
Group Key: t1.***********, t3.***********, (COALESCE(p.is_new, false)), t3.
year
-> Incremental Sort (cost=44501.48..44501.60 rows=8 width=675) (actual time
=4656.866..140784.160 rows=1994 loops=1)
Output: t1.***********, t3.***********, (COALESCE(p.is_new, false)),
t3.***********, t1.***********, t1.***********
Sort Key: t1.***********, t3.***********, (COALESCE(p.is_new, false)), t3.
year
Presorted Key: t1.m_id
Full-sort Groups: 24 Sort Method: quicksort Average Memory: 30kB Peak
Memory: 30kB
Pre-sorted Groups: 15 Sort Method: quicksort Average Memory: 34kB Peak
Memory: 34kB
-> Nested Loop Left Join (cost=29224.92..44501.36 rows=8 width=675) (actual
time=239.881..140777.492 rows=1994 loops=1)
Output: t1.***********, t3.***********, COALESCE(p.is_new, false),
t3.***********, t1.***********, t1.***********
-> Nested Loop (cost=29224.92..44486.38 rows=4 width=679) (actual time
=239.855..140771.693 rows=1994 loops=1)
Output: t1.***********, t1.***********, t1.***********, t3.***********,
t3.***********, t9.***********, t8.***********
Join Filter: (t3.mm_id = t4.mm_id)
Rows Removed by Join Filter: 33142274
-> Nested Loop Left Join (cost=2048.29..13628.25 rows=1 width=679) (actual
time=13.032..1942.680 rows=1994 loops=1)
Output: t1.***********, t1.***********, t1.***********, t3.***********,
t3.***********, t3.***********, t9.***********
-> Nested Loop (cost=2047.87..13623.80 rows=1 width=678) (actual time
=13.010..1906.470 rows=1650 loops=1)
Output: t1.***********, t1.***********, t1.***********, t3.***********,
t3.***********, t3.***********
Join Filter: (t1.m_id = t3.m_id)
Rows Removed by Join Filter: 632234
-> Nested Loop Left Join (cost=1047.87..1060.89 rows=1 width=552) (actual
time=11.398..19.784 rows=37 loops=1)
Output: t1.***********, t1.***********, t1.***********
Inner Unique: true
Join Filter: (t1.m_id = t2.m_id)
Rows Removed by Join Filter: 29232
Filter: COALESCE(((count(*)) > 0), false)
Rows Removed by Filter: 12
-> Index Scan using mds_mk_id_m_id_key on mds m (cost=0.27..8.29 rows=1
width=552) (actual time=0.023..0.223 rows=49 loops=1)
Output: t1.***********, t1.***********, t1.***********, t1.***********,
t1.***********, t1.***********, t1.***********, t1.***********,
t1.***********, t1.***********, t1.***********, t1.***********,
t1.***********
Index Cond: (t1.mk_id = 7)
-> HashAggregate (cost=1047.60..1049.60 rows=200 width=12) (actual time
=0.234..0.360 rows=597 loops=49)
Output: count(*), t2.m_id
Group Key: t2.m_id
Batches: 1 Memory Usage: 209kB
-> Seq Scan on t2 (cost=0.00..993.28 rows=10864 width=4) (actual time
=0.040..7.570 rows=20717 loops=1)
Output: t2.***********, t2.***********, t2.***********, t2.***********,
t2.***********, t2.***********, t2.***********, t2.***********,
t2.***********, t2.***********, t2.***********, t2.***********,
t2.***********, t2.***********, t2.***********, t2.***********,
t2.***********, t2.***********, t2.***********, t2.***********,
t2.***********, t2.***********, t2.***********, t2.***********,
t2.***********, t2.***********, t2.***********, t2.***********,
t2.***********
Filter: t2.is_active
Rows Removed by Filter: 11985
-> Subquery Scan on t3 (cost=1000.00..12562.73 rows=15 width=130) (actual
time=1.405..49.852 rows=17132 loops=37)
Output: t3.***********, t3.***********, t3.***********, t3.***********,
t3.***********, t3.***********, t3.***********, t3.***********,
t3.***********, t3.***********, t3.***********, t3.***********,
t3.***********, t3.***********, t3.***********, t3.***********,
t3.***********, t3.***********, t3.***********, t3.***********,
t3.***********, t3.***********, t3.***********, t3.***********,
t3.***********, t3.***********, t3.***********, t3.***********,
t3.***********, t3.***********, t3.***********, t3.***********,
t3.***********, t3.***********, t3.***********, t3.***********,
t3.***********, t3.***********, t3.***********, t3.***********,
t3.***********, t3.***********, t3.***********, t3.***********,
t3.***********, t3.***********, t3.***********, t3.***********,
t3.***********, t3.***********, t3.***********, t3.***********,
t3.***********, t3.***********
Filter: ((t3.year >= 2021) AND (t3.year <= 2024))
Rows Removed by Filter: 44378
-> Gather (cost=1000.00..12516.33 rows=3093 width=2335) (actual time=1.365..
44.739 rows=61510 loops=37)
Output: NULL::timestamp without time zone, NULL::timestamp without time zone,
mv.***********, NULL::text, NULL::integer, mv.***********, NULL::character
varying(50), mv.***********, NULL::character varying(255), NULL::integer,
NULL::character varying, NULL::integer, NULL::character varying(100), NULL::
integer, NULL::character varying(50), NULL::integer, NULL::character varying
(50), NULL::integer, NULL::character varying(50), NULL::integer, NULL::
character varying(50), NULL::integer, NULL::integer, NULL::integer, NULL::
character varying(50), NULL::integer, NULL::character varying(50), NULL::
integer, NULL::character varying(50), NULL::integer, NULL::character varying
(50), NULL::integer, NULL::text, NULL::text, NULL::integer, NULL::integer,
NULL::integer, NULL::integer, NULL::boolean, NULL::boolean, NULL::boolean,
NULL::boolean, NULL::integer[], NULL::text[], NULL::text[], NULL::text, NULL
::text, NULL::boolean, NULL::integer, NULL::integer, NULL::integer, NULL::
timestamp without time zone, NULL::timestamp without time zone, (
generate_series(mv.s_year, mv.e_year))
Workers Planned: 1
Workers Launched: 1
-> ProjectSet (cost=0.00..11207.03 rows=1819000 width=2335) (actual time
=0.034..25.766 rows=30755 loops=74)
Output: NULL::timestamp without time zone, NULL::timestamp without time zone,
mv.***********, NULL::text, NULL::integer, mv.***********, NULL::character
varying(50), mv.***********, NULL::character varying(255), NULL::integer,
NULL::character varying, NULL::integer, NULL::character varying(100), NULL::
integer, NULL::character varying(50), NULL::integer, NULL::character varying
(50), NULL::integer, NULL::character varying(50), NULL::integer, NULL::
character varying(50), NULL::integer, NULL::integer, NULL::integer, NULL::
character varying(50), NULL::integer, NULL::character varying(50), NULL::
integer, NULL::character varying(50), NULL::integer, NULL::character varying
(50), NULL::integer, NULL::text, NULL::text, NULL::integer, NULL::integer,
NULL::integer, NULL::integer, NULL::boolean, NULL::boolean, NULL::boolean,
NULL::boolean, NULL::integer[], NULL::text[], NULL::text[], NULL::text, NULL
::text, NULL::boolean, NULL::integer, NULL::integer, NULL::integer, NULL::
timestamp without time zone, NULL::timestamp without time zone,
generate_series(mv.s_year, mv.e_year)
Worker 0: actual time=0.019..33.874 rows=40139 loops=37
-> Parallel Seq Scan on mv (cost=0.00..2098.39 rows=1819 width=134) (actual
time=0.029..10.233 rows=10358 loops=74)
Output: mv.***********, mv.***********, mv.***********, mv.***********,
mv.***********, mv.***********, mv.***********, mv.***********,
mv.***********, mv.***********, mv.***********, mv.***********,
mv.***********, mv.***********, mv.***********, mv.***********,
mv.***********, mv.***********, mv.***********, mv.***********,
mv.***********, mv.***********, mv.***********, mv.***********,
mv.***********, mv.***********, mv.***********, mv.***********,
mv.***********, mv.***********, mv.***********, mv.***********,
mv.***********, mv.***********, mv.***********, mv.***********,
mv.***********, mv.***********, mv.***********, mv.***********,
mv.***********, mv.***********, mv.***********, mv.***********,
mv.***********, mv.***********, mv.***********, mv.***********,
mv.***********, mv.***********, mv.***********, mv.***********,
mv.***********
Filter: mv.is_active
Rows Removed by Filter: 5992
Worker 0: actual time=0.016..13.530 rows=13204 loops=37
-> Index Only Scan using t9_un on t9 (cost=0.42..4.44 rows=1 width=9)
(actual time=0.014..0.018 rows=1 loops=1650)
Output: t9.***********, t9.***********, t9.***********
Index Cond: ((p.mm_id = t3.mm_id) AND (p.year = t3.year))
Heap Fetches: 0
-> Hash Join (cost=27176.63..30848.49 rows=771 width=8) (actual time=0.084..
68.596 rows=16622 loops=1994)
Output: t4.***********, t8.sm_f_id
Hash Cond: (t4.f_id = t8.mm_f_id)
-> Unique (cost=27175.38..28718.18 rows=154280 width=126) (actual time
=0.084..50.368 rows=166526 loops=1994)
Output: t4.***********, t4.***********, t7.***********
-> Sort (cost=27175.38..27561.08 rows=154280 width=126) (actual time=0.083..
19.173 rows=166526 loops=1994)
Output: t4.***********, t4.***********, t7.***********
Sort Key: t4.***********, t4.***********, t7.***********
Sort Method: external sort Disk: 6912kB
-> Append (cost=16.07..3859.65 rows=154280 width=126) (actual time=0.025..
67.960 rows=166526 loops=1)
-> Hash Left Join (cost=16.07..773.18 rows=38220 width=126) (actual time
=0.024..12.886 rows=40136 loops=1)
Output: t4.***********, t4.***********, t7.***********
Inner Unique: true
Hash Cond: (t4.f_id = t7.f_id)
-> Seq Scan on t4 (cost=0.00..655.20 rows=38220 width=8) (actual time
=0.005..3.407 rows=40136 loops=1)
Output: t4.***********, t4.***********, t4.***********, t4.***********,
t4.***********, t4.***********, t4.***********
-> Hash (cost=12.70..12.70 rows=270 width=122) (actual time=0.013..0.014
rows=17 loops=1)
Output: t7.***********, t7.***********
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> Seq Scan on t7 (cost=0.00..12.70 rows=270 width=122) (actual time=0.004..
0.008 rows=17 loops=1)
Output: t7.***********, t7.***********
-> Hash Left Join (cost=15.85..2315.07 rows=116060 width=126) (actual time
=0.036..41.281 rows=126390 loops=1)
Output: t10.***********, t10.***********, t5.************
Inner Unique: true
Hash Cond: (t10.t_id = t5.t_id)
-> Seq Scan on t10 (cost=0.00..1989.60 rows=116060 width=8) (actual time
=0.012..10.756 rows=126390 loops=1)
Output: t10.***********, t10.***********, t10.***********, t10.***********,
t10.***********, t10.***********, t10.***********
-> Hash (cost=12.60..12.60 rows=260 width=122) (actual time=0.013..0.014
rows=16 loops=1)
Output: t5.***********, t5.***********
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> Seq Scan on t5 (cost=0.00..12.60 rows=260 width=122) (actual time=0.005..
0.008 rows=16 loops=1)
Output: t5.***********, t5.***********
-> Hash (cost=1.24..1.24 rows=1 width=8) (actual time=0.011..0.011 rows=1
loops=1)
Output: t8.***********, t8.***********
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> Seq Scan on t8 (cost=0.00..1.24 rows=1 width=8) (actual time=0.008..0.008
rows=1 loops=1)
Output: t8.***********, t8.***********
Filter: (t8.sm_f_id = 8)
Rows Removed by Filter: 18
-> Materialize (cost=0.00..14.88 rows=2 width=4) (actual time=0.001..0.001
rows=1 loops=1994)
Output: t7.***********
-> Seq Scan on t7 (cost=0.00..14.88 rows=2 width=4) (actual time=0.017..
0.019 rows=1 loops=1)
Output: t7.***********
Filter: (t7.f_id = 8)
Rows Removed by Filter: 9
Query Identifier: -4992889262453896709
Planning Time: 1.762 ms
Execution Time: 140786.468 ms
בתאריך יום ג׳, 18 במרץ 2025 ב-16:16 מאת Andrei Lepikhov <
lepihov@gmail.com>:
On 3/18/25 12:34, PG Bug reporting form wrote:
Query has a specific WHERE filter for tiny lookup table "t8" (19 rows)
which
generates an execution plan which has two "Nested Loop" steps instead of
"Hash Join" steps which makes the query finish after 90 seconds (andspiking
the server CPU when being ran by dozens of sessions simultaneously)
instead
of ~3 seconds (when commenting out that specific WHERE filter clause I
mentioned below):May you provide EXPLAIN ANALYZE VERBOSE in standard text output format?
--
regards, Andrei Lepikhov
--
בברכה,
יוני שדה
On 19/3/2025 14:34, Yoni Sade wrote:
I would like to add that commenting out that WHERE clause doesn't work
for all of this query variations so I've found another workaround for
now ("LIMIT 1000000" in *bold*)to make the optimizer better estimate the cardinality of the subquery
and change the plan to a better one:*
I wouldn't say it is the ultimate root of the problem, but using
generate_series, especially with non-constant parameters, seems harmful
for the query planning: we can't predict the number of tuples and don't
have any statistics on that set. So, your LIMIT works like a hint that
helps decide how massive a set of tuples it will generate and seems to
be the only option possible.
--
regards, Andrei Lepikhov