increasing effective_cache_size slows down join queries by a factor of 4000x

Started by A Shaposhnikovabout 4 years ago14 messagesgeneral
Jump to latest
#1A Shaposhnikov
artyom@gmail.com

I recently moved a postgres DB to a more powerful server with 1TB of
RAM instead of 64GB before. To my surprise after running the tuning on
the new server, the performance of joins deteriorated by 4000x
compared to the old server. I carefully examined all of the changes
and found the culprit:

if I use the effective_cache_size > 25GB, the query plan used is this:

Limit (cost=124.12..590.33 rows=1000 width=205) (actual
time=42326.662..42336.212 rows=1000 loops=1)
-> Nested Loop (cost=124.12..31909018.46 rows=68443040 width=205)
(actual time=42326.660..42336.127 rows=1000 loops=1)
-> Merge Join (cost=124.12..30710356.97 rows=68443040 width=169)
(actual time=42326.613..42332.764 rows=1000 loops=1)
Merge Cond: (d.id = dc.data_id)
-> Nested Loop (cost=1.00..31036282.72 rows=58785023 width=165)
(actual time=0.042..5.533 rows=854 loops=1)
-> Index Scan using data_pkey on data t (cost=0.57..4787030.00
rows=58785023 width=131) (actual time=0.023..0.526 rows=854 loops=1)
Index Cond: (id > 205284974)
-> Index Scan using user_pkey on data_user u (cost=0.43..0.45 rows=1
width=42) (actual time=0.005..0.005 rows=1 loops=854)
Index Cond: (id = d.user_id)
-> Index Only Scan using data_class_pkey on data_class ta
(cost=0.57..4935483.78 rows=216964862 width=8) (actual
time=0.018..35022.908 rows=151321889 loops=1)
Heap Fetches: 151321889
-> Index Scan using class_id_index on class a (cost=0.00..0.02 rows=1
width=44) (actual time=0.003..0.003 rows=1 loops=1000)
Index Cond: (id = dc.class_id)
Planning Time: 4.114 ms
Execution Time: 42336.397 ms

and it is 4000x slower than the query plan used with the lower
effective_cache_size that uses indexes instead of the merge joins:

Limit (cost=1.57..4832.30 rows=1000 width=205) (actual
time=0.081..10.457 rows=1000 loops=1)
-> Nested Loop (cost=1.57..330629805.46 rows=68443040 width=205)
(actual time=0.080..10.378 rows=1000 loops=1)
-> Nested Loop (cost=1.57..267793481.39 rows=68443040 width=169)
(actual time=0.065..7.496 rows=1000 loops=1)
-> Nested Loop (cost=1.00..100917823.18 rows=58785023 width=165)
(actual time=0.040..5.424 rows=854 loops=1)
-> Index Scan using data_pkey on data t (cost=0.57..21427806.53
rows=58785023 width=131) (actual time=0.024..0.482 rows=854 loops=1)
Index Cond: (id > 205284974)
-> Index Scan using user_pkey on data_user u (cost=0.43..1.35 rows=1
width=42) (actual time=0.005..0.005 rows=1 loops=854)
Index Cond: (id = d.user_id)
-> Index Only Scan using data_class_pkey on data_class ta
(cost=0.57..2.80 rows=4 width=8) (actual time=0.002..0.002 rows=1
loops=854)
Index Cond: (data_id = d.id)
Heap Fetches: 1000
-> Index Scan using class_id_index on class a (cost=0.00..0.92 rows=1
width=44) (actual time=0.002..0.002 rows=1 loops=1000)
Index Cond: (id = dc.class_id)
Planning Time: 5.074 ms
Execution Time: 10.614 ms

query:

explain analyze select d.time as time,d.id as id, u.username as
username, a.query_symbol as query_symbol from data as d, data_user as
u, class as a, data_class as dc
where dc.class_id = a.id and dc.data_id = d.id and d.user_id = u.id
and d.id > 205284974
order by d.id
limit 1000;

I found a way to fix it by creating a distinct statistics on the
largest join table:

create statistics stat_data_class (ndistinct) on data_id, class_id
from data_class;
analyze data_class;

Question: are there other ways to give PostgreSQL a hint to use the
indexes instead of the merge join plan without artificially lowering
the memory available in the parameter effective_cache_size or creating
the additional statistics on the table?

Thank you,

-Art

#2Michael Lewis
mlewis@entrata.com
In reply to: A Shaposhnikov (#1)
Re: increasing effective_cache_size slows down join queries by a factor of 4000x

What does the row estimate look like on the scan of data table with that
statistic in place? Anytime the stats give a mis-estimate this far off, I
wouldn't expect that plans would be optimal except by luck.

Index Scan using data_pkey on data t (cost=0.57..21427806.53 *rows=58785023*
width=131) (actual time=0.024..0.482 *rows=854 *loops=1)
Index Cond: (id > 205284974)

#3A Shaposhnikov
artyom@gmail.com
In reply to: Michael Lewis (#2)
Re: increasing effective_cache_size slows down join queries by a factor of 4000x

the row estimate became ~1000x smaller with the stat in place, so it
looks like it grossly miscalculates the query plans without the stats
for large tables representing M:M relations.

Show quoted text

On Wed, Feb 2, 2022 at 11:47 AM Michael Lewis <mlewis@entrata.com> wrote:

What does the row estimate look like on the scan of data table with that statistic in place? Anytime the stats give a mis-estimate this far off, I wouldn't expect that plans would be optimal except by luck.

Index Scan using data_pkey on data t (cost=0.57..21427806.53 rows=58785023 width=131) (actual time=0.024..0.482 rows=854 loops=1)
Index Cond: (id > 205284974)

#4Imre Samu
pella.samu@gmail.com
In reply to: A Shaposhnikov (#1)
Re: increasing effective_cache_size slows down join queries by a factor of 4000x

Question: are there other ways to give PostgreSQL a hint

What you your pg version?

Maybe with pg_hint_plan extension ( https://github.com/ossc-db/pg_hint_plan
)
"pg_hint_plan makes it possible to tweak PostgreSQL execution plans using
so-called "hints" in SQL comments, like /*+ SeqScan(a) */."

regards,
Imre

Artyom Shaposhnikov <artyom@gmail.com> ezt írta (időpont: 2022. febr. 2.,
Sze, 20:05):

Show quoted text

I recently moved a postgres DB to a more powerful server with 1TB of
RAM instead of 64GB before. To my surprise after running the tuning on
the new server, the performance of joins deteriorated by 4000x
compared to the old server. I carefully examined all of the changes
and found the culprit:

if I use the effective_cache_size > 25GB, the query plan used is this:

Limit (cost=124.12..590.33 rows=1000 width=205) (actual
time=42326.662..42336.212 rows=1000 loops=1)
-> Nested Loop (cost=124.12..31909018.46 rows=68443040 width=205)
(actual time=42326.660..42336.127 rows=1000 loops=1)
-> Merge Join (cost=124.12..30710356.97 rows=68443040 width=169)
(actual time=42326.613..42332.764 rows=1000 loops=1)
Merge Cond: (d.id = dc.data_id)
-> Nested Loop (cost=1.00..31036282.72 rows=58785023 width=165)
(actual time=0.042..5.533 rows=854 loops=1)
-> Index Scan using data_pkey on data t (cost=0.57..4787030.00
rows=58785023 width=131) (actual time=0.023..0.526 rows=854 loops=1)
Index Cond: (id > 205284974)
-> Index Scan using user_pkey on data_user u (cost=0.43..0.45 rows=1
width=42) (actual time=0.005..0.005 rows=1 loops=854)
Index Cond: (id = d.user_id)
-> Index Only Scan using data_class_pkey on data_class ta
(cost=0.57..4935483.78 rows=216964862 width=8) (actual
time=0.018..35022.908 rows=151321889 loops=1)
Heap Fetches: 151321889
-> Index Scan using class_id_index on class a (cost=0.00..0.02 rows=1
width=44) (actual time=0.003..0.003 rows=1 loops=1000)
Index Cond: (id = dc.class_id)
Planning Time: 4.114 ms
Execution Time: 42336.397 ms

and it is 4000x slower than the query plan used with the lower
effective_cache_size that uses indexes instead of the merge joins:

Limit (cost=1.57..4832.30 rows=1000 width=205) (actual
time=0.081..10.457 rows=1000 loops=1)
-> Nested Loop (cost=1.57..330629805.46 rows=68443040 width=205)
(actual time=0.080..10.378 rows=1000 loops=1)
-> Nested Loop (cost=1.57..267793481.39 rows=68443040 width=169)
(actual time=0.065..7.496 rows=1000 loops=1)
-> Nested Loop (cost=1.00..100917823.18 rows=58785023 width=165)
(actual time=0.040..5.424 rows=854 loops=1)
-> Index Scan using data_pkey on data t (cost=0.57..21427806.53
rows=58785023 width=131) (actual time=0.024..0.482 rows=854 loops=1)
Index Cond: (id > 205284974)
-> Index Scan using user_pkey on data_user u (cost=0.43..1.35 rows=1
width=42) (actual time=0.005..0.005 rows=1 loops=854)
Index Cond: (id = d.user_id)
-> Index Only Scan using data_class_pkey on data_class ta
(cost=0.57..2.80 rows=4 width=8) (actual time=0.002..0.002 rows=1
loops=854)
Index Cond: (data_id = d.id)
Heap Fetches: 1000
-> Index Scan using class_id_index on class a (cost=0.00..0.92 rows=1
width=44) (actual time=0.002..0.002 rows=1 loops=1000)
Index Cond: (id = dc.class_id)
Planning Time: 5.074 ms
Execution Time: 10.614 ms

query:

explain analyze select d.time as time,d.id as id, u.username as
username, a.query_symbol as query_symbol from data as d, data_user as
u, class as a, data_class as dc
where dc.class_id = a.id and dc.data_id = d.id and d.user_id = u.id
and d.id > 205284974
order by d.id
limit 1000;

I found a way to fix it by creating a distinct statistics on the
largest join table:

create statistics stat_data_class (ndistinct) on data_id, class_id
from data_class;
analyze data_class;

Question: are there other ways to give PostgreSQL a hint to use the
indexes instead of the merge join plan without artificially lowering
the memory available in the parameter effective_cache_size or creating
the additional statistics on the table?

Thank you,

-Art

#5Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: A Shaposhnikov (#3)
Re: increasing effective_cache_size slows down join queries by a factor of 4000x

On 2/2/22 22:10, Artyom Shaposhnikov wrote:

the row estimate became ~1000x smaller with the stat in place, so it
looks like it grossly miscalculates the query plans without the stats
for large tables representing M:M relations.

Well, if the estimates are significantly off (and 3 orders of magnitude
certainly qualifies), then all bets are off. There's no magical option
that'd fix planning in such conditions.

Ultimately, fixing the estimates (e.g. by creating extended statistics)
is the right "fix" as it gives the optimizer the information needed to
pick the right plan.

On Wed, Feb 2, 2022 at 11:47 AM Michael Lewis <mlewis@entrata.com> wrote:

What does the row estimate look like on the scan of data table with that statistic in place? Anytime the stats give a mis-estimate this far off, I wouldn't expect that plans would be optimal except by luck.

Index Scan using data_pkey on data t (cost=0.57..21427806.53 rows=58785023 width=131) (actual time=0.024..0.482 rows=854 loops=1)
Index Cond: (id > 205284974)

It'd be interesting to see the plans without the LIMIT, as that makes
the "actual" values low simply by terminating early.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#6A Shaposhnikov
artyom@gmail.com
In reply to: Tomas Vondra (#5)
Re: increasing effective_cache_size slows down join queries by a factor of 4000x

I made a mistake yesterday claiming that the created statistics
changed the row counts in the estimates - it did not - I looked at
the wrong query yesterday. In the correct query plan the row estimate
still differs from the actual by many orders of magnitude:

Nested Loop (cost=1.01..27584834.53 rows=59608439 width=164) (actual
time=0.047..2.723 rows=854 loops=1)

It completely ignores the "limit" in SQL statement when doing the row
estimates - the estimates do not change when I change it. The business
logic really needs only 1000 rows in the result, I never need the
whole table in the result. The query always runs fast if the limit is
<= 46, when the limit >=47, it sometimes chooses to do a merge join on
2 tables with hundreds of millions of rows instead of using the
indexes. The runtime difference is 4000x.

I migrated the data to the latest postgres 14.1. Both versions run
either the slow plan or the fast plan seemingly at random. The only
reliable way to make it choose the fast plan is to decrease the
effective_cache_size to a value 20x lower than the memory available in
the system. Dropping and creating the statistics on the join table
makes no difference in the estimated row counts and query plans.
PostgreSql seems to be caching something internally and choosing the
query plans at random - sometimes it is fast, sometimes 4000x slower
without any changes in configuration or statistics. The runtime
difference is 4000x. Such randomness is clearly unacceptable. I think
I would have to try to use the "hint" extension suggested by Imre to
make it consistently choose the fast plan.

On Wed, Feb 2, 2022 at 3:34 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

Show quoted text

On 2/2/22 22:10, Artyom Shaposhnikov wrote:

the row estimate became ~1000x smaller with the stat in place, so it
looks like it grossly miscalculates the query plans without the stats
for large tables representing M:M relations.

Well, if the estimates are significantly off (and 3 orders of magnitude
certainly qualifies), then all bets are off. There's no magical option
that'd fix planning in such conditions.

Ultimately, fixing the estimates (e.g. by creating extended statistics)
is the right "fix" as it gives the optimizer the information needed to
pick the right plan.

On Wed, Feb 2, 2022 at 11:47 AM Michael Lewis <mlewis@entrata.com> wrote:

What does the row estimate look like on the scan of data table with that statistic in place? Anytime the stats give a mis-estimate this far off, I wouldn't expect that plans would be optimal except by luck.

Index Scan using data_pkey on data t (cost=0.57..21427806.53 rows=58785023 width=131) (actual time=0.024..0.482 rows=854 loops=1)
Index Cond: (id > 205284974)

It'd be interesting to see the plans without the LIMIT, as that makes
the "actual" values low simply by terminating early.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#7Vijaykumar Jain
vijaykumarjain.github@gmail.com
In reply to: A Shaposhnikov (#6)
Re: increasing effective_cache_size slows down join queries by a factor of 4000x

On Fri, 4 Feb 2022 at 01:03, A Shaposhnikov <artyom@gmail.com> wrote:

I made a mistake yesterday claiming that the created statistics
changed the row counts in the estimates - it did not - I looked at
the wrong query yesterday. In the correct query plan the row estimate
still differs from the actual by many orders of magnitude:

Nested Loop (cost=1.01..27584834.53 rows=59608439 width=164) (actual
time=0.047..2.723 rows=854 loops=1)

It completely ignores the "limit" in SQL statement when doing the row
estimates - the estimates do not change when I change it. The business
logic really needs only 1000 rows in the result, I never need the
whole table in the result. The query always runs fast if the limit is
<= 46, when the limit >=47, it sometimes chooses to do a merge join on
2 tables with hundreds of millions of rows instead of using the
indexes. The runtime difference is 4000x.

I migrated the data to the latest postgres 14.1. Both versions run
either the slow plan or the fast plan seemingly at random. The only
reliable way to make it choose the fast plan is to decrease the
effective_cache_size to a value 20x lower than the memory available in
the system. Dropping and creating the statistics on the join table
makes no difference in the estimated row counts and query plans.
PostgreSql seems to be caching something internally and choosing the
query plans at random - sometimes it is fast, sometimes 4000x slower
without any changes in configuration or statistics. The runtime
difference is 4000x. Such randomness is clearly unacceptable. I think
I would have to try to use the "hint" extension suggested by Imre to
make it consistently choose the fast plan.

there was a similar concern raised on slack, where on shared plan (where
less effective_cache_size resulted in faster execution)

*"*
*Hey all, I have been diagnosing a slow running query and found that it
speeds up when the effective_cache_size is reduced. It's set at ~21.5gb,
and lowering to 10gb or under massively improves the query. Plans are here
<https://explain.depesz.com/s/VsaY&gt; and here
<https://explain.depesz.com/s/nW3d&gt;. You can see the difference is 33 -
42, where the order of execution results in a bunch more rows having to be
filtered out later in the first plan. Everything I've read suggests that a
bigger effective_cache_size should be better (assuming it's within the
recommended boundaries of total memory of course). Can anyone offer any
insight into why this might happen, and what a good workaround would be?*
*"*

https://explain.depesz.com/s/VsaY
https://explain.depesz.com/s/nW3d

I am not sure, though effective_cache_size is the culprit. if there are bad
estimates, even a simple query like below produces a poor plan.
i did try to check the code postgres/costsize.c at master ·
postgres/postgres (github.com)
<https://github.com/postgres/postgres/blob/master/src/backend/optimizer/path/costsize.c&gt;
(I am not a developer but try to read the english from code) and tried to
understand how effective_cache_sizealters the cost and changes the path,
but could not simulate on my laptop with small tables and less resources.
but yeah, i know 1TB is a huge setup, but can run vaccumdb and analyze all
the tables in question (maybe reindex concurrently if required) and check
if the row stats are still off in the plan and reducing effective_cache_size
improves the execution time? I am not an expert, but just try to figure out
if this is a red herring.

i also play with some gucs as in PostgreSQL: Documentation: 13: 19.7. Query
Planning <https://www.postgresql.org/docs/13/runtime-config-query.html&gt; to
check if enabling/disabling some of gucs results in a big difference in
execution and then try to zero in what needs immediate vaccum or reindex or
join count etc or bumping resources if required to use more cache.

postgres=# create table t(id int primary key, col1 int);
CREATE TABLE
postgres=# alter table t set (autovacuum_enabled = false,
toast.autovacuum_enabled = false);
ALTER TABLE

postgres=# insert into t select x, 1 from generate_series(1, 1000000) x;
INSERT 0 1000000

postgres=# \x
Expanded display is off.

postgres=# do $$
declare i int;
begin
for i in 1..10 loop
update t set col1 = i::int;
commit;
end loop;
end; $$;
DO
---this is the default plan picked up by optimizer (tables still not
analyzed)

postgres=# explain analyze select * from t where id < 100;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------
Seq Scan on t (cost=0.00..1472242.50 rows=28995800 width=8) (actual
time=198.740..4849.556 rows=99 loops=1)
Filter: (id < 100)
Rows Removed by Filter: 999901
Planning Time: 1.296 ms
JIT:
Functions: 2
Options: Inlining true, Optimization true, Expressions true, Deforming
true
Timing: Generation 1.194 ms, Inlining 4.383 ms, Optimization 13.051 ms,
Emission 6.954 ms, Total 25.581 ms
* Execution Time: 4850.870 ms*
(9 rows)

--play smart, disable seqscan
postgres=# set enable_seqscan TO 0;
SET
postgres=# explain analyze select * from t where id < 100;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on t (cost=235689.89..1646219.33 rows=28995800 width=8)
(actual time=1.312..2.313 rows=99 loops=1)
Recheck Cond: (id < 100)
Heap Blocks: exact=32
-> Bitmap Index Scan on t_pkey (cost=0.00..228440.94 rows=28995800
width=0) (actual time=1.295..1.296 rows=126 loops=1)
Index Cond: (id < 100)
Planning Time: 0.054 ms
JIT:
Functions: 2
Options: Inlining true, Optimization true, Expressions true, Deforming
true
Timing: Generation 7.309 ms, Inlining 0.000 ms, Optimization 0.000 ms,
Emission 0.000 ms, Total 7.309 ms
* Execution Time: 9.715 ms*
(11 rows)

# rule out cache
postgres=# set enable_seqscan TO 1;
SET
postgres=# explain analyze select * from t where id < 100;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------
Seq Scan on t (cost=0.00..1472242.50 rows=28995800 width=8) (actual
time=196.485..4778.942 rows=99 loops=1)
Filter: (id < 100)
Rows Removed by Filter: 999901
Planning Time: 0.052 ms
JIT:
Functions: 2
Options: Inlining true, Optimization true, Expressions true, Deforming
true
Timing: Generation 0.286 ms, Inlining 1.389 ms, Optimization 4.462 ms,
Emission 2.765 ms, Total 8.902 ms
* Execution Time: 4779.314 ms*
(9 rows)

#again disable seqscan
postgres=# set enable_seqscan TO 0;
SET
postgres=# explain analyze select * from t where id < 100;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on t (cost=235689.89..1646219.33 rows=28995800 width=8)
(actual time=0.028..0.101 rows=99 loops=1)
Recheck Cond: (id < 100)
Heap Blocks: exact=32
-> Bitmap Index Scan on t_pkey (cost=0.00..228440.94 rows=28995800
width=0) (actual time=0.017..0.017 rows=126 loops=1)
Index Cond: (id < 100)
Planning Time: 0.047 ms
JIT:
Functions: 2
Options: Inlining true, Optimization true, Expressions true, Deforming
true
Timing: Generation 0.834 ms, Inlining 0.000 ms, Optimization 0.000 ms,
Emission 0.000 ms, Total 0.834 ms
* Execution Time: 0.978 ms*
(11 rows)

#disable jit (even faster)
postgres=# set jit TO 0;
SET
postgres=# explain analyze select * from t where id < 100;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on t (cost=235689.89..1646219.33 rows=28995800 width=8)
(actual time=0.030..0.115 rows=99 loops=1)
Recheck Cond: (id < 100)
Heap Blocks: exact=32
-> Bitmap Index Scan on t_pkey (cost=0.00..228440.94 rows=28995800
width=0) (actual time=0.018..0.018 rows=126 loops=1)
Index Cond: (id < 100)
Planning Time: 0.053 ms
* Execution Time: 0.137 ms*
(7 rows)

#disable bitmap scan (more faster)
postgres=# set enable_bitmapscan TO 0;
SET
postgres=# explain analyze select * from t where id < 100;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Index Scan using t_pkey on t (cost=0.44..2057998.94 rows=28995800
width=8) (actual time=0.018..0.147 rows=99 loops=1)
Index Cond: (id < 100)
Planning Time: 0.053 ms
* Execution Time: 0.166 ms*
(4 rows)

#now cleanup dead tuples, and analyze to update stats and reset all gucs
postgres=# vacuum full t; -- note full not required, 'vacuum analyze t'
would suffice too.
VACUUM
postgres=# set enable_bitmapscan TO default;
SET
postgres=# set enable_seqscan TO default;
SET
postgres=# set jit to default;
SET
postgres=# explain analyze select * from t where id < 100;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on t (cost=6243.76..14835.42 rows=333333 width=8)
(actual time=0.028..0.042 rows=99 loops=1)
Recheck Cond: (id < 100)
Heap Blocks: exact=3
-> Bitmap Index Scan on t_pkey (cost=0.00..6160.42 rows=333333
width=0) (actual time=0.017..0.018 rows=99 loops=1)
Index Cond: (id < 100)
Planning Time: 0.819 ms
Execution Time: 0.061 ms
(7 rows)

postgres=# analyze t;
ANALYZE

*#the default plan after vacuum analyze is the fastest.*
postgres=# explain analyze select * from t where id < 100;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------
Index Scan using t_pkey on t (cost=0.42..10.62 rows=92 width=8) (actual
time=0.005..0.029 rows=99 loops=1)
Index Cond: (id < 100)
Planning Time: 0.159 ms
Execution Time: 0.045 ms
(4 rows)

#8A Shaposhnikov
artyom@gmail.com
In reply to: Vijaykumar Jain (#7)
Re: increasing effective_cache_size slows down join queries by a factor of 4000x

the random_page_cost = 1.2 and seq_page_cost= 1 in my setup are the
same. I also run the vacuum analyze on all tables. I have no idea why
the optimizer thinks that the merge on 2 200M row tables is faster
than 50 index lookups, other than it "thinks" that in-memory merges
are very low cost regardless of size. Another guess is that it almost
ignores the "limit" parameter or uses it in some incorrect way.

On Thu, Feb 3, 2022 at 12:31 PM Vijaykumar Jain
<vijaykumarjain.github@gmail.com> wrote:

Show quoted text

On Fri, 4 Feb 2022 at 01:03, A Shaposhnikov <artyom@gmail.com> wrote:

I made a mistake yesterday claiming that the created statistics
changed the row counts in the estimates - it did not - I looked at
the wrong query yesterday. In the correct query plan the row estimate
still differs from the actual by many orders of magnitude:

Nested Loop (cost=1.01..27584834.53 rows=59608439 width=164) (actual
time=0.047..2.723 rows=854 loops=1)

It completely ignores the "limit" in SQL statement when doing the row
estimates - the estimates do not change when I change it. The business
logic really needs only 1000 rows in the result, I never need the
whole table in the result. The query always runs fast if the limit is
<= 46, when the limit >=47, it sometimes chooses to do a merge join on
2 tables with hundreds of millions of rows instead of using the
indexes. The runtime difference is 4000x.

I migrated the data to the latest postgres 14.1. Both versions run
either the slow plan or the fast plan seemingly at random. The only
reliable way to make it choose the fast plan is to decrease the
effective_cache_size to a value 20x lower than the memory available in
the system. Dropping and creating the statistics on the join table
makes no difference in the estimated row counts and query plans.
PostgreSql seems to be caching something internally and choosing the
query plans at random - sometimes it is fast, sometimes 4000x slower
without any changes in configuration or statistics. The runtime
difference is 4000x. Such randomness is clearly unacceptable. I think
I would have to try to use the "hint" extension suggested by Imre to
make it consistently choose the fast plan.

there was a similar concern raised on slack, where on shared plan (where less effective_cache_size resulted in faster execution)

"
Hey all, I have been diagnosing a slow running query and found that it speeds up when the effective_cache_size is reduced. It's set at ~21.5gb, and lowering to 10gb or under massively improves the query. Plans are here and here. You can see the difference is 33 - 42, where the order of execution results in a bunch more rows having to be filtered out later in the first plan. Everything I've read suggests that a bigger effective_cache_size should be better (assuming it's within the recommended boundaries of total memory of course). Can anyone offer any insight into why this might happen, and what a good workaround would be?
"

https://explain.depesz.com/s/VsaY
https://explain.depesz.com/s/nW3d

I am not sure, though effective_cache_size is the culprit. if there are bad estimates, even a simple query like below produces a poor plan.
i did try to check the code postgres/costsize.c at master · postgres/postgres (github.com) (I am not a developer but try to read the english from code) and tried to understand how effective_cache_sizealters the cost and changes the path, but could not simulate on my laptop with small tables and less resources.
but yeah, i know 1TB is a huge setup, but can run vaccumdb and analyze all the tables in question (maybe reindex concurrently if required) and check if the row stats are still off in the plan and reducing effective_cache_size improves the execution time? I am not an expert, but just try to figure out if this is a red herring.

i also play with some gucs as in PostgreSQL: Documentation: 13: 19.7. Query Planning to check if enabling/disabling some of gucs results in a big difference in execution and then try to zero in what needs immediate vaccum or reindex or join count etc or bumping resources if required to use more cache.

postgres=# create table t(id int primary key, col1 int);
CREATE TABLE
postgres=# alter table t set (autovacuum_enabled = false, toast.autovacuum_enabled = false);
ALTER TABLE

postgres=# insert into t select x, 1 from generate_series(1, 1000000) x;
INSERT 0 1000000

postgres=# \x
Expanded display is off.

postgres=# do $$
declare i int;
begin
for i in 1..10 loop
update t set col1 = i::int;
commit;
end loop;
end; $$;
DO
---this is the default plan picked up by optimizer (tables still not analyzed)

postgres=# explain analyze select * from t where id < 100;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------
Seq Scan on t (cost=0.00..1472242.50 rows=28995800 width=8) (actual time=198.740..4849.556 rows=99 loops=1)
Filter: (id < 100)
Rows Removed by Filter: 999901
Planning Time: 1.296 ms
JIT:
Functions: 2
Options: Inlining true, Optimization true, Expressions true, Deforming true
Timing: Generation 1.194 ms, Inlining 4.383 ms, Optimization 13.051 ms, Emission 6.954 ms, Total 25.581 ms
Execution Time: 4850.870 ms
(9 rows)

--play smart, disable seqscan
postgres=# set enable_seqscan TO 0;
SET
postgres=# explain analyze select * from t where id < 100;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on t (cost=235689.89..1646219.33 rows=28995800 width=8) (actual time=1.312..2.313 rows=99 loops=1)
Recheck Cond: (id < 100)
Heap Blocks: exact=32
-> Bitmap Index Scan on t_pkey (cost=0.00..228440.94 rows=28995800 width=0) (actual time=1.295..1.296 rows=126 loops=1)
Index Cond: (id < 100)
Planning Time: 0.054 ms
JIT:
Functions: 2
Options: Inlining true, Optimization true, Expressions true, Deforming true
Timing: Generation 7.309 ms, Inlining 0.000 ms, Optimization 0.000 ms, Emission 0.000 ms, Total 7.309 ms
Execution Time: 9.715 ms
(11 rows)

# rule out cache
postgres=# set enable_seqscan TO 1;
SET
postgres=# explain analyze select * from t where id < 100;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------
Seq Scan on t (cost=0.00..1472242.50 rows=28995800 width=8) (actual time=196.485..4778.942 rows=99 loops=1)
Filter: (id < 100)
Rows Removed by Filter: 999901
Planning Time: 0.052 ms
JIT:
Functions: 2
Options: Inlining true, Optimization true, Expressions true, Deforming true
Timing: Generation 0.286 ms, Inlining 1.389 ms, Optimization 4.462 ms, Emission 2.765 ms, Total 8.902 ms
Execution Time: 4779.314 ms
(9 rows)

#again disable seqscan
postgres=# set enable_seqscan TO 0;
SET
postgres=# explain analyze select * from t where id < 100;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on t (cost=235689.89..1646219.33 rows=28995800 width=8) (actual time=0.028..0.101 rows=99 loops=1)
Recheck Cond: (id < 100)
Heap Blocks: exact=32
-> Bitmap Index Scan on t_pkey (cost=0.00..228440.94 rows=28995800 width=0) (actual time=0.017..0.017 rows=126 loops=1)
Index Cond: (id < 100)
Planning Time: 0.047 ms
JIT:
Functions: 2
Options: Inlining true, Optimization true, Expressions true, Deforming true
Timing: Generation 0.834 ms, Inlining 0.000 ms, Optimization 0.000 ms, Emission 0.000 ms, Total 0.834 ms
Execution Time: 0.978 ms
(11 rows)

#disable jit (even faster)
postgres=# set jit TO 0;
SET
postgres=# explain analyze select * from t where id < 100;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on t (cost=235689.89..1646219.33 rows=28995800 width=8) (actual time=0.030..0.115 rows=99 loops=1)
Recheck Cond: (id < 100)
Heap Blocks: exact=32
-> Bitmap Index Scan on t_pkey (cost=0.00..228440.94 rows=28995800 width=0) (actual time=0.018..0.018 rows=126 loops=1)
Index Cond: (id < 100)
Planning Time: 0.053 ms
Execution Time: 0.137 ms
(7 rows)

#disable bitmap scan (more faster)
postgres=# set enable_bitmapscan TO 0;
SET
postgres=# explain analyze select * from t where id < 100;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Index Scan using t_pkey on t (cost=0.44..2057998.94 rows=28995800 width=8) (actual time=0.018..0.147 rows=99 loops=1)
Index Cond: (id < 100)
Planning Time: 0.053 ms
Execution Time: 0.166 ms
(4 rows)

#now cleanup dead tuples, and analyze to update stats and reset all gucs
postgres=# vacuum full t; -- note full not required, 'vacuum analyze t' would suffice too.
VACUUM
postgres=# set enable_bitmapscan TO default;
SET
postgres=# set enable_seqscan TO default;
SET
postgres=# set jit to default;
SET
postgres=# explain analyze select * from t where id < 100;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on t (cost=6243.76..14835.42 rows=333333 width=8) (actual time=0.028..0.042 rows=99 loops=1)
Recheck Cond: (id < 100)
Heap Blocks: exact=3
-> Bitmap Index Scan on t_pkey (cost=0.00..6160.42 rows=333333 width=0) (actual time=0.017..0.018 rows=99 loops=1)
Index Cond: (id < 100)
Planning Time: 0.819 ms
Execution Time: 0.061 ms
(7 rows)

postgres=# analyze t;
ANALYZE

#the default plan after vacuum analyze is the fastest.
postgres=# explain analyze select * from t where id < 100;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------
Index Scan using t_pkey on t (cost=0.42..10.62 rows=92 width=8) (actual time=0.005..0.029 rows=99 loops=1)
Index Cond: (id < 100)
Planning Time: 0.159 ms
Execution Time: 0.045 ms
(4 rows)

#9Michael Lewis
mlewis@entrata.com
In reply to: A Shaposhnikov (#8)
Re: increasing effective_cache_size slows down join queries by a factor of 4000x

I would guess that autovacuum is analyzing the table and causing the stats
to change which is resulting in different estimates that result in
different plans. Unless you can get the estimate much more accurate, you
won't get far with expecting a stable plan that performs well.

How is data_class_pkey? If you run a query like this, how far off are the
estimates?

explain analyze
select d.time as time,d.id as id, a.query_symbol as query_symbol

from
data as d
join data_class as dc ON dc.data_id = d.id
join class as a ON dc.class_id = a.id
where
d.id > 205284974
order by d.id
limit 1000;

If you run 'analyze data( id );' and then run the query again, do you get a
better estimate? Have you tried adjusting default_stats_target? Are you
running the default value for random_page_cost with SSDs?

I'm seeing Index Only Scan nodes, but a high number of fetches so it seems
like you would benefit from vacuum to update pg_class.relallvisible value.

#10A Shaposhnikov
artyom@gmail.com
In reply to: A Shaposhnikov (#1)
Re: increasing effective_cache_size slows down join queries by a factor of 4000x

Tomas,

thank you! The query:

select 1 from data as d, data_class as dc
where dc.data_id = d.id and d.id > 205284974
and dc.data_id > 205284974 -- new condition
order by d.id
limit 1000;

totally solved it - it is now fast under all conditions! I thought
that the optimizer would be able to infer it itself.

Thank you

On Thu, Feb 3, 2022 at 4:24 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

Show quoted text

On 2/3/22 20:32, A Shaposhnikov wrote:

I made a mistake yesterday claiming that the created statistics
changed the row counts in the estimates - it did not - I looked at
the wrong query yesterday. In the correct query plan the row estimate
still differs from the actual by many orders of magnitude:

Nested Loop (cost=1.01..27584834.53 rows=59608439 width=164) (actual
time=0.047..2.723 rows=854 loops=1)

It completely ignores the "limit" in SQL statement when doing the row
estimates - the estimates do not change when I change it. The business
logic really needs only 1000 rows in the result, I never need the
whole table in the result. The query always runs fast if the limit is
<= 46, when the limit >=47, it sometimes chooses to do a merge join on
2 tables with hundreds of millions of rows instead of using the
indexes. The runtime difference is 4000x.

Yes, it ignores LIMIT while calculating the estimates - the estimates
below LIMIT are as if the query executed to completion. But the "actual"
value are affected by LIMIT, because the execution terminates early. So
you can't compare these two values and conclude the estimates are off.

Let me illustrate this on an example:

create table t (a int, b int);
insert into t select i, i from generate_series(1,1000000) s(i);
create index on t (a,b);
vacuum analyze t;

explain analyze select b from t order by a limit 10;
QUERY PLAN
-----------------------------------------------------------------------
Limit (cost=0.42..0.79 rows=10 width=8)
(actual time=0.018..0.033 rows=10 loops=1)
-> Index Only Scan using t_a_b_idx on t
(cost=0.42..36214.93 rows=1000000 width=8)
(actual time=0.016..0.021 rows=10 loops=1)
Heap Fetches: 10
Planning Time: 0.047 ms
Execution Time: 0.049 ms

The database has *perfect* stats in this case and the estimates are spot
on too. But comparing rows=1000000 and rows=10 would lead to conclusion
to a bogus conclusion that the estimates are wrong.

Which is why I suggested getting an explain for the query without the
limit clause.

I migrated the data to the latest postgres 14.1. Both versions run
either the slow plan or the fast plan seemingly at random. The only
reliable way to make it choose the fast plan is to decrease the
effective_cache_size to a value 20x lower than the memory available in
the system. Dropping and creating the statistics on the join table
makes no difference in the estimated row counts and query plans.
PostgreSql seems to be caching something internally and choosing the
query plans at random - sometimes it is fast, sometimes 4000x slower
without any changes in configuration or statistics. The runtime
difference is 4000x. Such randomness is clearly unacceptable. I think
I would have to try to use the "hint" extension suggested by Imre to
make it consistently choose the fast plan.

Does the plan change depending on parameters in the query. Or does it
change if you run the same query (including the same parameters in all
the conditions)?

One possible explanation is that there's some sort of correlation
between parameters. Limit assumes the matching rows are distributed
uniformly in the input, and if that's not true (due to some sort of
correlation), it may pick the wrong plan.

In this case there are joins so the correlation may be more complicated
- e.g. through a join condition, between (dc.class_id, dc.data_id) or
something like that.

This part of the plan hints this might be the case, because it forces
scanning a huge part of data_class table just to get to the interesting
data_id values.

-> Index Only Scan using data_class_pkey on data_class ta
(cost=0.57..4935483.78 rows=216964862 width=8)
(actual time=0.018..35022.908 rows=151321889 loops=1)
Heap Fetches: 151321889

That's hard to fix, because there's no way to tell the database about
such dependencies. I'd try two things:

(1) Try putting the same condition on ID on data_class too. I mean, if
you do

select 1 from data as d, data_class as dc
where dc.data_id = d.id and d.id > 205284974
order by d.id
limit 1000;

then clearly dc.data_id > 205284974, but the database does not realize
it and will scan the index from start for mergejoin. Nestedloop does
direct lookup for individual ID values, so it doesn't have this issue.

So try this:

select 1 from data as d, data_class as dc
where dc.data_id = d.id and d.id > 205284974
and dc.data_id > 205284974 -- new condition
order by d.id
limit 1000;

This is a special case of correlation between the tables, possibly
confusing the LIMIT estimation. We are smarter about similar cases (with
conditions on join keys), but only for equalities.

The other thing I'd try is vacuuming the data_class. The high number of
heap fetches suggests the visibility map is in pretty poor shape, which
is not great for index-only scans.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#11Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: A Shaposhnikov (#10)
Re: increasing effective_cache_size slows down join queries by a factor of 4000x

On 2/4/22 05:21, A Shaposhnikov wrote:

Tomas,

thank you! The query:

select 1 from data as d, data_class as dc
where dc.data_id = d.id and d.id > 205284974
and dc.data_id > 205284974 -- new condition
order by d.id
limit 1000;

totally solved it - it is now fast under all conditions! I thought
that the optimizer would be able to infer it itself.

Unfortunately, the optimizer is not that smart - we can do that for
equality conditions, but not for other operators. There was actually a
thread [1]/messages/by-id/CAFQUnFhqkWuPCwQ1NmHYrisHJhYx4DoJak-dV+FcjyY6scooYA@mail.gmail.com exploring a possibility to extend this to inequalities, but
it went nowhere so far. It also explains why it's done only for equality
operators. In short, it's fairly expensive, makes costing of joins more
difficult, and most queries can't benefit from it (because conditions on
join keys are not that common).

BTW how does the final query plan look like? Is it using the merge sort
of nested loop? I wonder if this might be formulated as a costing issue,
pushing the planner to use the nested loop.

[1]: /messages/by-id/CAFQUnFhqkWuPCwQ1NmHYrisHJhYx4DoJak-dV+FcjyY6scooYA@mail.gmail.com
/messages/by-id/CAFQUnFhqkWuPCwQ1NmHYrisHJhYx4DoJak-dV+FcjyY6scooYA@mail.gmail.com

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#12A Shaposhnikov
artyom@gmail.com
In reply to: Tomas Vondra (#11)
Re: increasing effective_cache_size slows down join queries by a factor of 4000x

I started using the latest postgres 14.2 and the query plans there for a
simplified query joining just 2 tables look like:

explain analyze select t.*, ta.* from team as t, team_aliases as ta where
ta.team_id = t.id and t.id > 200000000 order by t.id limit 1000;

QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------------------------------------

Limit (cost=81.33..331.82 rows=1000 width=183) (actual
time=31328.561..31329.565 rows=1000 loops=1)

-> Merge Join (cost=81.33..17727020.90 rows=70768783 width=183) (actual
time=31328.560..31329.498 rows=1000 loops=1)

Merge Cond: (t.id = ta.team_id)

-> Index Scan using team_pkey on team t (cost=0.57..11382381.88
rows=78693167 width=175) (actual time=0.016..0.466 rows=854 loops=1)

Index Cond: (id > 205284974)

-> Index Scan using fki_team_fk on team_aliases ta (cost=0.57..4893726.20
rows=218604096 width=8) (actual time=0.010..22172.405 rows=151321830
loops=1)

Planning Time: 0.472 ms

Execution Time: 31329.654 ms

(8 rows)

Now if I add the second condition as Tomas suggested, it speeds it up by
factor of 15,000:

explain analyze select t.*, ta.* from team as t, team_aliases as ta where
ta.team_id = t.id and t.id > 200000000 and ta.team_id > 200000000 order by
t.id limit 1000;

QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------------------------

Limit (cost=81.33..720.48 rows=1000 width=183) (actual time=0.073..1.909
rows=1000 loops=1)

-> Merge Join (cost=81.33..13859983.71 rows=21685030 width=183) (actual
time=0.072..1.756 rows=1000 loops=1)

Merge Cond: (t.id = ta.team_id)

-> Index Scan using team_pkey on team t (cost=0.57..11382381.88
rows=78693167 width=175) (actual time=0.016..0.695 rows=854 loops=1)

Index Cond: (id > 205284974)

-> Index Scan using fki_team_fk on team_aliases ta (cost=0.57..1896563.38
rows=66984851 width=8) (actual time=0.052..0.356 rows=1000 loops=1)

Index Cond: (team_id > 205284974)

Planning Time: 0.503 ms

Execution Time: 2.056 ms

(9 rows)

Interestingly I have a second PG 14.2 database, with identical table
definitions, but about 10% smaller row counts, and the exact same query
works fast there without the 2nd condition:

explain analyze select t.*, ta.* from team as t, team_aliases as ta where
ta.team_id = t.id and t.id > 200000000 order by t.id limit 1000;

QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------------------

Limit (cost=1.14..1700.75 rows=1000 width=168) (actual time=0.062..4.841
rows=1000 loops=1)

-> Nested Loop (cost=1.14..5685654.91 rows=3345265 width=168) (actual
time=0.060..4.700 rows=1000 loops=1)

-> Index Scan using team_pkey on team t (cost=0.57..2366113.83 rows=2807531
width=160) (actual time=0.031..0.801 rows=888 loops=1)

Index Cond: (id > 205284974)

-> Index Scan using fki_team_fk on team_aliases ta (cost=0.57..1.14 rows=4
width=8) (actual time=0.003..0.004 rows=1 loops=888)

Index Cond: (team_id = t.id)

Planning Time: 0.559 ms

Execution Time: 4.953 ms

In both databases I run the "vacuum full analyze" on all tables. I am not
sure why it decides to check the join condition much later in the plan when
the query runs slow? Basically, it seems to randomly decide to produce two
huge multimillion row tables and then merge them on the join condition
instead of looping 1000 times using indexes. Obviously, the optimizer
grossly miscalculates the costs, how can we help it?

On Fri, Feb 4, 2022 at 9:01 AM Tomas Vondra <tomas.vondra@enterprisedb.com>
wrote:

Show quoted text

On 2/4/22 05:21, A Shaposhnikov wrote:

Tomas,

thank you! The query:

select 1 from data as d, data_class as dc
where dc.data_id = d.id and d.id > 205284974
and dc.data_id > 205284974 -- new condition
order by d.id
limit 1000;

totally solved it - it is now fast under all conditions! I thought
that the optimizer would be able to infer it itself.

Unfortunately, the optimizer is not that smart - we can do that for
equality conditions, but not for other operators. There was actually a
thread [1] exploring a possibility to extend this to inequalities, but
it went nowhere so far. It also explains why it's done only for equality
operators. In short, it's fairly expensive, makes costing of joins more
difficult, and most queries can't benefit from it (because conditions on
join keys are not that common).

BTW how does the final query plan look like? Is it using the merge sort
of nested loop? I wonder if this might be formulated as a costing issue,
pushing the planner to use the nested loop.

[1]

/messages/by-id/CAFQUnFhqkWuPCwQ1NmHYrisHJhYx4DoJak-dV+FcjyY6scooYA@mail.gmail.com

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#13Michael Lewis
mlewis@entrata.com
In reply to: A Shaposhnikov (#12)
Re: increasing effective_cache_size slows down join queries by a factor of 4000x

I think that theoretically if the planner sees a condition like

a.column1 > constant_value, and it also has a condition like a.column1 =
b.column2 then it could autogenerate the b.column2 > constant_value
condition. And of course > could be <, <=, >= and <>

But I could be wrong, particularly with respect to nulls and how = is
mostly the same as 'is not distinct from' except for 'select null = null'
being null and 'select null is not distinct from null' being true. Perhaps
if either/both columns have a not null constraint, then this could still be
done. Whether or not it should be a separate question.

#14Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: A Shaposhnikov (#12)
Re: increasing effective_cache_size slows down join queries by a factor of 4000x

On 2/15/22 01:06, A Shaposhnikov wrote:

Interestingly I have a second PG 14.2 database, with identical table
definitions, but about 10% smaller row counts, and the exact same query
works fast there without the 2nd condition:

Are you sure about the 10%? Because in the plans from the first machine
I see this:

-> Index Scan using team_pkey on team t (cost=0.57..11382381.88
rows=78693167 width=175) (actual time=0.016..0.695 rows=854 loops=1)

while the second machine does this:

-> Index Scan using team_pkey on team t (cost=0.57..2366113.83
rows=2807531 width=160) (actual time=0.031..0.801 rows=888 loops=1)

That's 2.8M vs. 78M, quite far from "10% difference". Not sure about
team_aliases table, that's imposible to say from the plans.

This may matter a lot, because we use effective cache size to calculate
cache hit ratio for the query, with relation sizes as an input. So
smaller relations (or larger effective_cache_size) means cheaper random
I/O, hence preference for nested loop join.

The other thing is data distribution - that may matter too.

IMO it's pointless to investigate this further - we know what's causing
the issue. The optimizer is oblivious that merge join will have to skip
large part of the second input, due to the implicit condition. Notice
that adding the condition changes the cost from:

Limit (cost=81.33..331.82 rows=1000 width=183) ...

to

Limit (cost=81.33..720.48 rows=1000 width=183) ...

So it seems *more* expensive than the first plan. Taken to the extreme
the planner could theoretically have chosen to use the first plan (and
delay the condition until after the join).

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company