single table - fighting a seq scan

Started by Radoslav Nedyalkovalmost 6 years ago7 messagesgeneral
Jump to latest
#1Radoslav Nedyalkov
rnedyalkov@gmail.com

Hi Forum,
I'm scratching my head around the following case:

*te* is a 80M rows, 100GB table. It is a bare simple select over indexed
attribute of it.

EXPLAIN SELECT te.id FROM te WHERE te.current_pid IN (240900026,
240900027,
240900028,
-- 200 entries ...

Gather (cost=1000.00..61517367.85 rows=3870 width=8)
Workers Planned: 2
-> Parallel Seq Scan on te (cost=0.00..61515980.85 rows=1612 width=8)
Filter: (current_pid = ANY
('{240900026,240900027,...240901129}'::bigint[]))
Execution time is about 5 minutes

Reducing number of current_pids to 100 changes the plan and it does index
scan. (101 still does seq scan)

Index Scan using te_current_pid_idx on te (cost=0.57..731.26 rows=3832
width=8) (actual time=0.566..1.667 rows=600 loops=1)
Index Cond: (current_pid = ANY
('{240900026,240900027,...240900194}'::bigint[]))
Planning Time: 3.152 ms
Execution Time: 1.732 ms

Selecting 200 pids rewritten with CTE goes for index too.

EXPLAIN ANALYZE
WITH cte as (
select * from unnest(ARRAY[
240900026,
240900027,
240900028,
...
240901129
]))
SELECT te.id FROM te join cte on te.current_pid = cte.unnest;

QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=1.58..1097.83 rows=3847 width=8) (actual
time=0.882..14.927 rows=1468 loops=1)
CTE cte
-> Function Scan on unnest (cost=0.00..1.00 rows=100 width=4)
(actual time=0.025..0.043 rows=205 loops=1)
-> CTE Scan on cte (cost=0.00..2.00 rows=100 width=4) (actual
time=0.027..0.083 rows=205 loops=1)
-> Index Scan using te_current_pid_idx on te (cost=0.57..10.57 rows=38
width=16) (actual time=0.011..0.071 rows=7 loops=205)
Index Cond: (current_pid = cte.unnest)
Planning Time: 2.022 ms
Execution Time: 15.044 ms

I tried random_page_cost=1, a couple of combinations with very low
cpu_index_tuple_cost and cpu_operator_cost. Only managed to get an index
scan for a few more IN entries.
Did analyze. Bumped stats target for current_pid to 5000. Did not help.

I'm out of ideas. What is the right approach to solve this ?
Thank You!

Rado

#2Michael Lewis
mlewis@entrata.com
In reply to: Radoslav Nedyalkov (#1)
Re: single table - fighting a seq scan

rows=3832
rows=3870

Your estimate changed very little when you included 100 values vs 200
values. That is interesting to me.

What does the below query give you? How many of those 200 values are found
in the MCVs list? If n_distinct is low, and most of the values are NOT in
the most common value list, and the fraction of the table covered by the
MCVs is also low, then the planner will expect that each of the 200 values
represents some deceivingly high portion of the table.

You said there are 80 million rows, yes? That seems likely that ndistinct
and the MCVs list are not giving info very correlated with reality. You may
want to increase the minimum table size for sequential to kick in. I cannot
recall the name of that setting at the moment. You may also want to
increase stats target on that column at least, analyze, and explain the
query again.

SELECT
( SELECT SUM (x) FROM UNNEST (most_common_freqs) x ) frac_MCV,
tablename,
attname,
inherited,
null_frac,
n_distinct,
array_length(most_common_vals,1) n_mcv,
array_length(histogram_bounds,1) n_hist,
correlation,
*
FROM pg_stats
WHERE
schemaname = 'public'
AND tablename='te'
AND attname='current_pid';

Show quoted text
#3Radoslav Nedyalkov
rnedyalkov@gmail.com
In reply to: Michael Lewis (#2)
Re: single table - fighting a seq scan

Hi Michael,
full output from the query is attached.
here is the truncated lists version.

frac_mcv | 0.00306267
tablename | te
attname | current_pid
inherited | f
null_frac | 0.261823
n_distinct | 1.59236e+07
n_mcv | 560
n_hist | 5001
correlation | 0.995502
schemaname | public
tablename | te
attname | current_pid
inherited | f
null_frac | 0.261823
avg_width | 8
n_distinct | 1.59236e+07
most_common_vals |
{15026003,24951186,220707698,223344198,224236736,224355865,224359830,224371584,224380154,224382722,224388639,224390209,224394943,224396835,228259607,232148477,232173137,232379194,232729185,232913953,236304699,236797618,237355501,238860629,239082658}
most_common_freqs |
{4.46667e-05,1.93333e-05,4.66667e-06,4.66667e-06,...,4e-06,4e-06,4e-06,4e-06,4e-06,4e-06,4e-06,4e-06,4e-06,4e-06,4e-06,4e-06,4e-06,4e-06,4e-06,4e-06,4e-06,4e-06,4e-06,4e-06}
histogram_bounds |
{13426761,13467316,13510844,..239215632,239302648,239371125,239466529,239532095,239571468,239611801,239634487}
correlation | 0.995502
most_common_elems | (null)
most_common_elem_freqs | (null)
elem_count_histogram | (null)

Thank You very much for the response.
I'll try with settings as you propose.

Rado

On Tue, Jul 14, 2020 at 9:46 PM Michael Lewis <mlewis@entrata.com> wrote:

Show quoted text

rows=3832
rows=3870

Your estimate changed very little when you included 100 values vs 200
values. That is interesting to me.

What does the below query give you? How many of those 200 values are found
in the MCVs list? If n_distinct is low, and most of the values are NOT in
the most common value list, and the fraction of the table covered by the
MCVs is also low, then the planner will expect that each of the 200 values
represents some deceivingly high portion of the table.

You said there are 80 million rows, yes? That seems likely that ndistinct
and the MCVs list are not giving info very correlated with reality. You may
want to increase the minimum table size for sequential to kick in. I cannot
recall the name of that setting at the moment. You may also want to
increase stats target on that column at least, analyze, and explain the
query again.

SELECT
( SELECT SUM (x) FROM UNNEST (most_common_freqs) x ) frac_MCV,
tablename,
attname,
inherited,
null_frac,
n_distinct,
array_length(most_common_vals,1) n_mcv,
array_length(histogram_bounds,1) n_hist,
correlation,
*
FROM pg_stats
WHERE
schemaname = 'public'
AND tablename='te'
AND attname='current_pid';

Attachments:

out.txttext/plain; charset=US-ASCII; name=out.txtDownload
#4Radoslav Nedyalkov
rnedyalkov@gmail.com
In reply to: Radoslav Nedyalkov (#3)
Re: single table - fighting a seq scan

Ah, I could have messed up the examples I gave. Row numbers are different.
Once again the plans , sorry about that.

-- 200 entries

Gather (cost=1000.00..106905910.97 rows=7893 width=8)
Workers Planned: 2
-> Parallel Seq Scan on te (cost=0.00..106904121.67 rows=3289 width=8)
Filter: (current_pid = ANY
('{240900026,240900027,240900028,240900029,240900030,240900031,240900032,240900033,240900034,240900035,240900036,240900037,240900038,240900039,240900040,240900041,240900042,240900043,240900044,240900045,240900046,240900047,240900048,240900049,240900050,240900051,240900052,240900053,240900054,240900055,240900056,240900057,240900058,240900059,240900060,240900061,240900062,240900063,240900064,240900065,240900066,240900067,240900068,240900069,240900070,240900071,240900072,240900073,240900074,240900075,240900076,240900077,240900078,240900079,240900080,240900081,240900082,240900083,240900084,240900085,240900086,240900165,240900087,240900166,240900088,240900167,240900089,240900168,240900090,240900169,240900091,240900170,240900092,240900171,240900093,240900172,240900094,240900173,240900905,240900174,240900175,240900176,240900177,240900178,240900179,240900180,240900181,240900182,240900183,240900184,240900185,240900186,240900187,240900188,240900189,240900190,240900191,240900192,240900193,240900194,240900195,240900196,240900197,240900198,240900199,240900906,240900907,240900908,240900909,240900910,240900911,240900912,240900913,240900914,240900915,240900916,240900917,240900918,240900919,240900920,240900921,240900922,240900923,240900924,240900925,240900926,240900927,240900928,240901048,240901053,240901054,240901055,240901056,240901057,240901058,240901059,240901060,240901061,240901062,240901063,240901064,240901065,240901066,240901067,240901068,240901069,240901070,240901071,240901072,240901073,240901074,240901075,240901076,240901077,240901078,240901079,240901080,240901081,240901082,240901083,240901084,240901085,240901086,240901087,240901088,240901089,240901090,240901091,240901092,240901093,240901094,240901095,240901096,240901097,240901098,240901099,240901100,240901101,240901102,240901103,240901104,240901105,240901106,240901107,240901108,240901109,240901110,240901111,240901112,240901113,240901114,240901115,240901116,240901117,240901118,240901119,240901120,240901121,240901122,240901123,240901124,240901125,240901126,240901127,240901128,240901129}'::bigint[]))
(4 rows)

Time: 5.261 ms
========================================================================

-- 100 entries
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Index Scan using te_current_pid_idx on te (cost=0.57..731.26 rows=3832
width=8) (actual time=1.244..15.897 rows=600 loops=1)
Index Cond: (current_pid = ANY
('{240900026,240900027,240900028,240900029,240900030,240900031,240900032,240900033,240900034,240900035,240900036,240900037,240900038,240900039,240900040,240900041,240900042,240900043,240900044,240900045,240900046,240900047,240900048,240900049,240900050,240900051,240900052,240900053,240900054,240900055,240900056,240900057,240900058,240900059,240900060,240900061,240900062,240900063,240900064,240900065,240900066,240900067,240900068,240900069,240900070,240900071,240900072,240900073,240900074,240900075,240900076,240900077,240900078,240900079,240900080,240900081,240900082,240900083,240900084,240900085,240900086,240900165,240900087,240900166,240900088,240900167,240900089,240900168,240900090,240900169,240900091,240900170,240900092,240900171,240900093,240900172,240900094,240900173,240900905,240900174,240900175,240900176,240900177,240900178,240900179,240900180,240900181,240900182,240900183,240900184,240900185,240900186,240900187,240900188,240900189,240900190,240900191,240900192,240900193,240900194}'::bigint[]))
Planning Time: 3.430 ms
Execution Time: 15.954 ms
(4 rows)

Time: 20.257 ms

On Tue, Jul 14, 2020 at 10:31 PM Radoslav Nedyalkov <rnedyalkov@gmail.com>
wrote:

Show quoted text

Hi Michael,
full output from the query is attached.
here is the truncated lists version.

frac_mcv | 0.00306267
tablename | te
attname | current_pid
inherited | f
null_frac | 0.261823
n_distinct | 1.59236e+07
n_mcv | 560
n_hist | 5001
correlation | 0.995502
schemaname | public
tablename | te
attname | current_pid
inherited | f
null_frac | 0.261823
avg_width | 8
n_distinct | 1.59236e+07
most_common_vals |
{15026003,24951186,220707698,223344198,224236736,224355865,224359830,224371584,224380154,224382722,224388639,224390209,224394943,224396835,228259607,232148477,232173137,232379194,232729185,232913953,236304699,236797618,237355501,238860629,239082658}
most_common_freqs |
{4.46667e-05,1.93333e-05,4.66667e-06,4.66667e-06,...,4e-06,4e-06,4e-06,4e-06,4e-06,4e-06,4e-06,4e-06,4e-06,4e-06,4e-06,4e-06,4e-06,4e-06,4e-06,4e-06,4e-06,4e-06,4e-06,4e-06}
histogram_bounds |
{13426761,13467316,13510844,..239215632,239302648,239371125,239466529,239532095,239571468,239611801,239634487}
correlation | 0.995502
most_common_elems | (null)
most_common_elem_freqs | (null)
elem_count_histogram | (null)

Thank You very much for the response.
I'll try with settings as you propose.

Rado

On Tue, Jul 14, 2020 at 9:46 PM Michael Lewis <mlewis@entrata.com> wrote:

rows=3832
rows=3870

Your estimate changed very little when you included 100 values vs 200
values. That is interesting to me.

What does the below query give you? How many of those 200 values are
found in the MCVs list? If n_distinct is low, and most of the values are
NOT in the most common value list, and the fraction of the table covered by
the MCVs is also low, then the planner will expect that each of the 200
values represents some deceivingly high portion of the table.

You said there are 80 million rows, yes? That seems likely that ndistinct
and the MCVs list are not giving info very correlated with reality. You may
want to increase the minimum table size for sequential to kick in. I cannot
recall the name of that setting at the moment. You may also want to
increase stats target on that column at least, analyze, and explain the
query again.

SELECT
( SELECT SUM (x) FROM UNNEST (most_common_freqs) x ) frac_MCV,
tablename,
attname,
inherited,
null_frac,
n_distinct,
array_length(most_common_vals,1) n_mcv,
array_length(histogram_bounds,1) n_hist,
correlation,
*
FROM pg_stats
WHERE
schemaname = 'public'
AND tablename='te'
AND attname='current_pid';

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Radoslav Nedyalkov (#4)
Re: single table - fighting a seq scan

Radoslav Nedyalkov <rnedyalkov@gmail.com> writes:

Ah, I could have messed up the examples I gave. Row numbers are different.
Once again the plans , sorry about that.

Given that it works at 100 entries and not 101, I can't escape the
suspicion that you're being burnt by predtest.c's MAX_SAOP_ARRAY_SIZE
limit. However, that only affects the planner's willingness to make
constraint proofs involving the large IN clause, and nothing you've
mentioned here explains why such a proof would be needed. Is there
something you're not telling us about this table's schema? (I'm
wondering if the index is partial, for instance, though one would
think that the CTE form of the query wouldn't work either if so.)

regards, tom lane

#6Radoslav Nedyalkov
rnedyalkov@gmail.com
In reply to: Tom Lane (#5)
Re: single table - fighting a seq scan

Shame on me. It's a partial index - *where is not null.*
Put the* is not null *predicate in place and planner always goes for index.
(tested with thousands of IN entries)
CTE version always goes for index too even *without **is not null , *which
led to a slight confusion.

Thanks Tom, Michael,
Best
Rado

On Wed, Jul 15, 2020 at 1:06 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Show quoted text

Radoslav Nedyalkov <rnedyalkov@gmail.com> writes:

Ah, I could have messed up the examples I gave. Row numbers are

different.

Once again the plans , sorry about that.

Given that it works at 100 entries and not 101, I can't escape the
suspicion that you're being burnt by predtest.c's MAX_SAOP_ARRAY_SIZE
limit. However, that only affects the planner's willingness to make
constraint proofs involving the large IN clause, and nothing you've
mentioned here explains why such a proof would be needed. Is there
something you're not telling us about this table's schema? (I'm
wondering if the index is partial, for instance, though one would
think that the CTE form of the query wouldn't work either if so.)

regards, tom lane

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Radoslav Nedyalkov (#6)
Re: single table - fighting a seq scan

Radoslav Nedyalkov <rnedyalkov@gmail.com> writes:

Shame on me. It's a partial index - *where is not null.*
Put the* is not null *predicate in place and planner always goes for index.
(tested with thousands of IN entries)
CTE version always goes for index too even *without **is not null , *which
led to a slight confusion.

Ah. That's actually something we fixed in v12 (see [1]https://git.postgresql.org/gitweb/?p=postgresql.git&amp;a=commitdiff&amp;h=65ce07e02). In the CTE
version, the planner can prove "x is not null" from "x = cte_value" even
without knowing what the CTE output value is, just on the basis that "="
is strict. In the IN form, it's likewise possible to prove "x is not
null" from "x IN (list)", but you need a special test to recognize that.
With a short IN list, the planner converts IN to "x = this OR x = that
OR x = the-other ..." and can make the proof from that formulation.
But we prevent it from trying that on long IN lists, because it'd eat
lots of cycles and perhaps not be able to prove the desired partial index
qual anyway.

regards, tom lane

[1]: https://git.postgresql.org/gitweb/?p=postgresql.git&amp;a=commitdiff&amp;h=65ce07e02