Extended statistics improvement: multi-column MCV missing values

Started by Enrique Sánchez19 days ago7 messageshackers
Jump to latest
#1Enrique Sánchez
enriqueesanchz@gmail.com

Hi,

Postgres only uses multi-column MCVs when the value we are looking for is
in the list. If not, it falls back into individual independent statistics
to estimate selectivity.
However, a miss in a multi-column MCV list still yields valuable
information that it currently throws away: we know that the combination's
frequency is strictly bounded by the frequency of the last (least common)
item in that MCV list.

Test case
=========
```
-- 1. Create a minimal table
DROP TABLE IF EXISTS planner_trap;
CREATE TABLE planner_trap (
col_a integer,
col_b integer,
col_c integer,
sort_col integer
);

-- 2. 1 becomes the most common value (MCV) for all three columns.
INSERT INTO planner_trap (col_a, col_b, col_c, sort_col)
SELECT
(pow(random(), 5) * 100)::integer + 1,
(pow(random(), 5) * 10)::integer + 1,
(pow(random(), 5) * 50)::integer + 1,
i
FROM generate_series(1, 100000) AS i;

-- 3. Create indexes
CREATE INDEX idx_planner_trap_sort ON planner_trap(sort_col);
CREATE INDEX idx_planner_trap_a_b_c ON planner_trap(col_a, col_b, col_c);

-- 4. Make the exact combination (1, 1, 1) yield zero rows. This ensures it
won't appear
-- in the MCV list, even though the value '1' remains the most common for
each individual column.
UPDATE planner_trap
SET col_a = 2
WHERE col_a = 1 AND col_b = 1 AND col_c = 1;

-- 5. Create MCV statistics
CREATE STATISTICS (mcv) on col_a, col_b, col_c FROM planner_trap;
ANALYZE planner_trap;
```

The planner multiplies the individual selectivities and overestimates the
row count (5909; actual=0):
```
postgres=# EXPLAIN ANALYZE SELECT * FROM planner_trap WHERE col_a = 1 AND
col_b = 1 AND col_c = 1 ORDER BY sort_col DESC LIMIT 10;

QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.29..7.72 rows=10 width=16) (actual time=14.632..14.633
rows=0.00 loops=1)
Buffers: shared hit=16355 dirtied=1
-> Index Scan Backward using idx_planner_trap_sort on planner_trap
(cost=0.29..4386.99 rows=5909 width=16) (actual time=14.630..14.630
rows=0.00 loops=1)
Filter: ((col_a = 1) AND (col_b = 1) AND (col_c = 1))
Rows Removed by Filter: 100000
Index Searches: 1
Buffers: shared hit=16355 dirtied=1
Planning:
Buffers: shared hit=33 read=1
Planning Time: 0.478 ms
Execution Time: 14.651 ms
```

1. Cap selectivity with the last MCV item's frequency
=====================================================
Applying the last MCV cap here, Postgres estimates 117 rows: 0.001167 *
100000 = 117
```
postgres=# SELECT m.* FROM pg_statistic_ext join pg_statistic_ext_data on
(oid = stxoid),
pg_mcv_list_items(stxdmcv) m WHERE stxname =
'planner_trap_col_a_col_b_col_c_stat' and index = 99;
index | values | nulls | frequency | base_frequency
-------+----------+---------+-----------------------+----------------
99 | {1,1,31} | {f,f,f} | 0.0011666666666666668 | 0.001049409536
```

making postgres to choose a better plan:
```
Limit (cost=300.72..300.75 rows=10 width=16) (actual time=0.045..0.046
rows=0.00 loops=1)
Buffers: shared hit=3 read=2
-> Sort (cost=300.72..301.02 rows=117 width=16) (actual
time=0.043..0.044 rows=0.00 loops=1)
Sort Key: sort_col DESC
Sort Method: quicksort Memory: 25kB
Buffers: shared hit=3 read=2
-> Bitmap Heap Scan on planner_trap (cost=5.78..298.19 rows=117
width=16) (actual time=0.026..0.027 rows=0.00 loops=1)
Recheck Cond: ((col_a = 1) AND (col_b = 1) AND (col_c = 1))
Buffers: shared read=2
-> Bitmap Index Scan on idx_planner_trap_a_b_c
(cost=0.00..5.76 rows=117 width=0) (actual time=0.019..0.020 rows=0.00
loops=1)
Index Cond: ((col_a = 1) AND (col_b = 1) AND (col_c =
1))
Index Searches: 1
Buffers: shared read=2
Planning:
Buffers: shared hit=51 read=8 dirtied=4
Planning Time: 0.742 ms
Execution Time: 0.094 ms
```

2. Estimate selectivity as Postgres does for single-column values not in
MCVs
=============================================================================
While that significantly improves estimations, we could mirror what
Postgres already does for individual MCVs. Quote from the official
documentation:

The approach is to use the fact that the value is not in the list,

combined with the knowledge of the frequencies for all of the MCVs:

That is, add up all the frequencies for the MCVs and subtract them from

one, then divide by the number of other distinct values.

To achieve this, we need to store an ndistinct estimation alongside the
MCVs that can be used for partial or entire column match.

P(1, 1, 1) = (1 - sum(MCVs)) / (ndistinct(col_a, col_b, col_c) -
MCV_list_size)

```
postgres=# SELECT sum(m.frequency) FROM pg_statistic_ext join
pg_statistic_ext_data on (oid = stxoid),
pg_mcv_list_items(stxdmcv) m WHERE stxname =
'planner_trap_col_a_col_b_col_c_stat';
sum
---------------------
0.39456666666666645

postgres=# SELECT stxkeys AS k, stxdndistinct AS nd
FROM pg_statistic_ext join pg_statistic_ext_data on (oid = stxoid)
WHERE stxname = 'stts2';
k | nd

-------+---------------------------------------------------
1 2 3 | [...{"attributes": [1, 2, 3], "ndistinct": 8511}]
```

Then (1 - 0.39456666666666645) / (8511 - 100) = 0.000071981; 0.000071981 *
100000 = 7 rows
```
Limit (cost=30.30..30.32 rows=7 width=16) (actual time=0.035..0.036
rows=0.00 loops=1)
Buffers: shared hit=2
-> Sort (cost=30.30..30.32 rows=7 width=16) (actual time=0.033..0.034
rows=0.00 loops=1)
Sort Key: sort_col DESC
Sort Method: quicksort Memory: 25kB
Buffers: shared hit=2
-> Bitmap Heap Scan on planner_trap (cost=4.38..30.20 rows=7
width=16) (actual time=0.028..0.029 rows=0.00 loops=1)
Recheck Cond: ((col_a = 1) AND (col_b = 1) AND (col_c = 1))
Buffers: shared hit=2
-> Bitmap Index Scan on idx_planner_trap_a_b_c
(cost=0.00..4.38 rows=7 width=0) (actual time=0.022..0.022 rows=0.00
loops=1)
Index Cond: ((col_a = 1) AND (col_b = 1) AND (col_c =
1))
Index Searches: 1
Buffers: shared hit=2
Planning Time: 0.192 ms
Execution Time: 0.057 ms
```

I think this is a cheap way to prevent bad estimations. The storage
overhead of adding an ndistinct field is trivial compared to the MCV list
itself, and the O(1) arithmetic during planning adds no measurable
overhead. I look forward to your feedback before drafting a patch.

Best,
Enrique.

#2Ilia Evdokimov
ilya.evdokimov@tantorlabs.com
In reply to: Enrique Sánchez (#1)
Re: Extended statistics improvement: multi-column MCV missing values

Hi Enrique,

On 5/18/26 19:09, Enrique Sánchez wrote:

Postgres only uses multi-column MCVs when the value we are looking for
is in the list. If not, it falls back into individual independent
statistics to estimate selectivity.
However, a miss in a multi-column MCV list still yields valuable
information that it currently throws away: we know that the
combination's frequency is strictly bounded by the frequency of the
last (least common) item in that MCV list.

LGTM. If the multicolumn MCV statistics exists and the clause
combination is absent from the MCV-list, we can use the least frequent
MCV item as an upper bound. BTW, this only applies to AND-clauses.

2. Estimate selectivity as Postgres does for single-column values not
in MCVs
=============================================================================
While that significantly improves estimations, we could mirror what
Postgres already does for individual MCVs. Quote from the official
documentation:

The approach is to use the fact that the value is not in the list,

combined with the knowledge of the frequencies for all of the MCVs:

That is, add up all the frequencies for the MCVs and subtract them

from one, then divide by the number of other distinct values.

To achieve this, we need to store an ndistinct estimation alongside
the MCVs that can be used for partial or entire column match.

P(1, 1, 1) = (1 - sum(MCVs)) / (ndistinct(col_a, col_b, col_c) -
MCV_list_size)

...

I think this is a cheap way to prevent bad estimations. The storage
overhead of adding an ndistinct field is trivial compared to the MCV
list itself, and the O(1) arithmetic during planning adds no
measurable overhead. I look forward to your feedback before drafting a
patch.

For this, the ndistinct extended statistics already exist. If both MCV
and ndistinct statistics are present on the same column set, the formula
is correct. There are already places in the code that compute ndistinct
for columns without extended ndistinct statistics (see
estimate_num_groups) - but it is worth thinking carefully about whether
the added complexity is justified before going down that path.

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC,
https://tantorlabs.com/

#3Enrique Sánchez
enriqueesanchz@gmail.com
In reply to: Ilia Evdokimov (#2)
Re: Extended statistics improvement: multi-column MCV missing values

Hi Ilia,

1. Cap selectivity with the last MCV item's frequency

I've attached a draft patch. It's split into 4 commits to make it easier to
review. It adds the MCV cap for AND (equality, IN, ANY) clauses. Looking
forward to your feedback.

On 5/19/26 0:13, Ilia Evdokimov wrote:

Postgres only uses multi-column MCVs when the value we are looking for is
in the list. If not, it falls back into individual independent statistics
to estimate selectivity.
However, a miss in a multi-column MCV list still yields valuable
information that it currently throws away: we know that the combination's
frequency is strictly bounded by the frequency of the last (least common)
item in that MCV list.

LGTM. If the multicolumn MCV statistics exists and the clause combination
is absent from the MCV-list, we can use the least frequent MCV item as an
upper bound. BTW, this only applies to AND-clauses.

Given that the inclusion-exclusion principle (P(A OR B) = P(A) + P(B) - P(A
AND B)) is used to calculate OR paths, we could use the same cap to improve
the overlap estimation (P(A AND B)).

2. Estimate selectivity as Postgres does for single-column values not in

MCVs

=============================================================================
While that significantly improves estimations, we could mirror what
Postgres already does for individual MCVs. Quote from the official
documentation:

The approach is to use the fact that the value is not in the list,

combined with the knowledge of the frequencies for all of the MCVs:

That is, add up all the frequencies for the MCVs and subtract them from

one, then divide by the number of other distinct values.

To achieve this, we need to store an ndistinct estimation alongside the
MCVs that can be used for partial or entire column match.

P(1, 1, 1) = (1 - sum(MCVs)) / (ndistinct(col_a, col_b, col_c) -
MCV_list_size)
...
I think this is a cheap way to prevent bad estimations. The storage
overhead of adding an ndistinct field is trivial compared to the MCV list
itself, and the O(1) arithmetic during planning adds no measurable
overhead. I look forward to your feedback before drafting a patch.

For this, the ndistinct extended statistics already exist. If both MCV and
ndistinct statistics are present on the same column set, the formula is
correct. There are already places in the code that compute ndistinct for
columns without extended ndistinct statistics (see estimate_num_groups) -
but it is worth thinking carefully about whether the added complexity is
justified before going down that path.

I think that the implementation would look similar to the attached patch.
The only added complexity is getting the ndistinct estimation. There are 2
ways:
- Rely on the extended ndistinct statistic if it exists
- Calculate the ndistinct(col_a, col_b, col_c) statistic as part of the
MCV. The storage cost is negligible compared to the MCV list and it keeps
the MCV statistic complete, so it doesn't need to check if the ndistinct
statistic exists. That would be a change on the MCV struct, meaning a
change in the mcvlist->type.

I can start working on a proposal for this second part. Let me know what
you think.

Best regards,
Enrique.

Attachments:

v1-0004-Extend-multi-column-MCV-cap-to-AND-clauses-inside-OR.patchtext/x-patch; charset=US-ASCII; name=v1-0004-Extend-multi-column-MCV-cap-to-AND-clauses-inside-OR.patchDownload+90-18
v1-0001-Cap-selectivity-when-values-are-not-in-multi-column-.patchtext/x-patch; charset=US-ASCII; name=v1-0001-Cap-selectivity-when-values-are-not-in-multi-column-.patchDownload+140-4
v1-0002-Add-support-for-IN-ANY-clauses-in-multi-column-MCV-c.patchtext/x-patch; charset=US-ASCII; name=v1-0002-Add-support-for-IN-ANY-clauses-in-multi-column-MCV-c.patchDownload+52-15
v1-0003-Extend-multicolumn-MCV-cap-to-partial-IN-ANY-matches.patchtext/x-patch; charset=US-ASCII; name=v1-0003-Extend-multicolumn-MCV-cap-to-partial-IN-ANY-matches.patchDownload+12-13
#4Chengpeng Yan
chengpeng_yan@outlook.com
In reply to: Enrique Sánchez (#1)
Re: Extended statistics improvement: multi-column MCV missing values

Hi,

On May 19, 2026, at 00:09, Enrique Sánchez <enriqueesanchz@gmail.com> wrote:

Hi,

Postgres only uses multi-column MCVs when the value we are looking for is in the list. If not, it falls back into individual independent statistics to estimate selectivity.
However, a miss in a multi-column MCV list still yields valuable information that it currently throws away: we know that the combination's frequency is strictly bounded by the frequency of the last (least common) item in that MCV list.

1. Cap selectivity with the last MCV item's frequency
=====================================================
Applying the last MCV cap here, Postgres estimates 117 rows: 0.001167 * 100000 = 117

2. Estimate selectivity as Postgres does for single-column values not in MCVs
=============================================================================
While that significantly improves estimations, we could mirror what Postgres already does for individual MCVs. Quote from the official documentation:

The approach is to use the fact that the value is not in the list, combined with the knowledge of the frequencies for all of the MCVs:
That is, add up all the frequencies for the MCVs and subtract them from one, then divide by the number of other distinct values.

To achieve this, we need to store an ndistinct estimation alongside the MCVs that can be used for partial or entire column match.

P(1, 1, 1) = (1 - sum(MCVs)) / (ndistinct(col_a, col_b, col_c) - MCV_list_size)

I think this is a cheap way to prevent bad estimations. The storage overhead of adding an ndistinct field is trivial compared to the MCV list itself, and the O(1) arithmetic during planning adds no measurable overhead. I look forward to your feedback before drafting a patch.

Thanks for working on this.

I think the general idea is a good one: an MCV miss still carries useful
information, and it seems worth using that information instead of
falling back entirely to independent estimates.

I have not done a full code review of the patch yet, so this is mostly
feedback on the design and scope.

I think the important first boundary is the clause shape, not the choice
between the cap and the ndistinct-based average estimate. The strongest
case is a top-level AND condition with equality clauses for all columns
in the multivariate MCV statistic, where the queried value combination
is absent from the MCV list. In that case the miss is useful negative
information.

For that same case, the least-MCV-frequency cap is a useful bound and
fallback: the combination should not be estimated as more frequent than
the least frequent item kept in the MCV list. If matching ndistinct
statistics exist for the same column set, the ndistinct-based average
estimate also seems valid, as Ilia noted earlier in this thread. I do
not think the first patch needs to make `CREATE STATISTICS ... (mcv)`
store an ndistinct value. It can use matching ndistinct statistics when
they exist, while changing what an MCV statistic contains seems like a
separate decision.

I am also not sure IN/ANY should be included in the first version. For
that case we first have to define the distinct semantic full groups
represented by the clause, not just look at raw array elements. For
example, `a = 0 AND b IN (99, 99)` represents one full group, not two,
and NULL or empty arrays need similar care. That issue exists whether
the result is used as a cap or as an ndistinct-based average estimate.

Similarly, I would leave OR for a follow-up patch. As discussed, that
path would apply the cap while estimating the `P(A AND B)` part of an OR
estimate. That is a different estimation problem from the basic
top-level AND miss case, and probably needs its own design discussion.

So my preference would be to start with the smallest case:
1. Full-dimensional top-level AND equality miss, using the
ndistinct-based average estimate when matching ndistinct statistics
exist, with the least-MCV frequency as an upper bound; otherwise
using the cap alone.
2. Later patches for IN/ANY and OR handling.

That would make the first patch easier to reason about and review, while
still leaving room for the stronger cases later.

Those are just my initial thoughts, so I may be missing some details in
the patch.

--
Best regards,
Chengpeng Yan

#5Zsolt Parragi
zsolt.parragi@percona.com
In reply to: Chengpeng Yan (#4)
Re: Extended statistics improvement: multi-column MCV missing values

Hello

I am also not sure IN/ANY should be included in the first version.

I also only focused on testing the "basic" functionality.

The strongest
case is a top-level AND condition with equality clauses for all columns

I see one issue related to that with the current patch, having the same amount of conditions as the table doesn't mean that we have conditions for all columns.

See the following script that showcases a significant underestimate when we use one column twice, and another 0 times in the conditions:

CREATE TABLE u (a int, b int, c int);
INSERT INTO u SELECT 0, (g % 5) + 1, g FROM generate_series(1, 50000) g;
INSERT INTO u SELECT (g % 30) + 10, (g % 30) + 10, (g % 30) + 10
FROM generate_series(1, 30000) g;
CREATE STATISTICS su (mcv) ON a, b, c FROM u;
ANALYZE u;
SELECT count(*) FROM u WHERE a = 0 AND b = ANY(ARRAY[1,2,3]); -- 30000
EXPLAIN SELECT * FROM u WHERE a = 0 AND b = ANY(ARRAY[1,2,3]); -- ~18500, same as master
EXPLAIN SELECT * FROM u WHERE a = 0 AND a = ANY(ARRAY[0,9]) AND b = ANY(ARRAY[1,2,3]); -- ~5200 vs ~11500 on master

#6Enrique Sánchez
enriqueesanchz@gmail.com
In reply to: Chengpeng Yan (#4)
Re: Extended statistics improvement: multi-column MCV missing values

Hi, thank you for having a look.

On Jun 01, 2026 at 08:32, Chengpeng Yan (<chengpeng_yan@outlook.com>) wrote:

I think the important first boundary is the clause shape, not the choice
between the cap and the ndistinct-based average estimate. The strongest
case is a top-level AND condition with equality clauses for all columns
in the multivariate MCV statistic, where the queried value combination
is absent from the MCV list. In that case the miss is useful negative
information.

For that same case, the least-MCV-frequency cap is a useful bound and
fallback: the combination should not be estimated as more frequent than
the least frequent item kept in the MCV list. If matching ndistinct
statistics exist for the same column set, the ndistinct-based average
estimate also seems valid, as Ilia noted earlier in this thread. I do
not think the first patch needs to make `CREATE STATISTICS ... (mcv)`
store an ndistinct value. It can use matching ndistinct statistics when
they exist, while changing what an MCV statistic contains seems like a
separate decision.

Makes sense, I agree that changing what an MCV statistic contains is a
separate decision.

I am also not sure IN/ANY should be included in the first version. For
that case we first have to define the distinct semantic full groups
represented by the clause, not just look at raw array elements. For
example, `a = 0 AND b IN (99, 99)` represents one full group, not two,

Regarding array deduplication, i.e. `b in (99, 99)` Postgres does not check
if there are repeated elements right now, and uses them as different
elements to estimate; e.g.

```
postgres=# CREATE TABLE psql_hackers(a int, b int);
CREATE TABLE
postgres=# INSERT INTO psql_hackers SELECT a, b FROM generate_series(1,
100) a, generate_series(1, 100) b;
INSERT 0 9900
postgres=# EXPLAIN SELECT * FROM psql_hackers WHERE a IN (2);
QUERY PLAN
----------------------------------------------------------------
Seq Scan on psql_hackers (cost=0.00..167.75 rows=100 width=8)
Filter: (a = 2)
(2 rows)

postgres=# EXPLAIN SELECT * FROM psql_hackers WHERE a IN (2, 2);
QUERY PLAN
----------------------------------------------------------------
Seq Scan on psql_hackers (cost=0.00..167.75 rows=200 width=8)
Filter: (a = ANY ('{2,2}'::integer[]))
(2 rows)
```
I think we should be consistent with what Postgres already does now. It
seems like a potential future patch.

and NULL or empty arrays need similar care. That issue exists whether

the result is used as a cap or as an ndistinct-based average estimate.

NULLs and empty arrays can appear in a MCV entry so they should be treated
the same.

So, for the basic functionality we need to check that:
- All MCV dimensions are used in the clause
- All clauses are one of:
- Equality (=)
- IS NULL
- Bool operation: (= TRUE), (= FALSE)

We also need to check that there are no expressions in the clause, as they
could refer to multiple values e.g. `mod(a, 7) = 0`

Similarly, I would leave OR for a follow-up patch. As discussed, that
path would apply the cap while estimating the `P(A AND B)` part of an OR
estimate. That is a different estimation problem from the basic
top-level AND miss case, and probably needs its own design discussion.

Yes, I've been investigating further and it makes more sense to have a
follow-up discussion later.

So my preference would be to start with the smallest case:

1. Full-dimensional top-level AND equality miss, using the
ndistinct-based average estimate when matching ndistinct statistics
exist, with the least-MCV frequency as an upper bound; otherwise
using the cap alone.
2. Later patches for IN/ANY and OR handling.

That would make the first patch easier to reason about and review, while
still leaving room for the stronger cases later.

I've attached the v2 patch, covering only the full-dimensional top-level
AND equality miss.
Right now it only caps to the least-MCV frequency, but I will send a
follow-up patch with the ndistinct part asap.

Best regards,
Enrique.

Attachments:

v2-0001-Cap-selectivity-when-values-are-not-in-multi-colu.patchtext/x-patch; charset=US-ASCII; name=v2-0001-Cap-selectivity-when-values-are-not-in-multi-colu.patchDownload+138-4
#7Enrique Sánchez
enriqueesanchz@gmail.com
In reply to: Zsolt Parragi (#5)
Re: Extended statistics improvement: multi-column MCV missing values

Hi, thank you for reviewing it.

Tue, 02 Jun 2026 at 0:18, Zsolt Parragi (<zsolt.parragi@percona.com>) wrote:

The strongest
case is a top-level AND condition with equality clauses for all columns

I see one issue related to that with the current patch, having the
same amount of conditions as the table doesn't mean that we have
conditions for all columns.

I've fixed this in the v2 patch attached above (also attached here for
completeness). It focuses on the basic full-dimensional top-level AND
equality miss
Now we construct a Bitmapset with the columns the clause covers and compare
it to `stat->keys` that contains the columns covered by the MCV statistic.

Best regards,
Enrique.

Attachments:

v2-0001-Cap-selectivity-when-values-are-not-in-multi-colu.patchtext/x-patch; charset=US-ASCII; name=v2-0001-Cap-selectivity-when-values-are-not-in-multi-colu.patchDownload+138-4