Improvement of var_eq_non_const()
Hi!
Comment in var_eq_non_const() says:
/*
* Search is for a value that we do not know a priori, but we will
* assume it is not NULL. Estimate the selectivity as non-null
* fraction divided by number of distinct values, so that we get a
* result averaged over all possible values whether common or
* uncommon. (Essentially, we are assuming that the not-yet-known
* comparison value is equally likely to be any of the possible
* values, regardless of their frequency in the table. Is that a good
* idea?)
*/
Seems, it is workable, but not so good idea, because it could be a reason of
significant underestimation in case of non-uniform data distribution. Often,
this leads to a choice of nested loop instead of other join algorithms and, with
underestimation, planner chooses a wrong plan. In attachment there is a dump and
query from real application, but clipped and obfuscated, sorry.
explain analyze
SELECT
COUNT(*)
FROM
t1
LEFT JOIN
t2
ON t2.b = t1.a
AND t2.c = 0 //line of interest
;
With commented 'line of interest' everything works fine without any problems,
but with that line we will get another plan, much worst. Interesting, that t2.c
column contains only one value (zero). Planner chooses nested loop instead of
hash join and makes a wrong estimation:
Index Only Scan using i1_t2 on t2 (.. rows=6 ..) (.. rows=3555 loops=2000)
Index Cond: ((c = 0) AND (b = t1.a))
It supposed six rows but executor got 3555. That caused because
var_eq_non_const() supposes uniform distribution of t2.b column which is wrong.
Actually, this example works fine anyway, but real query contains a lot other
joins connected to t2 and this planner's mistake becomes a huge performance
issue, up to three orders of magnitude.
I'd like to suggest to improve var_eq_non_const() by using knowledge of MCV and
estimate the selectivity as quadratic mean of non-null fraction divided by
number of distinct values (as it was before) and set of MCV selectivities. Use
quadratic mean because it includes the squared deviation (error) as well and
here it would be nice to compute upper limit of estimation to prevent wrong
choose of nested loop, for example.
If nested loop is forced and patch is applied it will produce much better
estimation:
Index Only Scan using i1_t2 on t2 (.. rows=3380 ..) (.. rows=3555 loops=2000)
--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/
Attachments:
v1-0001-Estimate-the-selectivity-as-quadratic-mean-of-non.patchtext/x-patch; charset=UTF-8; name=v1-0001-Estimate-the-selectivity-as-quadratic-mean-of-non.patchDownload
From 715db4ec468da8dcedbc9d91e4b50f7408f5b025 Mon Sep 17 00:00:00 2001
From: Teodor Sigaev <teodor@sigaev.ru>
Date: Thu, 20 Feb 2025 19:00:02 +0300
Subject: [PATCH v1] Estimate the selectivity as quadratic mean of non-null
fraction divided by number of distinct values and set of MCV selectivities.
Use quadratic mean because it includes the squared deviation (error) as well
and here it would be nice to compute upper limit of estimation to prevent
wrong choose of nested loop, for example.
---
src/backend/utils/adt/selfuncs.c | 43 +++++++++++++++++++++-----------
1 file changed, 29 insertions(+), 14 deletions(-)
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index d3d1e485bb2..b0e8a21c923 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -501,29 +501,44 @@ var_eq_non_const(VariableStatData *vardata, Oid oproid, Oid collation,
/*
* Search is for a value that we do not know a priori, but we will
- * assume it is not NULL. Estimate the selectivity as non-null
- * fraction divided by number of distinct values, so that we get a
- * result averaged over all possible values whether common or
- * uncommon. (Essentially, we are assuming that the not-yet-known
- * comparison value is equally likely to be any of the possible
- * values, regardless of their frequency in the table. Is that a good
- * idea?)
+ * assume it is not NULL. Estimate the selectivity as quadratic mean of
+ * non-null fraction divided by number of distinct values and set of MCV
+ * selectivities. Use quadratic mean because it includes the squared
+ * deviation (error) as well and here it would be nice to compute upper
+ * limit of estimation to prevent wrong choose of nested loop, for
+ * example.
*/
selec = 1.0 - nullfrac;
ndistinct = get_variable_numdistinct(vardata, &isdefault);
if (ndistinct > 1)
selec /= ndistinct;
- /*
- * Cross-check: selectivity should never be estimated as more than the
- * most common value's.
- */
if (get_attstatsslot(&sslot, vardata->statsTuple,
STATISTIC_KIND_MCV, InvalidOid,
- ATTSTATSSLOT_NUMBERS))
+ ATTSTATSSLOT_VALUES | ATTSTATSSLOT_NUMBERS))
{
- if (sslot.nnumbers > 0 && selec > sslot.numbers[0])
- selec = sslot.numbers[0];
+ int i;
+ double sum_selec = 0.0;
+
+ /*
+ * Compute quadratic mean, walk on array in reverse direction to
+ * do not lose accuracy. We don't bother about sslot.nnumbers
+ * equality to zero, because in this case we just get the same
+ * result. But equality to zero is unlikely.
+ */
+ for(i=sslot.nnumbers - 1; i>=0; i--)
+ sum_selec += sslot.numbers[i] * sslot.numbers[i];
+
+ selec = sqrt((selec * selec + sum_selec) /
+ ((double)sslot.nnumbers + 1.0));
+
+ /*
+ * Cross-check: selectivity should never be estimated as
+ * more than the most common value's.
+ */
+ if (sslot.nnumbers > 0 && selec > sslot.numbers[0])
+ selec = sslot.numbers[0];
+
free_attstatsslot(&sslot);
}
}
--
2.47.1
Teodor Sigaev <teodor@sigaev.ru> writes:
I'd like to suggest to improve var_eq_non_const() by using knowledge of MCV and
estimate the selectivity as quadratic mean of non-null fraction divided by
number of distinct values (as it was before) and set of MCV selectivities.
What's the statistical interpretation of this calculation (that is,
the average MCV selectivity)? Maybe it's better, but without any
context it seems like a pretty random thing to do. In particular,
it seems like this could give radically different answers depending
on how many MCVs we chose to store, and I'm not sure we could argue
that the result gets more accurate with more MCVs stored.
regards, tom lane
On 20.02.2025 21:21, Tom Lane wrote:
Teodor Sigaev <teodor@sigaev.ru> writes:
I'd like to suggest to improve var_eq_non_const() by using knowledge of MCV and
estimate the selectivity as quadratic mean of non-null fraction divided by
number of distinct values (as it was before) and set of MCV selectivities.What's the statistical interpretation of this calculation (that is,
the average MCV selectivity)? Maybe it's better, but without any
context it seems like a pretty random thing to do. In particular,
it seems like this could give radically different answers depending
on how many MCVs we chose to store, and I'm not sure we could argue
that the result gets more accurate with more MCVs stored.regards, tom lane
Hi,
The arithmetic mean is not exactly the same as the root mean square
approach implemented by Teodor. The key difference is that the root mean
square is more influenced by the largest values in the distribution. The
further the data deviates from a uniform distribution, the less
representative a simple arithmetic mean becomes.
Theodor's idea seems quite useful to me because it ensures that
selectivity is now influenced by multiple significant values from the
MCV list, rather than just the single most frequent one. This should
lead to a more accurate selectivity estimate, better reflecting the
actual data distribution.
--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.