Improve join selectivity estimation using extended statistics
Original thread is:
/messages/by-id/196f1e1a-5464-ed07-ab3c-0c9920564af7@postgrespro.ru
Following Yugo's advice, I have splitted this patch into two:
1. Extending auto_explain extension to generate extended statistics in
case of bad selectivity estimation.
2. Taken in account extended statistics when computing join selectivity.
Now this thread will contain only patches for join selectivity estimation.
However,
IIUC, the clausesel patch uses only functional dependencies statistics for
improving join, so my question was about possibility to consider MCV in the
clausesel patch.
Sorry, do not have idea right now how to use MCV for better estimation
of join selectivity.
--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachments:
extended_statistic_join_selectivity-1.patchtext/x-patch; name=extended_statistic_join_selectivity-1.patchDownload
diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c
index d263ecf..5446ad5 100644
--- a/src/backend/optimizer/path/clausesel.c
+++ b/src/backend/optimizer/path/clausesel.c
@@ -25,6 +25,8 @@
#include "utils/fmgroids.h"
#include "utils/lsyscache.h"
#include "utils/selfuncs.h"
+#include "statistics/statistics.h"
+#include "catalog/pg_statistic_ext.h"
/*
* Data structure for accumulating info about possible range-query
@@ -56,6 +58,47 @@ static Selectivity clauselist_selectivity_or(PlannerInfo *root,
****************************************************************************/
/*
+ * Find functional dependency between attributes using multicolumn statistic.
+ * relid: index of relation to which all considered attributes belong
+ * var: variable which dependencies are inspected
+ * attnums: set of considered attributes included specified variables
+ * This function return degree of strongest dependency between some subset of this attributes
+ * and specified variable or 0.0 if on dependency is found.
+ */
+double
+find_var_dependency(PlannerInfo *root, Index relid, Var *var, Bitmapset *attnums)
+{
+ RelOptInfo* rel = find_base_rel(root, relid);
+ if (rel->rtekind == RTE_RELATION && rel->statlist != NIL)
+ {
+ StatisticExtInfo *stat = choose_best_statistics(rel->statlist, STATS_EXT_DEPENDENCIES,
+ &attnums, 1);
+ if (stat != NULL)
+ {
+ MVDependencies *dependencies = statext_dependencies_load(stat->statOid);
+ MVDependency *strongest = NULL;
+ int i;
+ for (i = 0; i < dependencies->ndeps; i++)
+ {
+ MVDependency *dependency = dependencies->deps[i];
+ int n_dep_vars = dependency->nattributes - 1;
+ /* Dependency implies attribute */
+ if (var->varattno == dependency->attributes[n_dep_vars])
+ {
+ while (--n_dep_vars >= 0
+ && bms_is_member(dependency->attributes[n_dep_vars], attnums));
+ if (n_dep_vars < 0 && (!strongest || strongest->degree < dependency->degree))
+ strongest = dependency;
+ }
+ }
+ if (strongest)
+ return strongest->degree;
+ }
+ }
+ return 0.0;
+}
+
+/*
* clauselist_selectivity -
* Compute the selectivity of an implicitly-ANDed list of boolean
* expression clauses. The list can be empty, in which case 1.0
@@ -129,6 +172,9 @@ clauselist_selectivity_ext(PlannerInfo *root,
RangeQueryClause *rqlist = NULL;
ListCell *l;
int listidx;
+ Bitmapset *clauses_attnums = NULL;
+ int n_clauses_attnums = 0;
+ int innerRelid = varRelid;
/*
* If there's exactly one clause, just go directly to
@@ -139,6 +185,9 @@ clauselist_selectivity_ext(PlannerInfo *root,
varRelid, jointype, sjinfo,
use_extended_stats);
+ if (innerRelid == 0 && sjinfo)
+ bms_get_singleton_member(sjinfo->min_righthand, &innerRelid);
+
/*
* Determine if these clauses reference a single relation. If so, and if
* it has extended statistics, try to apply those.
@@ -171,7 +220,6 @@ clauselist_selectivity_ext(PlannerInfo *root,
Node *clause = (Node *) lfirst(l);
RestrictInfo *rinfo;
Selectivity s2;
-
listidx++;
/*
@@ -204,6 +252,7 @@ clauselist_selectivity_ext(PlannerInfo *root,
else
rinfo = NULL;
+
/*
* See if it looks like a restriction clause with a pseudoconstant on
* one side. (Anything more complicated than that might not behave in
@@ -215,6 +264,42 @@ clauselist_selectivity_ext(PlannerInfo *root,
OpExpr *expr = (OpExpr *) clause;
bool varonleft = true;
bool ok;
+ int oprrest = get_oprrest(expr->opno);
+
+ /* Try to take in account functional dependencies between attributes */
+ if (oprrest == F_EQSEL && rinfo != NULL && innerRelid != 0)
+ {
+ Var* var = (Var*)linitial(expr->args);
+ if (!IsA(var, Var) || var->varno != innerRelid)
+ {
+ var = (Var*)lsecond(expr->args);
+ }
+ if (IsA(var, Var) && var->varattno >= 0 && var->varno == innerRelid)
+ {
+ clauses_attnums = bms_add_member(clauses_attnums, var->varattno);
+ if (n_clauses_attnums++ != 0)
+ {
+ double dep = find_var_dependency(root, innerRelid, var, clauses_attnums);
+ if (dep != 0.0)
+ {
+ /*
+ * Replace s2 with the conditional probability s2 given s1, computed
+ * using the formula P(b|a) = P(a,b) / P(a), which simplifies to
+ *
+ * P(b|a) = f * Min(P(a), P(b)) / P(a) + (1-f) * P(b)
+ *
+ * where P(a) = s1, the selectivity of the implying attributes, and
+ * P(b) = s2, the selectivity of the implied attribute.
+ */
+ if (s1 <= s2)
+ s1 *= dep + (1 - dep) * s2;
+ else
+ s1 *= dep * s2 / s1 + (1 - dep) * s2;
+ continue;
+ }
+ }
+ }
+ }
if (rinfo)
{
@@ -240,7 +325,7 @@ clauselist_selectivity_ext(PlannerInfo *root,
* selectivity in generically. But if it's the right oprrest,
* add the clause to rqlist for later processing.
*/
- switch (get_oprrest(expr->opno))
+ switch (oprrest)
{
case F_SCALARLTSEL:
case F_SCALARLESEL:
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index aab06c7..74ceaf5 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -4790,6 +4790,30 @@ get_parameterized_joinrel_size(PlannerInfo *root, RelOptInfo *rel,
return nrows;
}
+
+/*
+ * Try to find dependency between variables.
+ * var: varaibles which dependencies are considered
+ * clause_vars: list of variables used in other clauses
+ * This functions return strongest dependency and some subset of variables from the same relation
+ * or 0.0 if no dependency was found.
+ */
+static double
+var_depends_on(PlannerInfo *root, Var* var, List* clause_vars)
+{
+ ListCell* lc;
+ Bitmapset *attnums = NULL;
+ Index relid = var->varno;
+
+ foreach (lc, clause_vars)
+ {
+ Var* join_var = (Var*)lfirst(lc);
+ if (join_var->varno == relid && join_var->varattno >= 0)
+ attnums = bms_add_member(attnums, join_var->varattno);
+ }
+ return attnums ? find_var_dependency(root, relid, var, bms_add_member(attnums, var->varattno)) : 0.0;
+}
+
/*
* calc_joinrel_size_estimate
* Workhorse for set_joinrel_size_estimates and
@@ -4886,6 +4910,39 @@ calc_joinrel_size_estimate(PlannerInfo *root,
pselec = 0.0; /* not used, keep compiler quiet */
}
+ /* Try to take in account functional dependencies between attributes of clauses pushed-down to joined relations and
+ * retstrictlist clause. Right now we consider only case of restrictlist consists of one clause.
+ */
+ if (list_length(restrictlist) == 1)
+ {
+ RestrictInfo* rinfo = linitial(restrictlist);
+ Expr* clause = rinfo->clause;
+
+ Assert(IsA(rinfo, RestrictInfo));
+
+ if (is_opclause(clause))
+ {
+ OpExpr *expr = (OpExpr *) clause;
+ ListCell* lc;
+ List* join_vars = NULL;
+
+ /* Get list of all attributes in pushed-down clauses */
+ foreach (lc, outer_rel->baserestrictinfo)
+ join_vars = list_concat(join_vars, pull_vars_of_level((Node*)((RestrictInfo*)lfirst(lc))->clause, 0));
+ foreach (lc, inner_rel->baserestrictinfo)
+ join_vars = list_concat(join_vars, pull_vars_of_level((Node*)((RestrictInfo*)lfirst(lc))->clause, 0));
+
+ foreach (lc, expr->args)
+ {
+ Var *var = (Var*) lfirst(lc);
+ if (IsA(var, Var) && var->varattno >= 0)
+ {
+ double dep = var_depends_on(root, var, join_vars);
+ jselec = jselec*(1.0 - dep) + dep;
+ }
+ }
+ }
+ }
/*
* Basically, we multiply size of Cartesian product by selectivity.
*
diff --git a/src/include/optimizer/clauses.h b/src/include/optimizer/clauses.h
index 0673887..f7d91e7 100644
--- a/src/include/optimizer/clauses.h
+++ b/src/include/optimizer/clauses.h
@@ -53,4 +53,6 @@ extern void CommuteOpExpr(OpExpr *clause);
extern Query *inline_set_returning_function(PlannerInfo *root,
RangeTblEntry *rte);
+extern double find_var_dependency(PlannerInfo *root, Index relid, Var *var, Bitmapset *attnums);
+
#endif /* CLAUSES_H */
Hi Konstantin,
Thanks for working on this! Using extended statistics to improve join
cardinality estimates was definitely on my radar, and this patch seems
like a good start.
I had two basic ideas about how we might improve join estimates:
(a) use per-table extended statistics to estimate join conditions
(b) invent multi-table extended statistics (requires inventing how to
sample the tables in a coordinated way, etc.)
This patch aims to do (a) which is perfectly reasonable - I think we can
achieve significant improvements this way. I have some ideas about (b),
but it seems harder and for a separate thread/patch.
The patch includes some *very* interesting ideas, but I think it's does
them too late and at the wrong level of abstraction. I mean that:
1) I don't think the code in clausesel.c should deal with extended
statistics directly - it requires far too much knowledge about different
types of extended stats, what clauses are supported by them, etc.
Allowing stats on expressions will make this even worse.
Better do that in extended_stats.c, like statext_clauselist_selectivity.
2) in clauselist_selectivity_ext, functional dependencies are applied in
the part that processes remaining clauses, not estimated using extended
statistics. That seems a bit confusing, and I suspect it may lead to
issues - for example, it only processes the clauses incrementally, in a
particular order. That probably affects the result, because it affects
which functional dependencies we can apply.
In the example query that's not an issue, because it only has two Vars,
so it either can't apply anything (with one Var) or it can apply
everything (with two Vars). But with 3 or more Vars the order would
certainly matter, so it's problematic.
Moreover, it seems a bit strange that this considers dependencies only
on the inner relation. Can't that lead to issues with different join
orders producing different cardinality estimates?
I think a better approach would be to either modify the existing block
dealing with extended stats for a single relation to also handle join
conditions. Or perhaps we should invent a separate block, dealing with
*pairs* of relations? And it should deal with *all* join clauses for
that pair of relations at once, not one by one.
As for the exact implementation, I'd imagine we call overall logic to be
something like (for clauses on two joined relations):
- pick a subset of clauses with the same type of extended statistics on
both sides (MCV, ndistinct, ...), repeat until we can't apply more
statistics
- estimate remaining clauses either using functional dependencies or in
the regular (old) way
As for how to use other types of extended statistics, I think eqjoinsel
could serve as an inspiration. We should look for an MCV list and
ndistinct stats on both sides of the join (possibly on some subset of
clauses), and then do the same thing eqjoinsel does, just with multiple
columns.
Note: I'm not sure what to do when we find the stats only on one side.
Perhaps we should assume the other side does not have correlations and
use per-column statistics (seems reasonable), or maybe just not apply
anything (seems a bit too harsh).
Anyway, if there are some non-estimated clauses, we could try applying
functional dependencies similarly to what this patch does. It's also
consistent with statext_clauselist_selectivity - that also tries to
apply MCV lists first, and only then we try functional dependencies.
BTW, should this still rely on oprrest (e.g. F_EQSEL). That's the
selectivity function for restriction (non-join) clauses, so maybe we
should be looking at oprjoin when dealing with joins? Not sure.
One bit that I find *very* interesting is the calc_joinrel_size_estimate
part, with this comment:
/*
* Try to take in account functional dependencies between attributes
* of clauses pushed-down to joined relations and retstrictlist
* clause. Right now we consider only case of restrictlist consists of
* one clause.
*/
If I understand the comment and the code after it, it essentially tries
to apply extended statistics from both the join clauses and filters at
the relation level. That is, with a query like
SELECT * FROM t1 JOIN t2 ON (t1.a = t2.a) WHERE t1.b = 10
we would be looking at statistics on t1(a,b), because we're interested
in estimating conditional probability distribution
P(t1.a = a? | t1.b = 10)
I think that's extremely interesting and powerful, because it allows us
to "restrict" the multi-column MCV lists, we could probably estimate
number of distinct "a" values in rows with "b=10" like:
ndistinct(a,b) / ndistinct(b)
and do various interesting stuff like this.
That will require some improvements to the extended statistics code (to
allow passing a list of conditions), but that's quite doable. I think
the code actually did something like that originally ;-)
Obviously, none of this is achievable for PG14, as we're in the middle
of the last CF. But if you're interested in working on this for PG15,
I'd love to cooperate on that.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 11.03.2021 03:47, Tomas Vondra wrote:
Hi Konstantin,
Thanks for working on this! Using extended statistics to improve join
cardinality estimates was definitely on my radar, and this patch seems
like a good start.I had two basic ideas about how we might improve join estimates:
(a) use per-table extended statistics to estimate join conditions
(b) invent multi-table extended statistics (requires inventing how to
sample the tables in a coordinated way, etc.)This patch aims to do (a) which is perfectly reasonable - I think we can
achieve significant improvements this way. I have some ideas about (b),
but it seems harder and for a separate thread/patch.The patch includes some *very* interesting ideas, but I think it's does
them too late and at the wrong level of abstraction. I mean that:1) I don't think the code in clausesel.c should deal with extended
statistics directly - it requires far too much knowledge about different
types of extended stats, what clauses are supported by them, etc.
Allowing stats on expressions will make this even worse.Better do that in extended_stats.c, like statext_clauselist_selectivity.
2) in clauselist_selectivity_ext, functional dependencies are applied in
the part that processes remaining clauses, not estimated using extended
statistics. That seems a bit confusing, and I suspect it may lead to
issues - for example, it only processes the clauses incrementally, in a
particular order. That probably affects the result, because it affects
which functional dependencies we can apply.In the example query that's not an issue, because it only has two Vars,
so it either can't apply anything (with one Var) or it can apply
everything (with two Vars). But with 3 or more Vars the order would
certainly matter, so it's problematic.Moreover, it seems a bit strange that this considers dependencies only
on the inner relation. Can't that lead to issues with different join
orders producing different cardinality estimates?I think a better approach would be to either modify the existing block
dealing with extended stats for a single relation to also handle join
conditions. Or perhaps we should invent a separate block, dealing with
*pairs* of relations? And it should deal with *all* join clauses for
that pair of relations at once, not one by one.As for the exact implementation, I'd imagine we call overall logic to be
something like (for clauses on two joined relations):- pick a subset of clauses with the same type of extended statistics on
both sides (MCV, ndistinct, ...), repeat until we can't apply more
statistics- estimate remaining clauses either using functional dependencies or in
the regular (old) wayAs for how to use other types of extended statistics, I think eqjoinsel
could serve as an inspiration. We should look for an MCV list and
ndistinct stats on both sides of the join (possibly on some subset of
clauses), and then do the same thing eqjoinsel does, just with multiple
columns.Note: I'm not sure what to do when we find the stats only on one side.
Perhaps we should assume the other side does not have correlations and
use per-column statistics (seems reasonable), or maybe just not apply
anything (seems a bit too harsh).Anyway, if there are some non-estimated clauses, we could try applying
functional dependencies similarly to what this patch does. It's also
consistent with statext_clauselist_selectivity - that also tries to
apply MCV lists first, and only then we try functional dependencies.BTW, should this still rely on oprrest (e.g. F_EQSEL). That's the
selectivity function for restriction (non-join) clauses, so maybe we
should be looking at oprjoin when dealing with joins? Not sure.One bit that I find *very* interesting is the calc_joinrel_size_estimate
part, with this comment:/*
* Try to take in account functional dependencies between attributes
* of clauses pushed-down to joined relations and retstrictlist
* clause. Right now we consider only case of restrictlist consists of
* one clause.
*/If I understand the comment and the code after it, it essentially tries
to apply extended statistics from both the join clauses and filters at
the relation level. That is, with a query likeSELECT * FROM t1 JOIN t2 ON (t1.a = t2.a) WHERE t1.b = 10
we would be looking at statistics on t1(a,b), because we're interested
in estimating conditional probability distributionP(t1.a = a? | t1.b = 10)
I think that's extremely interesting and powerful, because it allows us
to "restrict" the multi-column MCV lists, we could probably estimate
number of distinct "a" values in rows with "b=10" like:ndistinct(a,b) / ndistinct(b)
and do various interesting stuff like this.
That will require some improvements to the extended statistics code (to
allow passing a list of conditions), but that's quite doable. I think
the code actually did something like that originally ;-)Obviously, none of this is achievable for PG14, as we're in the middle
of the last CF. But if you're interested in working on this for PG15,
I'd love to cooperate on that.regards
Hi Tomas,
Thank you for review of my patch.
My primary attention was to implement some kid of adaptive query
optimization based online_analyze extension and building extended
statistic on demand.
I have change clausesel.c because right now extended statistic is not
used for join selectivity estimation and manual or automatic adding such
statistic can help to
choose more efficient plan for queries with joins.
I agree wit you that it can be done in better way, handling more use cases.
I will be glad to cooperate with you in improving join selectivity
estimation using extended statistic.
On Mon, Mar 15, 2021 at 8:42 PM Konstantin Knizhnik <
k.knizhnik@postgrespro.ru> wrote:
On 11.03.2021 03:47, Tomas Vondra wrote:
Hi Konstantin,
Thanks for working on this! Using extended statistics to improve join
cardinality estimates was definitely on my radar, and this patch seems
like a good start.I had two basic ideas about how we might improve join estimates:
(a) use per-table extended statistics to estimate join conditions
(b) invent multi-table extended statistics (requires inventing how to
sample the tables in a coordinated way, etc.)This patch aims to do (a) which is perfectly reasonable - I think we can
achieve significant improvements this way. I have some ideas about (b),
but it seems harder and for a separate thread/patch.The patch includes some *very* interesting ideas, but I think it's does
them too late and at the wrong level of abstraction. I mean that:1) I don't think the code in clausesel.c should deal with extended
statistics directly - it requires far too much knowledge about different
types of extended stats, what clauses are supported by them, etc.
Allowing stats on expressions will make this even worse.Better do that in extended_stats.c, like statext_clauselist_selectivity.
2) in clauselist_selectivity_ext, functional dependencies are applied in
the part that processes remaining clauses, not estimated using extended
statistics. That seems a bit confusing, and I suspect it may lead to
issues - for example, it only processes the clauses incrementally, in a
particular order. That probably affects the result, because it affects
which functional dependencies we can apply.In the example query that's not an issue, because it only has two Vars,
so it either can't apply anything (with one Var) or it can apply
everything (with two Vars). But with 3 or more Vars the order would
certainly matter, so it's problematic.Moreover, it seems a bit strange that this considers dependencies only
on the inner relation. Can't that lead to issues with different join
orders producing different cardinality estimates?I think a better approach would be to either modify the existing block
dealing with extended stats for a single relation to also handle join
conditions. Or perhaps we should invent a separate block, dealing with
*pairs* of relations? And it should deal with *all* join clauses for
that pair of relations at once, not one by one.As for the exact implementation, I'd imagine we call overall logic to be
something like (for clauses on two joined relations):- pick a subset of clauses with the same type of extended statistics on
both sides (MCV, ndistinct, ...), repeat until we can't apply more
statistics- estimate remaining clauses either using functional dependencies or in
the regular (old) wayAs for how to use other types of extended statistics, I think eqjoinsel
could serve as an inspiration. We should look for an MCV list and
ndistinct stats on both sides of the join (possibly on some subset of
clauses), and then do the same thing eqjoinsel does, just with multiple
columns.Note: I'm not sure what to do when we find the stats only on one side.
Perhaps we should assume the other side does not have correlations and
use per-column statistics (seems reasonable), or maybe just not apply
anything (seems a bit too harsh).Anyway, if there are some non-estimated clauses, we could try applying
functional dependencies similarly to what this patch does. It's also
consistent with statext_clauselist_selectivity - that also tries to
apply MCV lists first, and only then we try functional dependencies.BTW, should this still rely on oprrest (e.g. F_EQSEL). That's the
selectivity function for restriction (non-join) clauses, so maybe we
should be looking at oprjoin when dealing with joins? Not sure.One bit that I find *very* interesting is the calc_joinrel_size_estimate
part, with this comment:/*
* Try to take in account functional dependencies between attributes
* of clauses pushed-down to joined relations and retstrictlist
* clause. Right now we consider only case of restrictlist consists of
* one clause.
*/If I understand the comment and the code after it, it essentially tries
to apply extended statistics from both the join clauses and filters at
the relation level. That is, with a query likeSELECT * FROM t1 JOIN t2 ON (t1.a = t2.a) WHERE t1.b = 10
we would be looking at statistics on t1(a,b), because we're interested
in estimating conditional probability distributionP(t1.a = a? | t1.b = 10)
I think that's extremely interesting and powerful, because it allows us
to "restrict" the multi-column MCV lists, we could probably estimate
number of distinct "a" values in rows with "b=10" like:ndistinct(a,b) / ndistinct(b)
and do various interesting stuff like this.
That will require some improvements to the extended statistics code (to
allow passing a list of conditions), but that's quite doable. I think
the code actually did something like that originally ;-)Obviously, none of this is achievable for PG14, as we're in the middle
of the last CF. But if you're interested in working on this for PG15,
I'd love to cooperate on that.regards
Hi Tomas,
Thank you for review of my patch.
My primary attention was to implement some kid of adaptive query
optimization based online_analyze extension and building extended
statistic on demand.
I have change clausesel.c because right now extended statistic is not
used for join selectivity estimation and manual or automatic adding such
statistic can help to
choose more efficient plan for queries with joins.
I agree wit you that it can be done in better way, handling more use cases.
I will be glad to cooperate with you in improving join selectivity
estimation using extended statistic.The patch does not compile, and needs your attention.
https://cirrus-ci.com/task/6397726985289728
clausesel.c:74:28: error: too few arguments to function
‘choose_best_statistics’
StatisticExtInfo *stat = choose_best_statistics(rel->statlist,
STATS_EXT_DEPENDENCIES,
^~~~~~~~~~~~~~~~~~~~~~
In file included from clausesel.c:24:
../../../../src/include/statistics/statistics.h:123:26: note: declared here
exter
I am changing the status to "Waiting on Author".
--
Ibrar Ahmed
On 19 Jul 2021, at 12:52, Ibrar Ahmed <ibrar.ahmad@gmail.com> wrote:
The patch does not compile, and needs your attention.
https://cirrus-ci.com/task/6397726985289728 <https://cirrus-ci.com/task/6397726985289728>
clausesel.c:74:28: error: too few arguments to function ‘choose_best_statistics’
StatisticExtInfo *stat = choose_best_statistics(rel->statlist, STATS_EXT_DEPENDENCIES,
^~~~~~~~~~~~~~~~~~~~~~
In file included from clausesel.c:24:
../../../../src/include/statistics/statistics.h:123:26: note: declared here
exterI am changing the status to "Waiting on Author".
And since this is still the case one CF later, I'm marking this as RwF.
--
Daniel Gustafsson https://vmware.com/