Columns correlation and adaptive query optimization

Started by Konstantin Knizhnikabout 6 years ago26 messages

k.knizhnik@postgrespro.ru

about 6 years ago

2 attachment(s)

Hi hackers,

Errors in selectivity estimations is one of the main reason of bad plans
generation by Postgres optimizer.
Postgres estimates selectivity based on the collected statistic
(histograms).
While it is able to more or less precisely estimated selectivity of
simple predicate for particular table,
it is much more difficult to estimate selectivity for result of join of
several tables and for complex predicate consisting of several
conjuncts/disjuncts
accessing different columns.

Postgres is not able to take in account correlation between columns
unless correspondent multicolumn statistic is explicitly created.
But even if such statistic is created, it can not be used in join
selectivity estimation.

The problem with adjusting selectivity using machine learning based on
the results of EXPLAIN ANALYZE was address in AQO project:

https://github.com/postgrespro/aqo

There are still many issues with proposed AQO approach (for example, it
doesn't take in account concrete constant values).
We are going to continue its improvement.

But here I wan to propose much simpler patch which allows two things:
1. Use extended statistic in estimation of join selectivity
2. Create on demand multicolumn statistic in auto_explain extension if
there is larger gap between real and estimated number of tuples for the
concrete plan node.

create table inner_tab(x integer, y integer);
create table outer_tab(pk integer primary key, x integer, y integer);
create index on inner_tab(x,y);
insert into outer_tab values (generate_series(1,100000),
generate_series(1,100000), generate_series(1,100000)*10);
insert into inner_tab values (generate_series(1,1000000)/10,
generate_series(1,1000000)/10*10);
analyze inner_tab;
analyze outer_tab;

Without this patch:
explain select * from outer_tab join inner_tab using(x,y) where pk=1;
                                          QUERY PLAN
----------------------------------------------------------------------------------------------
Nested Loop (cost=0.72..16.77 rows=1 width=12)
   -> Index Scan using outer_tab_pkey on outer_tab (cost=0.29..8.31
rows=1 width=12)
         Index Cond: (pk = 1)
   -> Index Only Scan using inner_tab_x_y_idx on inner_tab
(cost=0.42..8.45 rows=1 width=8)
         Index Cond: ((x = outer_tab.x) AND (y = outer_tab.y))
(5 rows)

With this patch:

load 'auto_explain';
set auto_explain.log_min_duration=0;
set auto_explain.add_statistics_threshold=10.0;
set auto_explain.log_analyze=on;
select * from outer_tab join inner_tab using(x,y) where pk=1;
analyze inner_tab;
analyze outer_tab;

explain select * from outer_tab join inner_tab using(x,y) where pk=1;
                                           QUERY PLAN
------------------------------------------------------------------------------------------------
Nested Loop (cost=0.72..32.79 rows=10 width=12)
   -> Index Scan using outer_tab_pkey on outer_tab (cost=0.29..8.31
rows=1 width=12)
         Index Cond: (pk = 1)
   -> Index Only Scan using inner_tab_x_y_idx on inner_tab
(cost=0.42..24.38 rows=10 width=8)
         Index Cond: ((x = outer_tab.x) AND (y = outer_tab.y))
(5 rows)

As you can see now estimation of join result is correct (10).

I attached two patches: one for using extended statistic for join
selectivity estimation and another for auto_explain to implicitly add
this extended statistic on demand.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

join_selectivity.patchtext/x-patch; name=join_selectivity.patchDownload

diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c
index 4bf777d..a356df9 100644
--- a/src/backend/optimizer/path/clausesel.c
+++ b/src/backend/optimizer/path/clausesel.c
@@ -25,6 +25,7 @@
 #include "utils/lsyscache.h"
 #include "utils/selfuncs.h"
 #include "statistics/statistics.h"
+#include "catalog/pg_statistic_ext.h"
 
 
 /*
@@ -159,6 +160,9 @@ clauselist_selectivity_simple(PlannerInfo *root,
 	RangeQueryClause *rqlist = NULL;
 	ListCell   *l;
 	int			listidx;
+	Bitmapset  *clauses_attnums = NULL;
+	int			n_clauses_attnums = 0;
+	int         innerRelid = varRelid;
 
 	/*
 	 * If there's exactly one clause (and it was not estimated yet), just go
@@ -170,6 +174,9 @@ clauselist_selectivity_simple(PlannerInfo *root,
 		return clause_selectivity(root, (Node *) linitial(clauses),
 								  varRelid, jointype, sjinfo);
 
+	if (innerRelid == 0 && sjinfo)
+		bms_get_singleton_member(sjinfo->min_righthand, &innerRelid);
+
 	/*
 	 * Anything that doesn't look like a potential rangequery clause gets
 	 * multiplied into s1 and forgotten. Anything that does gets inserted into
@@ -181,7 +188,6 @@ clauselist_selectivity_simple(PlannerInfo *root,
 		Node	   *clause = (Node *) lfirst(l);
 		RestrictInfo *rinfo;
 		Selectivity s2;
-
 		listidx++;
 
 		/*
@@ -213,6 +219,7 @@ clauselist_selectivity_simple(PlannerInfo *root,
 		else
 			rinfo = NULL;
 
+
 		/*
 		 * See if it looks like a restriction clause with a pseudoconstant on
 		 * one side.  (Anything more complicated than that might not behave in
@@ -224,6 +231,55 @@ clauselist_selectivity_simple(PlannerInfo *root,
 			OpExpr	   *expr = (OpExpr *) clause;
 			bool		varonleft = true;
 			bool		ok;
+			int         oprrest = get_oprrest(expr->opno);
+
+			/* Try to take in account functional dependencies between attributes */
+			if (oprrest == F_EQSEL && rinfo != NULL && innerRelid != 0)
+			{
+				Var* var = (Var*)linitial(expr->args);
+				if (!IsA(var, Var) || var->varno != innerRelid)
+				{
+					var = (Var*)lsecond(expr->args);
+				}
+				if (IsA(var, Var) && var->varno == innerRelid)
+				{
+					clauses_attnums = bms_add_member(clauses_attnums, var->varattno);
+					if (n_clauses_attnums++ != 0)
+					{
+						RelOptInfo* rel = find_base_rel(root, innerRelid);
+						if (rel->rtekind == RTE_RELATION && rel->statlist != NIL)
+						{
+							StatisticExtInfo *stat = choose_best_statistics(rel->statlist, clauses_attnums,
+																			STATS_EXT_DEPENDENCIES);
+							if (stat != NULL)
+							{
+								MVDependencies *dependencies = statext_dependencies_load(stat->statOid);
+								MVDependency *strongest = NULL;
+								int i;
+								for (i = 0; i < dependencies->ndeps; i++)
+								{
+									MVDependency *dependency = dependencies->deps[i];
+									int n_dep_vars = dependency->nattributes - 1;
+									/* Dependency implies attribute */
+									if (var->varattno == dependency->attributes[n_dep_vars])
+									{
+										while (--n_dep_vars >= 0
+											   && bms_is_member(dependency->attributes[n_dep_vars], clauses_attnums));
+										if (n_dep_vars < 0 && (!strongest || strongest->degree < dependency->degree))
+											strongest = dependency;
+									}
+								}
+								if (strongest)
+								{
+									Selectivity dep_sel = (strongest->degree + (1 - strongest->degree) * s1);
+									s1 = Min(dep_sel, s2);
+									continue;
+								}
+							}
+						}
+					}
+				}
+			}
 
 			if (rinfo)
 			{
@@ -249,7 +305,7 @@ clauselist_selectivity_simple(PlannerInfo *root,
 				 * selectivity in generically.  But if it's the right oprrest,
 				 * add the clause to rqlist for later processing.
 				 */
-				switch (get_oprrest(expr->opno))
+				switch (oprrest)
 				{
 					case F_SCALARLTSEL:
 					case F_SCALARLESEL:

auto_explain.patchtext/x-patch; name=auto_explain.patchDownload

diff --git a/contrib/auto_explain/auto_explain.c b/contrib/auto_explain/auto_explain.c
index a9536c2..0f19448 100644
--- a/contrib/auto_explain/auto_explain.c
+++ b/contrib/auto_explain/auto_explain.c
@@ -13,12 +13,25 @@
 #include "postgres.h"
 
 #include <limits.h>
+#include <math.h>
 
 #include "access/parallel.h"
 #include "commands/explain.h"
+#include "commands/defrem.h"
 #include "executor/instrument.h"
 #include "jit/jit.h"
+#include "nodes/makefuncs.h"
+#include "nodes/nodeFuncs.h"
+#include "optimizer/cost.h"
+#include "optimizer/optimizer.h"
+#include "optimizer/planmain.h"
+#include "parser/parsetree.h"
+#include "storage/ipc.h"
+#include "statistics/statistics.h"
 #include "utils/guc.h"
+#include "utils/syscache.h"
+#include "utils/lsyscache.h"
+#include "utils/ruleutils.h"
 
 PG_MODULE_MAGIC;
 
@@ -34,6 +47,7 @@ static int	auto_explain_log_format = EXPLAIN_FORMAT_TEXT;
 static int	auto_explain_log_level = LOG;
 static bool auto_explain_log_nested_statements = false;
 static double auto_explain_sample_rate = 1;
+static double auto_explain_add_statistics_threshold = 0.0;
 
 static const struct config_enum_entry format_options[] = {
 	{"text", EXPLAIN_FORMAT_TEXT, false},
@@ -218,6 +232,19 @@ _PG_init(void)
 							 NULL,
 							 NULL);
 
+	DefineCustomRealVariable("auto_explain.add_statistics_threshold",
+							 "Sets the threshold for actual/estimated #rows ratio triggering creation of multicolumn statistic for the related columns.",
+							 "Zero disables implicit creation of multicolumn statistic.",
+							 &auto_explain_add_statistics_threshold,
+							 0.0,
+							 0.0,
+							 INT_MAX,
+							 PGC_SUSET,
+							 0,
+							 NULL,
+							 NULL,
+							 NULL);
+
 	EmitWarningsOnPlaceholders("auto_explain");
 
 	/* Install hooks. */
@@ -353,6 +380,217 @@ explain_ExecutorFinish(QueryDesc *queryDesc)
 	PG_END_TRY();
 }
 
+static void
+AddMultiColumnStatisticsForNode(PlanState *planstate, ExplainState *es);
+
+static void
+AddMultiColumnStatisticsForSubPlans(List *plans, ExplainState *es)
+{
+	ListCell   *lst;
+
+	foreach(lst, plans)
+	{
+		SubPlanState *sps = (SubPlanState *) lfirst(lst);
+
+		AddMultiColumnStatisticsForNode(sps->planstate, es);
+	}
+}
+
+static void
+AddMultiColumnStatisticsForMemberNodes(PlanState **planstates, int nsubnodes,
+									ExplainState *es)
+{
+	int			j;
+
+	for (j = 0; j < nsubnodes; j++)
+		AddMultiColumnStatisticsForNode(planstates[j], es);
+}
+
+static int
+vars_list_comparator(const ListCell *a, const ListCell *b)
+{
+	char* va = strVal((Value *) linitial(((ColumnRef *)lfirst(a))->fields));
+	char* vb = strVal((Value *) linitial(((ColumnRef *)lfirst(b))->fields));
+	return strcmp(va, vb);
+}
+
+static void
+AddMultiColumnStatisticsForQual(void* qual, ExplainState *es)
+{
+	List *vars = NULL;
+	ListCell* lc;
+	foreach (lc, qual)
+	{
+		Node* node = (Node*)lfirst(lc);
+		if (IsA(node, RestrictInfo))
+			node = (Node*)((RestrictInfo*)node)->clause;
+		vars = list_concat(vars, pull_vars_of_level(node, 0));
+	}
+	while (vars != NULL)
+	{
+		ListCell *cell;
+		List *cols = NULL;
+		Index varno = 0;
+		Bitmapset* colmap = NULL;
+
+		foreach (cell, vars)
+		{
+			Node* node = (Node *) lfirst(cell);
+			if (IsA(node, Var))
+			{
+				Var *var = (Var *) node;
+				if (cols == NULL || var->varnoold == varno)
+				{
+					varno = var->varnoold;
+					if (var->varattno > 0 &&
+						!bms_is_member(var->varattno, colmap) &&
+						varno >= 1 &&
+						varno <= list_length(es->rtable) &&
+						list_length(cols) < STATS_MAX_DIMENSIONS)
+					{
+						RangeTblEntry *rte = rt_fetch(varno, es->rtable);
+						if (rte->rtekind == RTE_RELATION)
+						{
+							ColumnRef  *col = makeNode(ColumnRef);
+							char *colname = get_rte_attribute_name(rte, var->varattno);
+							col->fields = list_make1(makeString(colname));
+							cols = lappend(cols, col);
+							colmap = bms_add_member(colmap, var->varattno);
+						}
+					}
+				}
+				else
+				{
+					continue;
+				}
+			}
+			vars = foreach_delete_current(vars, cell);
+		}
+		if (list_length(cols) >= 2)
+		{
+			CreateStatsStmt* stats = makeNode(CreateStatsStmt);
+			RangeTblEntry *rte = rt_fetch(varno, es->rtable);
+			char *rel_namespace = get_namespace_name(get_rel_namespace(rte->relid));
+			char *rel_name = get_rel_name(rte->relid);
+			RangeVar* rel = makeRangeVar(rel_namespace, rel_name, 0);
+			char* stat_name = rel_name;
+
+			list_sort(cols, vars_list_comparator);
+			/* Construct name for statistic by concatenating relation name with all columns */
+			foreach (cell, cols)
+				stat_name = psprintf("%s_%s", stat_name, strVal((Value *) linitial(((ColumnRef *)lfirst(cell))->fields)));
+			elog(LOG, "Add statistic %s", stat_name);
+
+			/*
+			 * Check if multicolumn if multicolumn statistic object with such name already exists
+			 * (most likely if was already created by auto_explain, but either ANALYZE was not performed since
+			 * this time, either presence of this multicolumn statistic doesn't help to provide more precise estimation.
+			 * Despite to the fact that we create statistics with "if_not_exist" option, presence of such check
+			 * allows to eliminate notice message that statistics object already exists.
+			 */
+			if (!SearchSysCacheExists2(STATEXTNAMENSP,
+									   CStringGetDatum(stat_name),
+									   ObjectIdGetDatum(get_rel_namespace(rte->relid))))
+			{
+				stats->defnames = list_make2(makeString(rel_namespace), makeString(stat_name));
+				stats->if_not_exists = true;
+				stats->relations = list_make1(rel);
+				stats->exprs = cols;
+				CreateStatistics(stats);
+			}
+		}
+	}
+}
+
+static void
+AddMultiColumnStatisticsForNode(PlanState *planstate, ExplainState *es)
+{
+	Plan	   *plan = planstate->plan;
+
+	if (planstate->instrument && plan->plan_rows != 0)
+	{
+		if (auto_explain_add_statistics_threshold != 0
+			&& planstate->instrument->ntuples / plan->plan_rows >= auto_explain_add_statistics_threshold)
+		{
+			elog(LOG, "Estimated=%f, actual=%f, error=%f: plan=%s", plan->plan_rows, planstate->instrument->ntuples, planstate->instrument->ntuples / plan->plan_rows, nodeToString(plan));
+			/* quals, sort keys, etc */
+			switch (nodeTag(plan))
+			{
+			  case T_IndexScan:
+				AddMultiColumnStatisticsForQual(((IndexScan *) plan)->indexqualorig, es);
+				break;
+			  case T_IndexOnlyScan:
+				AddMultiColumnStatisticsForQual(((IndexOnlyScan *) plan)->indexqual, es);
+				break;
+			  case T_BitmapIndexScan:
+				AddMultiColumnStatisticsForQual(((BitmapIndexScan *) plan)->indexqualorig, es);
+				break;
+			  case T_NestLoop:
+				AddMultiColumnStatisticsForQual(((NestLoop *) plan)->join.joinqual, es);
+				break;
+			  case T_MergeJoin:
+				AddMultiColumnStatisticsForQual(((MergeJoin *) plan)->mergeclauses, es);
+				AddMultiColumnStatisticsForQual(((MergeJoin *) plan)->join.joinqual, es);
+				break;
+			  case T_HashJoin:
+				AddMultiColumnStatisticsForQual(((HashJoin *) plan)->hashclauses, es);
+				AddMultiColumnStatisticsForQual(((HashJoin *) plan)->join.joinqual, es);
+				break;
+			  default:
+				break;
+			}
+			AddMultiColumnStatisticsForQual(plan->qual, es);
+		}
+	}
+
+	/* initPlan-s */
+	if (planstate->initPlan)
+		AddMultiColumnStatisticsForSubPlans(planstate->initPlan, es);
+
+	/* lefttree */
+	if (outerPlanState(planstate))
+		AddMultiColumnStatisticsForNode(outerPlanState(planstate), es);
+
+	/* righttree */
+	if (innerPlanState(planstate))
+		AddMultiColumnStatisticsForNode(innerPlanState(planstate), es);
+
+	/* special child plans */
+	switch (nodeTag(plan))
+	{
+		case T_ModifyTable:
+			AddMultiColumnStatisticsForMemberNodes(((ModifyTableState *) planstate)->mt_plans,
+												   ((ModifyTableState *) planstate)->mt_nplans,
+												   es);
+			break;
+		case T_Append:
+			AddMultiColumnStatisticsForMemberNodes(((AppendState *) planstate)->appendplans,
+												   ((AppendState *) planstate)->as_nplans,
+												   es);
+			break;
+		case T_MergeAppend:
+			AddMultiColumnStatisticsForMemberNodes(((MergeAppendState *) planstate)->mergeplans,
+												   ((MergeAppendState *) planstate)->ms_nplans,
+												   es);
+			break;
+		case T_BitmapAnd:
+			AddMultiColumnStatisticsForMemberNodes(((BitmapAndState *) planstate)->bitmapplans,
+												   ((BitmapAndState *) planstate)->nplans,
+												   es);
+			break;
+		case T_BitmapOr:
+			AddMultiColumnStatisticsForMemberNodes(((BitmapOrState *) planstate)->bitmapplans,
+												   ((BitmapOrState *) planstate)->nplans,
+												   es);
+			break;
+		case T_SubqueryScan:
+			AddMultiColumnStatisticsForNode(((SubqueryScanState *) planstate)->subplan, es);
+			break;
+		default:
+			break;
+	}
+}
+
 /*
  * ExecutorEnd hook: log results if needed
  */
@@ -392,6 +630,10 @@ explain_ExecutorEnd(QueryDesc *queryDesc)
 				ExplainPrintJITSummary(es, queryDesc);
 			ExplainEndOutput(es);
 
+			/* Add multicolumn statistic if requested */
+			if (auto_explain_add_statistics_threshold && !IsParallelWorker())
+				AddMultiColumnStatisticsForNode(queryDesc->planstate, es);
+
 			/* Remove last line break */
 			if (es->str->len > 0 && es->str->data[es->str->len - 1] == '\n')
 				es->str->data[--es->str->len] = '\0';

legrand legrand

legrand_legrand@hotmail.com

about 6 years ago

In reply to: Konstantin Knizhnik (#1)

Re: Columns correlation and adaptive query optimization

Hello Konstantin,

What you have proposed regarding join_selectivity and multicolumn statistics
is a very good new !

Regarding your auto_explain modification, maybe an "advisor" mode would also
be helpfull (with auto_explain_add_statistics_threshold=-1 for exemple).
This would allow to track which missing statistic should be tested (manually
or in an other environment).

In my point of view this advice should be an option of the EXPLAIN command,
that should also permit
auto_explain module to propose "learning" phase.

Regards
PAscal

--
Sent from: https://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html

Konstantin Knizhnik

k.knizhnik@postgrespro.ru

about 6 years ago

In reply to: legrand legrand (#2)

2 attachment(s)

Re: Columns correlation and adaptive query optimization

On 15.10.2019 1:20, legrand legrand wrote:

Hello Konstantin,

What you have proposed regarding join_selectivity and multicolumn statistics
is a very good new !

Regarding your auto_explain modification, maybe an "advisor" mode would also
be helpfull (with auto_explain_add_statistics_threshold=-1 for exemple).
This would allow to track which missing statistic should be tested (manually
or in an other environment).

In my point of view this advice should be an option of the EXPLAIN command,
that should also permit
auto_explain module to propose "learning" phase.

Thank you for good suggestion. Advisor mode is really good idea.
I have added "auto_explain.suggest_only" GUC.
When it is switched on, suggested CREATE STATISTICS statement is just
printed in log but not actually created.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

join_selectivity.patchtext/x-patch; name=join_selectivity.patchDownload

diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c
index 4bf777d..a356df9 100644
--- a/src/backend/optimizer/path/clausesel.c
+++ b/src/backend/optimizer/path/clausesel.c
@@ -25,6 +25,7 @@
 #include "utils/lsyscache.h"
 #include "utils/selfuncs.h"
 #include "statistics/statistics.h"
+#include "catalog/pg_statistic_ext.h"
 
 
 /*
@@ -159,6 +160,9 @@ clauselist_selectivity_simple(PlannerInfo *root,
 	RangeQueryClause *rqlist = NULL;
 	ListCell   *l;
 	int			listidx;
+	Bitmapset  *clauses_attnums = NULL;
+	int			n_clauses_attnums = 0;
+	int         innerRelid = varRelid;
 
 	/*
 	 * If there's exactly one clause (and it was not estimated yet), just go
@@ -170,6 +174,9 @@ clauselist_selectivity_simple(PlannerInfo *root,
 		return clause_selectivity(root, (Node *) linitial(clauses),
 								  varRelid, jointype, sjinfo);
 
+	if (innerRelid == 0 && sjinfo)
+		bms_get_singleton_member(sjinfo->min_righthand, &innerRelid);
+
 	/*
 	 * Anything that doesn't look like a potential rangequery clause gets
 	 * multiplied into s1 and forgotten. Anything that does gets inserted into
@@ -181,7 +188,6 @@ clauselist_selectivity_simple(PlannerInfo *root,
 		Node	   *clause = (Node *) lfirst(l);
 		RestrictInfo *rinfo;
 		Selectivity s2;
-
 		listidx++;
 
 		/*
@@ -213,6 +219,7 @@ clauselist_selectivity_simple(PlannerInfo *root,
 		else
 			rinfo = NULL;
 
+
 		/*
 		 * See if it looks like a restriction clause with a pseudoconstant on
 		 * one side.  (Anything more complicated than that might not behave in
@@ -224,6 +231,55 @@ clauselist_selectivity_simple(PlannerInfo *root,
 			OpExpr	   *expr = (OpExpr *) clause;
 			bool		varonleft = true;
 			bool		ok;
+			int         oprrest = get_oprrest(expr->opno);
+
+			/* Try to take in account functional dependencies between attributes */
+			if (oprrest == F_EQSEL && rinfo != NULL && innerRelid != 0)
+			{
+				Var* var = (Var*)linitial(expr->args);
+				if (!IsA(var, Var) || var->varno != innerRelid)
+				{
+					var = (Var*)lsecond(expr->args);
+				}
+				if (IsA(var, Var) && var->varno == innerRelid)
+				{
+					clauses_attnums = bms_add_member(clauses_attnums, var->varattno);
+					if (n_clauses_attnums++ != 0)
+					{
+						RelOptInfo* rel = find_base_rel(root, innerRelid);
+						if (rel->rtekind == RTE_RELATION && rel->statlist != NIL)
+						{
+							StatisticExtInfo *stat = choose_best_statistics(rel->statlist, clauses_attnums,
+																			STATS_EXT_DEPENDENCIES);
+							if (stat != NULL)
+							{
+								MVDependencies *dependencies = statext_dependencies_load(stat->statOid);
+								MVDependency *strongest = NULL;
+								int i;
+								for (i = 0; i < dependencies->ndeps; i++)
+								{
+									MVDependency *dependency = dependencies->deps[i];
+									int n_dep_vars = dependency->nattributes - 1;
+									/* Dependency implies attribute */
+									if (var->varattno == dependency->attributes[n_dep_vars])
+									{
+										while (--n_dep_vars >= 0
+											   && bms_is_member(dependency->attributes[n_dep_vars], clauses_attnums));
+										if (n_dep_vars < 0 && (!strongest || strongest->degree < dependency->degree))
+											strongest = dependency;
+									}
+								}
+								if (strongest)
+								{
+									Selectivity dep_sel = (strongest->degree + (1 - strongest->degree) * s1);
+									s1 = Min(dep_sel, s2);
+									continue;
+								}
+							}
+						}
+					}
+				}
+			}
 
 			if (rinfo)
 			{
@@ -249,7 +305,7 @@ clauselist_selectivity_simple(PlannerInfo *root,
 				 * selectivity in generically.  But if it's the right oprrest,
 				 * add the clause to rqlist for later processing.
 				 */
-				switch (get_oprrest(expr->opno))
+				switch (oprrest)
 				{
 					case F_SCALARLTSEL:
 					case F_SCALARLESEL:

auto_explain_create_statistics.patchtext/x-patch; name=auto_explain_create_statistics.patchDownload

diff --git a/contrib/auto_explain/auto_explain.c b/contrib/auto_explain/auto_explain.c
index a9536c2..915a204 100644
--- a/contrib/auto_explain/auto_explain.c
+++ b/contrib/auto_explain/auto_explain.c
@@ -13,12 +13,25 @@
 #include "postgres.h"
 
 #include <limits.h>
+#include <math.h>
 
 #include "access/parallel.h"
 #include "commands/explain.h"
+#include "commands/defrem.h"
 #include "executor/instrument.h"
 #include "jit/jit.h"
+#include "nodes/makefuncs.h"
+#include "nodes/nodeFuncs.h"
+#include "optimizer/cost.h"
+#include "optimizer/optimizer.h"
+#include "optimizer/planmain.h"
+#include "parser/parsetree.h"
+#include "storage/ipc.h"
+#include "statistics/statistics.h"
 #include "utils/guc.h"
+#include "utils/syscache.h"
+#include "utils/lsyscache.h"
+#include "utils/ruleutils.h"
 
 PG_MODULE_MAGIC;
 
@@ -33,7 +46,9 @@ static bool auto_explain_log_settings = false;
 static int	auto_explain_log_format = EXPLAIN_FORMAT_TEXT;
 static int	auto_explain_log_level = LOG;
 static bool auto_explain_log_nested_statements = false;
+static bool auto_explain_suggest_only = false;
 static double auto_explain_sample_rate = 1;
+static double auto_explain_add_statistics_threshold = 0.0;
 
 static const struct config_enum_entry format_options[] = {
 	{"text", EXPLAIN_FORMAT_TEXT, false},
@@ -218,6 +233,30 @@ _PG_init(void)
 							 NULL,
 							 NULL);
 
+	DefineCustomRealVariable("auto_explain.add_statistics_threshold",
+							 "Sets the threshold for actual/estimated #rows ratio triggering creation of multicolumn statistic for the related columns.",
+							 "Zero disables implicit creation of multicolumn statistic.",
+							 &auto_explain_add_statistics_threshold,
+							 0.0,
+							 0.0,
+							 INT_MAX,
+							 PGC_SUSET,
+							 0,
+							 NULL,
+							 NULL,
+							 NULL);
+
+	DefineCustomBoolVariable("auto_explain.suggest_only",
+							 "Do not create statistic but just record in WAL suggested create statistics statement.",
+							 NULL,
+							 &auto_explain_suggest_only,
+							 false,
+							 PGC_SUSET,
+							 0,
+							 NULL,
+							 NULL,
+							 NULL);
+
 	EmitWarningsOnPlaceholders("auto_explain");
 
 	/* Install hooks. */
@@ -353,6 +392,230 @@ explain_ExecutorFinish(QueryDesc *queryDesc)
 	PG_END_TRY();
 }
 
+static void
+AddMultiColumnStatisticsForNode(PlanState *planstate, ExplainState *es);
+
+static void
+AddMultiColumnStatisticsForSubPlans(List *plans, ExplainState *es)
+{
+	ListCell   *lst;
+
+	foreach(lst, plans)
+	{
+		SubPlanState *sps = (SubPlanState *) lfirst(lst);
+
+		AddMultiColumnStatisticsForNode(sps->planstate, es);
+	}
+}
+
+static void
+AddMultiColumnStatisticsForMemberNodes(PlanState **planstates, int nsubnodes,
+									ExplainState *es)
+{
+	int			j;
+
+	for (j = 0; j < nsubnodes; j++)
+		AddMultiColumnStatisticsForNode(planstates[j], es);
+}
+
+static int
+vars_list_comparator(const ListCell *a, const ListCell *b)
+{
+	char* va = strVal((Value *) linitial(((ColumnRef *)lfirst(a))->fields));
+	char* vb = strVal((Value *) linitial(((ColumnRef *)lfirst(b))->fields));
+	return strcmp(va, vb);
+}
+
+static void
+AddMultiColumnStatisticsForQual(void* qual, ExplainState *es)
+{
+	List *vars = NULL;
+	ListCell* lc;
+	foreach (lc, qual)
+	{
+		Node* node = (Node*)lfirst(lc);
+		if (IsA(node, RestrictInfo))
+			node = (Node*)((RestrictInfo*)node)->clause;
+		vars = list_concat(vars, pull_vars_of_level(node, 0));
+	}
+	while (vars != NULL)
+	{
+		ListCell *cell;
+		List *cols = NULL;
+		Index varno = 0;
+		Bitmapset* colmap = NULL;
+
+		foreach (cell, vars)
+		{
+			Node* node = (Node *) lfirst(cell);
+			if (IsA(node, Var))
+			{
+				Var *var = (Var *) node;
+				if (cols == NULL || var->varnoold == varno)
+				{
+					varno = var->varnoold;
+					if (var->varattno > 0 &&
+						!bms_is_member(var->varattno, colmap) &&
+						varno >= 1 &&
+						varno <= list_length(es->rtable) &&
+						list_length(cols) < STATS_MAX_DIMENSIONS)
+					{
+						RangeTblEntry *rte = rt_fetch(varno, es->rtable);
+						if (rte->rtekind == RTE_RELATION)
+						{
+							ColumnRef  *col = makeNode(ColumnRef);
+							char *colname = get_rte_attribute_name(rte, var->varattno);
+							col->fields = list_make1(makeString(colname));
+							cols = lappend(cols, col);
+							colmap = bms_add_member(colmap, var->varattno);
+						}
+					}
+				}
+				else
+				{
+					continue;
+				}
+			}
+			vars = foreach_delete_current(vars, cell);
+		}
+		if (list_length(cols) >= 2)
+		{
+			CreateStatsStmt* stats = makeNode(CreateStatsStmt);
+			RangeTblEntry *rte = rt_fetch(varno, es->rtable);
+			char *rel_namespace = get_namespace_name(get_rel_namespace(rte->relid));
+			char *rel_name = get_rel_name(rte->relid);
+			RangeVar* rel = makeRangeVar(rel_namespace, rel_name, 0);
+			char* stat_name = rel_name;
+			char* create_stat_stmt = (char*)"";
+			char const* sep = "ON";
+
+			list_sort(cols, vars_list_comparator);
+			/* Construct name for statistic by concatenating relation name with all columns */
+			foreach (cell, cols)
+			{
+				char* col_name = strVal((Value *) linitial(((ColumnRef *)lfirst(cell))->fields));
+				stat_name = psprintf("%s_%s", stat_name, col_name);
+				create_stat_stmt = psprintf("%s%s %s", create_stat_stmt, sep, col_name);
+				sep = ",";
+			}
+			/*
+			 * Check if multicolumn statistic object with such name already exists.
+			 * Most likely if was already created by auto_explain, but either ANALYZE was not performed since
+			 * this time, either presence of this multicolumn statistic doesn't help to provide more precise estimation.
+			 * Despite to the fact that we create statistics with "if_not_exist" option, presence of such check
+			 * allows to eliminate notice message that statistics object already exists.
+			 */
+			if (!SearchSysCacheExists2(STATEXTNAMENSP,
+									   CStringGetDatum(stat_name),
+									   ObjectIdGetDatum(get_rel_namespace(rte->relid))))
+			{
+				if (auto_explain_suggest_only)
+				{
+					elog(NOTICE, "Auto_explain suggestion: CREATE STATISTICS %s %s FROM %s", stat_name, create_stat_stmt, rel_name);
+				}
+				else
+				{
+					elog(LOG, "Add statistics %s", stat_name);
+					stats->defnames = list_make2(makeString(rel_namespace), makeString(stat_name));
+					stats->if_not_exists = true;
+					stats->relations = list_make1(rel);
+					stats->exprs = cols;
+					CreateStatistics(stats);
+				}
+			}
+		}
+	}
+}
+
+static void
+AddMultiColumnStatisticsForNode(PlanState *planstate, ExplainState *es)
+{
+	Plan	   *plan = planstate->plan;
+
+	if (planstate->instrument && plan->plan_rows != 0)
+	{
+		if (auto_explain_add_statistics_threshold != 0
+			&& planstate->instrument->ntuples / plan->plan_rows >= auto_explain_add_statistics_threshold)
+		{
+			elog(DEBUG1, "Estimated=%f, actual=%f, error=%f: plan=%s", plan->plan_rows, planstate->instrument->ntuples, planstate->instrument->ntuples / plan->plan_rows, nodeToString(plan));
+			/* quals, sort keys, etc */
+			switch (nodeTag(plan))
+			{
+			  case T_IndexScan:
+				AddMultiColumnStatisticsForQual(((IndexScan *) plan)->indexqualorig, es);
+				break;
+			  case T_IndexOnlyScan:
+				AddMultiColumnStatisticsForQual(((IndexOnlyScan *) plan)->indexqual, es);
+				break;
+			  case T_BitmapIndexScan:
+				AddMultiColumnStatisticsForQual(((BitmapIndexScan *) plan)->indexqualorig, es);
+				break;
+			  case T_NestLoop:
+				AddMultiColumnStatisticsForQual(((NestLoop *) plan)->join.joinqual, es);
+				break;
+			  case T_MergeJoin:
+				AddMultiColumnStatisticsForQual(((MergeJoin *) plan)->mergeclauses, es);
+				AddMultiColumnStatisticsForQual(((MergeJoin *) plan)->join.joinqual, es);
+				break;
+			  case T_HashJoin:
+				AddMultiColumnStatisticsForQual(((HashJoin *) plan)->hashclauses, es);
+				AddMultiColumnStatisticsForQual(((HashJoin *) plan)->join.joinqual, es);
+				break;
+			  default:
+				break;
+			}
+			AddMultiColumnStatisticsForQual(plan->qual, es);
+		}
+	}
+
+	/* initPlan-s */
+	if (planstate->initPlan)
+		AddMultiColumnStatisticsForSubPlans(planstate->initPlan, es);
+
+	/* lefttree */
+	if (outerPlanState(planstate))
+		AddMultiColumnStatisticsForNode(outerPlanState(planstate), es);
+
+	/* righttree */
+	if (innerPlanState(planstate))
+		AddMultiColumnStatisticsForNode(innerPlanState(planstate), es);
+
+	/* special child plans */
+	switch (nodeTag(plan))
+	{
+		case T_ModifyTable:
+			AddMultiColumnStatisticsForMemberNodes(((ModifyTableState *) planstate)->mt_plans,
+												   ((ModifyTableState *) planstate)->mt_nplans,
+												   es);
+			break;
+		case T_Append:
+			AddMultiColumnStatisticsForMemberNodes(((AppendState *) planstate)->appendplans,
+												   ((AppendState *) planstate)->as_nplans,
+												   es);
+			break;
+		case T_MergeAppend:
+			AddMultiColumnStatisticsForMemberNodes(((MergeAppendState *) planstate)->mergeplans,
+												   ((MergeAppendState *) planstate)->ms_nplans,
+												   es);
+			break;
+		case T_BitmapAnd:
+			AddMultiColumnStatisticsForMemberNodes(((BitmapAndState *) planstate)->bitmapplans,
+												   ((BitmapAndState *) planstate)->nplans,
+												   es);
+			break;
+		case T_BitmapOr:
+			AddMultiColumnStatisticsForMemberNodes(((BitmapOrState *) planstate)->bitmapplans,
+												   ((BitmapOrState *) planstate)->nplans,
+												   es);
+			break;
+		case T_SubqueryScan:
+			AddMultiColumnStatisticsForNode(((SubqueryScanState *) planstate)->subplan, es);
+			break;
+		default:
+			break;
+	}
+}
+
 /*
  * ExecutorEnd hook: log results if needed
  */
@@ -392,6 +655,10 @@ explain_ExecutorEnd(QueryDesc *queryDesc)
 				ExplainPrintJITSummary(es, queryDesc);
 			ExplainEndOutput(es);
 
+			/* Add multicolumn statistic if requested */
+			if (auto_explain_add_statistics_threshold && !IsParallelWorker())
+				AddMultiColumnStatisticsForNode(queryDesc->planstate, es);
+
 			/* Remove last line break */
 			if (es->str->len > 0 && es->str->data[es->str->len - 1] == '\n')
 				es->str->data[--es->str->len] = '\0';

Konstantin Knizhnik

k.knizhnik@postgrespro.ru

about 6 years ago

In reply to: Konstantin Knizhnik (#3)

2 attachment(s)

Re: Columns correlation and adaptive query optimization

Smarter version of join selectivity patch handling cases like this:

explain select * from outer_tab join inner_tab using(x,y) where x=1;
                                           QUERY PLAN
------------------------------------------------------------------------------------------------
Nested Loop (cost=0.42..1815.47 rows=10 width=12)
   Join Filter: (outer_tab.y = inner_tab.y)
   -> Seq Scan on outer_tab (cost=0.00..1791.00 rows=1 width=12)
         Filter: (x = 1)
   -> Index Only Scan using inner_tab_x_y_idx on inner_tab
(cost=0.42..24.35 rows=10 width=8)
         Index Cond: (x = 1)
(6 rows)

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

auto_explain_create_statistics.patchtext/x-patch; name=auto_explain_create_statistics.patchDownload

diff --git a/contrib/auto_explain/auto_explain.c b/contrib/auto_explain/auto_explain.c
index a9536c2..915a204 100644
--- a/contrib/auto_explain/auto_explain.c
+++ b/contrib/auto_explain/auto_explain.c
@@ -13,12 +13,25 @@
 #include "postgres.h"
 
 #include <limits.h>
+#include <math.h>
 
 #include "access/parallel.h"
 #include "commands/explain.h"
+#include "commands/defrem.h"
 #include "executor/instrument.h"
 #include "jit/jit.h"
+#include "nodes/makefuncs.h"
+#include "nodes/nodeFuncs.h"
+#include "optimizer/cost.h"
+#include "optimizer/optimizer.h"
+#include "optimizer/planmain.h"
+#include "parser/parsetree.h"
+#include "storage/ipc.h"
+#include "statistics/statistics.h"
 #include "utils/guc.h"
+#include "utils/syscache.h"
+#include "utils/lsyscache.h"
+#include "utils/ruleutils.h"
 
 PG_MODULE_MAGIC;
 
@@ -33,7 +46,9 @@ static bool auto_explain_log_settings = false;
 static int	auto_explain_log_format = EXPLAIN_FORMAT_TEXT;
 static int	auto_explain_log_level = LOG;
 static bool auto_explain_log_nested_statements = false;
+static bool auto_explain_suggest_only = false;
 static double auto_explain_sample_rate = 1;
+static double auto_explain_add_statistics_threshold = 0.0;
 
 static const struct config_enum_entry format_options[] = {
 	{"text", EXPLAIN_FORMAT_TEXT, false},
@@ -218,6 +233,30 @@ _PG_init(void)
 							 NULL,
 							 NULL);
 
+	DefineCustomRealVariable("auto_explain.add_statistics_threshold",
+							 "Sets the threshold for actual/estimated #rows ratio triggering creation of multicolumn statistic for the related columns.",
+							 "Zero disables implicit creation of multicolumn statistic.",
+							 &auto_explain_add_statistics_threshold,
+							 0.0,
+							 0.0,
+							 INT_MAX,
+							 PGC_SUSET,
+							 0,
+							 NULL,
+							 NULL,
+							 NULL);
+
+	DefineCustomBoolVariable("auto_explain.suggest_only",
+							 "Do not create statistic but just record in WAL suggested create statistics statement.",
+							 NULL,
+							 &auto_explain_suggest_only,
+							 false,
+							 PGC_SUSET,
+							 0,
+							 NULL,
+							 NULL,
+							 NULL);
+
 	EmitWarningsOnPlaceholders("auto_explain");
 
 	/* Install hooks. */
@@ -353,6 +392,230 @@ explain_ExecutorFinish(QueryDesc *queryDesc)
 	PG_END_TRY();
 }
 
+static void
+AddMultiColumnStatisticsForNode(PlanState *planstate, ExplainState *es);
+
+static void
+AddMultiColumnStatisticsForSubPlans(List *plans, ExplainState *es)
+{
+	ListCell   *lst;
+
+	foreach(lst, plans)
+	{
+		SubPlanState *sps = (SubPlanState *) lfirst(lst);
+
+		AddMultiColumnStatisticsForNode(sps->planstate, es);
+	}
+}
+
+static void
+AddMultiColumnStatisticsForMemberNodes(PlanState **planstates, int nsubnodes,
+									ExplainState *es)
+{
+	int			j;
+
+	for (j = 0; j < nsubnodes; j++)
+		AddMultiColumnStatisticsForNode(planstates[j], es);
+}
+
+static int
+vars_list_comparator(const ListCell *a, const ListCell *b)
+{
+	char* va = strVal((Value *) linitial(((ColumnRef *)lfirst(a))->fields));
+	char* vb = strVal((Value *) linitial(((ColumnRef *)lfirst(b))->fields));
+	return strcmp(va, vb);
+}
+
+static void
+AddMultiColumnStatisticsForQual(void* qual, ExplainState *es)
+{
+	List *vars = NULL;
+	ListCell* lc;
+	foreach (lc, qual)
+	{
+		Node* node = (Node*)lfirst(lc);
+		if (IsA(node, RestrictInfo))
+			node = (Node*)((RestrictInfo*)node)->clause;
+		vars = list_concat(vars, pull_vars_of_level(node, 0));
+	}
+	while (vars != NULL)
+	{
+		ListCell *cell;
+		List *cols = NULL;
+		Index varno = 0;
+		Bitmapset* colmap = NULL;
+
+		foreach (cell, vars)
+		{
+			Node* node = (Node *) lfirst(cell);
+			if (IsA(node, Var))
+			{
+				Var *var = (Var *) node;
+				if (cols == NULL || var->varnoold == varno)
+				{
+					varno = var->varnoold;
+					if (var->varattno > 0 &&
+						!bms_is_member(var->varattno, colmap) &&
+						varno >= 1 &&
+						varno <= list_length(es->rtable) &&
+						list_length(cols) < STATS_MAX_DIMENSIONS)
+					{
+						RangeTblEntry *rte = rt_fetch(varno, es->rtable);
+						if (rte->rtekind == RTE_RELATION)
+						{
+							ColumnRef  *col = makeNode(ColumnRef);
+							char *colname = get_rte_attribute_name(rte, var->varattno);
+							col->fields = list_make1(makeString(colname));
+							cols = lappend(cols, col);
+							colmap = bms_add_member(colmap, var->varattno);
+						}
+					}
+				}
+				else
+				{
+					continue;
+				}
+			}
+			vars = foreach_delete_current(vars, cell);
+		}
+		if (list_length(cols) >= 2)
+		{
+			CreateStatsStmt* stats = makeNode(CreateStatsStmt);
+			RangeTblEntry *rte = rt_fetch(varno, es->rtable);
+			char *rel_namespace = get_namespace_name(get_rel_namespace(rte->relid));
+			char *rel_name = get_rel_name(rte->relid);
+			RangeVar* rel = makeRangeVar(rel_namespace, rel_name, 0);
+			char* stat_name = rel_name;
+			char* create_stat_stmt = (char*)"";
+			char const* sep = "ON";
+
+			list_sort(cols, vars_list_comparator);
+			/* Construct name for statistic by concatenating relation name with all columns */
+			foreach (cell, cols)
+			{
+				char* col_name = strVal((Value *) linitial(((ColumnRef *)lfirst(cell))->fields));
+				stat_name = psprintf("%s_%s", stat_name, col_name);
+				create_stat_stmt = psprintf("%s%s %s", create_stat_stmt, sep, col_name);
+				sep = ",";
+			}
+			/*
+			 * Check if multicolumn statistic object with such name already exists.
+			 * Most likely if was already created by auto_explain, but either ANALYZE was not performed since
+			 * this time, either presence of this multicolumn statistic doesn't help to provide more precise estimation.
+			 * Despite to the fact that we create statistics with "if_not_exist" option, presence of such check
+			 * allows to eliminate notice message that statistics object already exists.
+			 */
+			if (!SearchSysCacheExists2(STATEXTNAMENSP,
+									   CStringGetDatum(stat_name),
+									   ObjectIdGetDatum(get_rel_namespace(rte->relid))))
+			{
+				if (auto_explain_suggest_only)
+				{
+					elog(NOTICE, "Auto_explain suggestion: CREATE STATISTICS %s %s FROM %s", stat_name, create_stat_stmt, rel_name);
+				}
+				else
+				{
+					elog(LOG, "Add statistics %s", stat_name);
+					stats->defnames = list_make2(makeString(rel_namespace), makeString(stat_name));
+					stats->if_not_exists = true;
+					stats->relations = list_make1(rel);
+					stats->exprs = cols;
+					CreateStatistics(stats);
+				}
+			}
+		}
+	}
+}
+
+static void
+AddMultiColumnStatisticsForNode(PlanState *planstate, ExplainState *es)
+{
+	Plan	   *plan = planstate->plan;
+
+	if (planstate->instrument && plan->plan_rows != 0)
+	{
+		if (auto_explain_add_statistics_threshold != 0
+			&& planstate->instrument->ntuples / plan->plan_rows >= auto_explain_add_statistics_threshold)
+		{
+			elog(DEBUG1, "Estimated=%f, actual=%f, error=%f: plan=%s", plan->plan_rows, planstate->instrument->ntuples, planstate->instrument->ntuples / plan->plan_rows, nodeToString(plan));
+			/* quals, sort keys, etc */
+			switch (nodeTag(plan))
+			{
+			  case T_IndexScan:
+				AddMultiColumnStatisticsForQual(((IndexScan *) plan)->indexqualorig, es);
+				break;
+			  case T_IndexOnlyScan:
+				AddMultiColumnStatisticsForQual(((IndexOnlyScan *) plan)->indexqual, es);
+				break;
+			  case T_BitmapIndexScan:
+				AddMultiColumnStatisticsForQual(((BitmapIndexScan *) plan)->indexqualorig, es);
+				break;
+			  case T_NestLoop:
+				AddMultiColumnStatisticsForQual(((NestLoop *) plan)->join.joinqual, es);
+				break;
+			  case T_MergeJoin:
+				AddMultiColumnStatisticsForQual(((MergeJoin *) plan)->mergeclauses, es);
+				AddMultiColumnStatisticsForQual(((MergeJoin *) plan)->join.joinqual, es);
+				break;
+			  case T_HashJoin:
+				AddMultiColumnStatisticsForQual(((HashJoin *) plan)->hashclauses, es);
+				AddMultiColumnStatisticsForQual(((HashJoin *) plan)->join.joinqual, es);
+				break;
+			  default:
+				break;
+			}
+			AddMultiColumnStatisticsForQual(plan->qual, es);
+		}
+	}
+
+	/* initPlan-s */
+	if (planstate->initPlan)
+		AddMultiColumnStatisticsForSubPlans(planstate->initPlan, es);
+
+	/* lefttree */
+	if (outerPlanState(planstate))
+		AddMultiColumnStatisticsForNode(outerPlanState(planstate), es);
+
+	/* righttree */
+	if (innerPlanState(planstate))
+		AddMultiColumnStatisticsForNode(innerPlanState(planstate), es);
+
+	/* special child plans */
+	switch (nodeTag(plan))
+	{
+		case T_ModifyTable:
+			AddMultiColumnStatisticsForMemberNodes(((ModifyTableState *) planstate)->mt_plans,
+												   ((ModifyTableState *) planstate)->mt_nplans,
+												   es);
+			break;
+		case T_Append:
+			AddMultiColumnStatisticsForMemberNodes(((AppendState *) planstate)->appendplans,
+												   ((AppendState *) planstate)->as_nplans,
+												   es);
+			break;
+		case T_MergeAppend:
+			AddMultiColumnStatisticsForMemberNodes(((MergeAppendState *) planstate)->mergeplans,
+												   ((MergeAppendState *) planstate)->ms_nplans,
+												   es);
+			break;
+		case T_BitmapAnd:
+			AddMultiColumnStatisticsForMemberNodes(((BitmapAndState *) planstate)->bitmapplans,
+												   ((BitmapAndState *) planstate)->nplans,
+												   es);
+			break;
+		case T_BitmapOr:
+			AddMultiColumnStatisticsForMemberNodes(((BitmapOrState *) planstate)->bitmapplans,
+												   ((BitmapOrState *) planstate)->nplans,
+												   es);
+			break;
+		case T_SubqueryScan:
+			AddMultiColumnStatisticsForNode(((SubqueryScanState *) planstate)->subplan, es);
+			break;
+		default:
+			break;
+	}
+}
+
 /*
  * ExecutorEnd hook: log results if needed
  */
@@ -392,6 +655,10 @@ explain_ExecutorEnd(QueryDesc *queryDesc)
 				ExplainPrintJITSummary(es, queryDesc);
 			ExplainEndOutput(es);
 
+			/* Add multicolumn statistic if requested */
+			if (auto_explain_add_statistics_threshold && !IsParallelWorker())
+				AddMultiColumnStatisticsForNode(queryDesc->planstate, es);
+
 			/* Remove last line break */
 			if (es->str->len > 0 && es->str->data[es->str->len - 1] == '\n')
 				es->str->data[--es->str->len] = '\0';

join_selectivity-2.patchtext/x-patch; name=join_selectivity-2.patchDownload

diff --git a/contrib/auto_explain/auto_explain.c b/contrib/auto_explain/auto_explain.c
index a9536c2..915a204 100644
--- a/contrib/auto_explain/auto_explain.c
+++ b/contrib/auto_explain/auto_explain.c
@@ -13,12 +13,25 @@
 #include "postgres.h"
 
 #include <limits.h>
+#include <math.h>
 
 #include "access/parallel.h"
 #include "commands/explain.h"
+#include "commands/defrem.h"
 #include "executor/instrument.h"
 #include "jit/jit.h"
+#include "nodes/makefuncs.h"
+#include "nodes/nodeFuncs.h"
+#include "optimizer/cost.h"
+#include "optimizer/optimizer.h"
+#include "optimizer/planmain.h"
+#include "parser/parsetree.h"
+#include "storage/ipc.h"
+#include "statistics/statistics.h"
 #include "utils/guc.h"
+#include "utils/syscache.h"
+#include "utils/lsyscache.h"
+#include "utils/ruleutils.h"
 
 PG_MODULE_MAGIC;
 
@@ -33,7 +46,9 @@ static bool auto_explain_log_settings = false;
 static int	auto_explain_log_format = EXPLAIN_FORMAT_TEXT;
 static int	auto_explain_log_level = LOG;
 static bool auto_explain_log_nested_statements = false;
+static bool auto_explain_suggest_only = false;
 static double auto_explain_sample_rate = 1;
+static double auto_explain_add_statistics_threshold = 0.0;
 
 static const struct config_enum_entry format_options[] = {
 	{"text", EXPLAIN_FORMAT_TEXT, false},
@@ -218,6 +233,30 @@ _PG_init(void)
 							 NULL,
 							 NULL);
 
+	DefineCustomRealVariable("auto_explain.add_statistics_threshold",
+							 "Sets the threshold for actual/estimated #rows ratio triggering creation of multicolumn statistic for the related columns.",
+							 "Zero disables implicit creation of multicolumn statistic.",
+							 &auto_explain_add_statistics_threshold,
+							 0.0,
+							 0.0,
+							 INT_MAX,
+							 PGC_SUSET,
+							 0,
+							 NULL,
+							 NULL,
+							 NULL);
+
+	DefineCustomBoolVariable("auto_explain.suggest_only",
+							 "Do not create statistic but just record in WAL suggested create statistics statement.",
+							 NULL,
+							 &auto_explain_suggest_only,
+							 false,
+							 PGC_SUSET,
+							 0,
+							 NULL,
+							 NULL,
+							 NULL);
+
 	EmitWarningsOnPlaceholders("auto_explain");
 
 	/* Install hooks. */
@@ -353,6 +392,230 @@ explain_ExecutorFinish(QueryDesc *queryDesc)
 	PG_END_TRY();
 }
 
+static void
+AddMultiColumnStatisticsForNode(PlanState *planstate, ExplainState *es);
+
+static void
+AddMultiColumnStatisticsForSubPlans(List *plans, ExplainState *es)
+{
+	ListCell   *lst;
+
+	foreach(lst, plans)
+	{
+		SubPlanState *sps = (SubPlanState *) lfirst(lst);
+
+		AddMultiColumnStatisticsForNode(sps->planstate, es);
+	}
+}
+
+static void
+AddMultiColumnStatisticsForMemberNodes(PlanState **planstates, int nsubnodes,
+									ExplainState *es)
+{
+	int			j;
+
+	for (j = 0; j < nsubnodes; j++)
+		AddMultiColumnStatisticsForNode(planstates[j], es);
+}
+
+static int
+vars_list_comparator(const ListCell *a, const ListCell *b)
+{
+	char* va = strVal((Value *) linitial(((ColumnRef *)lfirst(a))->fields));
+	char* vb = strVal((Value *) linitial(((ColumnRef *)lfirst(b))->fields));
+	return strcmp(va, vb);
+}
+
+static void
+AddMultiColumnStatisticsForQual(void* qual, ExplainState *es)
+{
+	List *vars = NULL;
+	ListCell* lc;
+	foreach (lc, qual)
+	{
+		Node* node = (Node*)lfirst(lc);
+		if (IsA(node, RestrictInfo))
+			node = (Node*)((RestrictInfo*)node)->clause;
+		vars = list_concat(vars, pull_vars_of_level(node, 0));
+	}
+	while (vars != NULL)
+	{
+		ListCell *cell;
+		List *cols = NULL;
+		Index varno = 0;
+		Bitmapset* colmap = NULL;
+
+		foreach (cell, vars)
+		{
+			Node* node = (Node *) lfirst(cell);
+			if (IsA(node, Var))
+			{
+				Var *var = (Var *) node;
+				if (cols == NULL || var->varnoold == varno)
+				{
+					varno = var->varnoold;
+					if (var->varattno > 0 &&
+						!bms_is_member(var->varattno, colmap) &&
+						varno >= 1 &&
+						varno <= list_length(es->rtable) &&
+						list_length(cols) < STATS_MAX_DIMENSIONS)
+					{
+						RangeTblEntry *rte = rt_fetch(varno, es->rtable);
+						if (rte->rtekind == RTE_RELATION)
+						{
+							ColumnRef  *col = makeNode(ColumnRef);
+							char *colname = get_rte_attribute_name(rte, var->varattno);
+							col->fields = list_make1(makeString(colname));
+							cols = lappend(cols, col);
+							colmap = bms_add_member(colmap, var->varattno);
+						}
+					}
+				}
+				else
+				{
+					continue;
+				}
+			}
+			vars = foreach_delete_current(vars, cell);
+		}
+		if (list_length(cols) >= 2)
+		{
+			CreateStatsStmt* stats = makeNode(CreateStatsStmt);
+			RangeTblEntry *rte = rt_fetch(varno, es->rtable);
+			char *rel_namespace = get_namespace_name(get_rel_namespace(rte->relid));
+			char *rel_name = get_rel_name(rte->relid);
+			RangeVar* rel = makeRangeVar(rel_namespace, rel_name, 0);
+			char* stat_name = rel_name;
+			char* create_stat_stmt = (char*)"";
+			char const* sep = "ON";
+
+			list_sort(cols, vars_list_comparator);
+			/* Construct name for statistic by concatenating relation name with all columns */
+			foreach (cell, cols)
+			{
+				char* col_name = strVal((Value *) linitial(((ColumnRef *)lfirst(cell))->fields));
+				stat_name = psprintf("%s_%s", stat_name, col_name);
+				create_stat_stmt = psprintf("%s%s %s", create_stat_stmt, sep, col_name);
+				sep = ",";
+			}
+			/*
+			 * Check if multicolumn statistic object with such name already exists.
+			 * Most likely if was already created by auto_explain, but either ANALYZE was not performed since
+			 * this time, either presence of this multicolumn statistic doesn't help to provide more precise estimation.
+			 * Despite to the fact that we create statistics with "if_not_exist" option, presence of such check
+			 * allows to eliminate notice message that statistics object already exists.
+			 */
+			if (!SearchSysCacheExists2(STATEXTNAMENSP,
+									   CStringGetDatum(stat_name),
+									   ObjectIdGetDatum(get_rel_namespace(rte->relid))))
+			{
+				if (auto_explain_suggest_only)
+				{
+					elog(NOTICE, "Auto_explain suggestion: CREATE STATISTICS %s %s FROM %s", stat_name, create_stat_stmt, rel_name);
+				}
+				else
+				{
+					elog(LOG, "Add statistics %s", stat_name);
+					stats->defnames = list_make2(makeString(rel_namespace), makeString(stat_name));
+					stats->if_not_exists = true;
+					stats->relations = list_make1(rel);
+					stats->exprs = cols;
+					CreateStatistics(stats);
+				}
+			}
+		}
+	}
+}
+
+static void
+AddMultiColumnStatisticsForNode(PlanState *planstate, ExplainState *es)
+{
+	Plan	   *plan = planstate->plan;
+
+	if (planstate->instrument && plan->plan_rows != 0)
+	{
+		if (auto_explain_add_statistics_threshold != 0
+			&& planstate->instrument->ntuples / plan->plan_rows >= auto_explain_add_statistics_threshold)
+		{
+			elog(DEBUG1, "Estimated=%f, actual=%f, error=%f: plan=%s", plan->plan_rows, planstate->instrument->ntuples, planstate->instrument->ntuples / plan->plan_rows, nodeToString(plan));
+			/* quals, sort keys, etc */
+			switch (nodeTag(plan))
+			{
+			  case T_IndexScan:
+				AddMultiColumnStatisticsForQual(((IndexScan *) plan)->indexqualorig, es);
+				break;
+			  case T_IndexOnlyScan:
+				AddMultiColumnStatisticsForQual(((IndexOnlyScan *) plan)->indexqual, es);
+				break;
+			  case T_BitmapIndexScan:
+				AddMultiColumnStatisticsForQual(((BitmapIndexScan *) plan)->indexqualorig, es);
+				break;
+			  case T_NestLoop:
+				AddMultiColumnStatisticsForQual(((NestLoop *) plan)->join.joinqual, es);
+				break;
+			  case T_MergeJoin:
+				AddMultiColumnStatisticsForQual(((MergeJoin *) plan)->mergeclauses, es);
+				AddMultiColumnStatisticsForQual(((MergeJoin *) plan)->join.joinqual, es);
+				break;
+			  case T_HashJoin:
+				AddMultiColumnStatisticsForQual(((HashJoin *) plan)->hashclauses, es);
+				AddMultiColumnStatisticsForQual(((HashJoin *) plan)->join.joinqual, es);
+				break;
+			  default:
+				break;
+			}
+			AddMultiColumnStatisticsForQual(plan->qual, es);
+		}
+	}
+
+	/* initPlan-s */
+	if (planstate->initPlan)
+		AddMultiColumnStatisticsForSubPlans(planstate->initPlan, es);
+
+	/* lefttree */
+	if (outerPlanState(planstate))
+		AddMultiColumnStatisticsForNode(outerPlanState(planstate), es);
+
+	/* righttree */
+	if (innerPlanState(planstate))
+		AddMultiColumnStatisticsForNode(innerPlanState(planstate), es);
+
+	/* special child plans */
+	switch (nodeTag(plan))
+	{
+		case T_ModifyTable:
+			AddMultiColumnStatisticsForMemberNodes(((ModifyTableState *) planstate)->mt_plans,
+												   ((ModifyTableState *) planstate)->mt_nplans,
+												   es);
+			break;
+		case T_Append:
+			AddMultiColumnStatisticsForMemberNodes(((AppendState *) planstate)->appendplans,
+												   ((AppendState *) planstate)->as_nplans,
+												   es);
+			break;
+		case T_MergeAppend:
+			AddMultiColumnStatisticsForMemberNodes(((MergeAppendState *) planstate)->mergeplans,
+												   ((MergeAppendState *) planstate)->ms_nplans,
+												   es);
+			break;
+		case T_BitmapAnd:
+			AddMultiColumnStatisticsForMemberNodes(((BitmapAndState *) planstate)->bitmapplans,
+												   ((BitmapAndState *) planstate)->nplans,
+												   es);
+			break;
+		case T_BitmapOr:
+			AddMultiColumnStatisticsForMemberNodes(((BitmapOrState *) planstate)->bitmapplans,
+												   ((BitmapOrState *) planstate)->nplans,
+												   es);
+			break;
+		case T_SubqueryScan:
+			AddMultiColumnStatisticsForNode(((SubqueryScanState *) planstate)->subplan, es);
+			break;
+		default:
+			break;
+	}
+}
+
 /*
  * ExecutorEnd hook: log results if needed
  */
@@ -392,6 +655,10 @@ explain_ExecutorEnd(QueryDesc *queryDesc)
 				ExplainPrintJITSummary(es, queryDesc);
 			ExplainEndOutput(es);
 
+			/* Add multicolumn statistic if requested */
+			if (auto_explain_add_statistics_threshold && !IsParallelWorker())
+				AddMultiColumnStatisticsForNode(queryDesc->planstate, es);
+
 			/* Remove last line break */
 			if (es->str->len > 0 && es->str->data[es->str->len - 1] == '\n')
 				es->str->data[--es->str->len] = '\0';
diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c
index 4bf777d..085defe 100644
--- a/src/backend/optimizer/path/clausesel.c
+++ b/src/backend/optimizer/path/clausesel.c
@@ -25,6 +25,7 @@
 #include "utils/lsyscache.h"
 #include "utils/selfuncs.h"
 #include "statistics/statistics.h"
+#include "catalog/pg_statistic_ext.h"
 
 
 /*
@@ -106,6 +107,47 @@ clauselist_selectivity(PlannerInfo *root,
 }
 
 /*
+ * Find functional dependency between attributes using multicolumn statistic.
+ * relid:   index of relation to which all considered attributes belong
+ * var:     variable which dependencies are inspected
+ * attnums: set of considered attributes included specified variables
+ * This function return degree of strongest dependency between some subset of this attributes
+ * and specified variable or 0.0 if on dependency is found.
+ */
+double
+find_var_dependency(PlannerInfo *root, Index relid, Var *var, Bitmapset *attnums)
+{
+	RelOptInfo* rel = find_base_rel(root, relid);
+	if (rel->rtekind == RTE_RELATION && rel->statlist != NIL)
+	{
+		StatisticExtInfo *stat = choose_best_statistics(rel->statlist, attnums,
+														STATS_EXT_DEPENDENCIES);
+		if (stat != NULL)
+		{
+			MVDependencies *dependencies = statext_dependencies_load(stat->statOid);
+			MVDependency *strongest = NULL;
+			int i;
+			for (i = 0; i < dependencies->ndeps; i++)
+			{
+				MVDependency *dependency = dependencies->deps[i];
+				int n_dep_vars = dependency->nattributes - 1;
+				/* Dependency implies attribute */
+				if (var->varattno == dependency->attributes[n_dep_vars])
+				{
+					while (--n_dep_vars >= 0
+						   && bms_is_member(dependency->attributes[n_dep_vars], attnums));
+					if (n_dep_vars < 0 && (!strongest || strongest->degree < dependency->degree))
+						strongest = dependency;
+				}
+			}
+			if (strongest)
+				return strongest->degree;
+		}
+	}
+	return 0.0;
+}
+
+/*
  * clauselist_selectivity_simple -
  *	  Compute the selectivity of an implicitly-ANDed list of boolean
  *	  expression clauses.  The list can be empty, in which case 1.0
@@ -159,6 +201,9 @@ clauselist_selectivity_simple(PlannerInfo *root,
 	RangeQueryClause *rqlist = NULL;
 	ListCell   *l;
 	int			listidx;
+	Bitmapset  *clauses_attnums = NULL;
+	int			n_clauses_attnums = 0;
+	int         innerRelid = varRelid;
 
 	/*
 	 * If there's exactly one clause (and it was not estimated yet), just go
@@ -170,6 +215,9 @@ clauselist_selectivity_simple(PlannerInfo *root,
 		return clause_selectivity(root, (Node *) linitial(clauses),
 								  varRelid, jointype, sjinfo);
 
+	if (innerRelid == 0 && sjinfo)
+		bms_get_singleton_member(sjinfo->min_righthand, &innerRelid);
+
 	/*
 	 * Anything that doesn't look like a potential rangequery clause gets
 	 * multiplied into s1 and forgotten. Anything that does gets inserted into
@@ -181,7 +229,6 @@ clauselist_selectivity_simple(PlannerInfo *root,
 		Node	   *clause = (Node *) lfirst(l);
 		RestrictInfo *rinfo;
 		Selectivity s2;
-
 		listidx++;
 
 		/*
@@ -213,6 +260,7 @@ clauselist_selectivity_simple(PlannerInfo *root,
 		else
 			rinfo = NULL;
 
+
 		/*
 		 * See if it looks like a restriction clause with a pseudoconstant on
 		 * one side.  (Anything more complicated than that might not behave in
@@ -224,6 +272,30 @@ clauselist_selectivity_simple(PlannerInfo *root,
 			OpExpr	   *expr = (OpExpr *) clause;
 			bool		varonleft = true;
 			bool		ok;
+			int         oprrest = get_oprrest(expr->opno);
+
+			/* Try to take in account functional dependencies between attributes */
+			if (oprrest == F_EQSEL && rinfo != NULL && innerRelid != 0)
+			{
+				Var* var = (Var*)linitial(expr->args);
+				if (!IsA(var, Var) || var->varnoold != innerRelid)
+				{
+					var = (Var*)lsecond(expr->args);
+				}
+				if (IsA(var, Var) && var->varattno >= 0 && var->varnoold == innerRelid)
+				{
+					clauses_attnums = bms_add_member(clauses_attnums, var->varattno);
+					if (n_clauses_attnums++ != 0)
+					{
+						double dep = find_var_dependency(root, innerRelid, var, clauses_attnums);
+						if (dep != 0.0)
+						{
+							s1 *= dep + (1 - dep) * s2;
+							continue;
+						}
+					}
+				}
+			}
 
 			if (rinfo)
 			{
@@ -249,7 +321,7 @@ clauselist_selectivity_simple(PlannerInfo *root,
 				 * selectivity in generically.  But if it's the right oprrest,
 				 * add the clause to rqlist for later processing.
 				 */
-				switch (get_oprrest(expr->opno))
+				switch (oprrest)
 				{
 					case F_SCALARLTSEL:
 					case F_SCALARLESEL:
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index c5f6593..f6714e4 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -4543,6 +4543,30 @@ get_parameterized_joinrel_size(PlannerInfo *root, RelOptInfo *rel,
 	return nrows;
 }
 
+
+/*
+ * Try to find dependency between variables.
+ * var: varaibles which dependencies are considered
+ * join_vars: list of variables used in other clauses
+ * This functions return strongest dependency and some subset of variables from the same relation
+ * or 0.0 if no dependency was found.
+ */
+static double
+var_depends_on(PlannerInfo *root, Var* var, List* clause_vars)
+{
+	ListCell* lc;
+	Bitmapset *attnums = NULL;
+	Index relid = var->varnoold;
+
+	foreach (lc, clause_vars)
+	{
+		Var* join_var = (Var*)lfirst(lc);
+		if (join_var->varnoold == relid && join_var->varattno >= 0)
+			attnums = bms_add_member(attnums, join_var->varattno);
+	}
+	return attnums ? find_var_dependency(root, relid, var, bms_add_member(attnums, var->varattno)) : 0.0;
+}
+
 /*
  * calc_joinrel_size_estimate
  *		Workhorse for set_joinrel_size_estimates and
@@ -4639,6 +4663,39 @@ calc_joinrel_size_estimate(PlannerInfo *root,
 		pselec = 0.0;			/* not used, keep compiler quiet */
 	}
 
+	/* Try to take in account functional dependencies between attributes of clauses pushed-down to joined relations and
+	 * retstrictlist clause. Right now we consider only case of restrictlist consists of one clause.
+	 */
+	if (list_length(restrictlist) == 1)
+	{
+		RestrictInfo* rinfo = linitial(restrictlist);
+		Expr* clause = rinfo->clause;
+
+		Assert(IsA(rinfo, RestrictInfo));
+
+		if (is_opclause(clause))
+		{
+			OpExpr *expr = (OpExpr *) clause;
+			ListCell* lc;
+			List* join_vars = NULL;
+
+			/* Get list of all attributes in pushed-down clauses */
+			foreach (lc, outer_rel->baserestrictinfo)
+				join_vars = list_concat(join_vars, pull_vars_of_level((Node*)((RestrictInfo*)lfirst(lc))->clause, 0));
+			foreach (lc, inner_rel->baserestrictinfo)
+				join_vars = list_concat(join_vars, pull_vars_of_level((Node*)((RestrictInfo*)lfirst(lc))->clause, 0));
+
+			foreach (lc, expr->args)
+			{
+				Var *var = (Var*) lfirst(lc);
+				if (IsA(var, Var) && var->varattno >= 0)
+				{
+					double dep = var_depends_on(root, var, join_vars);
+					jselec = jselec*(1.0 - dep) + dep;
+				}
+			}
+		}
+	}
 	/*
 	 * Basically, we multiply size of Cartesian product by selectivity.
 	 *
diff --git a/src/include/optimizer/clauses.h b/src/include/optimizer/clauses.h
index 2f9aeec..09ff0c4 100644
--- a/src/include/optimizer/clauses.h
+++ b/src/include/optimizer/clauses.h
@@ -55,4 +55,6 @@ extern void CommuteOpExpr(OpExpr *clause);
 extern Query *inline_set_returning_function(PlannerInfo *root,
 											RangeTblEntry *rte);
 
+extern double find_var_dependency(PlannerInfo *root, Index relid, Var *var, Bitmapset *attnums);
+
 #endif							/* CLAUSES_H */

Konstantin Knizhnik

k.knizhnik@postgrespro.ru

about 6 years ago

In reply to: Konstantin Knizhnik (#4)

1 attachment(s)

Re: Columns correlation and adaptive query optimization

New version of patch implicitly adding multicolumn statistic in
auto_explain extension and using it in optimizer for more precise
estimation of join selectivity.
This patch fixes race condition while adding statistics and restricts
generated statistic name to fit in 64 bytes (NameData).

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

auto_explain_create_statistic-3.patchtext/x-patch; name=auto_explain_create_statistic-3.patchDownload

diff --git a/contrib/auto_explain/auto_explain.c b/contrib/auto_explain/auto_explain.c
index a9536c2..3f9d6ef 100644
--- a/contrib/auto_explain/auto_explain.c
+++ b/contrib/auto_explain/auto_explain.c
@@ -13,12 +13,32 @@
 #include "postgres.h"
 
 #include <limits.h>
+#include <math.h>
 
+#include "access/hash.h"
 #include "access/parallel.h"
+#include "access/relscan.h"
+#include "access/skey.h"
+#include "access/table.h"
+#include "access/tableam.h"
+#include "catalog/pg_statistic_ext.h"
 #include "commands/explain.h"
+#include "commands/defrem.h"
 #include "executor/instrument.h"
 #include "jit/jit.h"
+#include "nodes/makefuncs.h"
+#include "nodes/nodeFuncs.h"
+#include "optimizer/cost.h"
+#include "optimizer/optimizer.h"
+#include "optimizer/planmain.h"
+#include "parser/parsetree.h"
+#include "storage/ipc.h"
+#include "statistics/statistics.h"
+#include "utils/fmgroids.h"
 #include "utils/guc.h"
+#include "utils/syscache.h"
+#include "utils/lsyscache.h"
+#include "utils/ruleutils.h"
 
 PG_MODULE_MAGIC;
 
@@ -33,7 +53,9 @@ static bool auto_explain_log_settings = false;
 static int	auto_explain_log_format = EXPLAIN_FORMAT_TEXT;
 static int	auto_explain_log_level = LOG;
 static bool auto_explain_log_nested_statements = false;
+static bool auto_explain_suggest_only = false;
 static double auto_explain_sample_rate = 1;
+static double auto_explain_add_statistics_threshold = 0.0;
 
 static const struct config_enum_entry format_options[] = {
 	{"text", EXPLAIN_FORMAT_TEXT, false},
@@ -218,6 +240,30 @@ _PG_init(void)
 							 NULL,
 							 NULL);
 
+	DefineCustomRealVariable("auto_explain.add_statistics_threshold",
+							 "Sets the threshold for actual/estimated #rows ratio triggering creation of multicolumn statistic for the related columns.",
+							 "Zero disables implicit creation of multicolumn statistic.",
+							 &auto_explain_add_statistics_threshold,
+							 0.0,
+							 0.0,
+							 INT_MAX,
+							 PGC_SUSET,
+							 0,
+							 NULL,
+							 NULL,
+							 NULL);
+
+	DefineCustomBoolVariable("auto_explain.suggest_only",
+							 "Do not create statistic but just record in WAL suggested create statistics statement.",
+							 NULL,
+							 &auto_explain_suggest_only,
+							 false,
+							 PGC_SUSET,
+							 0,
+							 NULL,
+							 NULL,
+							 NULL);
+
 	EmitWarningsOnPlaceholders("auto_explain");
 
 	/* Install hooks. */
@@ -353,6 +399,256 @@ explain_ExecutorFinish(QueryDesc *queryDesc)
 	PG_END_TRY();
 }
 
+static void
+AddMultiColumnStatisticsForNode(PlanState *planstate, ExplainState *es);
+
+static void
+AddMultiColumnStatisticsForSubPlans(List *plans, ExplainState *es)
+{
+	ListCell   *lst;
+
+	foreach(lst, plans)
+	{
+		SubPlanState *sps = (SubPlanState *) lfirst(lst);
+
+		AddMultiColumnStatisticsForNode(sps->planstate, es);
+	}
+}
+
+static void
+AddMultiColumnStatisticsForMemberNodes(PlanState **planstates, int nsubnodes,
+									ExplainState *es)
+{
+	int			j;
+
+	for (j = 0; j < nsubnodes; j++)
+		AddMultiColumnStatisticsForNode(planstates[j], es);
+}
+
+static int
+vars_list_comparator(const ListCell *a, const ListCell *b)
+{
+	char* va = strVal((Value *) linitial(((ColumnRef *)lfirst(a))->fields));
+	char* vb = strVal((Value *) linitial(((ColumnRef *)lfirst(b))->fields));
+	return strcmp(va, vb);
+}
+
+static void
+AddMultiColumnStatisticsForQual(void* qual, ExplainState *es)
+{
+	List *vars = NULL;
+	ListCell* lc;
+
+	foreach (lc, qual)
+	{
+		Node* node = (Node*)lfirst(lc);
+		if (IsA(node, RestrictInfo))
+			node = (Node*)((RestrictInfo*)node)->clause;
+		vars = list_concat(vars, pull_vars_of_level(node, 0));
+	}
+	while (vars != NULL)
+	{
+		ListCell *cell;
+		List *cols = NULL;
+		Index varno = 0;
+		Bitmapset* colmap = NULL;
+
+		foreach (cell, vars)
+		{
+			Node* node = (Node *) lfirst(cell);
+			if (IsA(node, Var))
+			{
+				Var *var = (Var *) node;
+				if (cols == NULL || var->varnoold == varno)
+				{
+					varno = var->varnoold;
+					if (var->varoattno > 0 &&
+						!bms_is_member(var->varoattno, colmap) &&
+						varno >= 1 &&
+						varno <= list_length(es->rtable) &&
+						list_length(cols) < STATS_MAX_DIMENSIONS)
+					{
+						RangeTblEntry *rte = rt_fetch(varno, es->rtable);
+						if (rte->rtekind == RTE_RELATION)
+						{
+							ColumnRef  *col = makeNode(ColumnRef);
+							char *colname = get_rte_attribute_name(rte, var->varoattno);
+							col->fields = list_make1(makeString(colname));
+							cols = lappend(cols, col);
+							colmap = bms_add_member(colmap, var->varoattno);
+						}
+					}
+				}
+				else
+				{
+					continue;
+				}
+			}
+			vars = foreach_delete_current(vars, cell);
+		}
+		if (list_length(cols) >= 2)
+		{
+			RangeTblEntry *rte = rt_fetch(varno, es->rtable);
+			CreateStatsStmt* stats = makeNode(CreateStatsStmt);
+			char *rel_namespace = get_namespace_name(get_rel_namespace(rte->relid));
+			char *rel_name = get_rel_name(rte->relid);
+			RangeVar* rel = makeRangeVar(rel_namespace, rel_name, 0);
+			char* stat_name = rel_name;
+			char* create_stat_stmt = (char*)"";
+			char const* sep = "ON";
+			ScanKeyData entry[2];
+			TableScanDesc scan;
+			Relation stat_rel;
+			size_t name_len;
+			TupleTableSlot *slot;
+
+			list_sort(cols, vars_list_comparator);
+			/* Construct name for statistic by concatenating relation name with all columns */
+			foreach (cell, cols)
+			{
+				char* col_name = strVal((Value *) linitial(((ColumnRef *)lfirst(cell))->fields));
+				stat_name = psprintf("%s_%s", stat_name, col_name);
+				create_stat_stmt = psprintf("%s%s %s", create_stat_stmt, sep, col_name);
+				sep = ",";
+			}
+
+			name_len = strlen(stat_name);
+			if (name_len >= NAMEDATALEN)
+				stat_name = psprintf("%.*s_%08x", NAMEDATALEN - 10, stat_name, (unsigned)hash_any((uint8*)stat_name, name_len));
+
+			ScanKeyInit(&entry[0],
+						Anum_pg_statistic_ext_stxname,
+						BTEqualStrategyNumber, F_NAMEEQ,
+						CStringGetDatum(stat_name));
+			ScanKeyInit(&entry[1],
+						Anum_pg_statistic_ext_stxnamespace,
+						BTEqualStrategyNumber, F_OIDEQ,
+						ObjectIdGetDatum(get_rel_namespace(rte->relid)));
+
+			/*
+			 * Prevent concurrent access to extended statistic table
+			 */
+			stat_rel = table_open(StatisticExtRelationId, AccessExclusiveLock);
+			slot = table_slot_create(stat_rel, NULL);
+			scan = table_beginscan_catalog(stat_rel, 2, entry);
+
+			/*
+			 * Check if multicolumn statistic object with such name already exists.
+			 * Most likely if was already created by auto_explain, but either ANALYZE was not performed since
+			 * this time, either presence of this multicolumn statistic doesn't help to provide more precise estimation.
+			 * Despite to the fact that we create statistics with "if_not_exist" option, presence of such check
+			 * allows to eliminate notice message that statistics object already exists.
+			 */
+			if (!table_scan_getnextslot(scan, ForwardScanDirection, slot))
+			{
+				if (auto_explain_suggest_only)
+				{
+					elog(NOTICE, "Auto_explain suggestion: CREATE STATISTICS %s %s FROM %s", stat_name, create_stat_stmt, rel_name);
+				}
+				else
+				{
+					elog(LOG, "Add statistics %s", stat_name);
+					stats->defnames = list_make2(makeString(rel_namespace), makeString(stat_name));
+					stats->if_not_exists = true;
+					stats->relations = list_make1(rel);
+					stats->exprs = cols;
+					CreateStatistics(stats);
+				}
+			}
+			table_close(stat_rel, AccessExclusiveLock);
+		}
+	}
+}
+
+static void
+AddMultiColumnStatisticsForNode(PlanState *planstate, ExplainState *es)
+{
+	Plan	   *plan = planstate->plan;
+
+	if (planstate->instrument && plan->plan_rows != 0)
+	{
+		if (auto_explain_add_statistics_threshold != 0
+			&& planstate->instrument->ntuples / plan->plan_rows >= auto_explain_add_statistics_threshold)
+		{
+			elog(DEBUG1, "Estimated=%f, actual=%f, error=%f: plan=%s", plan->plan_rows, planstate->instrument->ntuples, planstate->instrument->ntuples / plan->plan_rows, nodeToString(plan));
+			/* quals, sort keys, etc */
+			switch (nodeTag(plan))
+			{
+			  case T_IndexScan:
+				AddMultiColumnStatisticsForQual(((IndexScan *) plan)->indexqualorig, es);
+				break;
+			  case T_IndexOnlyScan:
+				AddMultiColumnStatisticsForQual(((IndexOnlyScan *) plan)->indexqual, es);
+				break;
+			  case T_BitmapIndexScan:
+				AddMultiColumnStatisticsForQual(((BitmapIndexScan *) plan)->indexqualorig, es);
+				break;
+			  case T_NestLoop:
+				AddMultiColumnStatisticsForQual(((NestLoop *) plan)->join.joinqual, es);
+				break;
+			  case T_MergeJoin:
+				AddMultiColumnStatisticsForQual(((MergeJoin *) plan)->mergeclauses, es);
+				AddMultiColumnStatisticsForQual(((MergeJoin *) plan)->join.joinqual, es);
+				break;
+			  case T_HashJoin:
+				AddMultiColumnStatisticsForQual(((HashJoin *) plan)->hashclauses, es);
+				AddMultiColumnStatisticsForQual(((HashJoin *) plan)->join.joinqual, es);
+				break;
+			  default:
+				break;
+			}
+			AddMultiColumnStatisticsForQual(plan->qual, es);
+		}
+	}
+
+	/* initPlan-s */
+	if (planstate->initPlan)
+		AddMultiColumnStatisticsForSubPlans(planstate->initPlan, es);
+
+	/* lefttree */
+	if (outerPlanState(planstate))
+		AddMultiColumnStatisticsForNode(outerPlanState(planstate), es);
+
+	/* righttree */
+	if (innerPlanState(planstate))
+		AddMultiColumnStatisticsForNode(innerPlanState(planstate), es);
+
+	/* special child plans */
+	switch (nodeTag(plan))
+	{
+		case T_ModifyTable:
+			AddMultiColumnStatisticsForMemberNodes(((ModifyTableState *) planstate)->mt_plans,
+												   ((ModifyTableState *) planstate)->mt_nplans,
+												   es);
+			break;
+		case T_Append:
+			AddMultiColumnStatisticsForMemberNodes(((AppendState *) planstate)->appendplans,
+												   ((AppendState *) planstate)->as_nplans,
+												   es);
+			break;
+		case T_MergeAppend:
+			AddMultiColumnStatisticsForMemberNodes(((MergeAppendState *) planstate)->mergeplans,
+												   ((MergeAppendState *) planstate)->ms_nplans,
+												   es);
+			break;
+		case T_BitmapAnd:
+			AddMultiColumnStatisticsForMemberNodes(((BitmapAndState *) planstate)->bitmapplans,
+												   ((BitmapAndState *) planstate)->nplans,
+												   es);
+			break;
+		case T_BitmapOr:
+			AddMultiColumnStatisticsForMemberNodes(((BitmapOrState *) planstate)->bitmapplans,
+												   ((BitmapOrState *) planstate)->nplans,
+												   es);
+			break;
+		case T_SubqueryScan:
+			AddMultiColumnStatisticsForNode(((SubqueryScanState *) planstate)->subplan, es);
+			break;
+		default:
+			break;
+	}
+}
+
 /*
  * ExecutorEnd hook: log results if needed
  */
@@ -392,6 +688,10 @@ explain_ExecutorEnd(QueryDesc *queryDesc)
 				ExplainPrintJITSummary(es, queryDesc);
 			ExplainEndOutput(es);
 
+			/* Add multicolumn statistic if requested */
+			if (auto_explain_add_statistics_threshold && !IsParallelWorker())
+				AddMultiColumnStatisticsForNode(queryDesc->planstate, es);
+
 			/* Remove last line break */
 			if (es->str->len > 0 && es->str->data[es->str->len - 1] == '\n')
 				es->str->data[--es->str->len] = '\0';
diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c
index 4bf777d..085defe 100644
--- a/src/backend/optimizer/path/clausesel.c
+++ b/src/backend/optimizer/path/clausesel.c
@@ -25,6 +25,7 @@
 #include "utils/lsyscache.h"
 #include "utils/selfuncs.h"
 #include "statistics/statistics.h"
+#include "catalog/pg_statistic_ext.h"
 
 
 /*
@@ -106,6 +107,47 @@ clauselist_selectivity(PlannerInfo *root,
 }
 
 /*
+ * Find functional dependency between attributes using multicolumn statistic.
+ * relid:   index of relation to which all considered attributes belong
+ * var:     variable which dependencies are inspected
+ * attnums: set of considered attributes included specified variables
+ * This function return degree of strongest dependency between some subset of this attributes
+ * and specified variable or 0.0 if on dependency is found.
+ */
+double
+find_var_dependency(PlannerInfo *root, Index relid, Var *var, Bitmapset *attnums)
+{
+	RelOptInfo* rel = find_base_rel(root, relid);
+	if (rel->rtekind == RTE_RELATION && rel->statlist != NIL)
+	{
+		StatisticExtInfo *stat = choose_best_statistics(rel->statlist, attnums,
+														STATS_EXT_DEPENDENCIES);
+		if (stat != NULL)
+		{
+			MVDependencies *dependencies = statext_dependencies_load(stat->statOid);
+			MVDependency *strongest = NULL;
+			int i;
+			for (i = 0; i < dependencies->ndeps; i++)
+			{
+				MVDependency *dependency = dependencies->deps[i];
+				int n_dep_vars = dependency->nattributes - 1;
+				/* Dependency implies attribute */
+				if (var->varattno == dependency->attributes[n_dep_vars])
+				{
+					while (--n_dep_vars >= 0
+						   && bms_is_member(dependency->attributes[n_dep_vars], attnums));
+					if (n_dep_vars < 0 && (!strongest || strongest->degree < dependency->degree))
+						strongest = dependency;
+				}
+			}
+			if (strongest)
+				return strongest->degree;
+		}
+	}
+	return 0.0;
+}
+
+/*
  * clauselist_selectivity_simple -
  *	  Compute the selectivity of an implicitly-ANDed list of boolean
  *	  expression clauses.  The list can be empty, in which case 1.0
@@ -159,6 +201,9 @@ clauselist_selectivity_simple(PlannerInfo *root,
 	RangeQueryClause *rqlist = NULL;
 	ListCell   *l;
 	int			listidx;
+	Bitmapset  *clauses_attnums = NULL;
+	int			n_clauses_attnums = 0;
+	int         innerRelid = varRelid;
 
 	/*
 	 * If there's exactly one clause (and it was not estimated yet), just go
@@ -170,6 +215,9 @@ clauselist_selectivity_simple(PlannerInfo *root,
 		return clause_selectivity(root, (Node *) linitial(clauses),
 								  varRelid, jointype, sjinfo);
 
+	if (innerRelid == 0 && sjinfo)
+		bms_get_singleton_member(sjinfo->min_righthand, &innerRelid);
+
 	/*
 	 * Anything that doesn't look like a potential rangequery clause gets
 	 * multiplied into s1 and forgotten. Anything that does gets inserted into
@@ -181,7 +229,6 @@ clauselist_selectivity_simple(PlannerInfo *root,
 		Node	   *clause = (Node *) lfirst(l);
 		RestrictInfo *rinfo;
 		Selectivity s2;
-
 		listidx++;
 
 		/*
@@ -213,6 +260,7 @@ clauselist_selectivity_simple(PlannerInfo *root,
 		else
 			rinfo = NULL;
 
+
 		/*
 		 * See if it looks like a restriction clause with a pseudoconstant on
 		 * one side.  (Anything more complicated than that might not behave in
@@ -224,6 +272,30 @@ clauselist_selectivity_simple(PlannerInfo *root,
 			OpExpr	   *expr = (OpExpr *) clause;
 			bool		varonleft = true;
 			bool		ok;
+			int         oprrest = get_oprrest(expr->opno);
+
+			/* Try to take in account functional dependencies between attributes */
+			if (oprrest == F_EQSEL && rinfo != NULL && innerRelid != 0)
+			{
+				Var* var = (Var*)linitial(expr->args);
+				if (!IsA(var, Var) || var->varnoold != innerRelid)
+				{
+					var = (Var*)lsecond(expr->args);
+				}
+				if (IsA(var, Var) && var->varattno >= 0 && var->varnoold == innerRelid)
+				{
+					clauses_attnums = bms_add_member(clauses_attnums, var->varattno);
+					if (n_clauses_attnums++ != 0)
+					{
+						double dep = find_var_dependency(root, innerRelid, var, clauses_attnums);
+						if (dep != 0.0)
+						{
+							s1 *= dep + (1 - dep) * s2;
+							continue;
+						}
+					}
+				}
+			}
 
 			if (rinfo)
 			{
@@ -249,7 +321,7 @@ clauselist_selectivity_simple(PlannerInfo *root,
 				 * selectivity in generically.  But if it's the right oprrest,
 				 * add the clause to rqlist for later processing.
 				 */
-				switch (get_oprrest(expr->opno))
+				switch (oprrest)
 				{
 					case F_SCALARLTSEL:
 					case F_SCALARLESEL:
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index c5f6593..f6714e4 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -4543,6 +4543,30 @@ get_parameterized_joinrel_size(PlannerInfo *root, RelOptInfo *rel,
 	return nrows;
 }
 
+
+/*
+ * Try to find dependency between variables.
+ * var: varaibles which dependencies are considered
+ * join_vars: list of variables used in other clauses
+ * This functions return strongest dependency and some subset of variables from the same relation
+ * or 0.0 if no dependency was found.
+ */
+static double
+var_depends_on(PlannerInfo *root, Var* var, List* clause_vars)
+{
+	ListCell* lc;
+	Bitmapset *attnums = NULL;
+	Index relid = var->varnoold;
+
+	foreach (lc, clause_vars)
+	{
+		Var* join_var = (Var*)lfirst(lc);
+		if (join_var->varnoold == relid && join_var->varattno >= 0)
+			attnums = bms_add_member(attnums, join_var->varattno);
+	}
+	return attnums ? find_var_dependency(root, relid, var, bms_add_member(attnums, var->varattno)) : 0.0;
+}
+
 /*
  * calc_joinrel_size_estimate
  *		Workhorse for set_joinrel_size_estimates and
@@ -4639,6 +4663,39 @@ calc_joinrel_size_estimate(PlannerInfo *root,
 		pselec = 0.0;			/* not used, keep compiler quiet */
 	}
 
+	/* Try to take in account functional dependencies between attributes of clauses pushed-down to joined relations and
+	 * retstrictlist clause. Right now we consider only case of restrictlist consists of one clause.
+	 */
+	if (list_length(restrictlist) == 1)
+	{
+		RestrictInfo* rinfo = linitial(restrictlist);
+		Expr* clause = rinfo->clause;
+
+		Assert(IsA(rinfo, RestrictInfo));
+
+		if (is_opclause(clause))
+		{
+			OpExpr *expr = (OpExpr *) clause;
+			ListCell* lc;
+			List* join_vars = NULL;
+
+			/* Get list of all attributes in pushed-down clauses */
+			foreach (lc, outer_rel->baserestrictinfo)
+				join_vars = list_concat(join_vars, pull_vars_of_level((Node*)((RestrictInfo*)lfirst(lc))->clause, 0));
+			foreach (lc, inner_rel->baserestrictinfo)
+				join_vars = list_concat(join_vars, pull_vars_of_level((Node*)((RestrictInfo*)lfirst(lc))->clause, 0));
+
+			foreach (lc, expr->args)
+			{
+				Var *var = (Var*) lfirst(lc);
+				if (IsA(var, Var) && var->varattno >= 0)
+				{
+					double dep = var_depends_on(root, var, join_vars);
+					jselec = jselec*(1.0 - dep) + dep;
+				}
+			}
+		}
+	}
 	/*
 	 * Basically, we multiply size of Cartesian product by selectivity.
 	 *
diff --git a/src/include/optimizer/clauses.h b/src/include/optimizer/clauses.h
index 2f9aeec..09ff0c4 100644
--- a/src/include/optimizer/clauses.h
+++ b/src/include/optimizer/clauses.h
@@ -55,4 +55,6 @@ extern void CommuteOpExpr(OpExpr *clause);
 extern Query *inline_set_returning_function(PlannerInfo *root,
 											RangeTblEntry *rte);
 
+extern double find_var_dependency(PlannerInfo *root, Index relid, Var *var, Bitmapset *attnums);
+
 #endif							/* CLAUSES_H */

David Steele

david@pgmasters.net

almost 6 years ago

In reply to: Konstantin Knizhnik (#5)

Re: Columns correlation and adaptive query optimization

On 12/24/19 3:15 AM, Konstantin Knizhnik wrote:

New version of patch implicitly adding multicolumn statistic in
auto_explain extension and using it in optimizer for more precise
estimation of join selectivity.
This patch fixes race condition while adding statistics and restricts
generated statistic name to fit in 64 bytes (NameData).

This patch no longer applies: https://commitfest.postgresql.org/27/2386/

The CF entry has been updated to Waiting on Author.

Regards,
--
-David
david@pgmasters.net

Konstantin Knizhnik

k.knizhnik@postgrespro.ru

almost 6 years ago

In reply to: David Steele (#6)

1 attachment(s)

Re: Columns correlation and adaptive query optimization

On 24.03.2020 20:12, David Steele wrote:

On 12/24/19 3:15 AM, Konstantin Knizhnik wrote:

New version of patch implicitly adding multicolumn statistic in
auto_explain extension and using it in optimizer for more precise
estimation of join selectivity.
This patch fixes race condition while adding statistics and restricts
generated statistic name to fit in 64 bytes (NameData).

This patch no longer applies: https://commitfest.postgresql.org/27/2386/

The CF entry has been updated to Waiting on Author.

Regards,

Rebased patch is attached.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

auto_explain_create_statistic-4.patchtext/x-patch; name=auto_explain_create_statistic-4.patchDownload

diff --git a/contrib/auto_explain/auto_explain.c b/contrib/auto_explain/auto_explain.c
index f69dde8..9e15913 100644
--- a/contrib/auto_explain/auto_explain.c
+++ b/contrib/auto_explain/auto_explain.c
@@ -13,12 +13,32 @@
 #include "postgres.h"
 
 #include <limits.h>
+#include <math.h>
 
+#include "access/hash.h"
 #include "access/parallel.h"
+#include "access/relscan.h"
+#include "access/skey.h"
+#include "access/table.h"
+#include "access/tableam.h"
+#include "catalog/pg_statistic_ext.h"
 #include "commands/explain.h"
+#include "commands/defrem.h"
 #include "executor/instrument.h"
 #include "jit/jit.h"
+#include "nodes/makefuncs.h"
+#include "nodes/nodeFuncs.h"
+#include "optimizer/cost.h"
+#include "optimizer/optimizer.h"
+#include "optimizer/planmain.h"
+#include "parser/parsetree.h"
+#include "storage/ipc.h"
+#include "statistics/statistics.h"
+#include "utils/fmgroids.h"
 #include "utils/guc.h"
+#include "utils/syscache.h"
+#include "utils/lsyscache.h"
+#include "utils/ruleutils.h"
 
 PG_MODULE_MAGIC;
 
@@ -33,7 +53,9 @@ static bool auto_explain_log_settings = false;
 static int	auto_explain_log_format = EXPLAIN_FORMAT_TEXT;
 static int	auto_explain_log_level = LOG;
 static bool auto_explain_log_nested_statements = false;
+static bool auto_explain_suggest_only = false;
 static double auto_explain_sample_rate = 1;
+static double auto_explain_add_statistics_threshold = 0.0;
 
 static const struct config_enum_entry format_options[] = {
 	{"text", EXPLAIN_FORMAT_TEXT, false},
@@ -218,6 +240,30 @@ _PG_init(void)
 							 NULL,
 							 NULL);
 
+	DefineCustomRealVariable("auto_explain.add_statistics_threshold",
+							 "Sets the threshold for actual/estimated #rows ratio triggering creation of multicolumn statistic for the related columns.",
+							 "Zero disables implicit creation of multicolumn statistic.",
+							 &auto_explain_add_statistics_threshold,
+							 0.0,
+							 0.0,
+							 INT_MAX,
+							 PGC_SUSET,
+							 0,
+							 NULL,
+							 NULL,
+							 NULL);
+
+	DefineCustomBoolVariable("auto_explain.suggest_only",
+							 "Do not create statistic but just record in WAL suggested create statistics statement.",
+							 NULL,
+							 &auto_explain_suggest_only,
+							 false,
+							 PGC_SUSET,
+							 0,
+							 NULL,
+							 NULL,
+							 NULL);
+
 	EmitWarningsOnPlaceholders("auto_explain");
 
 	/* Install hooks. */
@@ -349,6 +395,256 @@ explain_ExecutorFinish(QueryDesc *queryDesc)
 	PG_END_TRY();
 }
 
+static void
+AddMultiColumnStatisticsForNode(PlanState *planstate, ExplainState *es);
+
+static void
+AddMultiColumnStatisticsForSubPlans(List *plans, ExplainState *es)
+{
+	ListCell   *lst;
+
+	foreach(lst, plans)
+	{
+		SubPlanState *sps = (SubPlanState *) lfirst(lst);
+
+		AddMultiColumnStatisticsForNode(sps->planstate, es);
+	}
+}
+
+static void
+AddMultiColumnStatisticsForMemberNodes(PlanState **planstates, int nsubnodes,
+									ExplainState *es)
+{
+	int			j;
+
+	for (j = 0; j < nsubnodes; j++)
+		AddMultiColumnStatisticsForNode(planstates[j], es);
+}
+
+static int
+vars_list_comparator(const ListCell *a, const ListCell *b)
+{
+	char* va = strVal((Value *) linitial(((ColumnRef *)lfirst(a))->fields));
+	char* vb = strVal((Value *) linitial(((ColumnRef *)lfirst(b))->fields));
+	return strcmp(va, vb);
+}
+
+static void
+AddMultiColumnStatisticsForQual(void* qual, ExplainState *es)
+{
+	List *vars = NULL;
+	ListCell* lc;
+
+	foreach (lc, qual)
+	{
+		Node* node = (Node*)lfirst(lc);
+		if (IsA(node, RestrictInfo))
+			node = (Node*)((RestrictInfo*)node)->clause;
+		vars = list_concat(vars, pull_vars_of_level(node, 0));
+	}
+	while (vars != NULL)
+	{
+		ListCell *cell;
+		List *cols = NULL;
+		Index varno = 0;
+		Bitmapset* colmap = NULL;
+
+		foreach (cell, vars)
+		{
+			Node* node = (Node *) lfirst(cell);
+			if (IsA(node, Var))
+			{
+				Var *var = (Var *) node;
+				if (cols == NULL || var->varnoold == varno)
+				{
+					varno = var->varnoold;
+					if (var->varoattno > 0 &&
+						!bms_is_member(var->varoattno, colmap) &&
+						varno >= 1 &&
+						varno <= list_length(es->rtable) &&
+						list_length(cols) < STATS_MAX_DIMENSIONS)
+					{
+						RangeTblEntry *rte = rt_fetch(varno, es->rtable);
+						if (rte->rtekind == RTE_RELATION)
+						{
+							ColumnRef  *col = makeNode(ColumnRef);
+							char *colname = get_rte_attribute_name(rte, var->varoattno);
+							col->fields = list_make1(makeString(colname));
+							cols = lappend(cols, col);
+							colmap = bms_add_member(colmap, var->varoattno);
+						}
+					}
+				}
+				else
+				{
+					continue;
+				}
+			}
+			vars = foreach_delete_current(vars, cell);
+		}
+		if (list_length(cols) >= 2)
+		{
+			RangeTblEntry *rte = rt_fetch(varno, es->rtable);
+			CreateStatsStmt* stats = makeNode(CreateStatsStmt);
+			char *rel_namespace = get_namespace_name(get_rel_namespace(rte->relid));
+			char *rel_name = get_rel_name(rte->relid);
+			RangeVar* rel = makeRangeVar(rel_namespace, rel_name, 0);
+			char* stat_name = rel_name;
+			char* create_stat_stmt = (char*)"";
+			char const* sep = "ON";
+			ScanKeyData entry[2];
+			TableScanDesc scan;
+			Relation stat_rel;
+			size_t name_len;
+			TupleTableSlot *slot;
+
+			list_sort(cols, vars_list_comparator);
+			/* Construct name for statistic by concatenating relation name with all columns */
+			foreach (cell, cols)
+			{
+				char* col_name = strVal((Value *) linitial(((ColumnRef *)lfirst(cell))->fields));
+				stat_name = psprintf("%s_%s", stat_name, col_name);
+				create_stat_stmt = psprintf("%s%s %s", create_stat_stmt, sep, col_name);
+				sep = ",";
+			}
+
+			name_len = strlen(stat_name);
+			if (name_len >= NAMEDATALEN)
+				stat_name = psprintf("%.*s_%08x", NAMEDATALEN - 10, stat_name, (unsigned)hash_any((uint8*)stat_name, name_len));
+
+			ScanKeyInit(&entry[0],
+						Anum_pg_statistic_ext_stxname,
+						BTEqualStrategyNumber, F_NAMEEQ,
+						CStringGetDatum(stat_name));
+			ScanKeyInit(&entry[1],
+						Anum_pg_statistic_ext_stxnamespace,
+						BTEqualStrategyNumber, F_OIDEQ,
+						ObjectIdGetDatum(get_rel_namespace(rte->relid)));
+
+			/*
+			 * Prevent concurrent access to extended statistic table
+			 */
+			stat_rel = table_open(StatisticExtRelationId, AccessExclusiveLock);
+			slot = table_slot_create(stat_rel, NULL);
+			scan = table_beginscan_catalog(stat_rel, 2, entry);
+
+			/*
+			 * Check if multicolumn statistic object with such name already exists.
+			 * Most likely if was already created by auto_explain, but either ANALYZE was not performed since
+			 * this time, either presence of this multicolumn statistic doesn't help to provide more precise estimation.
+			 * Despite to the fact that we create statistics with "if_not_exist" option, presence of such check
+			 * allows to eliminate notice message that statistics object already exists.
+			 */
+			if (!table_scan_getnextslot(scan, ForwardScanDirection, slot))
+			{
+				if (auto_explain_suggest_only)
+				{
+					elog(NOTICE, "Auto_explain suggestion: CREATE STATISTICS %s %s FROM %s", stat_name, create_stat_stmt, rel_name);
+				}
+				else
+				{
+					elog(LOG, "Add statistics %s", stat_name);
+					stats->defnames = list_make2(makeString(rel_namespace), makeString(stat_name));
+					stats->if_not_exists = true;
+					stats->relations = list_make1(rel);
+					stats->exprs = cols;
+					CreateStatistics(stats);
+				}
+			}
+			table_close(stat_rel, AccessExclusiveLock);
+		}
+	}
+}
+
+static void
+AddMultiColumnStatisticsForNode(PlanState *planstate, ExplainState *es)
+{
+	Plan	   *plan = planstate->plan;
+
+	if (planstate->instrument && plan->plan_rows != 0)
+	{
+		if (auto_explain_add_statistics_threshold != 0
+			&& planstate->instrument->ntuples / plan->plan_rows >= auto_explain_add_statistics_threshold)
+		{
+			elog(DEBUG1, "Estimated=%f, actual=%f, error=%f: plan=%s", plan->plan_rows, planstate->instrument->ntuples, planstate->instrument->ntuples / plan->plan_rows, nodeToString(plan));
+			/* quals, sort keys, etc */
+			switch (nodeTag(plan))
+			{
+			  case T_IndexScan:
+				AddMultiColumnStatisticsForQual(((IndexScan *) plan)->indexqualorig, es);
+				break;
+			  case T_IndexOnlyScan:
+				AddMultiColumnStatisticsForQual(((IndexOnlyScan *) plan)->indexqual, es);
+				break;
+			  case T_BitmapIndexScan:
+				AddMultiColumnStatisticsForQual(((BitmapIndexScan *) plan)->indexqualorig, es);
+				break;
+			  case T_NestLoop:
+				AddMultiColumnStatisticsForQual(((NestLoop *) plan)->join.joinqual, es);
+				break;
+			  case T_MergeJoin:
+				AddMultiColumnStatisticsForQual(((MergeJoin *) plan)->mergeclauses, es);
+				AddMultiColumnStatisticsForQual(((MergeJoin *) plan)->join.joinqual, es);
+				break;
+			  case T_HashJoin:
+				AddMultiColumnStatisticsForQual(((HashJoin *) plan)->hashclauses, es);
+				AddMultiColumnStatisticsForQual(((HashJoin *) plan)->join.joinqual, es);
+				break;
+			  default:
+				break;
+			}
+			AddMultiColumnStatisticsForQual(plan->qual, es);
+		}
+	}
+
+	/* initPlan-s */
+	if (planstate->initPlan)
+		AddMultiColumnStatisticsForSubPlans(planstate->initPlan, es);
+
+	/* lefttree */
+	if (outerPlanState(planstate))
+		AddMultiColumnStatisticsForNode(outerPlanState(planstate), es);
+
+	/* righttree */
+	if (innerPlanState(planstate))
+		AddMultiColumnStatisticsForNode(innerPlanState(planstate), es);
+
+	/* special child plans */
+	switch (nodeTag(plan))
+	{
+		case T_ModifyTable:
+			AddMultiColumnStatisticsForMemberNodes(((ModifyTableState *) planstate)->mt_plans,
+												   ((ModifyTableState *) planstate)->mt_nplans,
+												   es);
+			break;
+		case T_Append:
+			AddMultiColumnStatisticsForMemberNodes(((AppendState *) planstate)->appendplans,
+												   ((AppendState *) planstate)->as_nplans,
+												   es);
+			break;
+		case T_MergeAppend:
+			AddMultiColumnStatisticsForMemberNodes(((MergeAppendState *) planstate)->mergeplans,
+												   ((MergeAppendState *) planstate)->ms_nplans,
+												   es);
+			break;
+		case T_BitmapAnd:
+			AddMultiColumnStatisticsForMemberNodes(((BitmapAndState *) planstate)->bitmapplans,
+												   ((BitmapAndState *) planstate)->nplans,
+												   es);
+			break;
+		case T_BitmapOr:
+			AddMultiColumnStatisticsForMemberNodes(((BitmapOrState *) planstate)->bitmapplans,
+												   ((BitmapOrState *) planstate)->nplans,
+												   es);
+			break;
+		case T_SubqueryScan:
+			AddMultiColumnStatisticsForNode(((SubqueryScanState *) planstate)->subplan, es);
+			break;
+		default:
+			break;
+	}
+}
+
 /*
  * ExecutorEnd hook: log results if needed
  */
@@ -388,6 +684,10 @@ explain_ExecutorEnd(QueryDesc *queryDesc)
 				ExplainPrintJITSummary(es, queryDesc);
 			ExplainEndOutput(es);
 
+			/* Add multicolumn statistic if requested */
+			if (auto_explain_add_statistics_threshold && !IsParallelWorker())
+				AddMultiColumnStatisticsForNode(queryDesc->planstate, es);
+
 			/* Remove last line break */
 			if (es->str->len > 0 && es->str->data[es->str->len - 1] == '\n')
 				es->str->data[--es->str->len] = '\0';
diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c
index a3ebe10..96ab7d3 100644
--- a/src/backend/optimizer/path/clausesel.c
+++ b/src/backend/optimizer/path/clausesel.c
@@ -25,6 +25,8 @@
 #include "utils/fmgroids.h"
 #include "utils/lsyscache.h"
 #include "utils/selfuncs.h"
+#include "statistics/statistics.h"
+#include "catalog/pg_statistic_ext.h"
 
 /*
  * Data structure for accumulating info about possible range-query
@@ -105,6 +107,47 @@ clauselist_selectivity(PlannerInfo *root,
 }
 
 /*
+ * Find functional dependency between attributes using multicolumn statistic.
+ * relid:   index of relation to which all considered attributes belong
+ * var:     variable which dependencies are inspected
+ * attnums: set of considered attributes included specified variables
+ * This function return degree of strongest dependency between some subset of this attributes
+ * and specified variable or 0.0 if on dependency is found.
+ */
+double
+find_var_dependency(PlannerInfo *root, Index relid, Var *var, Bitmapset *attnums)
+{
+	RelOptInfo* rel = find_base_rel(root, relid);
+	if (rel->rtekind == RTE_RELATION && rel->statlist != NIL)
+	{
+		StatisticExtInfo *stat = choose_best_statistics(rel->statlist, STATS_EXT_DEPENDENCIES,
+														&attnums, 1);
+		if (stat != NULL)
+		{
+			MVDependencies *dependencies = statext_dependencies_load(stat->statOid);
+			MVDependency *strongest = NULL;
+			int i;
+			for (i = 0; i < dependencies->ndeps; i++)
+			{
+				MVDependency *dependency = dependencies->deps[i];
+				int n_dep_vars = dependency->nattributes - 1;
+				/* Dependency implies attribute */
+				if (var->varattno == dependency->attributes[n_dep_vars])
+				{
+					while (--n_dep_vars >= 0
+						   && bms_is_member(dependency->attributes[n_dep_vars], attnums));
+					if (n_dep_vars < 0 && (!strongest || strongest->degree < dependency->degree))
+						strongest = dependency;
+				}
+			}
+			if (strongest)
+				return strongest->degree;
+		}
+	}
+	return 0.0;
+}
+
+/*
  * clauselist_selectivity_simple -
  *	  Compute the selectivity of an implicitly-ANDed list of boolean
  *	  expression clauses.  The list can be empty, in which case 1.0
@@ -158,6 +201,9 @@ clauselist_selectivity_simple(PlannerInfo *root,
 	RangeQueryClause *rqlist = NULL;
 	ListCell   *l;
 	int			listidx;
+	Bitmapset  *clauses_attnums = NULL;
+	int			n_clauses_attnums = 0;
+	int         innerRelid = varRelid;
 
 	/*
 	 * If there's exactly one clause (and it was not estimated yet), just go
@@ -169,6 +215,9 @@ clauselist_selectivity_simple(PlannerInfo *root,
 		return clause_selectivity(root, (Node *) linitial(clauses),
 								  varRelid, jointype, sjinfo);
 
+	if (innerRelid == 0 && sjinfo)
+		bms_get_singleton_member(sjinfo->min_righthand, &innerRelid);
+
 	/*
 	 * Anything that doesn't look like a potential rangequery clause gets
 	 * multiplied into s1 and forgotten. Anything that does gets inserted into
@@ -180,7 +229,6 @@ clauselist_selectivity_simple(PlannerInfo *root,
 		Node	   *clause = (Node *) lfirst(l);
 		RestrictInfo *rinfo;
 		Selectivity s2;
-
 		listidx++;
 
 		/*
@@ -212,6 +260,7 @@ clauselist_selectivity_simple(PlannerInfo *root,
 		else
 			rinfo = NULL;
 
+
 		/*
 		 * See if it looks like a restriction clause with a pseudoconstant on
 		 * one side.  (Anything more complicated than that might not behave in
@@ -223,6 +272,30 @@ clauselist_selectivity_simple(PlannerInfo *root,
 			OpExpr	   *expr = (OpExpr *) clause;
 			bool		varonleft = true;
 			bool		ok;
+			int         oprrest = get_oprrest(expr->opno);
+
+			/* Try to take in account functional dependencies between attributes */
+			if (oprrest == F_EQSEL && rinfo != NULL && innerRelid != 0)
+			{
+				Var* var = (Var*)linitial(expr->args);
+				if (!IsA(var, Var) || var->varno != innerRelid)
+				{
+					var = (Var*)lsecond(expr->args);
+				}
+				if (IsA(var, Var) && var->varattno >= 0 && var->varno == innerRelid)
+				{
+					clauses_attnums = bms_add_member(clauses_attnums, var->varattno);
+					if (n_clauses_attnums++ != 0)
+					{
+						double dep = find_var_dependency(root, innerRelid, var, clauses_attnums);
+						if (dep != 0.0)
+						{
+							s1 *= dep + (1 - dep) * s2;
+							continue;
+						}
+					}
+				}
+			}
 
 			if (rinfo)
 			{
@@ -248,7 +321,7 @@ clauselist_selectivity_simple(PlannerInfo *root,
 				 * selectivity in generically.  But if it's the right oprrest,
 				 * add the clause to rqlist for later processing.
 				 */
-				switch (get_oprrest(expr->opno))
+				switch (oprrest)
 				{
 					case F_SCALARLTSEL:
 					case F_SCALARLESEL:
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 8cf694b..a838312 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -4611,6 +4611,30 @@ get_parameterized_joinrel_size(PlannerInfo *root, RelOptInfo *rel,
 	return nrows;
 }
 
+
+/*
+ * Try to find dependency between variables.
+ * var: varaibles which dependencies are considered
+ * join_vars: list of variables used in other clauses
+ * This functions return strongest dependency and some subset of variables from the same relation
+ * or 0.0 if no dependency was found.
+ */
+static double
+var_depends_on(PlannerInfo *root, Var* var, List* clause_vars)
+{
+	ListCell* lc;
+	Bitmapset *attnums = NULL;
+	Index relid = var->varno;
+
+	foreach (lc, clause_vars)
+	{
+		Var* join_var = (Var*)lfirst(lc);
+		if (join_var->varno == relid && join_var->varattno >= 0)
+			attnums = bms_add_member(attnums, join_var->varattno);
+	}
+	return attnums ? find_var_dependency(root, relid, var, bms_add_member(attnums, var->varattno)) : 0.0;
+}
+
 /*
  * calc_joinrel_size_estimate
  *		Workhorse for set_joinrel_size_estimates and
@@ -4707,6 +4731,39 @@ calc_joinrel_size_estimate(PlannerInfo *root,
 		pselec = 0.0;			/* not used, keep compiler quiet */
 	}
 
+	/* Try to take in account functional dependencies between attributes of clauses pushed-down to joined relations and
+	 * retstrictlist clause. Right now we consider only case of restrictlist consists of one clause.
+	 */
+	if (list_length(restrictlist) == 1)
+	{
+		RestrictInfo* rinfo = linitial(restrictlist);
+		Expr* clause = rinfo->clause;
+
+		Assert(IsA(rinfo, RestrictInfo));
+
+		if (is_opclause(clause))
+		{
+			OpExpr *expr = (OpExpr *) clause;
+			ListCell* lc;
+			List* join_vars = NULL;
+
+			/* Get list of all attributes in pushed-down clauses */
+			foreach (lc, outer_rel->baserestrictinfo)
+				join_vars = list_concat(join_vars, pull_vars_of_level((Node*)((RestrictInfo*)lfirst(lc))->clause, 0));
+			foreach (lc, inner_rel->baserestrictinfo)
+				join_vars = list_concat(join_vars, pull_vars_of_level((Node*)((RestrictInfo*)lfirst(lc))->clause, 0));
+
+			foreach (lc, expr->args)
+			{
+				Var *var = (Var*) lfirst(lc);
+				if (IsA(var, Var) && var->varattno >= 0)
+				{
+					double dep = var_depends_on(root, var, join_vars);
+					jselec = jselec*(1.0 - dep) + dep;
+				}
+			}
+		}
+	}
 	/*
 	 * Basically, we multiply size of Cartesian product by selectivity.
 	 *
diff --git a/src/include/optimizer/clauses.h b/src/include/optimizer/clauses.h
index b7456e3..25bc0b2 100644
--- a/src/include/optimizer/clauses.h
+++ b/src/include/optimizer/clauses.h
@@ -55,4 +55,6 @@ extern void CommuteOpExpr(OpExpr *clause);
 extern Query *inline_set_returning_function(PlannerInfo *root,
 											RangeTblEntry *rte);
 
+extern double find_var_dependency(PlannerInfo *root, Index relid, Var *var, Bitmapset *attnums);
+
 #endif							/* CLAUSES_H */

David Steele

david@pgmasters.net

almost 6 years ago

In reply to: Konstantin Knizhnik (#7)

Re: Columns correlation and adaptive query optimization

On 3/25/20 6:57 AM, Konstantin Knizhnik wrote:

On 24.03.2020 20:12, David Steele wrote:

On 12/24/19 3:15 AM, Konstantin Knizhnik wrote:

New version of patch implicitly adding multicolumn statistic in
auto_explain extension and using it in optimizer for more precise
estimation of join selectivity.
This patch fixes race condition while adding statistics and restricts
generated statistic name to fit in 64 bytes (NameData).

This patch no longer applies: https://commitfest.postgresql.org/27/2386/

The CF entry has been updated to Waiting on Autho

Rebased patch is attached.

The patch applies now but there are error on Windows and Linux:
https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.85481
https://travis-ci.org/postgresql-cfbot/postgresql/builds/666729979

Regards,
--
-David
david@pgmasters.net

Konstantin Knizhnik

k.knizhnik@postgrespro.ru

almost 6 years ago

In reply to: David Steele (#8)

1 attachment(s)

Re: Columns correlation and adaptive query optimization

On 25.03.2020 16:00, David Steele wrote:

On 3/25/20 6:57 AM, Konstantin Knizhnik wrote:

On 24.03.2020 20:12, David Steele wrote:

On 12/24/19 3:15 AM, Konstantin Knizhnik wrote:

New version of patch implicitly adding multicolumn statistic in
auto_explain extension and using it in optimizer for more precise
estimation of join selectivity.
This patch fixes race condition while adding statistics and
restricts generated statistic name to fit in 64 bytes (NameData).

This patch no longer applies:
https://commitfest.postgresql.org/27/2386/

The CF entry has been updated to Waiting on Autho

Rebased patch is attached.

The patch applies now but there are error on Windows and Linux:
https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.85481

https://travis-ci.org/postgresql-cfbot/postgresql/builds/666729979

Regards,

Sorry, yet another patch is attached.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

auto_explain_create_statistic-5.patchtext/x-patch; name=auto_explain_create_statistic-5.patchDownload

diff --git a/contrib/auto_explain/auto_explain.c b/contrib/auto_explain/auto_explain.c
index f69dde8..8714d94 100644
--- a/contrib/auto_explain/auto_explain.c
+++ b/contrib/auto_explain/auto_explain.c
@@ -13,12 +13,32 @@
 #include "postgres.h"
 
 #include <limits.h>
+#include <math.h>
 
+#include "access/hash.h"
 #include "access/parallel.h"
+#include "access/relscan.h"
+#include "access/skey.h"
+#include "access/table.h"
+#include "access/tableam.h"
+#include "catalog/pg_statistic_ext.h"
 #include "commands/explain.h"
+#include "commands/defrem.h"
 #include "executor/instrument.h"
 #include "jit/jit.h"
+#include "nodes/makefuncs.h"
+#include "nodes/nodeFuncs.h"
+#include "optimizer/cost.h"
+#include "optimizer/optimizer.h"
+#include "optimizer/planmain.h"
+#include "parser/parsetree.h"
+#include "storage/ipc.h"
+#include "statistics/statistics.h"
+#include "utils/fmgroids.h"
 #include "utils/guc.h"
+#include "utils/syscache.h"
+#include "utils/lsyscache.h"
+#include "utils/ruleutils.h"
 
 PG_MODULE_MAGIC;
 
@@ -33,7 +53,9 @@ static bool auto_explain_log_settings = false;
 static int	auto_explain_log_format = EXPLAIN_FORMAT_TEXT;
 static int	auto_explain_log_level = LOG;
 static bool auto_explain_log_nested_statements = false;
+static bool auto_explain_suggest_only = false;
 static double auto_explain_sample_rate = 1;
+static double auto_explain_add_statistics_threshold = 0.0;
 
 static const struct config_enum_entry format_options[] = {
 	{"text", EXPLAIN_FORMAT_TEXT, false},
@@ -218,6 +240,30 @@ _PG_init(void)
 							 NULL,
 							 NULL);
 
+	DefineCustomRealVariable("auto_explain.add_statistics_threshold",
+							 "Sets the threshold for actual/estimated #rows ratio triggering creation of multicolumn statistic for the related columns.",
+							 "Zero disables implicit creation of multicolumn statistic.",
+							 &auto_explain_add_statistics_threshold,
+							 0.0,
+							 0.0,
+							 INT_MAX,
+							 PGC_SUSET,
+							 0,
+							 NULL,
+							 NULL,
+							 NULL);
+
+	DefineCustomBoolVariable("auto_explain.suggest_only",
+							 "Do not create statistic but just record in WAL suggested create statistics statement.",
+							 NULL,
+							 &auto_explain_suggest_only,
+							 false,
+							 PGC_SUSET,
+							 0,
+							 NULL,
+							 NULL,
+							 NULL);
+
 	EmitWarningsOnPlaceholders("auto_explain");
 
 	/* Install hooks. */
@@ -349,6 +395,256 @@ explain_ExecutorFinish(QueryDesc *queryDesc)
 	PG_END_TRY();
 }
 
+static void
+AddMultiColumnStatisticsForNode(PlanState *planstate, ExplainState *es);
+
+static void
+AddMultiColumnStatisticsForSubPlans(List *plans, ExplainState *es)
+{
+	ListCell   *lst;
+
+	foreach(lst, plans)
+	{
+		SubPlanState *sps = (SubPlanState *) lfirst(lst);
+
+		AddMultiColumnStatisticsForNode(sps->planstate, es);
+	}
+}
+
+static void
+AddMultiColumnStatisticsForMemberNodes(PlanState **planstates, int nsubnodes,
+									ExplainState *es)
+{
+	int			j;
+
+	for (j = 0; j < nsubnodes; j++)
+		AddMultiColumnStatisticsForNode(planstates[j], es);
+}
+
+static int
+vars_list_comparator(const ListCell *a, const ListCell *b)
+{
+	char* va = strVal((Value *) linitial(((ColumnRef *)lfirst(a))->fields));
+	char* vb = strVal((Value *) linitial(((ColumnRef *)lfirst(b))->fields));
+	return strcmp(va, vb);
+}
+
+static void
+AddMultiColumnStatisticsForQual(void* qual, ExplainState *es)
+{
+	List *vars = NULL;
+	ListCell* lc;
+
+	foreach (lc, qual)
+	{
+		Node* node = (Node*)lfirst(lc);
+		if (IsA(node, RestrictInfo))
+			node = (Node*)((RestrictInfo*)node)->clause;
+		vars = list_concat(vars, pull_vars_of_level(node, 0));
+	}
+	while (vars != NULL)
+	{
+		ListCell *cell;
+		List *cols = NULL;
+		Index varno = 0;
+		Bitmapset* colmap = NULL;
+
+		foreach (cell, vars)
+		{
+			Node* node = (Node *) lfirst(cell);
+			if (IsA(node, Var))
+			{
+				Var *var = (Var *) node;
+				if (cols == NULL || var->varno == varno)
+				{
+					varno = var->varno;
+					if (var->varattno > 0 &&
+						!bms_is_member(var->varattno, colmap) &&
+						varno >= 1 &&
+						varno <= list_length(es->rtable) &&
+						list_length(cols) < STATS_MAX_DIMENSIONS)
+					{
+						RangeTblEntry *rte = rt_fetch(varno, es->rtable);
+						if (rte->rtekind == RTE_RELATION)
+						{
+							ColumnRef  *col = makeNode(ColumnRef);
+							char *colname = get_rte_attribute_name(rte, var->varattno);
+							col->fields = list_make1(makeString(colname));
+							cols = lappend(cols, col);
+							colmap = bms_add_member(colmap, var->varattno);
+						}
+					}
+				}
+				else
+				{
+					continue;
+				}
+			}
+			vars = foreach_delete_current(vars, cell);
+		}
+		if (list_length(cols) >= 2)
+		{
+			RangeTblEntry *rte = rt_fetch(varno, es->rtable);
+			CreateStatsStmt* stats = makeNode(CreateStatsStmt);
+			char *rel_namespace = get_namespace_name(get_rel_namespace(rte->relid));
+			char *rel_name = get_rel_name(rte->relid);
+			RangeVar* rel = makeRangeVar(rel_namespace, rel_name, 0);
+			char* stat_name = rel_name;
+			char* create_stat_stmt = (char*)"";
+			char const* sep = "ON";
+			ScanKeyData entry[2];
+			TableScanDesc scan;
+			Relation stat_rel;
+			size_t name_len;
+			TupleTableSlot *slot;
+
+			list_sort(cols, vars_list_comparator);
+			/* Construct name for statistic by concatenating relation name with all columns */
+			foreach (cell, cols)
+			{
+				char* col_name = strVal((Value *) linitial(((ColumnRef *)lfirst(cell))->fields));
+				stat_name = psprintf("%s_%s", stat_name, col_name);
+				create_stat_stmt = psprintf("%s%s %s", create_stat_stmt, sep, col_name);
+				sep = ",";
+			}
+
+			name_len = strlen(stat_name);
+			if (name_len >= NAMEDATALEN)
+				stat_name = psprintf("%.*s_%08x", NAMEDATALEN - 10, stat_name, (unsigned)hash_any((uint8*)stat_name, name_len));
+
+			ScanKeyInit(&entry[0],
+						Anum_pg_statistic_ext_stxname,
+						BTEqualStrategyNumber, F_NAMEEQ,
+						CStringGetDatum(stat_name));
+			ScanKeyInit(&entry[1],
+						Anum_pg_statistic_ext_stxnamespace,
+						BTEqualStrategyNumber, F_OIDEQ,
+						ObjectIdGetDatum(get_rel_namespace(rte->relid)));
+
+			/*
+			 * Prevent concurrent access to extended statistic table
+			 */
+			stat_rel = table_open(StatisticExtRelationId, AccessExclusiveLock);
+			slot = table_slot_create(stat_rel, NULL);
+			scan = table_beginscan_catalog(stat_rel, 2, entry);
+
+			/*
+			 * Check if multicolumn statistic object with such name already exists.
+			 * Most likely if was already created by auto_explain, but either ANALYZE was not performed since
+			 * this time, either presence of this multicolumn statistic doesn't help to provide more precise estimation.
+			 * Despite to the fact that we create statistics with "if_not_exist" option, presence of such check
+			 * allows to eliminate notice message that statistics object already exists.
+			 */
+			if (!table_scan_getnextslot(scan, ForwardScanDirection, slot))
+			{
+				if (auto_explain_suggest_only)
+				{
+					elog(NOTICE, "Auto_explain suggestion: CREATE STATISTICS %s %s FROM %s", stat_name, create_stat_stmt, rel_name);
+				}
+				else
+				{
+					elog(LOG, "Add statistics %s", stat_name);
+					stats->defnames = list_make2(makeString(rel_namespace), makeString(stat_name));
+					stats->if_not_exists = true;
+					stats->relations = list_make1(rel);
+					stats->exprs = cols;
+					CreateStatistics(stats);
+				}
+			}
+			table_close(stat_rel, AccessExclusiveLock);
+		}
+	}
+}
+
+static void
+AddMultiColumnStatisticsForNode(PlanState *planstate, ExplainState *es)
+{
+	Plan	   *plan = planstate->plan;
+
+	if (planstate->instrument && plan->plan_rows != 0)
+	{
+		if (auto_explain_add_statistics_threshold != 0
+			&& planstate->instrument->ntuples / plan->plan_rows >= auto_explain_add_statistics_threshold)
+		{
+			elog(DEBUG1, "Estimated=%f, actual=%f, error=%f: plan=%s", plan->plan_rows, planstate->instrument->ntuples, planstate->instrument->ntuples / plan->plan_rows, nodeToString(plan));
+			/* quals, sort keys, etc */
+			switch (nodeTag(plan))
+			{
+			  case T_IndexScan:
+				AddMultiColumnStatisticsForQual(((IndexScan *) plan)->indexqualorig, es);
+				break;
+			  case T_IndexOnlyScan:
+				AddMultiColumnStatisticsForQual(((IndexOnlyScan *) plan)->indexqual, es);
+				break;
+			  case T_BitmapIndexScan:
+				AddMultiColumnStatisticsForQual(((BitmapIndexScan *) plan)->indexqualorig, es);
+				break;
+			  case T_NestLoop:
+				AddMultiColumnStatisticsForQual(((NestLoop *) plan)->join.joinqual, es);
+				break;
+			  case T_MergeJoin:
+				AddMultiColumnStatisticsForQual(((MergeJoin *) plan)->mergeclauses, es);
+				AddMultiColumnStatisticsForQual(((MergeJoin *) plan)->join.joinqual, es);
+				break;
+			  case T_HashJoin:
+				AddMultiColumnStatisticsForQual(((HashJoin *) plan)->hashclauses, es);
+				AddMultiColumnStatisticsForQual(((HashJoin *) plan)->join.joinqual, es);
+				break;
+			  default:
+				break;
+			}
+			AddMultiColumnStatisticsForQual(plan->qual, es);
+		}
+	}
+
+	/* initPlan-s */
+	if (planstate->initPlan)
+		AddMultiColumnStatisticsForSubPlans(planstate->initPlan, es);
+
+	/* lefttree */
+	if (outerPlanState(planstate))
+		AddMultiColumnStatisticsForNode(outerPlanState(planstate), es);
+
+	/* righttree */
+	if (innerPlanState(planstate))
+		AddMultiColumnStatisticsForNode(innerPlanState(planstate), es);
+
+	/* special child plans */
+	switch (nodeTag(plan))
+	{
+		case T_ModifyTable:
+			AddMultiColumnStatisticsForMemberNodes(((ModifyTableState *) planstate)->mt_plans,
+												   ((ModifyTableState *) planstate)->mt_nplans,
+												   es);
+			break;
+		case T_Append:
+			AddMultiColumnStatisticsForMemberNodes(((AppendState *) planstate)->appendplans,
+												   ((AppendState *) planstate)->as_nplans,
+												   es);
+			break;
+		case T_MergeAppend:
+			AddMultiColumnStatisticsForMemberNodes(((MergeAppendState *) planstate)->mergeplans,
+												   ((MergeAppendState *) planstate)->ms_nplans,
+												   es);
+			break;
+		case T_BitmapAnd:
+			AddMultiColumnStatisticsForMemberNodes(((BitmapAndState *) planstate)->bitmapplans,
+												   ((BitmapAndState *) planstate)->nplans,
+												   es);
+			break;
+		case T_BitmapOr:
+			AddMultiColumnStatisticsForMemberNodes(((BitmapOrState *) planstate)->bitmapplans,
+												   ((BitmapOrState *) planstate)->nplans,
+												   es);
+			break;
+		case T_SubqueryScan:
+			AddMultiColumnStatisticsForNode(((SubqueryScanState *) planstate)->subplan, es);
+			break;
+		default:
+			break;
+	}
+}
+
 /*
  * ExecutorEnd hook: log results if needed
  */
@@ -388,6 +684,10 @@ explain_ExecutorEnd(QueryDesc *queryDesc)
 				ExplainPrintJITSummary(es, queryDesc);
 			ExplainEndOutput(es);
 
+			/* Add multicolumn statistic if requested */
+			if (auto_explain_add_statistics_threshold && !IsParallelWorker())
+				AddMultiColumnStatisticsForNode(queryDesc->planstate, es);
+
 			/* Remove last line break */
 			if (es->str->len > 0 && es->str->data[es->str->len - 1] == '\n')
 				es->str->data[--es->str->len] = '\0';
diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c
index a3ebe10..96ab7d3 100644
--- a/src/backend/optimizer/path/clausesel.c
+++ b/src/backend/optimizer/path/clausesel.c
@@ -25,6 +25,8 @@
 #include "utils/fmgroids.h"
 #include "utils/lsyscache.h"
 #include "utils/selfuncs.h"
+#include "statistics/statistics.h"
+#include "catalog/pg_statistic_ext.h"
 
 /*
  * Data structure for accumulating info about possible range-query
@@ -105,6 +107,47 @@ clauselist_selectivity(PlannerInfo *root,
 }
 
 /*
+ * Find functional dependency between attributes using multicolumn statistic.
+ * relid:   index of relation to which all considered attributes belong
+ * var:     variable which dependencies are inspected
+ * attnums: set of considered attributes included specified variables
+ * This function return degree of strongest dependency between some subset of this attributes
+ * and specified variable or 0.0 if on dependency is found.
+ */
+double
+find_var_dependency(PlannerInfo *root, Index relid, Var *var, Bitmapset *attnums)
+{
+	RelOptInfo* rel = find_base_rel(root, relid);
+	if (rel->rtekind == RTE_RELATION && rel->statlist != NIL)
+	{
+		StatisticExtInfo *stat = choose_best_statistics(rel->statlist, STATS_EXT_DEPENDENCIES,
+														&attnums, 1);
+		if (stat != NULL)
+		{
+			MVDependencies *dependencies = statext_dependencies_load(stat->statOid);
+			MVDependency *strongest = NULL;
+			int i;
+			for (i = 0; i < dependencies->ndeps; i++)
+			{
+				MVDependency *dependency = dependencies->deps[i];
+				int n_dep_vars = dependency->nattributes - 1;
+				/* Dependency implies attribute */
+				if (var->varattno == dependency->attributes[n_dep_vars])
+				{
+					while (--n_dep_vars >= 0
+						   && bms_is_member(dependency->attributes[n_dep_vars], attnums));
+					if (n_dep_vars < 0 && (!strongest || strongest->degree < dependency->degree))
+						strongest = dependency;
+				}
+			}
+			if (strongest)
+				return strongest->degree;
+		}
+	}
+	return 0.0;
+}
+
+/*
  * clauselist_selectivity_simple -
  *	  Compute the selectivity of an implicitly-ANDed list of boolean
  *	  expression clauses.  The list can be empty, in which case 1.0
@@ -158,6 +201,9 @@ clauselist_selectivity_simple(PlannerInfo *root,
 	RangeQueryClause *rqlist = NULL;
 	ListCell   *l;
 	int			listidx;
+	Bitmapset  *clauses_attnums = NULL;
+	int			n_clauses_attnums = 0;
+	int         innerRelid = varRelid;
 
 	/*
 	 * If there's exactly one clause (and it was not estimated yet), just go
@@ -169,6 +215,9 @@ clauselist_selectivity_simple(PlannerInfo *root,
 		return clause_selectivity(root, (Node *) linitial(clauses),
 								  varRelid, jointype, sjinfo);
 
+	if (innerRelid == 0 && sjinfo)
+		bms_get_singleton_member(sjinfo->min_righthand, &innerRelid);
+
 	/*
 	 * Anything that doesn't look like a potential rangequery clause gets
 	 * multiplied into s1 and forgotten. Anything that does gets inserted into
@@ -180,7 +229,6 @@ clauselist_selectivity_simple(PlannerInfo *root,
 		Node	   *clause = (Node *) lfirst(l);
 		RestrictInfo *rinfo;
 		Selectivity s2;
-
 		listidx++;
 
 		/*
@@ -212,6 +260,7 @@ clauselist_selectivity_simple(PlannerInfo *root,
 		else
 			rinfo = NULL;
 
+
 		/*
 		 * See if it looks like a restriction clause with a pseudoconstant on
 		 * one side.  (Anything more complicated than that might not behave in
@@ -223,6 +272,30 @@ clauselist_selectivity_simple(PlannerInfo *root,
 			OpExpr	   *expr = (OpExpr *) clause;
 			bool		varonleft = true;
 			bool		ok;
+			int         oprrest = get_oprrest(expr->opno);
+
+			/* Try to take in account functional dependencies between attributes */
+			if (oprrest == F_EQSEL && rinfo != NULL && innerRelid != 0)
+			{
+				Var* var = (Var*)linitial(expr->args);
+				if (!IsA(var, Var) || var->varno != innerRelid)
+				{
+					var = (Var*)lsecond(expr->args);
+				}
+				if (IsA(var, Var) && var->varattno >= 0 && var->varno == innerRelid)
+				{
+					clauses_attnums = bms_add_member(clauses_attnums, var->varattno);
+					if (n_clauses_attnums++ != 0)
+					{
+						double dep = find_var_dependency(root, innerRelid, var, clauses_attnums);
+						if (dep != 0.0)
+						{
+							s1 *= dep + (1 - dep) * s2;
+							continue;
+						}
+					}
+				}
+			}
 
 			if (rinfo)
 			{
@@ -248,7 +321,7 @@ clauselist_selectivity_simple(PlannerInfo *root,
 				 * selectivity in generically.  But if it's the right oprrest,
 				 * add the clause to rqlist for later processing.
 				 */
-				switch (get_oprrest(expr->opno))
+				switch (oprrest)
 				{
 					case F_SCALARLTSEL:
 					case F_SCALARLESEL:
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 8cf694b..a838312 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -4611,6 +4611,30 @@ get_parameterized_joinrel_size(PlannerInfo *root, RelOptInfo *rel,
 	return nrows;
 }
 
+
+/*
+ * Try to find dependency between variables.
+ * var: varaibles which dependencies are considered
+ * join_vars: list of variables used in other clauses
+ * This functions return strongest dependency and some subset of variables from the same relation
+ * or 0.0 if no dependency was found.
+ */
+static double
+var_depends_on(PlannerInfo *root, Var* var, List* clause_vars)
+{
+	ListCell* lc;
+	Bitmapset *attnums = NULL;
+	Index relid = var->varno;
+
+	foreach (lc, clause_vars)
+	{
+		Var* join_var = (Var*)lfirst(lc);
+		if (join_var->varno == relid && join_var->varattno >= 0)
+			attnums = bms_add_member(attnums, join_var->varattno);
+	}
+	return attnums ? find_var_dependency(root, relid, var, bms_add_member(attnums, var->varattno)) : 0.0;
+}
+
 /*
  * calc_joinrel_size_estimate
  *		Workhorse for set_joinrel_size_estimates and
@@ -4707,6 +4731,39 @@ calc_joinrel_size_estimate(PlannerInfo *root,
 		pselec = 0.0;			/* not used, keep compiler quiet */
 	}
 
+	/* Try to take in account functional dependencies between attributes of clauses pushed-down to joined relations and
+	 * retstrictlist clause. Right now we consider only case of restrictlist consists of one clause.
+	 */
+	if (list_length(restrictlist) == 1)
+	{
+		RestrictInfo* rinfo = linitial(restrictlist);
+		Expr* clause = rinfo->clause;
+
+		Assert(IsA(rinfo, RestrictInfo));
+
+		if (is_opclause(clause))
+		{
+			OpExpr *expr = (OpExpr *) clause;
+			ListCell* lc;
+			List* join_vars = NULL;
+
+			/* Get list of all attributes in pushed-down clauses */
+			foreach (lc, outer_rel->baserestrictinfo)
+				join_vars = list_concat(join_vars, pull_vars_of_level((Node*)((RestrictInfo*)lfirst(lc))->clause, 0));
+			foreach (lc, inner_rel->baserestrictinfo)
+				join_vars = list_concat(join_vars, pull_vars_of_level((Node*)((RestrictInfo*)lfirst(lc))->clause, 0));
+
+			foreach (lc, expr->args)
+			{
+				Var *var = (Var*) lfirst(lc);
+				if (IsA(var, Var) && var->varattno >= 0)
+				{
+					double dep = var_depends_on(root, var, join_vars);
+					jselec = jselec*(1.0 - dep) + dep;
+				}
+			}
+		}
+	}
 	/*
 	 * Basically, we multiply size of Cartesian product by selectivity.
 	 *
diff --git a/src/include/optimizer/clauses.h b/src/include/optimizer/clauses.h
index b7456e3..25bc0b2 100644
--- a/src/include/optimizer/clauses.h
+++ b/src/include/optimizer/clauses.h
@@ -55,4 +55,6 @@ extern void CommuteOpExpr(OpExpr *clause);
 extern Query *inline_set_returning_function(PlannerInfo *root,
 											RangeTblEntry *rte);
 
+extern double find_var_dependency(PlannerInfo *root, Index relid, Var *var, Bitmapset *attnums);
+
 #endif							/* CLAUSES_H */

#10

Rafia Sabih

rafia.pghackers@gmail.com

almost 6 years ago

In reply to: Konstantin Knizhnik (#9)

Re: Columns correlation and adaptive query optimization

Hello,

This sounded like an interesting addition to postgresql. I gave some
time to it today to review, here are few comments,

On Wed, 25 Mar 2020 at 14:28, Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:

On 25.03.2020 16:00, David Steele wrote:

On 3/25/20 6:57 AM, Konstantin Knizhnik wrote:

On 24.03.2020 20:12, David Steele wrote:

On 12/24/19 3:15 AM, Konstantin Knizhnik wrote:

New version of patch implicitly adding multicolumn statistic in
auto_explain extension and using it in optimizer for more precise
estimation of join selectivity.
This patch fixes race condition while adding statistics and
restricts generated statistic name to fit in 64 bytes (NameData).

This patch no longer applies:
https://commitfest.postgresql.org/27/2386/

The CF entry has been updated to Waiting on Autho

Rebased patch is attached.

The patch applies now but there are error on Windows and Linux:
https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.85481

https://travis-ci.org/postgresql-cfbot/postgresql/builds/666729979

Regards,

Sorry, yet another patch is attached.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

+static void
+AddMultiColumnStatisticsForNode(PlanState *planstate, ExplainState *es);
+

This doesn't look like the right place for it, you might want to
declare it with other functions in the starting of the file.

Also, there is no description about any of the functions here,
wouldn’t hurt having some more comments there.

A few of more questions that cross my mind at this point,

- have you tried measuring the extra cost we have to pay for this
mores statistics , and also compare it with the benefit it gives in
terms of accuracy.
- I would also be interested in understanding if there are cases when
adding this extra step doesn’t help and have you excluded them already
or if some of them are easily identifiable at this stage...?
- is there any limit on the number of columns for which this will
work, or should there be any such limit...?

--
Regards,
Rafia Sabih

#11

Konstantin Knizhnik

k.knizhnik@postgrespro.ru

almost 6 years ago

In reply to: Rafia Sabih (#10)

Re: Columns correlation and adaptive query optimization

Thank you very much for review.

On 25.03.2020 20:04, Rafia Sabih wrote:

+static void
+AddMultiColumnStatisticsForNode(PlanState *planstate, ExplainState *es);
+
This doesn't look like the right place for it, you might want to
declare it with other functions in the starting of the file.

Also, there is no description about any of the functions here,
wouldn’t hurt having some more comments there.

Sorry, I will fix it.
Actually this patch contains of two independent parts:
first allows to use auto_explain extension to generate mutlicolumn
statistic for variables used in clauses
for which selectivity estimation gives wrong result. It affects only
auto_explain extension.

Second part allows to use multicolumn statistic for join selectivity
estimation.
As far as I know extended statistic is now actively improved:
/messages/by-id/20200309000157.ig5tcrynvaqu4ixd@development

I think that using extended statistic for join selectivity is very
important and should also be addressed.
If my approach is on so good, I will be pleased for other suggestions.

A few of more questions that cross my mind at this point,

- have you tried measuring the extra cost we have to pay for this
mores statistics , and also compare it with the benefit it gives in
terms of accuracy.

Adding statistic not always leads to performance improvement but I never
observed any performance degradation caused by presence of extended
statistic.
Definitely we can manually create too many extended statistic entries
for different subsets of columns.
And it certainly increase planning time because optimizer has to
consider more alternatives.
But in practice I never noticed such slowdown.

- I would also be interested in understanding if there are cases when
adding this extra step doesn’t help and have you excluded them already
or if some of them are easily identifiable at this stage...?

Unfortunately there are many cases when extended statistic can not help.
Either because optimizer is not able to use it (for example my patch
consider only cases with strict equality comparison,
but if you use predicate like "a.x=b.x and a.y in (1,2,3)" then
extended statistic for <x,y> can not be used.
Either because collected statistic itself is not precise enough ,
especially in case of data skews.

- is there any limit on the number of columns for which this will
work, or should there be any such limit...?

Right now there is limit for maximal number of columns used in extended
statistic: 8 columns.
But in practice I rarely see join predicates involving more than 3 columns.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#12

Konstantin Knizhnik

k.knizhnik@postgrespro.ru

almost 6 years ago

In reply to: Rafia Sabih (#10)

1 attachment(s)

Re: Columns correlation and adaptive query optimization

On 25.03.2020 20:04, Rafia Sabih wrote:

Also, there is no description about any of the functions here,
wouldn’t hurt having some more comments there.

Attached please find new version of the patch with more comments and
descriptions added.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

auto_explain_create_statistic-6.patchtext/x-patch; name=auto_explain_create_statistic-6.patchDownload

diff --git a/contrib/auto_explain/auto_explain.c b/contrib/auto_explain/auto_explain.c
index f69dde8..8ab95a1 100644
--- a/contrib/auto_explain/auto_explain.c
+++ b/contrib/auto_explain/auto_explain.c
@@ -13,12 +13,32 @@
 #include "postgres.h"
 
 #include <limits.h>
+#include <math.h>
 
+#include "access/hash.h"
 #include "access/parallel.h"
+#include "access/relscan.h"
+#include "access/skey.h"
+#include "access/table.h"
+#include "access/tableam.h"
+#include "catalog/pg_statistic_ext.h"
 #include "commands/explain.h"
+#include "commands/defrem.h"
 #include "executor/instrument.h"
 #include "jit/jit.h"
+#include "nodes/makefuncs.h"
+#include "nodes/nodeFuncs.h"
+#include "optimizer/cost.h"
+#include "optimizer/optimizer.h"
+#include "optimizer/planmain.h"
+#include "parser/parsetree.h"
+#include "storage/ipc.h"
+#include "statistics/statistics.h"
+#include "utils/fmgroids.h"
 #include "utils/guc.h"
+#include "utils/syscache.h"
+#include "utils/lsyscache.h"
+#include "utils/ruleutils.h"
 
 PG_MODULE_MAGIC;
 
@@ -33,7 +53,9 @@ static bool auto_explain_log_settings = false;
 static int	auto_explain_log_format = EXPLAIN_FORMAT_TEXT;
 static int	auto_explain_log_level = LOG;
 static bool auto_explain_log_nested_statements = false;
+static bool auto_explain_suggest_only = false;
 static double auto_explain_sample_rate = 1;
+static double auto_explain_add_statistics_threshold = 0.0;
 
 static const struct config_enum_entry format_options[] = {
 	{"text", EXPLAIN_FORMAT_TEXT, false},
@@ -84,6 +106,7 @@ static void explain_ExecutorRun(QueryDesc *queryDesc,
 static void explain_ExecutorFinish(QueryDesc *queryDesc);
 static void explain_ExecutorEnd(QueryDesc *queryDesc);
 
+static void AddMultiColumnStatisticsForNode(PlanState *planstate, ExplainState *es);
 
 /*
  * Module load callback
@@ -218,6 +241,30 @@ _PG_init(void)
 							 NULL,
 							 NULL);
 
+	DefineCustomRealVariable("auto_explain.add_statistics_threshold",
+							 "Sets the threshold for actual/estimated #rows ratio triggering creation of multicolumn statistic for the related columns.",
+							 "Zero disables implicit creation of multicolumn statistic.",
+							 &auto_explain_add_statistics_threshold,
+							 0.0,
+							 0.0,
+							 INT_MAX,
+							 PGC_SUSET,
+							 0,
+							 NULL,
+							 NULL,
+							 NULL);
+
+	DefineCustomBoolVariable("auto_explain.suggest_only",
+							 "Do not create statistic but just record in WAL suggested create statistics statement.",
+							 NULL,
+							 &auto_explain_suggest_only,
+							 false,
+							 PGC_SUSET,
+							 0,
+							 NULL,
+							 NULL,
+							 NULL);
+
 	EmitWarningsOnPlaceholders("auto_explain");
 
 	/* Install hooks. */
@@ -349,6 +396,276 @@ explain_ExecutorFinish(QueryDesc *queryDesc)
 	PG_END_TRY();
 }
 
+/**
+ * Try to add multicolumn statistics for specified subplans.
+ */
+static void
+AddMultiColumnStatisticsForSubPlans(List *plans, ExplainState *es)
+{
+	ListCell   *lst;
+
+	foreach(lst, plans)
+	{
+		SubPlanState *sps = (SubPlanState *) lfirst(lst);
+
+		AddMultiColumnStatisticsForNode(sps->planstate, es);
+	}
+}
+
+/**
+ * Try to add multicolumn statistics for plan subnodes.
+ */
+static void
+AddMultiColumnStatisticsForMemberNodes(PlanState **planstates, int nsubnodes,
+									   ExplainState *es)
+{
+	int			j;
+
+	for (j = 0; j < nsubnodes; j++)
+		AddMultiColumnStatisticsForNode(planstates[j], es);
+}
+
+/**
+ * Comparator used to sort Vars by name
+ */
+static int
+vars_list_comparator(const ListCell *a, const ListCell *b)
+{
+	char* va = strVal((Value *) linitial(((ColumnRef *)lfirst(a))->fields));
+	char* vb = strVal((Value *) linitial(((ColumnRef *)lfirst(b))->fields));
+	return strcmp(va, vb);
+}
+
+/**
+ * Try to add multicolumn statistics for qual
+ */
+static void
+AddMultiColumnStatisticsForQual(void* qual, ExplainState *es)
+{
+	List *vars = NULL;
+	ListCell* lc;
+
+	/* Extract vars from all quals */
+	foreach (lc, qual)
+	{
+		Node* node = (Node*)lfirst(lc);
+		if (IsA(node, RestrictInfo))
+			node = (Node*)((RestrictInfo*)node)->clause;
+		vars = list_concat(vars, pull_vars_of_level(node, 0));
+	}
+
+	/* Loop until we considered all vars */
+	while (vars != NULL)
+	{
+		ListCell *cell;
+		List *cols = NULL;
+		Index varno = 0;
+		Bitmapset* colmap = NULL;
+
+		/* Contruct list of unique vars */
+		foreach (cell, vars)
+		{
+			Node* node = (Node *) lfirst(cell);
+			if (IsA(node, Var))
+			{
+				Var *var = (Var *) node;
+				if (cols == NULL || var->varno == varno)
+				{
+					varno = var->varno;
+					if (var->varattno > 0 &&
+						!bms_is_member(var->varattno, colmap) &&
+						varno >= 1 && /* not synthetic var */
+						varno <= list_length(es->rtable) &&
+						list_length(cols) < STATS_MAX_DIMENSIONS)
+					{
+						RangeTblEntry *rte = rt_fetch(varno, es->rtable);
+						if (rte->rtekind == RTE_RELATION)
+						{
+							ColumnRef  *col = makeNode(ColumnRef);
+							char *colname = get_rte_attribute_name(rte, var->varattno);
+							col->fields = list_make1(makeString(colname));
+							cols = lappend(cols, col);
+							colmap = bms_add_member(colmap, var->varattno);
+						}
+					}
+				}
+				else
+				{
+					continue;
+				}
+			}
+			vars = foreach_delete_current(vars, cell);
+		}
+		/* To create multicolumn statitics we need to have at least 2 columns */
+		if (list_length(cols) >= 2)
+		{
+			RangeTblEntry *rte = rt_fetch(varno, es->rtable);
+			CreateStatsStmt* stats = makeNode(CreateStatsStmt);
+			char *rel_namespace = get_namespace_name(get_rel_namespace(rte->relid));
+			char *rel_name = get_rel_name(rte->relid);
+			RangeVar* rel = makeRangeVar(rel_namespace, rel_name, 0);
+			char* stat_name = rel_name;
+			char* create_stat_stmt = (char*)"";
+			char const* sep = "ON";
+			ScanKeyData entry[2];
+			TableScanDesc scan;
+			Relation stat_rel;
+			size_t name_len;
+			TupleTableSlot *slot;
+
+			/* Sort variables by name */
+			list_sort(cols, vars_list_comparator);
+
+			/* Construct name for statistic by concatenating relation name with all columns */
+			foreach (cell, cols)
+			{
+				char* col_name = strVal((Value *) linitial(((ColumnRef *)lfirst(cell))->fields));
+				stat_name = psprintf("%s_%s", stat_name, col_name);
+				create_stat_stmt = psprintf("%s%s %s", create_stat_stmt, sep, col_name);
+				sep = ",";
+			}
+
+			name_len = strlen(stat_name);
+			/* Truncate name if it doesn't fit in NameData */
+			if (name_len >= NAMEDATALEN)
+				stat_name = psprintf("%.*s_%08x", NAMEDATALEN - 10, stat_name, (unsigned)hash_any((uint8*)stat_name, name_len));
+
+			ScanKeyInit(&entry[0],
+						Anum_pg_statistic_ext_stxname,
+						BTEqualStrategyNumber, F_NAMEEQ,
+						CStringGetDatum(stat_name));
+			ScanKeyInit(&entry[1],
+						Anum_pg_statistic_ext_stxnamespace,
+						BTEqualStrategyNumber, F_OIDEQ,
+						ObjectIdGetDatum(get_rel_namespace(rte->relid)));
+
+			/*
+			 * Prevent concurrent access to extended statistic table
+			 */
+			stat_rel = table_open(StatisticExtRelationId, AccessExclusiveLock);
+			slot = table_slot_create(stat_rel, NULL);
+			scan = table_beginscan_catalog(stat_rel, 2, entry);
+
+			/*
+			 * Check if multicolumn statistic object with such name already exists.
+			 * Most likely if was already created by auto_explain, but either ANALYZE was not performed since
+			 * this time, either presence of this multicolumn statistic doesn't help to provide more precise estimation.
+			 * Despite to the fact that we create statistics with "if_not_exist" option, presence of such check
+			 * allows to eliminate notice message that statistics object already exists.
+			 */
+			if (!table_scan_getnextslot(scan, ForwardScanDirection, slot))
+			{
+				if (auto_explain_suggest_only)
+				{
+					elog(NOTICE, "Auto_explain suggestion: CREATE STATISTICS %s %s FROM %s", stat_name, create_stat_stmt, rel_name);
+				}
+				else
+				{
+					elog(LOG, "Add statistics %s", stat_name);
+					stats->defnames = list_make2(makeString(rel_namespace), makeString(stat_name));
+					stats->if_not_exists = true;
+					stats->relations = list_make1(rel);
+					stats->exprs = cols;
+					CreateStatistics(stats);
+				}
+			}
+			table_close(stat_rel, AccessExclusiveLock);
+		}
+	}
+}
+
+/**
+ * Try to add multicolumn statistics for node
+ */
+static void
+AddMultiColumnStatisticsForNode(PlanState *planstate, ExplainState *es)
+{
+	Plan	   *plan = planstate->plan;
+
+	if (planstate->instrument && plan->plan_rows != 0)
+	{
+		if (auto_explain_add_statistics_threshold != 0
+			&& planstate->instrument->ntuples / plan->plan_rows >= auto_explain_add_statistics_threshold)
+		{
+			elog(DEBUG1, "Estimated=%f, actual=%f, error=%f: plan=%s", plan->plan_rows, planstate->instrument->ntuples, planstate->instrument->ntuples / plan->plan_rows, nodeToString(plan));
+			/* quals, sort keys, etc */
+			switch (nodeTag(plan))
+			{
+			  case T_IndexScan:
+				AddMultiColumnStatisticsForQual(((IndexScan *) plan)->indexqualorig, es);
+				break;
+			  case T_IndexOnlyScan:
+				AddMultiColumnStatisticsForQual(((IndexOnlyScan *) plan)->indexqual, es);
+				break;
+			  case T_BitmapIndexScan:
+				AddMultiColumnStatisticsForQual(((BitmapIndexScan *) plan)->indexqualorig, es);
+				break;
+			  case T_NestLoop:
+				AddMultiColumnStatisticsForQual(((NestLoop *) plan)->join.joinqual, es);
+				break;
+			  case T_MergeJoin:
+				AddMultiColumnStatisticsForQual(((MergeJoin *) plan)->mergeclauses, es);
+				AddMultiColumnStatisticsForQual(((MergeJoin *) plan)->join.joinqual, es);
+				break;
+			  case T_HashJoin:
+				AddMultiColumnStatisticsForQual(((HashJoin *) plan)->hashclauses, es);
+				AddMultiColumnStatisticsForQual(((HashJoin *) plan)->join.joinqual, es);
+				break;
+			  default:
+				break;
+			}
+			AddMultiColumnStatisticsForQual(plan->qual, es);
+		}
+	}
+
+	/* initPlan-s */
+	if (planstate->initPlan)
+		AddMultiColumnStatisticsForSubPlans(planstate->initPlan, es);
+
+	/* lefttree */
+	if (outerPlanState(planstate))
+		AddMultiColumnStatisticsForNode(outerPlanState(planstate), es);
+
+	/* righttree */
+	if (innerPlanState(planstate))
+		AddMultiColumnStatisticsForNode(innerPlanState(planstate), es);
+
+	/* special child plans */
+	switch (nodeTag(plan))
+	{
+		case T_ModifyTable:
+			AddMultiColumnStatisticsForMemberNodes(((ModifyTableState *) planstate)->mt_plans,
+												   ((ModifyTableState *) planstate)->mt_nplans,
+												   es);
+			break;
+		case T_Append:
+			AddMultiColumnStatisticsForMemberNodes(((AppendState *) planstate)->appendplans,
+												   ((AppendState *) planstate)->as_nplans,
+												   es);
+			break;
+		case T_MergeAppend:
+			AddMultiColumnStatisticsForMemberNodes(((MergeAppendState *) planstate)->mergeplans,
+												   ((MergeAppendState *) planstate)->ms_nplans,
+												   es);
+			break;
+		case T_BitmapAnd:
+			AddMultiColumnStatisticsForMemberNodes(((BitmapAndState *) planstate)->bitmapplans,
+												   ((BitmapAndState *) planstate)->nplans,
+												   es);
+			break;
+		case T_BitmapOr:
+			AddMultiColumnStatisticsForMemberNodes(((BitmapOrState *) planstate)->bitmapplans,
+												   ((BitmapOrState *) planstate)->nplans,
+												   es);
+			break;
+		case T_SubqueryScan:
+			AddMultiColumnStatisticsForNode(((SubqueryScanState *) planstate)->subplan, es);
+			break;
+		default:
+			break;
+	}
+}
+
 /*
  * ExecutorEnd hook: log results if needed
  */
@@ -388,6 +705,10 @@ explain_ExecutorEnd(QueryDesc *queryDesc)
 				ExplainPrintJITSummary(es, queryDesc);
 			ExplainEndOutput(es);
 
+			/* Add multicolumn statistic if requested */
+			if (auto_explain_add_statistics_threshold && !IsParallelWorker())
+				AddMultiColumnStatisticsForNode(queryDesc->planstate, es);
+
 			/* Remove last line break */
 			if (es->str->len > 0 && es->str->data[es->str->len - 1] == '\n')
 				es->str->data[--es->str->len] = '\0';
diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c
index a3ebe10..96ab7d3 100644
--- a/src/backend/optimizer/path/clausesel.c
+++ b/src/backend/optimizer/path/clausesel.c
@@ -25,6 +25,8 @@
 #include "utils/fmgroids.h"
 #include "utils/lsyscache.h"
 #include "utils/selfuncs.h"
+#include "statistics/statistics.h"
+#include "catalog/pg_statistic_ext.h"
 
 /*
  * Data structure for accumulating info about possible range-query
@@ -105,6 +107,47 @@ clauselist_selectivity(PlannerInfo *root,
 }
 
 /*
+ * Find functional dependency between attributes using multicolumn statistic.
+ * relid:   index of relation to which all considered attributes belong
+ * var:     variable which dependencies are inspected
+ * attnums: set of considered attributes included specified variables
+ * This function return degree of strongest dependency between some subset of this attributes
+ * and specified variable or 0.0 if on dependency is found.
+ */
+double
+find_var_dependency(PlannerInfo *root, Index relid, Var *var, Bitmapset *attnums)
+{
+	RelOptInfo* rel = find_base_rel(root, relid);
+	if (rel->rtekind == RTE_RELATION && rel->statlist != NIL)
+	{
+		StatisticExtInfo *stat = choose_best_statistics(rel->statlist, STATS_EXT_DEPENDENCIES,
+														&attnums, 1);
+		if (stat != NULL)
+		{
+			MVDependencies *dependencies = statext_dependencies_load(stat->statOid);
+			MVDependency *strongest = NULL;
+			int i;
+			for (i = 0; i < dependencies->ndeps; i++)
+			{
+				MVDependency *dependency = dependencies->deps[i];
+				int n_dep_vars = dependency->nattributes - 1;
+				/* Dependency implies attribute */
+				if (var->varattno == dependency->attributes[n_dep_vars])
+				{
+					while (--n_dep_vars >= 0
+						   && bms_is_member(dependency->attributes[n_dep_vars], attnums));
+					if (n_dep_vars < 0 && (!strongest || strongest->degree < dependency->degree))
+						strongest = dependency;
+				}
+			}
+			if (strongest)
+				return strongest->degree;
+		}
+	}
+	return 0.0;
+}
+
+/*
  * clauselist_selectivity_simple -
  *	  Compute the selectivity of an implicitly-ANDed list of boolean
  *	  expression clauses.  The list can be empty, in which case 1.0
@@ -158,6 +201,9 @@ clauselist_selectivity_simple(PlannerInfo *root,
 	RangeQueryClause *rqlist = NULL;
 	ListCell   *l;
 	int			listidx;
+	Bitmapset  *clauses_attnums = NULL;
+	int			n_clauses_attnums = 0;
+	int         innerRelid = varRelid;
 
 	/*
 	 * If there's exactly one clause (and it was not estimated yet), just go
@@ -169,6 +215,9 @@ clauselist_selectivity_simple(PlannerInfo *root,
 		return clause_selectivity(root, (Node *) linitial(clauses),
 								  varRelid, jointype, sjinfo);
 
+	if (innerRelid == 0 && sjinfo)
+		bms_get_singleton_member(sjinfo->min_righthand, &innerRelid);
+
 	/*
 	 * Anything that doesn't look like a potential rangequery clause gets
 	 * multiplied into s1 and forgotten. Anything that does gets inserted into
@@ -180,7 +229,6 @@ clauselist_selectivity_simple(PlannerInfo *root,
 		Node	   *clause = (Node *) lfirst(l);
 		RestrictInfo *rinfo;
 		Selectivity s2;
-
 		listidx++;
 
 		/*
@@ -212,6 +260,7 @@ clauselist_selectivity_simple(PlannerInfo *root,
 		else
 			rinfo = NULL;
 
+
 		/*
 		 * See if it looks like a restriction clause with a pseudoconstant on
 		 * one side.  (Anything more complicated than that might not behave in
@@ -223,6 +272,30 @@ clauselist_selectivity_simple(PlannerInfo *root,
 			OpExpr	   *expr = (OpExpr *) clause;
 			bool		varonleft = true;
 			bool		ok;
+			int         oprrest = get_oprrest(expr->opno);
+
+			/* Try to take in account functional dependencies between attributes */
+			if (oprrest == F_EQSEL && rinfo != NULL && innerRelid != 0)
+			{
+				Var* var = (Var*)linitial(expr->args);
+				if (!IsA(var, Var) || var->varno != innerRelid)
+				{
+					var = (Var*)lsecond(expr->args);
+				}
+				if (IsA(var, Var) && var->varattno >= 0 && var->varno == innerRelid)
+				{
+					clauses_attnums = bms_add_member(clauses_attnums, var->varattno);
+					if (n_clauses_attnums++ != 0)
+					{
+						double dep = find_var_dependency(root, innerRelid, var, clauses_attnums);
+						if (dep != 0.0)
+						{
+							s1 *= dep + (1 - dep) * s2;
+							continue;
+						}
+					}
+				}
+			}
 
 			if (rinfo)
 			{
@@ -248,7 +321,7 @@ clauselist_selectivity_simple(PlannerInfo *root,
 				 * selectivity in generically.  But if it's the right oprrest,
 				 * add the clause to rqlist for later processing.
 				 */
-				switch (get_oprrest(expr->opno))
+				switch (oprrest)
 				{
 					case F_SCALARLTSEL:
 					case F_SCALARLESEL:
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 8cf694b..a838312 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -4611,6 +4611,30 @@ get_parameterized_joinrel_size(PlannerInfo *root, RelOptInfo *rel,
 	return nrows;
 }
 
+
+/*
+ * Try to find dependency between variables.
+ * var: varaibles which dependencies are considered
+ * join_vars: list of variables used in other clauses
+ * This functions return strongest dependency and some subset of variables from the same relation
+ * or 0.0 if no dependency was found.
+ */
+static double
+var_depends_on(PlannerInfo *root, Var* var, List* clause_vars)
+{
+	ListCell* lc;
+	Bitmapset *attnums = NULL;
+	Index relid = var->varno;
+
+	foreach (lc, clause_vars)
+	{
+		Var* join_var = (Var*)lfirst(lc);
+		if (join_var->varno == relid && join_var->varattno >= 0)
+			attnums = bms_add_member(attnums, join_var->varattno);
+	}
+	return attnums ? find_var_dependency(root, relid, var, bms_add_member(attnums, var->varattno)) : 0.0;
+}
+
 /*
  * calc_joinrel_size_estimate
  *		Workhorse for set_joinrel_size_estimates and
@@ -4707,6 +4731,39 @@ calc_joinrel_size_estimate(PlannerInfo *root,
 		pselec = 0.0;			/* not used, keep compiler quiet */
 	}
 
+	/* Try to take in account functional dependencies between attributes of clauses pushed-down to joined relations and
+	 * retstrictlist clause. Right now we consider only case of restrictlist consists of one clause.
+	 */
+	if (list_length(restrictlist) == 1)
+	{
+		RestrictInfo* rinfo = linitial(restrictlist);
+		Expr* clause = rinfo->clause;
+
+		Assert(IsA(rinfo, RestrictInfo));
+
+		if (is_opclause(clause))
+		{
+			OpExpr *expr = (OpExpr *) clause;
+			ListCell* lc;
+			List* join_vars = NULL;
+
+			/* Get list of all attributes in pushed-down clauses */
+			foreach (lc, outer_rel->baserestrictinfo)
+				join_vars = list_concat(join_vars, pull_vars_of_level((Node*)((RestrictInfo*)lfirst(lc))->clause, 0));
+			foreach (lc, inner_rel->baserestrictinfo)
+				join_vars = list_concat(join_vars, pull_vars_of_level((Node*)((RestrictInfo*)lfirst(lc))->clause, 0));
+
+			foreach (lc, expr->args)
+			{
+				Var *var = (Var*) lfirst(lc);
+				if (IsA(var, Var) && var->varattno >= 0)
+				{
+					double dep = var_depends_on(root, var, join_vars);
+					jselec = jselec*(1.0 - dep) + dep;
+				}
+			}
+		}
+	}
 	/*
 	 * Basically, we multiply size of Cartesian product by selectivity.
 	 *
diff --git a/src/include/optimizer/clauses.h b/src/include/optimizer/clauses.h
index b7456e3..25bc0b2 100644
--- a/src/include/optimizer/clauses.h
+++ b/src/include/optimizer/clauses.h
@@ -55,4 +55,6 @@ extern void CommuteOpExpr(OpExpr *clause);
 extern Query *inline_set_returning_function(PlannerInfo *root,
 											RangeTblEntry *rte);
 
+extern double find_var_dependency(PlannerInfo *root, Index relid, Var *var, Bitmapset *attnums);
+
 #endif							/* CLAUSES_H */

#13

Yugo NAGATA

nagata@sraoss.co.jp

almost 5 years ago

In reply to: Konstantin Knizhnik (#12)

Re: Columns correlation and adaptive query optimization

Hello,

On Thu, 26 Mar 2020 18:49:51 +0300
Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote:

Attached please find new version of the patch with more comments and
descriptions added.

Adaptive query optimization is very interesting feature for me, so I looked
into this patch. Here are some comments and questions.

(1)
This patch needs rebase because clauselist_selectivity was modified to improve
estimation of OR clauses.

(2)
If I understand correctly, your proposal consists of the following two features.

1. Add a feature to auto_explain that creates an extended statistic automatically
if an error on estimated rows number is large.

2. Improve rows number estimation of join results by considering functional
dependencies between vars in the join qual if the qual has more than one clauses,
and also functional dependencies between a var in the join qual and vars in quals
of the inner/outer relation.

As you said, these two parts are independent each other, so one feature will work
even if we don't assume the other. I wonder it would be better to split the patch
again, and register them to commitfest separately.

(3)
+	DefineCustomBoolVariable("auto_explain.suggest_only",
+							 "Do not create statistic but just record in WAL suggested create statistics statement.",
+							 NULL,
+							 &auto_explain_suggest_on

To imply that this parameter is involving to add_statistics_threshold, it seems
better for me to use more related name like add_statistics_suggest_only.

Also, additional documentations for new parameters are required.

(4)
+			/*
+			 * Prevent concurrent access to extended statistic table
+			 */
+			stat_rel = table_open(StatisticExtRelationId, AccessExclusiveLock);
+			slot = table_slot_create(stat_rel, NULL);
+			scan = table_beginscan_catalog(stat_rel, 2, entry);
(snip)
+			table_close(stat_rel, AccessExclusiveLock);
+		}

When I tested the auto_explain part, I got the following WARNING.

WARNING: buffer refcount leak: [097] (rel=base/12879/3381, blockNum=0, flags=0x83000000, refcount=1 2)
WARNING: buffer refcount leak: [097] (rel=base/12879/3381, blockNum=0, flags=0x83000000, refcount=1 1)
WARNING: relcache reference leak: relation "pg_statistic_ext" not closed
WARNING: TupleDesc reference leak: TupleDesc 0x7fa439266338 (12029,-1) still referenced
WARNING: Snapshot reference leak: Snapshot 0x55c332c10418 still referenced

To suppress this, I think we need table_endscan(scan) and
ExecDropSingleTupleTableSlot(slot) before finishing this function.

(6)
+ elog(NOTICE, "Auto_explain suggestion: CREATE STATISTICS %s %s FROM %s", stat_name, create_stat_stmt, rel_name);

We should use ereport instead of elog for log messages.

(7)
+						double dep = find_var_dependency(root, innerRelid, var, clauses_attnums);
+						if (dep != 0.0)
+						{
+							s1 *= dep + (1 - dep) * s2;
+							continue;
+						}

I found the following comment of clauselist_apply_dependencies():

* we actually combine selectivities using the formula
*
* P(a,b) = f * Min(P(a), P(b)) + (1-f) * P(a) * P(b)

so, is it not necessary using the same formula in this patch? That is,

s1 *= dep + (1-dep) * s2 (if s1 <= s2)
s1 *= dep * (s2/s1) + (1-dep) * s2 (otherwise) .

(8)
+/*
+ * Try to find dependency between variables.
+ * var: varaibles which dependencies are considered
+ * join_vars: list of variables used in other clauses
+ * This functions return strongest dependency and some subset of variables from the same relation
+ * or 0.0 if no dependency was found.
+ */
+static double
+var_depends_on(PlannerInfo *root, Var* var, List* clause_vars)
+{

The comment mentions join_vars but the actual argument name is clauses_vars,
so it needs unification.

(9)
Currently, it only consider functional dependencies statistics. Can we also
consider multivariate MCV list, and is it useful?

(10)
To achieve adaptive query optimization (AQO) in PostgreSQL, this patch proposes
to use auto_explain for getting feedback from actual results. So, could auto_explain
be a infrastructure of AQO in future? Or, do you have any plan or idea to make
built-in infrastructure for AQO?

Regards,
Yugo Nagata

--
Yugo NAGATA <nagata@sraoss.co.jp>

#14

Konstantin Knizhnik

k.knizhnik@postgrespro.ru

almost 5 years ago

In reply to: Yugo NAGATA (#13)

2 attachment(s)

Re: Columns correlation and adaptive query optimization

Hello,

Thank you for review.
My answers are inside.

On 21.01.2021 15:30, Yugo NAGATA wrote:

Hello,

On Thu, 26 Mar 2020 18:49:51 +0300
Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote:

Attached please find new version of the patch with more comments and
descriptions added.

Adaptive query optimization is very interesting feature for me, so I looked
into this patch. Here are some comments and questions.

(1)
This patch needs rebase because clauselist_selectivity was modified to improve
estimation of OR clauses.

Rebased version is attached.

(2)
If I understand correctly, your proposal consists of the following two features.

1. Add a feature to auto_explain that creates an extended statistic automatically
if an error on estimated rows number is large.

2. Improve rows number estimation of join results by considering functional
dependencies between vars in the join qual if the qual has more than one clauses,
and also functional dependencies between a var in the join qual and vars in quals
of the inner/outer relation.

As you said, these two parts are independent each other, so one feature will work
even if we don't assume the other. I wonder it would be better to split the patch
again, and register them to commitfest separately.

I agree with you that this are two almost unrelated changes, although
without clausesel patch additional statistic can not improve query planning.
But I already have too many patches at commitfest.
May be it will be enough to spit this patch into two?

(3)
+	DefineCustomBoolVariable("auto_explain.suggest_only",
+							 "Do not create statistic but just record in WAL suggested create statistics statement.",
+							 NULL,
+							 &auto_explain_suggest_on
To imply that this parameter is involving to add_statistics_threshold, it seems
better for me to use more related name like add_statistics_suggest_only.

Also, additional documentations for new parameters are required.

Done.

(4)
+			/*
+			 * Prevent concurrent access to extended statistic table
+			 */
+			stat_rel = table_open(StatisticExtRelationId, AccessExclusiveLock);
+			slot = table_slot_create(stat_rel, NULL);
+			scan = table_beginscan_catalog(stat_rel, 2, entry);
(snip)
+			table_close(stat_rel, AccessExclusiveLock);
+		}
When I tested the auto_explain part, I got the following WARNING.

WARNING: buffer refcount leak: [097] (rel=base/12879/3381, blockNum=0, flags=0x83000000, refcount=1 2)
WARNING: buffer refcount leak: [097] (rel=base/12879/3381, blockNum=0, flags=0x83000000, refcount=1 1)
WARNING: relcache reference leak: relation "pg_statistic_ext" not closed
WARNING: TupleDesc reference leak: TupleDesc 0x7fa439266338 (12029,-1) still referenced
WARNING: Snapshot reference leak: Snapshot 0x55c332c10418 still referenced

To suppress this, I think we need table_endscan(scan) and
ExecDropSingleTupleTableSlot(slot) before finishing this function.

Thank you for noticing the problem, fixed.

(6)
+ elog(NOTICE, "Auto_explain suggestion: CREATE STATISTICS %s %s FROM %s", stat_name, create_stat_stmt, rel_name);

We should use ereport instead of elog for log messages.

Changed.

(7)
+						double dep = find_var_dependency(root, innerRelid, var, clauses_attnums);
+						if (dep != 0.0)
+						{
+							s1 *= dep + (1 - dep) * s2;
+							continue;
+						}
I found the following comment of clauselist_apply_dependencies():

* we actually combine selectivities using the formula
*
* P(a,b) = f * Min(P(a), P(b)) + (1-f) * P(a) * P(b)

so, is it not necessary using the same formula in this patch? That is,

s1 *= dep + (1-dep) * s2 (if s1 <= s2)
s1 *= dep * (s2/s1) + (1-dep) * s2 (otherwise) .

Makes sense.

(8)
+/*
+ * Try to find dependency between variables.
+ * var: varaibles which dependencies are considered
+ * join_vars: list of variables used in other clauses
+ * This functions return strongest dependency and some subset of variables from the same relation
+ * or 0.0 if no dependency was found.
+ */
+static double
+var_depends_on(PlannerInfo *root, Var* var, List* clause_vars)
+{

The comment mentions join_vars but the actual argument name is clauses_vars,
so it needs unification.

Fixed.

(9)
Currently, it only consider functional dependencies statistics. Can we also
consider multivariate MCV list, and is it useful?

Right now auto_explain create statistic without explicit specification
of statistic kind.
According to the documentation all supported statistics kinds should be
created in this case:

/|statistics_kind|/

A statistics kind to be computed in this statistics object.
Currently supported kinds are |ndistinct|, which enables n-distinct
statistics, and |dependencies|, which enables functional dependency
statistics. If this clause is omitted, all supported statistics
kinds are included in the statistics object. For more information,
see Section 14.2.2
<https://www.postgresql.org/docs/10/planner-stats.html#PLANNER-STATS-EXTENDED>
and Section 68.2
<https://www.postgresql.org/docs/10/multivariate-statistics-examples.html>.

(10)
To achieve adaptive query optimization (AQO) in PostgreSQL, this patch proposes
to use auto_explain for getting feedback from actual results. So, could auto_explain
be a infrastructure of AQO in future? Or, do you have any plan or idea to make
built-in infrastructure for AQO?

Sorry, I do not have answer for this question.
I just patched auto_explain extension because it is doing half of the
required work (analyze expensive statements).
It can be certainly moved to separate extension. In this case it will
party duplicate existed functionality and
settings of auto_explain (like statement execution time threshold). I am
not sure that it is good.
But from the other side, this my patch makes auto_explain extension to
do some unexpected work...

Actually task of adaptive query optimization is much bigger.
We have separate AQO extension which tries to use machine learning to
correctly adjust estimations.
This my patch is much simpler and use existed mechanism (extended
statistics) to improve estimations.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

auto_explain_create_statistic-7.patchtext/x-patch; name=auto_explain_create_statistic-7.patchDownload

diff --git a/contrib/auto_explain/auto_explain.c b/contrib/auto_explain/auto_explain.c
index faa6231..5d8f088 100644
--- a/contrib/auto_explain/auto_explain.c
+++ b/contrib/auto_explain/auto_explain.c
@@ -13,12 +13,32 @@
 #include "postgres.h"
 
 #include <limits.h>
+#include <math.h>
 
+#include "access/hash.h"
 #include "access/parallel.h"
+#include "access/relscan.h"
+#include "access/skey.h"
+#include "access/table.h"
+#include "access/tableam.h"
+#include "catalog/pg_statistic_ext.h"
 #include "commands/explain.h"
+#include "commands/defrem.h"
 #include "executor/instrument.h"
 #include "jit/jit.h"
+#include "nodes/makefuncs.h"
+#include "nodes/nodeFuncs.h"
+#include "optimizer/cost.h"
+#include "optimizer/optimizer.h"
+#include "optimizer/planmain.h"
+#include "parser/parsetree.h"
+#include "storage/ipc.h"
+#include "statistics/statistics.h"
+#include "utils/fmgroids.h"
 #include "utils/guc.h"
+#include "utils/syscache.h"
+#include "utils/lsyscache.h"
+#include "utils/ruleutils.h"
 
 PG_MODULE_MAGIC;
 
@@ -34,7 +54,9 @@ static bool auto_explain_log_settings = false;
 static int	auto_explain_log_format = EXPLAIN_FORMAT_TEXT;
 static int	auto_explain_log_level = LOG;
 static bool auto_explain_log_nested_statements = false;
+static bool auto_explain_add_statistics_suggest_only = false;
 static double auto_explain_sample_rate = 1;
+static double auto_explain_add_statistics_threshold = 0.0;
 
 static const struct config_enum_entry format_options[] = {
 	{"text", EXPLAIN_FORMAT_TEXT, false},
@@ -85,6 +107,7 @@ static void explain_ExecutorRun(QueryDesc *queryDesc,
 static void explain_ExecutorFinish(QueryDesc *queryDesc);
 static void explain_ExecutorEnd(QueryDesc *queryDesc);
 
+static void AddMultiColumnStatisticsForNode(PlanState *planstate, ExplainState *es);
 
 /*
  * Module load callback
@@ -230,6 +253,30 @@ _PG_init(void)
 							 NULL,
 							 NULL);
 
+	DefineCustomRealVariable("auto_explain.add_statistics_threshold",
+							 "Sets the threshold for actual/estimated #rows ratio triggering creation of multicolumn statistic for the related columns.",
+							 "Zero disables implicit creation of multicolumn statistic.",
+							 &auto_explain_add_statistics_threshold,
+							 0.0,
+							 0.0,
+							 INT_MAX,
+							 PGC_SUSET,
+							 0,
+							 NULL,
+							 NULL,
+							 NULL);
+
+	DefineCustomBoolVariable("auto_explain.add_statistics_suggest_only",
+							 "Do not create statistic but just record in WAL suggested create statistics statement.",
+							 NULL,
+							 &auto_explain_add_statistics_suggest_only,
+							 false,
+							 PGC_SUSET,
+							 0,
+							 NULL,
+							 NULL,
+							 NULL);
+
 	EmitWarningsOnPlaceholders("auto_explain");
 
 	/* Install hooks. */
@@ -363,6 +410,282 @@ explain_ExecutorFinish(QueryDesc *queryDesc)
 	PG_END_TRY();
 }
 
+/**
+ * Try to add multicolumn statistics for specified subplans.
+ */
+static void
+AddMultiColumnStatisticsForSubPlans(List *plans, ExplainState *es)
+{
+	ListCell   *lst;
+
+	foreach(lst, plans)
+	{
+		SubPlanState *sps = (SubPlanState *) lfirst(lst);
+
+		AddMultiColumnStatisticsForNode(sps->planstate, es);
+	}
+}
+
+/**
+ * Try to add multicolumn statistics for plan subnodes.
+ */
+static void
+AddMultiColumnStatisticsForMemberNodes(PlanState **planstates, int nsubnodes,
+									   ExplainState *es)
+{
+	int			j;
+
+	for (j = 0; j < nsubnodes; j++)
+		AddMultiColumnStatisticsForNode(planstates[j], es);
+}
+
+/**
+ * Comparator used to sort Vars by name
+ */
+static int
+vars_list_comparator(const ListCell *a, const ListCell *b)
+{
+	char* va = strVal((Value *) linitial(((ColumnRef *)lfirst(a))->fields));
+	char* vb = strVal((Value *) linitial(((ColumnRef *)lfirst(b))->fields));
+	return strcmp(va, vb);
+}
+
+/**
+ * Try to add multicolumn statistics for qual
+ */
+static void
+AddMultiColumnStatisticsForQual(void* qual, ExplainState *es)
+{
+	List *vars = NULL;
+	ListCell* lc;
+
+	/* Extract vars from all quals */
+	foreach (lc, qual)
+	{
+		Node* node = (Node*)lfirst(lc);
+		if (IsA(node, RestrictInfo))
+			node = (Node*)((RestrictInfo*)node)->clause;
+		vars = list_concat(vars, pull_vars_of_level(node, 0));
+	}
+
+	/* Loop until we considered all vars */
+	while (vars != NULL)
+	{
+		ListCell *cell;
+		List *cols = NULL;
+		Index relno = 0;
+		Bitmapset* colmap = NULL;
+
+		/* Contruct list of unique vars */
+		foreach (cell, vars)
+		{
+			Node* node = (Node *) lfirst(cell);
+			if (IsA(node, Var))
+			{
+				Var *var = (Var *) node;
+				int varno = IS_SPECIAL_VARNO(var->varno) ? var->varnosyn : var->varno;
+				if (cols == NULL || varno == relno)
+				{
+					relno = varno;
+					if (var->varattno > 0 &&
+						!bms_is_member(var->varattno, colmap) &&
+						varno >= 1 && /* not synthetic var */
+						varno <= list_length(es->rtable) &&
+						list_length(cols) < STATS_MAX_DIMENSIONS)
+					{
+						RangeTblEntry *rte = rt_fetch(varno, es->rtable);
+						if (rte->rtekind == RTE_RELATION)
+						{
+							ColumnRef  *col = makeNode(ColumnRef);
+							char *colname = get_rte_attribute_name(rte, var->varattno);
+							col->fields = list_make1(makeString(colname));
+							cols = lappend(cols, col);
+							colmap = bms_add_member(colmap, var->varattno);
+						}
+					}
+				}
+				else
+				{
+					continue;
+				}
+			}
+			vars = foreach_delete_current(vars, cell);
+		}
+		/* To create multicolumn statitics we need to have at least 2 columns */
+		if (list_length(cols) >= 2)
+		{
+			RangeTblEntry *rte = rt_fetch(relno, es->rtable);
+			CreateStatsStmt* stats = makeNode(CreateStatsStmt);
+			char *rel_namespace = get_namespace_name(get_rel_namespace(rte->relid));
+			char *rel_name = get_rel_name(rte->relid);
+			RangeVar* rel = makeRangeVar(rel_namespace, rel_name, 0);
+			char* stat_name = rel_name;
+			char* create_stat_stmt = (char*)"";
+			char const* sep = "ON";
+			ScanKeyData entry[2];
+			TableScanDesc scan;
+			Relation stat_rel;
+			size_t name_len;
+			TupleTableSlot *slot;
+
+			/* Sort variables by name */
+			list_sort(cols, vars_list_comparator);
+
+			/* Construct name for statistic by concatenating relation name with all columns */
+			foreach (cell, cols)
+			{
+				char* col_name = strVal((Value *) linitial(((ColumnRef *)lfirst(cell))->fields));
+				stat_name = psprintf("%s_%s", stat_name, col_name);
+				create_stat_stmt = psprintf("%s%s %s", create_stat_stmt, sep, col_name);
+				sep = ",";
+			}
+
+			name_len = strlen(stat_name);
+			/* Truncate name if it doesn't fit in NameData */
+			if (name_len >= NAMEDATALEN)
+				stat_name = psprintf("%.*s_%08x", NAMEDATALEN - 10, stat_name, (unsigned)hash_any((uint8*)stat_name, name_len));
+
+			ScanKeyInit(&entry[0],
+						Anum_pg_statistic_ext_stxname,
+						BTEqualStrategyNumber, F_NAMEEQ,
+						CStringGetDatum(stat_name));
+			ScanKeyInit(&entry[1],
+						Anum_pg_statistic_ext_stxnamespace,
+						BTEqualStrategyNumber, F_OIDEQ,
+						ObjectIdGetDatum(get_rel_namespace(rte->relid)));
+
+			/*
+			 * Prevent concurrent access to extended statistic table
+			 */
+			stat_rel = table_open(StatisticExtRelationId, AccessExclusiveLock);
+			slot = table_slot_create(stat_rel, NULL);
+			scan = table_beginscan_catalog(stat_rel, 2, entry);
+
+			/*
+			 * Check if multicolumn statistic object with such name already exists.
+			 * Most likely if was already created by auto_explain, but either ANALYZE was not performed since
+			 * this time, either presence of this multicolumn statistic doesn't help to provide more precise estimation.
+			 * Despite to the fact that we create statistics with "if_not_exist" option, presence of such check
+			 * allows to eliminate notice message that statistics object already exists.
+			 */
+			if (!table_scan_getnextslot(scan, ForwardScanDirection, slot))
+			{
+				if (auto_explain_add_statistics_suggest_only)
+				{
+					ereport(NOTICE, (errmsg("Auto_explain suggestion: CREATE STATISTICS %s %s FROM %s",
+											stat_name, create_stat_stmt, rel_name),
+									 errhidestmt(true)));
+				}
+				else
+				{
+					ereport(LOG, (errmsg("Add statistics %s", stat_name),
+								  errhidestmt(true)));
+					stats->defnames = list_make2(makeString(rel_namespace), makeString(stat_name));
+					stats->if_not_exists = true;
+					stats->relations = list_make1(rel);
+					stats->exprs = cols;
+					CreateStatistics(stats);
+				}
+			}
+			table_endscan(scan);
+			ExecDropSingleTupleTableSlot(slot);
+			table_close(stat_rel, AccessExclusiveLock);
+		}
+	}
+}
+
+/**
+ * Try to add multicolumn statistics for node
+ */
+static void
+AddMultiColumnStatisticsForNode(PlanState *planstate, ExplainState *es)
+{
+	Plan	   *plan = planstate->plan;
+
+	if (planstate->instrument && plan->plan_rows != 0)
+	{
+		if (auto_explain_add_statistics_threshold != 0
+			&& planstate->instrument->ntuples / plan->plan_rows >= auto_explain_add_statistics_threshold)
+		{
+			elog(DEBUG1, "Estimated=%f, actual=%f, error=%f: plan=%s", plan->plan_rows, planstate->instrument->ntuples, planstate->instrument->ntuples / plan->plan_rows, nodeToString(plan));
+			/* quals, sort keys, etc */
+			switch (nodeTag(plan))
+			{
+			  case T_IndexScan:
+				AddMultiColumnStatisticsForQual(((IndexScan *) plan)->indexqualorig, es);
+				break;
+			  case T_IndexOnlyScan:
+				AddMultiColumnStatisticsForQual(((IndexOnlyScan *) plan)->indexqual, es);
+				break;
+			  case T_BitmapIndexScan:
+				AddMultiColumnStatisticsForQual(((BitmapIndexScan *) plan)->indexqualorig, es);
+				break;
+			  case T_NestLoop:
+				AddMultiColumnStatisticsForQual(((NestLoop *) plan)->join.joinqual, es);
+				break;
+			  case T_MergeJoin:
+				AddMultiColumnStatisticsForQual(((MergeJoin *) plan)->mergeclauses, es);
+				AddMultiColumnStatisticsForQual(((MergeJoin *) plan)->join.joinqual, es);
+				break;
+			  case T_HashJoin:
+				AddMultiColumnStatisticsForQual(((HashJoin *) plan)->hashclauses, es);
+				AddMultiColumnStatisticsForQual(((HashJoin *) plan)->join.joinqual, es);
+				break;
+			  default:
+				break;
+			}
+			AddMultiColumnStatisticsForQual(plan->qual, es);
+		}
+	}
+
+	/* initPlan-s */
+	if (planstate->initPlan)
+		AddMultiColumnStatisticsForSubPlans(planstate->initPlan, es);
+
+	/* lefttree */
+	if (outerPlanState(planstate))
+		AddMultiColumnStatisticsForNode(outerPlanState(planstate), es);
+
+	/* righttree */
+	if (innerPlanState(planstate))
+		AddMultiColumnStatisticsForNode(innerPlanState(planstate), es);
+
+	/* special child plans */
+	switch (nodeTag(plan))
+	{
+		case T_ModifyTable:
+			AddMultiColumnStatisticsForMemberNodes(((ModifyTableState *) planstate)->mt_plans,
+												   ((ModifyTableState *) planstate)->mt_nplans,
+												   es);
+			break;
+		case T_Append:
+			AddMultiColumnStatisticsForMemberNodes(((AppendState *) planstate)->appendplans,
+												   ((AppendState *) planstate)->as_nplans,
+												   es);
+			break;
+		case T_MergeAppend:
+			AddMultiColumnStatisticsForMemberNodes(((MergeAppendState *) planstate)->mergeplans,
+												   ((MergeAppendState *) planstate)->ms_nplans,
+												   es);
+			break;
+		case T_BitmapAnd:
+			AddMultiColumnStatisticsForMemberNodes(((BitmapAndState *) planstate)->bitmapplans,
+												   ((BitmapAndState *) planstate)->nplans,
+												   es);
+			break;
+		case T_BitmapOr:
+			AddMultiColumnStatisticsForMemberNodes(((BitmapOrState *) planstate)->bitmapplans,
+												   ((BitmapOrState *) planstate)->nplans,
+												   es);
+			break;
+		case T_SubqueryScan:
+			AddMultiColumnStatisticsForNode(((SubqueryScanState *) planstate)->subplan, es);
+			break;
+		default:
+			break;
+	}
+}
+
 /*
  * ExecutorEnd hook: log results if needed
  */
@@ -403,6 +726,10 @@ explain_ExecutorEnd(QueryDesc *queryDesc)
 				ExplainPrintJITSummary(es, queryDesc);
 			ExplainEndOutput(es);
 
+			/* Add multicolumn statistic if requested */
+			if (auto_explain_add_statistics_threshold && !IsParallelWorker())
+				AddMultiColumnStatisticsForNode(queryDesc->planstate, es);
+
 			/* Remove last line break */
 			if (es->str->len > 0 && es->str->data[es->str->len - 1] == '\n')
 				es->str->data[--es->str->len] = '\0';
diff --git a/doc/src/sgml/auto-explain.sgml b/doc/src/sgml/auto-explain.sgml
index 30e35a7..559f7be 100644
--- a/doc/src/sgml/auto-explain.sgml
+++ b/doc/src/sgml/auto-explain.sgml
@@ -280,6 +280,41 @@ LOAD 'auto_explain';
      </para>
     </listitem>
    </varlistentry>
+
+   <varlistentry>
+    <term>
+     <varname>auto_explain.auto_explain.add_statistics_threshold</varname> (<type>real</type>)
+     <indexterm>
+      <primary><varname>auto_explain.add_statistics_threshold</varname> configuration parameter</primary>
+     </indexterm>
+    </term>
+    <listitem>
+     <para>
+       <varname>auto_explain.add_statistics_threshold</varname> sets the threshold for
+       actual/estimated #rows ratio triggering creation of multicolumn statistic
+       for the related columns. It can be used for adpative query optimization.
+       If there is large gap between real and estimated number of tuples for the
+       concrete plan node, then multicolumn statistic is created for involved
+       attributes. Zero value (default) disables implicit creation of multicolumn statistic.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term>
+     <varname>auto_explain.auto_explain.add_statistics_suggest_only</varname> (<type>boolean</type>)
+     <indexterm>
+      <primary><varname>auto_explain.add_statistics_suggest_only</varname> configuration parameter</primary>
+     </indexterm>
+    </term>
+    <listitem>
+     <para>
+       <varname>auto_explain.add_statistics_suggest_only</varname> disables creation
+       of multicolumn statistic even if <varname>auto_explain.add_statistics_threshold</varname>
+       contains non zero value.Only log message will be reported in case of bad estimation.
+     </para>
+    </listitem>
+   </varlistentry>
   </variablelist>
 
   <para>

extended_statistic_join_selectivity-7.patchtext/x-patch; name=extended_statistic_join_selectivity-7.patchDownload

diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c
index d263ecf..5446ad5 100644
--- a/src/backend/optimizer/path/clausesel.c
+++ b/src/backend/optimizer/path/clausesel.c
@@ -25,6 +25,8 @@
 #include "utils/fmgroids.h"
 #include "utils/lsyscache.h"
 #include "utils/selfuncs.h"
+#include "statistics/statistics.h"
+#include "catalog/pg_statistic_ext.h"
 
 /*
  * Data structure for accumulating info about possible range-query
@@ -56,6 +58,47 @@ static Selectivity clauselist_selectivity_or(PlannerInfo *root,
  ****************************************************************************/
 
 /*
+ * Find functional dependency between attributes using multicolumn statistic.
+ * relid:   index of relation to which all considered attributes belong
+ * var:     variable which dependencies are inspected
+ * attnums: set of considered attributes included specified variables
+ * This function return degree of strongest dependency between some subset of this attributes
+ * and specified variable or 0.0 if on dependency is found.
+ */
+double
+find_var_dependency(PlannerInfo *root, Index relid, Var *var, Bitmapset *attnums)
+{
+	RelOptInfo* rel = find_base_rel(root, relid);
+	if (rel->rtekind == RTE_RELATION && rel->statlist != NIL)
+	{
+		StatisticExtInfo *stat = choose_best_statistics(rel->statlist, STATS_EXT_DEPENDENCIES,
+														&attnums, 1);
+		if (stat != NULL)
+		{
+			MVDependencies *dependencies = statext_dependencies_load(stat->statOid);
+			MVDependency *strongest = NULL;
+			int i;
+			for (i = 0; i < dependencies->ndeps; i++)
+			{
+				MVDependency *dependency = dependencies->deps[i];
+				int n_dep_vars = dependency->nattributes - 1;
+				/* Dependency implies attribute */
+				if (var->varattno == dependency->attributes[n_dep_vars])
+				{
+					while (--n_dep_vars >= 0
+						   && bms_is_member(dependency->attributes[n_dep_vars], attnums));
+					if (n_dep_vars < 0 && (!strongest || strongest->degree < dependency->degree))
+						strongest = dependency;
+				}
+			}
+			if (strongest)
+				return strongest->degree;
+		}
+	}
+	return 0.0;
+}
+
+/*
  * clauselist_selectivity -
  *	  Compute the selectivity of an implicitly-ANDed list of boolean
  *	  expression clauses.  The list can be empty, in which case 1.0
@@ -129,6 +172,9 @@ clauselist_selectivity_ext(PlannerInfo *root,
 	RangeQueryClause *rqlist = NULL;
 	ListCell   *l;
 	int			listidx;
+	Bitmapset  *clauses_attnums = NULL;
+	int			n_clauses_attnums = 0;
+	int         innerRelid = varRelid;
 
 	/*
 	 * If there's exactly one clause, just go directly to
@@ -139,6 +185,9 @@ clauselist_selectivity_ext(PlannerInfo *root,
 									  varRelid, jointype, sjinfo,
 									  use_extended_stats);
 
+	if (innerRelid == 0 && sjinfo)
+		bms_get_singleton_member(sjinfo->min_righthand, &innerRelid);
+
 	/*
 	 * Determine if these clauses reference a single relation.  If so, and if
 	 * it has extended statistics, try to apply those.
@@ -171,7 +220,6 @@ clauselist_selectivity_ext(PlannerInfo *root,
 		Node	   *clause = (Node *) lfirst(l);
 		RestrictInfo *rinfo;
 		Selectivity s2;
-
 		listidx++;
 
 		/*
@@ -204,6 +252,7 @@ clauselist_selectivity_ext(PlannerInfo *root,
 		else
 			rinfo = NULL;
 
+
 		/*
 		 * See if it looks like a restriction clause with a pseudoconstant on
 		 * one side.  (Anything more complicated than that might not behave in
@@ -215,6 +264,42 @@ clauselist_selectivity_ext(PlannerInfo *root,
 			OpExpr	   *expr = (OpExpr *) clause;
 			bool		varonleft = true;
 			bool		ok;
+			int         oprrest = get_oprrest(expr->opno);
+
+			/* Try to take in account functional dependencies between attributes */
+			if (oprrest == F_EQSEL && rinfo != NULL && innerRelid != 0)
+			{
+				Var* var = (Var*)linitial(expr->args);
+				if (!IsA(var, Var) || var->varno != innerRelid)
+				{
+					var = (Var*)lsecond(expr->args);
+				}
+				if (IsA(var, Var) && var->varattno >= 0 && var->varno == innerRelid)
+				{
+					clauses_attnums = bms_add_member(clauses_attnums, var->varattno);
+					if (n_clauses_attnums++ != 0)
+					{
+						double dep = find_var_dependency(root, innerRelid, var, clauses_attnums);
+						if (dep != 0.0)
+						{
+							/*
+							 * Replace s2 with the conditional probability s2 given s1, computed
+							 * using the formula P(b|a) = P(a,b) / P(a), which simplifies to
+							 *
+							 * P(b|a) = f * Min(P(a), P(b)) / P(a) + (1-f) * P(b)
+							 *
+							 * where P(a) = s1, the selectivity of the implying attributes, and
+							 * P(b) = s2, the selectivity of the implied attribute.
+							 */
+							if (s1 <= s2)
+								s1 *= dep + (1 - dep) * s2;
+							else
+								s1 *= dep * s2 / s1 + (1 - dep) * s2;
+							continue;
+						}
+					}
+				}
+			}
 
 			if (rinfo)
 			{
@@ -240,7 +325,7 @@ clauselist_selectivity_ext(PlannerInfo *root,
 				 * selectivity in generically.  But if it's the right oprrest,
 				 * add the clause to rqlist for later processing.
 				 */
-				switch (get_oprrest(expr->opno))
+				switch (oprrest)
 				{
 					case F_SCALARLTSEL:
 					case F_SCALARLESEL:
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index aab06c7..74ceaf5 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -4790,6 +4790,30 @@ get_parameterized_joinrel_size(PlannerInfo *root, RelOptInfo *rel,
 	return nrows;
 }
 
+
+/*
+ * Try to find dependency between variables.
+ * var: varaibles which dependencies are considered
+ * clause_vars: list of variables used in other clauses
+ * This functions return strongest dependency and some subset of variables from the same relation
+ * or 0.0 if no dependency was found.
+ */
+static double
+var_depends_on(PlannerInfo *root, Var* var, List* clause_vars)
+{
+	ListCell* lc;
+	Bitmapset *attnums = NULL;
+	Index relid = var->varno;
+
+	foreach (lc, clause_vars)
+	{
+		Var* join_var = (Var*)lfirst(lc);
+		if (join_var->varno == relid && join_var->varattno >= 0)
+			attnums = bms_add_member(attnums, join_var->varattno);
+	}
+	return attnums ? find_var_dependency(root, relid, var, bms_add_member(attnums, var->varattno)) : 0.0;
+}
+
 /*
  * calc_joinrel_size_estimate
  *		Workhorse for set_joinrel_size_estimates and
@@ -4886,6 +4910,39 @@ calc_joinrel_size_estimate(PlannerInfo *root,
 		pselec = 0.0;			/* not used, keep compiler quiet */
 	}
 
+	/* Try to take in account functional dependencies between attributes of clauses pushed-down to joined relations and
+	 * retstrictlist clause. Right now we consider only case of restrictlist consists of one clause.
+	 */
+	if (list_length(restrictlist) == 1)
+	{
+		RestrictInfo* rinfo = linitial(restrictlist);
+		Expr* clause = rinfo->clause;
+
+		Assert(IsA(rinfo, RestrictInfo));
+
+		if (is_opclause(clause))
+		{
+			OpExpr *expr = (OpExpr *) clause;
+			ListCell* lc;
+			List* join_vars = NULL;
+
+			/* Get list of all attributes in pushed-down clauses */
+			foreach (lc, outer_rel->baserestrictinfo)
+				join_vars = list_concat(join_vars, pull_vars_of_level((Node*)((RestrictInfo*)lfirst(lc))->clause, 0));
+			foreach (lc, inner_rel->baserestrictinfo)
+				join_vars = list_concat(join_vars, pull_vars_of_level((Node*)((RestrictInfo*)lfirst(lc))->clause, 0));
+
+			foreach (lc, expr->args)
+			{
+				Var *var = (Var*) lfirst(lc);
+				if (IsA(var, Var) && var->varattno >= 0)
+				{
+					double dep = var_depends_on(root, var, join_vars);
+					jselec = jselec*(1.0 - dep) + dep;
+				}
+			}
+		}
+	}
 	/*
 	 * Basically, we multiply size of Cartesian product by selectivity.
 	 *
diff --git a/src/include/optimizer/clauses.h b/src/include/optimizer/clauses.h
index 0673887..f7d91e7 100644
--- a/src/include/optimizer/clauses.h
+++ b/src/include/optimizer/clauses.h
@@ -53,4 +53,6 @@ extern void CommuteOpExpr(OpExpr *clause);
 extern Query *inline_set_returning_function(PlannerInfo *root,
 											RangeTblEntry *rte);
 
+extern double find_var_dependency(PlannerInfo *root, Index relid, Var *var, Bitmapset *attnums);
+
 #endif							/* CLAUSES_H */

#15

Yugo NAGATA

nagata@sraoss.co.jp

almost 5 years ago

In reply to: Konstantin Knizhnik (#14)

Re: Columns correlation and adaptive query optimization

On Mon, 25 Jan 2021 16:27:25 +0300
Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote:

Hello,

Thank you for review.
My answers are inside.

Thank you for updating the patch and answering my questions.

(2)
If I understand correctly, your proposal consists of the following two features.

1. Add a feature to auto_explain that creates an extended statistic automatically
if an error on estimated rows number is large.

2. Improve rows number estimation of join results by considering functional
dependencies between vars in the join qual if the qual has more than one clauses,
and also functional dependencies between a var in the join qual and vars in quals
of the inner/outer relation.

As you said, these two parts are independent each other, so one feature will work
even if we don't assume the other. I wonder it would be better to split the patch
again, and register them to commitfest separately.

I agree with you that this are two almost unrelated changes, although
without clausesel patch additional statistic can not improve query planning.

I think extended statistics created by the auto_explain patch can improve query
planning even without the clausesel patch. For example, suppose the following case:

postgres=# create table t ( i int, j int);
CREATE TABLE
postgres=# insert into t select i/10, i/100 from generate_series(1,1000000) i;
INSERT 0 1000000
postgres=# analyze t;
ANALYZE
postgres=# explain analyze select * from t where i = 100 and j = 10;
QUERY PLAN
-------------------------------------------------------------------------------------------------
Seq Scan on t (cost=0.00..19425.00 rows=1 width=8) (actual time=0.254..97.293 rows=10 loops=1)
Filter: ((i = 100) AND (j = 10))
Rows Removed by Filter: 999990
Planning Time: 0.199 ms
Execution Time: 97.327 ms
(5 rows)

After applying the auto_explain patch (without clausesel patch) and issuing the query,
additional statistics were created.

postgres=# select * from t where i = 100 and j = 10;
LOG: Add statistics t_i_j

Then, after analyze, the row estimation was improved.

postgres=# analyze t;
ANALYZE
postgres=# explain analyze select * from t where i = 100 and j = 10;
postgres=# explain analyze select * from t where i = 100 and j = 10;
QUERY PLAN
--------------------------------------------------------------------------------------------------
Seq Scan on t (cost=0.00..19425.00 rows=10 width=8) (actual time=0.255..95.347 rows=10 loops=1)
Filter: ((i = 100) AND (j = 10))
Rows Removed by Filter: 999990
Planning Time: 0.124 ms
Execution Time: 95.383 ms
(5 rows)

So, I think the auto_explain patch is useful with just that as a tool
to detect a gap between estimate and real and adjust the plan. Also,
the clausesel patch would be useful without the auto_explain patch
if an appropriate statistics are created.

But I already have too many patches at commitfest.
May be it will be enough to spit this patch into two?

Although we can discuss both of these patches in this thread,
I wonder we don't have to commit them together.

(3)
+	DefineCustomBoolVariable("auto_explain.suggest_only",
+							 "Do not create statistic but just record in WAL suggested create statistics statement.",
+							 NULL,
+							 &auto_explain_suggest_on
To imply that this parameter is involving to add_statistics_threshold, it seems
better for me to use more related name like add_statistics_suggest_only.

Also, additional documentations for new parameters are required.
Done.

+
+   <varlistentry>
+    <term>
+     <varname>auto_explain.auto_explain.add_statistics_threshold</varname> (<type>real</type>)
+     <indexterm>
+      <primary><varname>auto_explain.add_statistics_threshold</varname> configuration parameter</primary>
+     </indexterm>
+    </term>
+    <listitem>
+     <para>
+       <varname>auto_explain.add_statistics_threshold</varname> sets the threshold for
+       actual/estimated #rows ratio triggering creation of multicolumn statistic
+       for the related columns. It can be used for adpative query optimization.
+       If there is large gap between real and estimated number of tuples for the
+       concrete plan node, then multicolumn statistic is created for involved
+       attributes. Zero value (default) disables implicit creation of multicolumn statistic.
+     </para>
+    </listitem>

I wonder we need to say that this parameter has no effect unless log_analyze
is enabled and that statistics are created only when the excution time exceeds
log_min_duration, if these behaviors are intentional.

In addition, additional statistics are created only if #rows is over-estimated
and not if it is under-estimated. Although it seems good as a criterion for creating
multicolumn statistic since extended statisstic is usually useful to fix over-estimation,
I am not sure if we don't have to consider under-estimation case at all.

(9)
Currently, it only consider functional dependencies statistics. Can we also
consider multivariate MCV list, and is it useful?

Right now auto_explain create statistic without explicit specification
of statistic kind.
According to the documentation all supported statistics kinds should be
created in this case:

Yes, auto_explain creates all kinds of extended statistics. However,
IIUC, the clausesel patch uses only functional dependencies statistics for
improving join, so my question was about possibility to consider MCV in the
clausesel patch.

(10)
To achieve adaptive query optimization (AQO) in PostgreSQL, this patch proposes
to use auto_explain for getting feedback from actual results. So, could auto_explain
be a infrastructure of AQO in future? Or, do you have any plan or idea to make
built-in infrastructure for AQO?

Sorry, I do not have answer for this question.
I just patched auto_explain extension because it is doing half of the
required work (analyze expensive statements).
It can be certainly moved to separate extension. In this case it will
party duplicate existed functionality and
settings of auto_explain (like statement execution time threshold). I am
not sure that it is good.
But from the other side, this my patch makes auto_explain extension to
do some unexpected work...

I think that auto_explain is an extension originally for aiming to detect
and log plans that take a long time, so it doesn't seem so unnatural for
me to use this for improving such plans. Especially, the feature to find
tunable points in executed plans seems useful.

Actually task of adaptive query optimization is much bigger.
We have separate AQO extension which tries to use machine learning to
correctly adjust estimations.
This my patch is much simpler and use existed mechanism (extended
statistics) to improve estimations.

Well, this patch provide a kind of AQO as auto_explain feature, but this
is independent of the AQO extension. Is it right?
Anyway, I'm interested in the AQO extension, so I'll look into this, too.

Regards,
Yugo Nagata

--
Yugo NAGATA <nagata@sraoss.co.jp>

#16

Konstantin Knizhnik

k.knizhnik@postgrespro.ru

almost 5 years ago

In reply to: Yugo NAGATA (#15)

1 attachment(s)

Re: Columns correlation and adaptive query optimization

On 27.01.2021 8:45, Yugo NAGATA wrote:

On Mon, 25 Jan 2021 16:27:25 +0300
Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote:

Hello,

Thank you for review.
My answers are inside.

Thank you for updating the patch and answering my questions.

(2)
If I understand correctly, your proposal consists of the following two features.

1. Add a feature to auto_explain that creates an extended statistic automatically
if an error on estimated rows number is large.

2. Improve rows number estimation of join results by considering functional
dependencies between vars in the join qual if the qual has more than one clauses,
and also functional dependencies between a var in the join qual and vars in quals
of the inner/outer relation.

As you said, these two parts are independent each other, so one feature will work
even if we don't assume the other. I wonder it would be better to split the patch
again, and register them to commitfest separately.

I agree with you that this are two almost unrelated changes, although
without clausesel patch additional statistic can not improve query planning.

I think extended statistics created by the auto_explain patch can improve query
planning even without the clausesel patch. For example, suppose the following case:

postgres=# create table t ( i int, j int);
CREATE TABLE
postgres=# insert into t select i/10, i/100 from generate_series(1,1000000) i;
INSERT 0 1000000
postgres=# analyze t;
ANALYZE
postgres=# explain analyze select * from t where i = 100 and j = 10;
QUERY PLAN
-------------------------------------------------------------------------------------------------
Seq Scan on t (cost=0.00..19425.00 rows=1 width=8) (actual time=0.254..97.293 rows=10 loops=1)
Filter: ((i = 100) AND (j = 10))
Rows Removed by Filter: 999990
Planning Time: 0.199 ms
Execution Time: 97.327 ms
(5 rows)

After applying the auto_explain patch (without clausesel patch) and issuing the query,
additional statistics were created.

postgres=# select * from t where i = 100 and j = 10;
LOG: Add statistics t_i_j

Then, after analyze, the row estimation was improved.

postgres=# analyze t;
ANALYZE
postgres=# explain analyze select * from t where i = 100 and j = 10;
postgres=# explain analyze select * from t where i = 100 and j = 10;
QUERY PLAN
--------------------------------------------------------------------------------------------------
Seq Scan on t (cost=0.00..19425.00 rows=10 width=8) (actual time=0.255..95.347 rows=10 loops=1)
Filter: ((i = 100) AND (j = 10))
Rows Removed by Filter: 999990
Planning Time: 0.124 ms
Execution Time: 95.383 ms
(5 rows)

So, I think the auto_explain patch is useful with just that as a tool
to detect a gap between estimate and real and adjust the plan. Also,
the clausesel patch would be useful without the auto_explain patch
if an appropriate statistics are created.

But I already have too many patches at commitfest.
May be it will be enough to spit this patch into two?

Although we can discuss both of these patches in this thread,
I wonder we don't have to commit them together.
(3)
+	DefineCustomBoolVariable("auto_explain.suggest_only",
+							 "Do not create statistic but just record in WAL suggested create statistics statement.",
+							 NULL,
+							 &auto_explain_suggest_on
To imply that this parameter is involving to add_statistics_threshold, it seems
better for me to use more related name like add_statistics_suggest_only.

Also, additional documentations for new parameters are required.
Done.
+
+   <varlistentry>
+    <term>
+     <varname>auto_explain.auto_explain.add_statistics_threshold</varname> (<type>real</type>)
+     <indexterm>
+      <primary><varname>auto_explain.add_statistics_threshold</varname> configuration parameter</primary>
+     </indexterm>
+    </term>
+    <listitem>
+     <para>
+       <varname>auto_explain.add_statistics_threshold</varname> sets the threshold for
+       actual/estimated #rows ratio triggering creation of multicolumn statistic
+       for the related columns. It can be used for adpative query optimization.
+       If there is large gap between real and estimated number of tuples for the
+       concrete plan node, then multicolumn statistic is created for involved
+       attributes. Zero value (default) disables implicit creation of multicolumn statistic.
+     </para>
+    </listitem>
I wonder we need to say that this parameter has no effect unless log_analyze
is enabled and that statistics are created only when the excution time exceeds
log_min_duration, if these behaviors are intentional.

In addition, additional statistics are created only if #rows is over-estimated
and not if it is under-estimated. Although it seems good as a criterion for creating
multicolumn statistic since extended statisstic is usually useful to fix over-estimation,
I am not sure if we don't have to consider under-estimation case at all.

(9)
Currently, it only consider functional dependencies statistics. Can we also
consider multivariate MCV list, and is it useful?

Right now auto_explain create statistic without explicit specification
of statistic kind.
According to the documentation all supported statistics kinds should be
created in this case:

Yes, auto_explain creates all kinds of extended statistics. However,
IIUC, the clausesel patch uses only functional dependencies statistics for
improving join, so my question was about possibility to consider MCV in the
clausesel patch.

(10)
To achieve adaptive query optimization (AQO) in PostgreSQL, this patch proposes
to use auto_explain for getting feedback from actual results. So, could auto_explain
be a infrastructure of AQO in future? Or, do you have any plan or idea to make
built-in infrastructure for AQO?

Sorry, I do not have answer for this question.
I just patched auto_explain extension because it is doing half of the
required work (analyze expensive statements).
It can be certainly moved to separate extension. In this case it will
party duplicate existed functionality and
settings of auto_explain (like statement execution time threshold). I am
not sure that it is good.
But from the other side, this my patch makes auto_explain extension to
do some unexpected work...

I think that auto_explain is an extension originally for aiming to detect
and log plans that take a long time, so it doesn't seem so unnatural for
me to use this for improving such plans. Especially, the feature to find
tunable points in executed plans seems useful.

Actually task of adaptive query optimization is much bigger.
We have separate AQO extension which tries to use machine learning to
correctly adjust estimations.
This my patch is much simpler and use existed mechanism (extended
statistics) to improve estimations.

Well, this patch provide a kind of AQO as auto_explain feature, but this
is independent of the AQO extension. Is it right?
Anyway, I'm interested in the AQO extension, so I'll look into this, too.

Regards,
Yugo Nagata

I have updated documentation as you suggested and submit patch for
auto_explain extension for the next commitfest.
I will create separate thread for improving join selectivity estimation
using extended statistics.

Well, this patch provide a kind of AQO as auto_explain feature, but this
is independent of the AQO extension. Is it right?

Yes. The basic idea is the same, but approaches are different.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

auto_explain_create_statistic-8.patchtext/x-patch; name=auto_explain_create_statistic-8.patchDownload

diff --git a/contrib/auto_explain/auto_explain.c b/contrib/auto_explain/auto_explain.c
index faa6231..5d8f088 100644
--- a/contrib/auto_explain/auto_explain.c
+++ b/contrib/auto_explain/auto_explain.c
@@ -13,12 +13,32 @@
 #include "postgres.h"
 
 #include <limits.h>
+#include <math.h>
 
+#include "access/hash.h"
 #include "access/parallel.h"
+#include "access/relscan.h"
+#include "access/skey.h"
+#include "access/table.h"
+#include "access/tableam.h"
+#include "catalog/pg_statistic_ext.h"
 #include "commands/explain.h"
+#include "commands/defrem.h"
 #include "executor/instrument.h"
 #include "jit/jit.h"
+#include "nodes/makefuncs.h"
+#include "nodes/nodeFuncs.h"
+#include "optimizer/cost.h"
+#include "optimizer/optimizer.h"
+#include "optimizer/planmain.h"
+#include "parser/parsetree.h"
+#include "storage/ipc.h"
+#include "statistics/statistics.h"
+#include "utils/fmgroids.h"
 #include "utils/guc.h"
+#include "utils/syscache.h"
+#include "utils/lsyscache.h"
+#include "utils/ruleutils.h"
 
 PG_MODULE_MAGIC;
 
@@ -34,7 +54,9 @@ static bool auto_explain_log_settings = false;
 static int	auto_explain_log_format = EXPLAIN_FORMAT_TEXT;
 static int	auto_explain_log_level = LOG;
 static bool auto_explain_log_nested_statements = false;
+static bool auto_explain_add_statistics_suggest_only = false;
 static double auto_explain_sample_rate = 1;
+static double auto_explain_add_statistics_threshold = 0.0;
 
 static const struct config_enum_entry format_options[] = {
 	{"text", EXPLAIN_FORMAT_TEXT, false},
@@ -85,6 +107,7 @@ static void explain_ExecutorRun(QueryDesc *queryDesc,
 static void explain_ExecutorFinish(QueryDesc *queryDesc);
 static void explain_ExecutorEnd(QueryDesc *queryDesc);
 
+static void AddMultiColumnStatisticsForNode(PlanState *planstate, ExplainState *es);
 
 /*
  * Module load callback
@@ -230,6 +253,30 @@ _PG_init(void)
 							 NULL,
 							 NULL);
 
+	DefineCustomRealVariable("auto_explain.add_statistics_threshold",
+							 "Sets the threshold for actual/estimated #rows ratio triggering creation of multicolumn statistic for the related columns.",
+							 "Zero disables implicit creation of multicolumn statistic.",
+							 &auto_explain_add_statistics_threshold,
+							 0.0,
+							 0.0,
+							 INT_MAX,
+							 PGC_SUSET,
+							 0,
+							 NULL,
+							 NULL,
+							 NULL);
+
+	DefineCustomBoolVariable("auto_explain.add_statistics_suggest_only",
+							 "Do not create statistic but just record in WAL suggested create statistics statement.",
+							 NULL,
+							 &auto_explain_add_statistics_suggest_only,
+							 false,
+							 PGC_SUSET,
+							 0,
+							 NULL,
+							 NULL,
+							 NULL);
+
 	EmitWarningsOnPlaceholders("auto_explain");
 
 	/* Install hooks. */
@@ -363,6 +410,282 @@ explain_ExecutorFinish(QueryDesc *queryDesc)
 	PG_END_TRY();
 }
 
+/**
+ * Try to add multicolumn statistics for specified subplans.
+ */
+static void
+AddMultiColumnStatisticsForSubPlans(List *plans, ExplainState *es)
+{
+	ListCell   *lst;
+
+	foreach(lst, plans)
+	{
+		SubPlanState *sps = (SubPlanState *) lfirst(lst);
+
+		AddMultiColumnStatisticsForNode(sps->planstate, es);
+	}
+}
+
+/**
+ * Try to add multicolumn statistics for plan subnodes.
+ */
+static void
+AddMultiColumnStatisticsForMemberNodes(PlanState **planstates, int nsubnodes,
+									   ExplainState *es)
+{
+	int			j;
+
+	for (j = 0; j < nsubnodes; j++)
+		AddMultiColumnStatisticsForNode(planstates[j], es);
+}
+
+/**
+ * Comparator used to sort Vars by name
+ */
+static int
+vars_list_comparator(const ListCell *a, const ListCell *b)
+{
+	char* va = strVal((Value *) linitial(((ColumnRef *)lfirst(a))->fields));
+	char* vb = strVal((Value *) linitial(((ColumnRef *)lfirst(b))->fields));
+	return strcmp(va, vb);
+}
+
+/**
+ * Try to add multicolumn statistics for qual
+ */
+static void
+AddMultiColumnStatisticsForQual(void* qual, ExplainState *es)
+{
+	List *vars = NULL;
+	ListCell* lc;
+
+	/* Extract vars from all quals */
+	foreach (lc, qual)
+	{
+		Node* node = (Node*)lfirst(lc);
+		if (IsA(node, RestrictInfo))
+			node = (Node*)((RestrictInfo*)node)->clause;
+		vars = list_concat(vars, pull_vars_of_level(node, 0));
+	}
+
+	/* Loop until we considered all vars */
+	while (vars != NULL)
+	{
+		ListCell *cell;
+		List *cols = NULL;
+		Index relno = 0;
+		Bitmapset* colmap = NULL;
+
+		/* Contruct list of unique vars */
+		foreach (cell, vars)
+		{
+			Node* node = (Node *) lfirst(cell);
+			if (IsA(node, Var))
+			{
+				Var *var = (Var *) node;
+				int varno = IS_SPECIAL_VARNO(var->varno) ? var->varnosyn : var->varno;
+				if (cols == NULL || varno == relno)
+				{
+					relno = varno;
+					if (var->varattno > 0 &&
+						!bms_is_member(var->varattno, colmap) &&
+						varno >= 1 && /* not synthetic var */
+						varno <= list_length(es->rtable) &&
+						list_length(cols) < STATS_MAX_DIMENSIONS)
+					{
+						RangeTblEntry *rte = rt_fetch(varno, es->rtable);
+						if (rte->rtekind == RTE_RELATION)
+						{
+							ColumnRef  *col = makeNode(ColumnRef);
+							char *colname = get_rte_attribute_name(rte, var->varattno);
+							col->fields = list_make1(makeString(colname));
+							cols = lappend(cols, col);
+							colmap = bms_add_member(colmap, var->varattno);
+						}
+					}
+				}
+				else
+				{
+					continue;
+				}
+			}
+			vars = foreach_delete_current(vars, cell);
+		}
+		/* To create multicolumn statitics we need to have at least 2 columns */
+		if (list_length(cols) >= 2)
+		{
+			RangeTblEntry *rte = rt_fetch(relno, es->rtable);
+			CreateStatsStmt* stats = makeNode(CreateStatsStmt);
+			char *rel_namespace = get_namespace_name(get_rel_namespace(rte->relid));
+			char *rel_name = get_rel_name(rte->relid);
+			RangeVar* rel = makeRangeVar(rel_namespace, rel_name, 0);
+			char* stat_name = rel_name;
+			char* create_stat_stmt = (char*)"";
+			char const* sep = "ON";
+			ScanKeyData entry[2];
+			TableScanDesc scan;
+			Relation stat_rel;
+			size_t name_len;
+			TupleTableSlot *slot;
+
+			/* Sort variables by name */
+			list_sort(cols, vars_list_comparator);
+
+			/* Construct name for statistic by concatenating relation name with all columns */
+			foreach (cell, cols)
+			{
+				char* col_name = strVal((Value *) linitial(((ColumnRef *)lfirst(cell))->fields));
+				stat_name = psprintf("%s_%s", stat_name, col_name);
+				create_stat_stmt = psprintf("%s%s %s", create_stat_stmt, sep, col_name);
+				sep = ",";
+			}
+
+			name_len = strlen(stat_name);
+			/* Truncate name if it doesn't fit in NameData */
+			if (name_len >= NAMEDATALEN)
+				stat_name = psprintf("%.*s_%08x", NAMEDATALEN - 10, stat_name, (unsigned)hash_any((uint8*)stat_name, name_len));
+
+			ScanKeyInit(&entry[0],
+						Anum_pg_statistic_ext_stxname,
+						BTEqualStrategyNumber, F_NAMEEQ,
+						CStringGetDatum(stat_name));
+			ScanKeyInit(&entry[1],
+						Anum_pg_statistic_ext_stxnamespace,
+						BTEqualStrategyNumber, F_OIDEQ,
+						ObjectIdGetDatum(get_rel_namespace(rte->relid)));
+
+			/*
+			 * Prevent concurrent access to extended statistic table
+			 */
+			stat_rel = table_open(StatisticExtRelationId, AccessExclusiveLock);
+			slot = table_slot_create(stat_rel, NULL);
+			scan = table_beginscan_catalog(stat_rel, 2, entry);
+
+			/*
+			 * Check if multicolumn statistic object with such name already exists.
+			 * Most likely if was already created by auto_explain, but either ANALYZE was not performed since
+			 * this time, either presence of this multicolumn statistic doesn't help to provide more precise estimation.
+			 * Despite to the fact that we create statistics with "if_not_exist" option, presence of such check
+			 * allows to eliminate notice message that statistics object already exists.
+			 */
+			if (!table_scan_getnextslot(scan, ForwardScanDirection, slot))
+			{
+				if (auto_explain_add_statistics_suggest_only)
+				{
+					ereport(NOTICE, (errmsg("Auto_explain suggestion: CREATE STATISTICS %s %s FROM %s",
+											stat_name, create_stat_stmt, rel_name),
+									 errhidestmt(true)));
+				}
+				else
+				{
+					ereport(LOG, (errmsg("Add statistics %s", stat_name),
+								  errhidestmt(true)));
+					stats->defnames = list_make2(makeString(rel_namespace), makeString(stat_name));
+					stats->if_not_exists = true;
+					stats->relations = list_make1(rel);
+					stats->exprs = cols;
+					CreateStatistics(stats);
+				}
+			}
+			table_endscan(scan);
+			ExecDropSingleTupleTableSlot(slot);
+			table_close(stat_rel, AccessExclusiveLock);
+		}
+	}
+}
+
+/**
+ * Try to add multicolumn statistics for node
+ */
+static void
+AddMultiColumnStatisticsForNode(PlanState *planstate, ExplainState *es)
+{
+	Plan	   *plan = planstate->plan;
+
+	if (planstate->instrument && plan->plan_rows != 0)
+	{
+		if (auto_explain_add_statistics_threshold != 0
+			&& planstate->instrument->ntuples / plan->plan_rows >= auto_explain_add_statistics_threshold)
+		{
+			elog(DEBUG1, "Estimated=%f, actual=%f, error=%f: plan=%s", plan->plan_rows, planstate->instrument->ntuples, planstate->instrument->ntuples / plan->plan_rows, nodeToString(plan));
+			/* quals, sort keys, etc */
+			switch (nodeTag(plan))
+			{
+			  case T_IndexScan:
+				AddMultiColumnStatisticsForQual(((IndexScan *) plan)->indexqualorig, es);
+				break;
+			  case T_IndexOnlyScan:
+				AddMultiColumnStatisticsForQual(((IndexOnlyScan *) plan)->indexqual, es);
+				break;
+			  case T_BitmapIndexScan:
+				AddMultiColumnStatisticsForQual(((BitmapIndexScan *) plan)->indexqualorig, es);
+				break;
+			  case T_NestLoop:
+				AddMultiColumnStatisticsForQual(((NestLoop *) plan)->join.joinqual, es);
+				break;
+			  case T_MergeJoin:
+				AddMultiColumnStatisticsForQual(((MergeJoin *) plan)->mergeclauses, es);
+				AddMultiColumnStatisticsForQual(((MergeJoin *) plan)->join.joinqual, es);
+				break;
+			  case T_HashJoin:
+				AddMultiColumnStatisticsForQual(((HashJoin *) plan)->hashclauses, es);
+				AddMultiColumnStatisticsForQual(((HashJoin *) plan)->join.joinqual, es);
+				break;
+			  default:
+				break;
+			}
+			AddMultiColumnStatisticsForQual(plan->qual, es);
+		}
+	}
+
+	/* initPlan-s */
+	if (planstate->initPlan)
+		AddMultiColumnStatisticsForSubPlans(planstate->initPlan, es);
+
+	/* lefttree */
+	if (outerPlanState(planstate))
+		AddMultiColumnStatisticsForNode(outerPlanState(planstate), es);
+
+	/* righttree */
+	if (innerPlanState(planstate))
+		AddMultiColumnStatisticsForNode(innerPlanState(planstate), es);
+
+	/* special child plans */
+	switch (nodeTag(plan))
+	{
+		case T_ModifyTable:
+			AddMultiColumnStatisticsForMemberNodes(((ModifyTableState *) planstate)->mt_plans,
+												   ((ModifyTableState *) planstate)->mt_nplans,
+												   es);
+			break;
+		case T_Append:
+			AddMultiColumnStatisticsForMemberNodes(((AppendState *) planstate)->appendplans,
+												   ((AppendState *) planstate)->as_nplans,
+												   es);
+			break;
+		case T_MergeAppend:
+			AddMultiColumnStatisticsForMemberNodes(((MergeAppendState *) planstate)->mergeplans,
+												   ((MergeAppendState *) planstate)->ms_nplans,
+												   es);
+			break;
+		case T_BitmapAnd:
+			AddMultiColumnStatisticsForMemberNodes(((BitmapAndState *) planstate)->bitmapplans,
+												   ((BitmapAndState *) planstate)->nplans,
+												   es);
+			break;
+		case T_BitmapOr:
+			AddMultiColumnStatisticsForMemberNodes(((BitmapOrState *) planstate)->bitmapplans,
+												   ((BitmapOrState *) planstate)->nplans,
+												   es);
+			break;
+		case T_SubqueryScan:
+			AddMultiColumnStatisticsForNode(((SubqueryScanState *) planstate)->subplan, es);
+			break;
+		default:
+			break;
+	}
+}
+
 /*
  * ExecutorEnd hook: log results if needed
  */
@@ -403,6 +726,10 @@ explain_ExecutorEnd(QueryDesc *queryDesc)
 				ExplainPrintJITSummary(es, queryDesc);
 			ExplainEndOutput(es);
 
+			/* Add multicolumn statistic if requested */
+			if (auto_explain_add_statistics_threshold && !IsParallelWorker())
+				AddMultiColumnStatisticsForNode(queryDesc->planstate, es);
+
 			/* Remove last line break */
 			if (es->str->len > 0 && es->str->data[es->str->len - 1] == '\n')
 				es->str->data[--es->str->len] = '\0';
diff --git a/doc/src/sgml/auto-explain.sgml b/doc/src/sgml/auto-explain.sgml
index 30e35a7..29dfd7a 100644
--- a/doc/src/sgml/auto-explain.sgml
+++ b/doc/src/sgml/auto-explain.sgml
@@ -280,6 +280,45 @@ LOAD 'auto_explain';
      </para>
     </listitem>
    </varlistentry>
+
+   <varlistentry>
+    <term>
+     <varname>auto_explain.auto_explain.add_statistics_threshold</varname> (<type>real</type>)
+     <indexterm>
+      <primary><varname>auto_explain.add_statistics_threshold</varname> configuration parameter</primary>
+     </indexterm>
+    </term>
+    <listitem>
+     <para>
+       <varname>auto_explain.add_statistics_threshold</varname> sets the threshold for
+       actual/estimated #rows ratio triggering creation of multicolumn statistic
+       for the related columns. It can be used for adpative query optimization.
+       If there is large gap between real and estimated number of tuples for the
+       concrete plan node, then multicolumn statistic is created for involved
+       attributes. Zero value (default) disables implicit creation of multicolumn statistic.
+       Please notice, this parameter has no effect unless <varname>auto_explain.log_analyze</varname>
+       is enabled and that statistics are created only when the excution time exceeds
+       <varname>auto_explain.log_min_duration</varname>. Also additional statistics are created only if
+       number of rows is over-estimated and not if it is under-estimated.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term>
+     <varname>auto_explain.auto_explain.add_statistics_suggest_only</varname> (<type>boolean</type>)
+     <indexterm>
+      <primary><varname>auto_explain.add_statistics_suggest_only</varname> configuration parameter</primary>
+     </indexterm>
+    </term>
+    <listitem>
+     <para>
+       <varname>auto_explain.add_statistics_suggest_only</varname> disables creation
+       of multicolumn statistic even if <varname>auto_explain.add_statistics_threshold</varname>
+       contains non zero value.Only log message will be reported in case of bad estimation.
+     </para>
+    </listitem>
+   </varlistentry>
   </variablelist>
 
   <para>

#17

Tomas Vondra

tomas.vondra@enterprisedb.com

almost 5 years ago

In reply to: Konstantin Knizhnik (#16)

Re: Columns correlation and adaptive query optimization

Hello Konstantin,

Sorry for not responding to this thread earlier. I definitely agree the
features proposed here are very interesting and useful, and I appreciate
you kept rebasing the patch.

I think the patch improving join estimates can be treated as separate,
and I see it already has a separate CF entry - it however still points
to this thread, which will be confusing. I suggest we start a different
thread for it, to keep the discussions separate.

I'll focus on the auto_explain part here.

I did have some ideas about adaptive query optimization too, although
maybe in a slightly different form. My plan was to collect information
about estimated / actual cardinalities, and then use this knowledge to
directly tweak the estimates. Directly, without creating extended stats,
but treat the collected info about estimates / row counts as a kind of
ad hoc statistics. (Not sure if this is what the AQE extension does.)

What is being proposed here - an extension suggesting which statistics
to create (and possibly creating them automatically) is certainly
useful, but I'm not sure I'd call it "adaptive query optimization". I
think "adaptive" means the extension directly modifies the estimates
based on past executions. So I propose calling it maybe "statistics
advisor" or something like that.

A couple additional points:

1) I think we should create a new extension for this.

auto_explain has a fairly well defined purpose, I don't think this is
consistent with it. It's quite likely it'll require stuff like shared
memory, etc. which auto_explain does not (and should not) need.

Let's call it statistics_advisor, or something like that. It will use
about the same planner/executor callbacks as auto_explain, but that's
fine I think.

2) I'm not sure creating statistics automatically based on a single
query execution is a good idea. I think we'll need to collect data from
multiple runs (in shared memory), and do suggestions based on that.

3) I wonder if it should also consider duration of the query (who cares
about estimates if it still executed in 10ms)? Similarly, it probably
should require some minimal number of rows (1 vs. 10 rows is likely
different from 1M vs. 10M rows, but both is 10x difference).

4) Ideally it'd evaluate impact of the improved estimates on the whole
query plan (you may fix one node, but the cost difference for the whole
query may be negligible). But that seems very hard/expensive :-(

5) I think AddMultiColumnStatisticsForQual() needs refactoring - it
mixes stuff at many different levels of abstraction (generating names,
deciding which statistics to build, ...). I think it'll also need some
improvements to better identify which Vars to consider for statistics,
and once we get support for statistics on expressions committed (which
seems to be fairly close now) also to handle expressions.

BTW Why is "qual" in

static void
AddMultiColumnStatisticsForQual(void* qual, ExplainState *es)

declared as "void *"? Shouldn't that be "List *"?

5) I'm not sure about automatically creating the stats. I can't imagine
anyone actually enabling that on production, TBH (I myself probably
would not do that). I suggest we instead provide an easy way to show
which statistics are suggested.

For one execution that might be integrated into EXPLAIN ANALYZE, I guess
(through some callback, which seems fairly easy to do).

For many executions (you can leave it running for a coupel days, then
see what is the suggestion based on X runs) we could have a view or
something. This would also work for read-only replicas, where just
creating the statistics is impossible.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#18

Tomas Vondra

tomas.vondra@enterprisedb.com

almost 5 years ago

In reply to: Tomas Vondra (#17)

Re: Columns correlation and adaptive query optimization

On 3/10/21 3:00 AM, Tomas Vondra wrote:

Hello Konstantin,

Sorry for not responding to this thread earlier. I definitely agree the
features proposed here are very interesting and useful, and I appreciate
you kept rebasing the patch.

I think the patch improving join estimates can be treated as separate,
and I see it already has a separate CF entry - it however still points
to this thread, which will be confusing. I suggest we start a different
thread for it, to keep the discussions separate.

D'oh! I must have been confused yesterday, because now I see there
already is a separate thread [1]/messages/by-id/71d67391-16a9-3e5e-b5e4-8f7fd32cc1b2@postgrespro.ru for the join selectivity patch. So you
can ignore this.

regards

[1]: /messages/by-id/71d67391-16a9-3e5e-b5e4-8f7fd32cc1b2@postgrespro.ru
/messages/by-id/71d67391-16a9-3e5e-b5e4-8f7fd32cc1b2@postgrespro.ru

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#19

Yugo NAGATA

nagata@sraoss.co.jp

almost 5 years ago

In reply to: Tomas Vondra (#17)

Re: Columns correlation and adaptive query optimization

On Wed, 10 Mar 2021 03:00:25 +0100
Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

What is being proposed here - an extension suggesting which statistics
to create (and possibly creating them automatically) is certainly
useful, but I'm not sure I'd call it "adaptive query optimization". I
think "adaptive" means the extension directly modifies the estimates
based on past executions. So I propose calling it maybe "statistics
advisor" or something like that.

I am also agree with the idea to implement this feature as a new
extension for statistics advisor.

BTW Why is "qual" in

static void
AddMultiColumnStatisticsForQual(void* qual, ExplainState *es)

declared as "void *"? Shouldn't that be "List *"?

When I tested this extension using TPC-H queries, it raised segmentation
fault in this function. I think the cause would be around this argument.

Regards,
Yugo Nagata

--
Yugo NAGATA <nagata@sraoss.co.jp>

#20

Konstantin Knizhnik

k.knizhnik@postgrespro.ru

almost 5 years ago

In reply to: Yugo NAGATA (#19)

1 attachment(s)

Re: Columns correlation and adaptive query optimization

On 19.03.2021 12:17, Yugo NAGATA wrote:

On Wed, 10 Mar 2021 03:00:25 +0100
Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

What is being proposed here - an extension suggesting which statistics
to create (and possibly creating them automatically) is certainly
useful, but I'm not sure I'd call it "adaptive query optimization". I
think "adaptive" means the extension directly modifies the estimates
based on past executions. So I propose calling it maybe "statistics
advisor" or something like that.

I am also agree with the idea to implement this feature as a new
extension for statistics advisor.

BTW Why is "qual" in

static void
AddMultiColumnStatisticsForQual(void* qual, ExplainState *es)

declared as "void *"? Shouldn't that be "List *"?

When I tested this extension using TPC-H queries, it raised segmentation
fault in this function. I think the cause would be around this argument.

Regards,
Yugo Nagata

Attached please find new version of the patch with
AddMultiColumnStatisticsForQual parameter type fix and one more fix
related with handling synthetic attributes.
I can not reproduce the crash on TPC-H queries, so if the problem
persists, can you please send me stack trace and may be some other
information helping to understand the reason of SIGSEGV?

Thanks in advance,
Konstantin

Attachments:

auto_explain_create_statistic-9.patchtext/x-patch; charset=UTF-8; name=auto_explain_create_statistic-9.patchDownload

diff --git a/contrib/auto_explain/auto_explain.c b/contrib/auto_explain/auto_explain.c
index faa6231d87..3fefc47e0b 100644
--- a/contrib/auto_explain/auto_explain.c
+++ b/contrib/auto_explain/auto_explain.c
@@ -13,12 +13,32 @@
 #include "postgres.h"
 
 #include <limits.h>
+#include <math.h>
 
+#include "access/hash.h"
 #include "access/parallel.h"
+#include "access/relscan.h"
+#include "access/skey.h"
+#include "access/table.h"
+#include "access/tableam.h"
+#include "catalog/pg_statistic_ext.h"
 #include "commands/explain.h"
+#include "commands/defrem.h"
 #include "executor/instrument.h"
 #include "jit/jit.h"
+#include "nodes/makefuncs.h"
+#include "nodes/nodeFuncs.h"
+#include "optimizer/cost.h"
+#include "optimizer/optimizer.h"
+#include "optimizer/planmain.h"
+#include "parser/parsetree.h"
+#include "storage/ipc.h"
+#include "statistics/statistics.h"
+#include "utils/fmgroids.h"
 #include "utils/guc.h"
+#include "utils/syscache.h"
+#include "utils/lsyscache.h"
+#include "utils/ruleutils.h"
 
 PG_MODULE_MAGIC;
 
@@ -34,7 +54,9 @@ static bool auto_explain_log_settings = false;
 static int	auto_explain_log_format = EXPLAIN_FORMAT_TEXT;
 static int	auto_explain_log_level = LOG;
 static bool auto_explain_log_nested_statements = false;
+static bool auto_explain_add_statistics_suggest_only = false;
 static double auto_explain_sample_rate = 1;
+static double auto_explain_add_statistics_threshold = 0.0;
 
 static const struct config_enum_entry format_options[] = {
 	{"text", EXPLAIN_FORMAT_TEXT, false},
@@ -85,6 +107,7 @@ static void explain_ExecutorRun(QueryDesc *queryDesc,
 static void explain_ExecutorFinish(QueryDesc *queryDesc);
 static void explain_ExecutorEnd(QueryDesc *queryDesc);
 
+static void AddMultiColumnStatisticsForNode(PlanState *planstate, ExplainState *es);
 
 /*
  * Module load callback
@@ -230,6 +253,30 @@ _PG_init(void)
 							 NULL,
 							 NULL);
 
+	DefineCustomRealVariable("auto_explain.add_statistics_threshold",
+							 "Sets the threshold for actual/estimated #rows ratio triggering creation of multicolumn statistic for the related columns.",
+							 "Zero disables implicit creation of multicolumn statistic.",
+							 &auto_explain_add_statistics_threshold,
+							 0.0,
+							 0.0,
+							 INT_MAX,
+							 PGC_SUSET,
+							 0,
+							 NULL,
+							 NULL,
+							 NULL);
+
+	DefineCustomBoolVariable("auto_explain.add_statistics_suggest_only",
+							 "Do not create statistic but just record in WAL suggested create statistics statement.",
+							 NULL,
+							 &auto_explain_add_statistics_suggest_only,
+							 false,
+							 PGC_SUSET,
+							 0,
+							 NULL,
+							 NULL,
+							 NULL);
+
 	EmitWarningsOnPlaceholders("auto_explain");
 
 	/* Install hooks. */
@@ -363,6 +410,283 @@ explain_ExecutorFinish(QueryDesc *queryDesc)
 	PG_END_TRY();
 }
 
+/**
+ * Try to add multicolumn statistics for specified subplans.
+ */
+static void
+AddMultiColumnStatisticsForSubPlans(List *plans, ExplainState *es)
+{
+	ListCell   *lst;
+
+	foreach(lst, plans)
+	{
+		SubPlanState *sps = (SubPlanState *) lfirst(lst);
+
+		AddMultiColumnStatisticsForNode(sps->planstate, es);
+	}
+}
+
+/**
+ * Try to add multicolumn statistics for plan subnodes.
+ */
+static void
+AddMultiColumnStatisticsForMemberNodes(PlanState **planstates, int nsubnodes,
+									   ExplainState *es)
+{
+	int			j;
+
+	for (j = 0; j < nsubnodes; j++)
+		AddMultiColumnStatisticsForNode(planstates[j], es);
+}
+
+/**
+ * Comparator used to sort Vars by name
+ */
+static int
+vars_list_comparator(const ListCell *a, const ListCell *b)
+{
+	char* va = strVal((Value *) linitial(((ColumnRef *)lfirst(a))->fields));
+	char* vb = strVal((Value *) linitial(((ColumnRef *)lfirst(b))->fields));
+	return strcmp(va, vb);
+}
+
+/**
+ * Try to add multicolumn statistics for qual
+ */
+static void
+AddMultiColumnStatisticsForQual(List* qual, ExplainState *es)
+{
+	List *vars = NULL;
+	ListCell* lc;
+
+	/* Extract vars from all quals */
+	foreach (lc, qual)
+	{
+		Node* node = (Node*)lfirst(lc);
+		if (IsA(node, RestrictInfo))
+			node = (Node*)((RestrictInfo*)node)->clause;
+		vars = list_concat(vars, pull_vars_of_level(node, 0));
+	}
+
+	/* Loop until we considered all vars */
+	while (vars != NULL)
+	{
+		ListCell *cell;
+		List *cols = NULL;
+		Index relno = 0;
+		Bitmapset* colmap = NULL;
+
+		/* Contruct list of unique vars */
+		foreach (cell, vars)
+		{
+			Node* node = (Node *) lfirst(cell);
+			if (IsA(node, Var))
+			{
+				Var *var = (Var *) node;
+				int varno = IS_SPECIAL_VARNO(var->varno) ? var->varnosyn : var->varno;
+				if (cols == NULL || varno == relno)
+				{
+					int varattno = IS_SPECIAL_VARNO(var->varno) ? var->varattnosyn : var->varattno;
+					relno = varno;
+					if (var->varattno > 0 &&
+						!bms_is_member(varattno, colmap) &&
+						varno >= 1 && /* not synthetic var */
+						varno <= list_length(es->rtable) &&
+						list_length(cols) < STATS_MAX_DIMENSIONS)
+					{
+						RangeTblEntry *rte = rt_fetch(varno, es->rtable);
+						if (rte->rtekind == RTE_RELATION)
+						{
+							ColumnRef  *col = makeNode(ColumnRef);
+							char *colname = get_rte_attribute_name(rte, varattno);
+							col->fields = list_make1(makeString(colname));
+							cols = lappend(cols, col);
+							colmap = bms_add_member(colmap, varattno);
+						}
+					}
+				}
+				else
+				{
+					continue;
+				}
+			}
+			vars = foreach_delete_current(vars, cell);
+		}
+		/* To create multicolumn statitics we need to have at least 2 columns */
+		if (list_length(cols) >= 2)
+		{
+			RangeTblEntry *rte = rt_fetch(relno, es->rtable);
+			CreateStatsStmt* stats = makeNode(CreateStatsStmt);
+			char *rel_namespace = get_namespace_name(get_rel_namespace(rte->relid));
+			char *rel_name = get_rel_name(rte->relid);
+			RangeVar* rel = makeRangeVar(rel_namespace, rel_name, 0);
+			char* stat_name = rel_name;
+			char* create_stat_stmt = (char*)"";
+			char const* sep = "ON";
+			ScanKeyData entry[2];
+			TableScanDesc scan;
+			Relation stat_rel;
+			size_t name_len;
+			TupleTableSlot *slot;
+
+			/* Sort variables by name */
+			list_sort(cols, vars_list_comparator);
+
+			/* Construct name for statistic by concatenating relation name with all columns */
+			foreach (cell, cols)
+			{
+				char* col_name = strVal((Value *) linitial(((ColumnRef *)lfirst(cell))->fields));
+				stat_name = psprintf("%s_%s", stat_name, col_name);
+				create_stat_stmt = psprintf("%s%s %s", create_stat_stmt, sep, col_name);
+				sep = ",";
+			}
+
+			name_len = strlen(stat_name);
+			/* Truncate name if it doesn't fit in NameData */
+			if (name_len >= NAMEDATALEN)
+				stat_name = psprintf("%.*s_%08x", NAMEDATALEN - 10, stat_name, (unsigned)hash_any((uint8*)stat_name, name_len));
+
+			ScanKeyInit(&entry[0],
+						Anum_pg_statistic_ext_stxname,
+						BTEqualStrategyNumber, F_NAMEEQ,
+						CStringGetDatum(stat_name));
+			ScanKeyInit(&entry[1],
+						Anum_pg_statistic_ext_stxnamespace,
+						BTEqualStrategyNumber, F_OIDEQ,
+						ObjectIdGetDatum(get_rel_namespace(rte->relid)));
+
+			/*
+			 * Prevent concurrent access to extended statistic table
+			 */
+			stat_rel = table_open(StatisticExtRelationId, AccessExclusiveLock);
+			slot = table_slot_create(stat_rel, NULL);
+			scan = table_beginscan_catalog(stat_rel, 2, entry);
+
+			/*
+			 * Check if multicolumn statistic object with such name already exists.
+			 * Most likely if was already created by auto_explain, but either ANALYZE was not performed since
+			 * this time, either presence of this multicolumn statistic doesn't help to provide more precise estimation.
+			 * Despite to the fact that we create statistics with "if_not_exist" option, presence of such check
+			 * allows to eliminate notice message that statistics object already exists.
+			 */
+			if (!table_scan_getnextslot(scan, ForwardScanDirection, slot))
+			{
+				if (auto_explain_add_statistics_suggest_only)
+				{
+					ereport(NOTICE, (errmsg("Auto_explain suggestion: CREATE STATISTICS %s %s FROM %s",
+											stat_name, create_stat_stmt, rel_name),
+									 errhidestmt(true)));
+				}
+				else
+				{
+					ereport(LOG, (errmsg("Add statistics %s", stat_name),
+								  errhidestmt(true)));
+					stats->defnames = list_make2(makeString(rel_namespace), makeString(stat_name));
+					stats->if_not_exists = true;
+					stats->relations = list_make1(rel);
+					stats->exprs = cols;
+					CreateStatistics(stats);
+				}
+			}
+			table_endscan(scan);
+			ExecDropSingleTupleTableSlot(slot);
+			table_close(stat_rel, AccessExclusiveLock);
+		}
+	}
+}
+
+/**
+ * Try to add multicolumn statistics for node
+ */
+static void
+AddMultiColumnStatisticsForNode(PlanState *planstate, ExplainState *es)
+{
+	Plan	   *plan = planstate->plan;
+
+	if (planstate->instrument && plan->plan_rows != 0)
+	{
+		if (auto_explain_add_statistics_threshold != 0
+			&& planstate->instrument->ntuples / plan->plan_rows >= auto_explain_add_statistics_threshold)
+		{
+			elog(DEBUG1, "Estimated=%f, actual=%f, error=%f: plan=%s", plan->plan_rows, planstate->instrument->ntuples, planstate->instrument->ntuples / plan->plan_rows, nodeToString(plan));
+			/* quals, sort keys, etc */
+			switch (nodeTag(plan))
+			{
+			  case T_IndexScan:
+				AddMultiColumnStatisticsForQual(((IndexScan *) plan)->indexqualorig, es);
+				break;
+			  case T_IndexOnlyScan:
+				AddMultiColumnStatisticsForQual(((IndexOnlyScan *) plan)->indexqual, es);
+				break;
+			  case T_BitmapIndexScan:
+				AddMultiColumnStatisticsForQual(((BitmapIndexScan *) plan)->indexqualorig, es);
+				break;
+			  case T_NestLoop:
+				AddMultiColumnStatisticsForQual(((NestLoop *) plan)->join.joinqual, es);
+				break;
+			  case T_MergeJoin:
+				AddMultiColumnStatisticsForQual(((MergeJoin *) plan)->mergeclauses, es);
+				AddMultiColumnStatisticsForQual(((MergeJoin *) plan)->join.joinqual, es);
+				break;
+			  case T_HashJoin:
+				AddMultiColumnStatisticsForQual(((HashJoin *) plan)->hashclauses, es);
+				AddMultiColumnStatisticsForQual(((HashJoin *) plan)->join.joinqual, es);
+				break;
+			  default:
+				break;
+			}
+			AddMultiColumnStatisticsForQual(plan->qual, es);
+		}
+	}
+
+	/* initPlan-s */
+	if (planstate->initPlan)
+		AddMultiColumnStatisticsForSubPlans(planstate->initPlan, es);
+
+	/* lefttree */
+	if (outerPlanState(planstate))
+		AddMultiColumnStatisticsForNode(outerPlanState(planstate), es);
+
+	/* righttree */
+	if (innerPlanState(planstate))
+		AddMultiColumnStatisticsForNode(innerPlanState(planstate), es);
+
+	/* special child plans */
+	switch (nodeTag(plan))
+	{
+		case T_ModifyTable:
+			AddMultiColumnStatisticsForMemberNodes(((ModifyTableState *) planstate)->mt_plans,
+												   ((ModifyTableState *) planstate)->mt_nplans,
+												   es);
+			break;
+		case T_Append:
+			AddMultiColumnStatisticsForMemberNodes(((AppendState *) planstate)->appendplans,
+												   ((AppendState *) planstate)->as_nplans,
+												   es);
+			break;
+		case T_MergeAppend:
+			AddMultiColumnStatisticsForMemberNodes(((MergeAppendState *) planstate)->mergeplans,
+												   ((MergeAppendState *) planstate)->ms_nplans,
+												   es);
+			break;
+		case T_BitmapAnd:
+			AddMultiColumnStatisticsForMemberNodes(((BitmapAndState *) planstate)->bitmapplans,
+												   ((BitmapAndState *) planstate)->nplans,
+												   es);
+			break;
+		case T_BitmapOr:
+			AddMultiColumnStatisticsForMemberNodes(((BitmapOrState *) planstate)->bitmapplans,
+												   ((BitmapOrState *) planstate)->nplans,
+												   es);
+			break;
+		case T_SubqueryScan:
+			AddMultiColumnStatisticsForNode(((SubqueryScanState *) planstate)->subplan, es);
+			break;
+		default:
+			break;
+	}
+}
+
 /*
  * ExecutorEnd hook: log results if needed
  */
@@ -403,6 +727,10 @@ explain_ExecutorEnd(QueryDesc *queryDesc)
 				ExplainPrintJITSummary(es, queryDesc);
 			ExplainEndOutput(es);
 
+			/* Add multicolumn statistic if requested */
+			if (auto_explain_add_statistics_threshold && !IsParallelWorker())
+				AddMultiColumnStatisticsForNode(queryDesc->planstate, es);
+
 			/* Remove last line break */
 			if (es->str->len > 0 && es->str->data[es->str->len - 1] == '\n')
 				es->str->data[--es->str->len] = '\0';
diff --git a/doc/src/sgml/auto-explain.sgml b/doc/src/sgml/auto-explain.sgml
index 30e35a714a..29dfd7a8fb 100644
--- a/doc/src/sgml/auto-explain.sgml
+++ b/doc/src/sgml/auto-explain.sgml
@@ -280,6 +280,45 @@ LOAD 'auto_explain';
      </para>
     </listitem>
    </varlistentry>
+
+   <varlistentry>
+    <term>
+     <varname>auto_explain.auto_explain.add_statistics_threshold</varname> (<type>real</type>)
+     <indexterm>
+      <primary><varname>auto_explain.add_statistics_threshold</varname> configuration parameter</primary>
+     </indexterm>
+    </term>
+    <listitem>
+     <para>
+       <varname>auto_explain.add_statistics_threshold</varname> sets the threshold for
+       actual/estimated #rows ratio triggering creation of multicolumn statistic
+       for the related columns. It can be used for adpative query optimization.
+       If there is large gap between real and estimated number of tuples for the
+       concrete plan node, then multicolumn statistic is created for involved
+       attributes. Zero value (default) disables implicit creation of multicolumn statistic.
+       Please notice, this parameter has no effect unless <varname>auto_explain.log_analyze</varname>
+       is enabled and that statistics are created only when the excution time exceeds
+       <varname>auto_explain.log_min_duration</varname>. Also additional statistics are created only if
+       number of rows is over-estimated and not if it is under-estimated.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term>
+     <varname>auto_explain.auto_explain.add_statistics_suggest_only</varname> (<type>boolean</type>)
+     <indexterm>
+      <primary><varname>auto_explain.add_statistics_suggest_only</varname> configuration parameter</primary>
+     </indexterm>
+    </term>
+    <listitem>
+     <para>
+       <varname>auto_explain.add_statistics_suggest_only</varname> disables creation
+       of multicolumn statistic even if <varname>auto_explain.add_statistics_threshold</varname>
+       contains non zero value.Only log message will be reported in case of bad estimation.
+     </para>
+    </listitem>
+   </varlistentry>
   </variablelist>
 
   <para>

#21

Zhihong Yu

zyu@yugabyte.com

almost 5 years ago

In reply to: Konstantin Knizhnik (#20)

Re: Columns correlation and adaptive query optimization

Hi,
In AddMultiColumnStatisticsForQual(),

+   /* Loop until we considered all vars */
+   while (vars != NULL)
...
+       /* Contruct list of unique vars */
+       foreach (cell, vars)

What if some cell / node, gets into the else block:

+               else
+               {
+                   continue;

and being left in vars. Is there a chance for infinite loop ?
It seems there should be a bool variable indicating whether any cell gets
to the following:

+ vars = foreach_delete_current(vars, cell);

If no cell gets removed in the current iteration, the outer while loop
should exit.

Cheers

On Fri, Mar 19, 2021 at 9:58 AM Konstantin Knizhnik <
k.knizhnik@postgrespro.ru> wrote:

Show quoted text

On 19.03.2021 12:17, Yugo NAGATA wrote:

On Wed, 10 Mar 2021 03:00:25 +0100
Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

What is being proposed here - an extension suggesting which statistics
to create (and possibly creating them automatically) is certainly
useful, but I'm not sure I'd call it "adaptive query optimization". I
think "adaptive" means the extension directly modifies the estimates
based on past executions. So I propose calling it maybe "statistics
advisor" or something like that.

I am also agree with the idea to implement this feature as a new
extension for statistics advisor.

BTW Why is "qual" in

static void
AddMultiColumnStatisticsForQual(void* qual, ExplainState *es)

declared as "void *"? Shouldn't that be "List *"?

When I tested this extension using TPC-H queries, it raised segmentation
fault in this function. I think the cause would be around this argument.

Regards,
Yugo Nagata

Attached please find new version of the patch with
AddMultiColumnStatisticsForQual parameter type fix and one more fix
related with handling synthetic attributes.
I can not reproduce the crash on TPC-H queries, so if the problem
persists, can you please send me stack trace and may be some other
information helping to understand the reason of SIGSEGV?

Thanks in advance,
Konstantin

#22

Konstantin Knizhnik

k.knizhnik@postgrespro.ru

almost 5 years ago

In reply to: Zhihong Yu (#21)

Re: Columns correlation and adaptive query optimization

On 19.03.2021 20:32, Zhihong Yu wrote:

Hi,
In AddMultiColumnStatisticsForQual(),
+   /* Loop until we considered all vars */
+   while (vars != NULL)
...
+       /* Contruct list of unique vars */
+       foreach (cell, vars)
What if some cell / node, gets into the else block:
+               else
+               {
+                   continue;
and being left in vars. Is there a chance for infinite loop ?
It seems there should be a bool variable indicating whether any cell
gets to the following:

+ vars = foreach_delete_current(vars, cell);

If no cell gets removed in the current iteration, the outer while loop
should exit.

Each iteration of outer loop (while (vars != NULL))
process variables belonging to one relation.
We take "else continue" branch only if variable belongs to some other
relation.
At first iteration of foreach (cell, vars)
variable "cols" is NULL and we always take first branch of the if.
In other words, at each iteration of outer loop we always make some
progress in processing "vars" list and remove some elements
from this list. So infinite loop can never happen.

Show quoted text

Cheers

On Fri, Mar 19, 2021 at 9:58 AM Konstantin Knizhnik
<k.knizhnik@postgrespro.ru <mailto:k.knizhnik@postgrespro.ru>> wrote:

On 19.03.2021 12:17, Yugo NAGATA wrote:

On Wed, 10 Mar 2021 03:00:25 +0100
Tomas Vondra <tomas.vondra@enterprisedb.com

<mailto:tomas.vondra@enterprisedb.com>> wrote:

What is being proposed here - an extension suggesting which

statistics

to create (and possibly creating them automatically) is certainly
useful, but I'm not sure I'd call it "adaptive query

optimization". I

think "adaptive" means the extension directly modifies the

estimates

based on past executions. So I propose calling it maybe "statistics
advisor" or something like that.

I am also agree with the idea to implement this feature as a new
extension for statistics advisor.

BTW Why is "qual" in

static void
AddMultiColumnStatisticsForQual(void* qual, ExplainState *es)

declared as "void *"? Shouldn't that be "List *"?

When I tested this extension using TPC-H queries, it raised

segmentation

fault in this function. I think the cause would be around this

argument.

Regards,
Yugo Nagata

Attached please find new version of the patch with
AddMultiColumnStatisticsForQual parameter type fix and one more fix
related with handling synthetic attributes.
I can not reproduce the crash on TPC-H queries, so if the problem
persists, can you please send me stack trace and may be some other
information helping to understand the reason of SIGSEGV?

Thanks in advance,
Konstantin

#23

Yugo NAGATA

nagata@sraoss.co.jp

almost 5 years ago

In reply to: Konstantin Knizhnik (#20)

Re: Columns correlation and adaptive query optimization

On Fri, 19 Mar 2021 19:58:27 +0300
Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote:

On 19.03.2021 12:17, Yugo NAGATA wrote:

On Wed, 10 Mar 2021 03:00:25 +0100
Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

What is being proposed here - an extension suggesting which statistics
to create (and possibly creating them automatically) is certainly
useful, but I'm not sure I'd call it "adaptive query optimization". I
think "adaptive" means the extension directly modifies the estimates
based on past executions. So I propose calling it maybe "statistics
advisor" or something like that.

I am also agree with the idea to implement this feature as a new
extension for statistics advisor.

BTW Why is "qual" in

static void
AddMultiColumnStatisticsForQual(void* qual, ExplainState *es)

declared as "void *"? Shouldn't that be "List *"?

When I tested this extension using TPC-H queries, it raised segmentation
fault in this function. I think the cause would be around this argument.

Regards,
Yugo Nagata

Attached please find new version of the patch with
AddMultiColumnStatisticsForQual parameter type fix and one more fix
related with handling synthetic attributes.
I can not reproduce the crash on TPC-H queries, so if the problem
persists, can you please send me stack trace and may be some other
information helping to understand the reason of SIGSEGV?

I also could not reproduce the segfault. I don't know why I observed it,
but it may be because I missed something when installing. Sorry for
annoying you.

Instead, I observed "ERROR: cache lookup failed for attribute 6 of
relation xxxx" in v8 patch, but this was fixed in v9 patch.

Regards,
Yugo Nagata

--
Yugo NAGATA <nagata@sraoss.co.jp>

#24

Yugo NAGATA

nagata@sraoss.co.jp

almost 5 years ago

In reply to: Konstantin Knizhnik (#22)

Re: Columns correlation and adaptive query optimization

Hello Konstantin,

I tested this patch as a statistics advisor using TPC-H queries.
The used parameters are:

auto_explain.add_statistics_suggest_only = on
auto_explain.add_statistics_threshold = 0.1
auto_explain.log_analyze = on
auto_explain.log_min_duration = 0

auto_explain suggested to create a few extented statistics for some
queries, but I could not find performance improvement with the "join
selectivity estimation using extended statistics" patch. I think this
is because there are no such correlation in TPC-H dataset that the
extended statistics can help, though.

During this test, I came up with a additional comments.

1)
As found in TPC-H test, suggested extended statistics may not be useful
for improving performance. Therefore, to decide to adopt it or not, it
would be useful if we could get information about "why we require it" or
"why this is suggested" as DETAIL or HINT. For example, we may show a
line or snippet of EXPLAIN result as the reason of the suggestion.

2)
For Q21 of TPC-H, the following extended statistics were suggested.

NOTICE: Auto_explain suggestion: CREATE STATISTICS lineitem_l_commitdate_l_receiptdate ON l_commitdate, l_receiptdate FROM lineitem
NOTICE: Auto_explain suggestion: CREATE STATISTICS lineitem_l_commitdate_l_receiptdate_l_suppkey ON l_commitdate, l_receiptdate, l_suppkey FROM lineitem

The latter's target columns includes the former's, so I am not sure
we need both of them. (Which we should adopt may be up to on administrator,
though.)

3)
For Q22 of TPC-H, the following two same extended statistics were suggested.

NOTICE: Auto_explain suggestion: CREATE STATISTICS customer_c_acctbal_c_phone ON c_acctbal, c_phone FROM customer
NOTICE: Auto_explain suggestion: CREATE STATISTICS customer_c_acctbal_c_phone ON c_acctbal, c_phone FROM customer

So, when we set add_statistics_suggest_only to off, we get the following error:

ERROR: duplicate key value violates unique constraint "pg_statistic_ext_name_index"
DETAIL: Key (stxname, stxnamespace)=(customer_c_acctbal_c_phone, 2200) already exists.

Regards,
Yugo Nagata

On Sat, 20 Mar 2021 12:41:44 +0300
Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote:

On 19.03.2021 20:32, Zhihong Yu wrote:
Hi,
In AddMultiColumnStatisticsForQual(),
+   /* Loop until we considered all vars */
+   while (vars != NULL)
...
+       /* Contruct list of unique vars */
+       foreach (cell, vars)
What if some cell / node, gets into the else block:
+               else
+               {
+                   continue;
and being left in vars. Is there a chance for infinite loop ?
It seems there should be a bool variable indicating whether any cell
gets to the following:

+ vars = foreach_delete_current(vars, cell);

If no cell gets removed in the current iteration, the outer while loop
should exit.
Each iteration of outer loop (while (vars != NULL))
process variables belonging to one relation.
We take "else continue" branch only if variable belongs to some other
relation.
At first iteration of foreach (cell, vars)
variable "cols" is NULL and we always take first branch of the if.
In other words, at each iteration of outer loop we always make some
progress in processing "vars" list and remove some elements
from this list. So infinite loop can never happen.

Cheers

On Fri, Mar 19, 2021 at 9:58 AM Konstantin Knizhnik
<k.knizhnik@postgrespro.ru <mailto:k.knizhnik@postgrespro.ru>> wrote:

On 19.03.2021 12:17, Yugo NAGATA wrote:

On Wed, 10 Mar 2021 03:00:25 +0100
Tomas Vondra <tomas.vondra@enterprisedb.com

<mailto:tomas.vondra@enterprisedb.com>> wrote:

What is being proposed here - an extension suggesting which

statistics

to create (and possibly creating them automatically) is certainly
useful, but I'm not sure I'd call it "adaptive query

optimization". I

think "adaptive" means the extension directly modifies the

estimates

based on past executions. So I propose calling it maybe "statistics
advisor" or something like that.

I am also agree with the idea to implement this feature as a new
extension for statistics advisor.

BTW Why is "qual" in

static void
AddMultiColumnStatisticsForQual(void* qual, ExplainState *es)

declared as "void *"? Shouldn't that be "List *"?

When I tested this extension using TPC-H queries, it raised

segmentation

fault in this function. I think the cause would be around this

argument.

Regards,
Yugo Nagata

Attached please find new version of the patch with
AddMultiColumnStatisticsForQual parameter type fix and one more fix
related with handling synthetic attributes.
I can not reproduce the crash on TPC-H queries, so if the problem
persists, can you please send me stack trace and may be some other
information helping to understand the reason of SIGSEGV?

Thanks in advance,
Konstantin

--
Yugo NAGATA <nagata@sraoss.co.jp>

#25

vignesh C

vignesh21@gmail.com

over 4 years ago

In reply to: Konstantin Knizhnik (#20)

Re: Columns correlation and adaptive query optimization

On Fri, Mar 19, 2021 at 10:28 PM Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:

On 19.03.2021 12:17, Yugo NAGATA wrote:

On Wed, 10 Mar 2021 03:00:25 +0100
Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

What is being proposed here - an extension suggesting which statistics
to create (and possibly creating them automatically) is certainly
useful, but I'm not sure I'd call it "adaptive query optimization". I
think "adaptive" means the extension directly modifies the estimates
based on past executions. So I propose calling it maybe "statistics
advisor" or something like that.

I am also agree with the idea to implement this feature as a new
extension for statistics advisor.

BTW Why is "qual" in

static void
AddMultiColumnStatisticsForQual(void* qual, ExplainState *es)

declared as "void *"? Shouldn't that be "List *"?

When I tested this extension using TPC-H queries, it raised segmentation
fault in this function. I think the cause would be around this argument.

Regards,
Yugo Nagata

Attached please find new version of the patch with
AddMultiColumnStatisticsForQual parameter type fix and one more fix
related with handling synthetic attributes.
I can not reproduce the crash on TPC-H queries, so if the problem
persists, can you please send me stack trace and may be some other
information helping to understand the reason of SIGSEGV?

"C:\projects\postgresql\pgsql.sln" (default target) (1) ->
"C:\projects\postgresql\auto_explain.vcxproj" (default target) (45) ->
(ClCompile target) ->
contrib/auto_explain/auto_explain.c(658): error C2039: 'mt_plans' : is
not a member of 'ModifyTableState'
[C:\projects\postgresql\auto_explain.vcxproj]
contrib/auto_explain/auto_explain.c(659): error C2039: 'mt_nplans' :
is not a member of 'ModifyTableState'
[C:\projects\postgresql\auto_explain.vcxproj]
contrib/auto_explain/auto_explain.c(660): error C2198:
'AddMultiColumnStatisticsForMemberNodes' : too few arguments for call
[C:\projects\postgresql\auto_explain.vcxproj]
2 Warning(s)
3 Error(s)

Also Yugo Nagata's comments need to be addressed, I'm changing the
status to "Waiting for Author".

Regards,
Vignesh

#26

Daniel Gustafsson

daniel@yesql.se

about 4 years ago

In reply to: vignesh C (#25)

Re: Columns correlation and adaptive query optimization

On 14 Jul 2021, at 13:13, vignesh C <vignesh21@gmail.com> wrote:

"C:\projects\postgresql\pgsql.sln" (default target) (1) ->
"C:\projects\postgresql\auto_explain.vcxproj" (default target) (45) ->
(ClCompile target) ->
contrib/auto_explain/auto_explain.c(658): error C2039: 'mt_plans' : is
not a member of 'ModifyTableState'
[C:\projects\postgresql\auto_explain.vcxproj]
contrib/auto_explain/auto_explain.c(659): error C2039: 'mt_nplans' :
is not a member of 'ModifyTableState'
[C:\projects\postgresql\auto_explain.vcxproj]
contrib/auto_explain/auto_explain.c(660): error C2198:
'AddMultiColumnStatisticsForMemberNodes' : too few arguments for call
[C:\projects\postgresql\auto_explain.vcxproj]
2 Warning(s)
3 Error(s)

Also Yugo Nagata's comments need to be addressed, I'm changing the
status to "Waiting for Author".

As this thread has stalled and the patch hasn't worked in the CI for quite some
time, I'm marking this Returned with Feedback. Feel free to open a new entry
for an updated patch.

--
Daniel Gustafsson https://vmware.com/