Additional improvements to extended statistics

Started by Tomas Vondraalmost 6 years ago40 messages

tomas.vondra@2ndquadrant.com

almost 6 years ago

3 attachment(s)

Hi,

Now that I've committed [1]https://commitfest.postgresql.org/26/2320/ which allows us to use multiple extended
statistics per table, I'd like to start a thread discussing a couple of
additional improvements for extended statistics. I've considered
starting a separate patch for each, but that would be messy as those
changes will touch roughly the same places. So I've organized it into a
single patch series, with the simpler parts at the beginning.

There are three main improvements:

1) improve estimates of OR clauses

Until now, OR clauses pretty much ignored extended statistics, based on
the experience that they're less vulnerable to misestimates. But it's a
bit weird that AND clauses are handled while OR clauses are not, so this
extends the logic to OR clauses.

Status: I think this is fairly OK.

2) support estimating clauses (Var op Var)

Currently, we only support clauses with a single Var, i.e. clauses like

- Var op Const
- Var IS [NOT] NULL
- [NOT] Var
- ...

and AND/OR clauses built from those simple ones. This patch adds support
for clauses of the form (Var op Var), of course assuming both Vars come
from the same relation.

Status: This works, but it feels a bit hackish. Needs more work.

3) support extended statistics on expressions

Currently we only allow simple references to columns in extended stats,
so we can do

CREATE STATISTICS s ON a, b, c FROM t;

but not

CREATE STATISTICS s ON (a+b), (c + 1) FROM t;

This patch aims to allow this. At the moment it's a WIP - it does most
of the catalog changes and stats building, but with some hacks/bugs. And
it does not even try to use those statistics during estimation.

The first question is how to extend the current pg_statistic_ext catalog
to support expressions. I've been planning to do it the way we support
expressions for indexes, i.e. have two catalog fields - one for keys,
one for expressions.

One difference is that for statistics we don't care about order of the
keys, so that we don't need to bother with storing 0 keys in place for
expressions - we can simply assume keys are first, then expressions.

And this is what the patch does now.

I'm however wondering whether to keep this split - why not to just treat
everything as expressions, and be done with it? A key just represents a
Var expression, after all. And it would massively simplify a lot of code
that now has to care about both keys and expressions.

Of course, expressions are a bit more expensive, but I wonder how
noticeable that would be.

Opinions?

ragards

[1]: https://commitfest.postgresql.org/26/2320/

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

0002-Support-clauses-of-the-form-Var-op-Var-20200113.patchtext/plain; charset=us-asciiDownload

From 4f6d8f7e1cd16ec2c0c022479524497f271f821a Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Mon, 11 Nov 2019 01:34:11 +0100
Subject: [PATCH 2/3] Support clauses of the form Var op Var

---
 src/backend/statistics/extended_stats.c       | 62 ++++++++++++++++++-
 src/backend/statistics/mcv.c                  | 61 ++++++++++++++++++
 .../statistics/extended_stats_internal.h      |  2 +
 3 files changed, 122 insertions(+), 3 deletions(-)

diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c
index ccf9565c75..d9e854228c 100644
--- a/src/backend/statistics/extended_stats.c
+++ b/src/backend/statistics/extended_stats.c
@@ -964,13 +964,15 @@ statext_is_compatible_clause_internal(PlannerInfo *root, Node *clause,
 		RangeTblEntry *rte = root->simple_rte_array[relid];
 		OpExpr	   *expr = (OpExpr *) clause;
 		Var		   *var;
+		Var		   *var2 = NULL;
 
 		/* Only expressions with two arguments are considered compatible. */
 		if (list_length(expr->args) != 2)
 			return false;
 
 		/* Check if the expression the right shape (one Var, one Const) */
-		if (!examine_opclause_expression(expr, &var, NULL, NULL))
+		if ((!examine_opclause_expression(expr, &var, NULL, NULL)) &&
+			(!examine_opclause_expression2(expr, &var, &var2)))
 			return false;
 
 		/*
@@ -1010,8 +1012,16 @@ statext_is_compatible_clause_internal(PlannerInfo *root, Node *clause,
 			!get_func_leakproof(get_opcode(expr->opno)))
 			return false;
 
-		return statext_is_compatible_clause_internal(root, (Node *) var,
-													 relid, attnums);
+		if (var2)
+		{
+			return statext_is_compatible_clause_internal(root, (Node *) var,
+														 relid, attnums) &&
+				   statext_is_compatible_clause_internal(root, (Node *) var2,
+														 relid, attnums);
+		}
+		else
+			return statext_is_compatible_clause_internal(root, (Node *) var,
+														 relid, attnums);
 	}
 
 	/* AND/OR/NOT clause */
@@ -1450,3 +1460,49 @@ examine_opclause_expression(OpExpr *expr, Var **varp, Const **cstp, bool *varonl
 
 	return true;
 }
+
+bool
+examine_opclause_expression2(OpExpr *expr, Var **varap, Var **varbp)
+{
+	Var	   *vara;
+	Var	   *varb;
+	Node   *leftop,
+		   *rightop;
+
+	/* enforced by statext_is_compatible_clause_internal */
+	Assert(list_length(expr->args) == 2);
+
+	leftop = linitial(expr->args);
+	rightop = lsecond(expr->args);
+
+	/* strip RelabelType from either side of the expression */
+	if (IsA(leftop, RelabelType))
+		leftop = (Node *) ((RelabelType *) leftop)->arg;
+
+	if (IsA(rightop, RelabelType))
+		rightop = (Node *) ((RelabelType *) rightop)->arg;
+
+	if (IsA(leftop, Var) && IsA(rightop, Var))
+	{
+		vara = (Var *) leftop;
+		varb = (Var *) rightop;
+	}
+	else
+		return false;
+
+	/*
+	 * Both variables have to be for the same relation (otherwise it's a
+	 * join clause, and we don't deal with those yet.
+	 */
+	if (vara->varno != varb->varno)
+		return false;
+
+	/* return pointers to the extracted parts if requested */
+	if (varap)
+		*varap = vara;
+
+	if (varbp)
+		*varbp = varb;
+
+	return true;
+}
diff --git a/src/backend/statistics/mcv.c b/src/backend/statistics/mcv.c
index 3f42713aa2..4b51af287e 100644
--- a/src/backend/statistics/mcv.c
+++ b/src/backend/statistics/mcv.c
@@ -1581,6 +1581,7 @@ mcv_get_match_bitmap(PlannerInfo *root, List *clauses,
 
 			/* valid only after examine_opclause_expression returns true */
 			Var		   *var;
+			Var		   *var2;
 			Const	   *cst;
 			bool		varonleft;
 
@@ -1651,6 +1652,66 @@ mcv_get_match_bitmap(PlannerInfo *root, List *clauses,
 					matches[i] = RESULT_MERGE(matches[i], is_or, match);
 				}
 			}
+			else if (examine_opclause_expression2(expr, &var, &var2))
+			{
+				int			idx;
+				int			idx2;
+
+				/* match the attribute to a dimension of the statistic */
+				idx = bms_member_index(keys, var->varattno);
+				idx2 = bms_member_index(keys, var2->varattno);
+
+				/*
+				 * Walk through the MCV items and evaluate the current clause.
+				 * We can skip items that were already ruled out, and
+				 * terminate if there are no remaining MCV items that might
+				 * possibly match.
+				 */
+				for (i = 0; i < mcvlist->nitems; i++)
+				{
+					bool		match = true;
+					MCVItem    *item = &mcvlist->items[i];
+
+					/*
+					 * When either of the MCV items is NULL we can treat this
+					 * as a mismatch. We must not call the operator because
+					 * of strictness.
+					 */
+					if (item->isnull[idx] || item->isnull[idx2])
+					{
+						matches[i] = RESULT_MERGE(matches[i], is_or, false);
+						continue;
+					}
+
+					/*
+					 * Skip MCV items that can't change result in the bitmap.
+					 * Once the value gets false for AND-lists, or true for
+					 * OR-lists, we don't need to look at more clauses.
+					 */
+					if (RESULT_IS_FINAL(matches[i], is_or))
+						continue;
+
+					/*
+					 * First check whether the constant is below the lower
+					 * boundary (in that case we can skip the bucket, because
+					 * there's no overlap).
+					 *
+					 * We don't store collations used to build the statistics,
+					 * but we can use the collation for the attribute itself,
+					 * as stored in varcollid. We do reset the statistics after
+					 * a type change (including collation change), so this is
+					 * OK. We may need to relax this after allowing extended
+					 * statistics on expressions.
+					 */
+					match = DatumGetBool(FunctionCall2Coll(&opproc,
+														   var->varcollid,
+														   item->values[idx],
+														   item->values[idx2]));
+
+					/* update the match bitmap with the result */
+					matches[i] = RESULT_MERGE(matches[i], is_or, match);
+				}
+			}
 		}
 		else if (IsA(clause, NullTest))
 		{
diff --git a/src/include/statistics/extended_stats_internal.h b/src/include/statistics/extended_stats_internal.h
index 5171895bba..23217497bb 100644
--- a/src/include/statistics/extended_stats_internal.h
+++ b/src/include/statistics/extended_stats_internal.h
@@ -98,6 +98,8 @@ extern SortItem *build_sorted_items(int numrows, int *nitems, HeapTuple *rows,
 
 extern bool examine_opclause_expression(OpExpr *expr, Var **varp,
 										Const **cstp, bool *varonleftp);
+extern bool examine_opclause_expression2(OpExpr *expr,
+										 Var **varap, Var **varbp);
 
 extern Selectivity mcv_clauselist_selectivity(PlannerInfo *root,
 											  StatisticExtInfo *stat,
-- 
2.21.0

0003-Support-for-extended-statistics-on-expressi-20200113.patchtext/plain; charset=us-asciiDownload

From 8e5f74c7c55b4e602ef5460a9fbd4cdf26e52f77 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Mon, 11 Nov 2019 14:01:21 +0100
Subject: [PATCH 3/3] Support for extended statistics on expressions

---
 src/backend/commands/statscmds.c              | 190 +++++--
 src/backend/nodes/copyfuncs.c                 |  14 +
 src/backend/nodes/equalfuncs.c                |  13 +
 src/backend/nodes/outfuncs.c                  |  12 +
 src/backend/optimizer/util/plancat.c          |  40 ++
 src/backend/parser/gram.y                     |  31 +-
 src/backend/parser/parse_agg.c                |  10 +
 src/backend/parser/parse_expr.c               |   6 +
 src/backend/parser/parse_func.c               |   3 +
 src/backend/parser/parse_utilcmd.c            |  89 ++-
 src/backend/statistics/dependencies.c         | 159 +++++-
 src/backend/statistics/extended_stats.c       | 532 +++++++++++++++++-
 src/backend/statistics/mcv.c                  |  17 +-
 src/backend/statistics/mvdistinct.c           |  51 +-
 src/backend/tcop/utility.c                    |  16 +-
 src/backend/utils/adt/ruleutils.c             |  59 ++
 src/backend/utils/adt/selfuncs.c              |  11 +
 src/bin/psql/describe.c                       |   1 +
 src/include/catalog/pg_statistic_ext.h        |   3 +
 src/include/nodes/nodes.h                     |   1 +
 src/include/nodes/parsenodes.h                |  16 +
 src/include/nodes/pathnodes.h                 |   1 +
 src/include/parser/parse_node.h               |   1 +
 src/include/parser/parse_utilcmd.h            |   2 +
 .../statistics/extended_stats_internal.h      |  13 +-
 25 files changed, 1191 insertions(+), 100 deletions(-)

diff --git a/src/backend/commands/statscmds.c b/src/backend/commands/statscmds.c
index fb608cf5cd..a8415463af 100644
--- a/src/backend/commands/statscmds.c
+++ b/src/backend/commands/statscmds.c
@@ -29,6 +29,8 @@
 #include "commands/comment.h"
 #include "commands/defrem.h"
 #include "miscadmin.h"
+#include "nodes/nodeFuncs.h"
+#include "optimizer/optimizer.h"
 #include "statistics/statistics.h"
 #include "utils/builtins.h"
 #include "utils/fmgroids.h"
@@ -42,6 +44,7 @@
 static char *ChooseExtendedStatisticName(const char *name1, const char *name2,
 										 const char *label, Oid namespaceid);
 static char *ChooseExtendedStatisticNameAddition(List *exprs);
+static bool CheckMutability(Expr *expr);
 
 
 /* qsort comparator for the attnums in CreateStatistics */
@@ -62,6 +65,7 @@ ObjectAddress
 CreateStatistics(CreateStatsStmt *stmt)
 {
 	int16		attnums[STATS_MAX_DIMENSIONS];
+	int			nattnums = 0;
 	int			numcols = 0;
 	char	   *namestr;
 	NameData	stxname;
@@ -74,6 +78,8 @@ CreateStatistics(CreateStatsStmt *stmt)
 	Datum		datavalues[Natts_pg_statistic_ext_data];
 	bool		datanulls[Natts_pg_statistic_ext_data];
 	int2vector *stxkeys;
+	List	   *stxexprs = NIL;
+	Datum		exprsDatum;
 	Relation	statrel;
 	Relation	datarel;
 	Relation	rel = NULL;
@@ -192,56 +198,95 @@ CreateStatistics(CreateStatsStmt *stmt)
 	foreach(cell, stmt->exprs)
 	{
 		Node	   *expr = (Node *) lfirst(cell);
-		ColumnRef  *cref;
-		char	   *attname;
+		StatsElem  *selem;
 		HeapTuple	atttuple;
 		Form_pg_attribute attForm;
 		TypeCacheEntry *type;
 
-		if (!IsA(expr, ColumnRef))
+		if (!IsA(expr, StatsElem))
 			ereport(ERROR,
 					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 					 errmsg("only simple column references are allowed in CREATE STATISTICS")));
-		cref = (ColumnRef *) expr;
+		selem = (StatsElem *) expr;
 
-		if (list_length(cref->fields) != 1)
-			ereport(ERROR,
-					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-					 errmsg("only simple column references are allowed in CREATE STATISTICS")));
-		attname = strVal((Value *) linitial(cref->fields));
-
-		atttuple = SearchSysCacheAttName(relid, attname);
-		if (!HeapTupleIsValid(atttuple))
-			ereport(ERROR,
-					(errcode(ERRCODE_UNDEFINED_COLUMN),
-					 errmsg("column \"%s\" does not exist",
-							attname)));
-		attForm = (Form_pg_attribute) GETSTRUCT(atttuple);
-
-		/* Disallow use of system attributes in extended stats */
-		if (attForm->attnum <= 0)
-			ereport(ERROR,
-					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-					 errmsg("statistics creation on system columns is not supported")));
-
-		/* Disallow data types without a less-than operator */
-		type = lookup_type_cache(attForm->atttypid, TYPECACHE_LT_OPR);
-		if (type->lt_opr == InvalidOid)
-			ereport(ERROR,
-					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-					 errmsg("column \"%s\" cannot be used in statistics because its type %s has no default btree operator class",
-							attname, format_type_be(attForm->atttypid))));
-
-		/* Make sure no more than STATS_MAX_DIMENSIONS columns are used */
-		if (numcols >= STATS_MAX_DIMENSIONS)
-			ereport(ERROR,
-					(errcode(ERRCODE_TOO_MANY_COLUMNS),
-					 errmsg("cannot have more than %d columns in statistics",
-							STATS_MAX_DIMENSIONS)));
-
-		attnums[numcols] = attForm->attnum;
-		numcols++;
-		ReleaseSysCache(atttuple);
+		if (selem->name)	/* column reference */
+		{
+			char	   *attname;
+			attname = selem->name;
+
+			atttuple = SearchSysCacheAttName(relid, attname);
+			if (!HeapTupleIsValid(atttuple))
+				ereport(ERROR,
+						(errcode(ERRCODE_UNDEFINED_COLUMN),
+						 errmsg("column \"%s\" does not exist",
+								attname)));
+			attForm = (Form_pg_attribute) GETSTRUCT(atttuple);
+
+			/* Disallow use of system attributes in extended stats */
+			if (attForm->attnum <= 0)
+				ereport(ERROR,
+						(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+						 errmsg("statistics creation on system columns is not supported")));
+
+			/* Disallow data types without a less-than operator */
+			type = lookup_type_cache(attForm->atttypid, TYPECACHE_LT_OPR);
+			if (type->lt_opr == InvalidOid)
+				ereport(ERROR,
+						(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+						 errmsg("column \"%s\" cannot be used in statistics because its type %s has no default btree operator class",
+								attname, format_type_be(attForm->atttypid))));
+
+			/* Make sure no more than STATS_MAX_DIMENSIONS columns are used */
+			if (numcols >= STATS_MAX_DIMENSIONS)
+				ereport(ERROR,
+						(errcode(ERRCODE_TOO_MANY_COLUMNS),
+						 errmsg("cannot have more than %d columns in statistics",
+								STATS_MAX_DIMENSIONS)));
+
+			attnums[nattnums] = attForm->attnum;
+			nattnums++;
+			numcols++;
+			ReleaseSysCache(atttuple);
+		}
+		else	/* expression */
+		{
+			Node	   *expr = selem->expr;
+			TypeCacheEntry *type;
+			Oid			atttype;
+
+			Assert(expr != NULL);
+
+			/*
+			 * An expression using mutable functions is probably wrong,
+			 * since if you aren't going to get the same result for the
+			 * same data every time, it's not clear what the index entries
+			 * mean at all.
+			 */
+			if (CheckMutability((Expr *) expr))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+						 errmsg("functions in statistics expression must be marked IMMUTABLE")));
+
+			/* Disallow data types without a less-than operator */
+			atttype = exprType(expr);
+			type = lookup_type_cache(atttype, TYPECACHE_LT_OPR);
+			if (type->lt_opr == InvalidOid)
+				ereport(ERROR,
+						(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+						 errmsg("expression cannot be used in statistics because its type %s has no default btree operator class",
+								format_type_be(atttype))));
+
+			/* Make sure no more than STATS_MAX_DIMENSIONS columns are used */
+			if (numcols >= STATS_MAX_DIMENSIONS)
+				ereport(ERROR,
+						(errcode(ERRCODE_TOO_MANY_COLUMNS),
+						 errmsg("cannot have more than %d columns in statistics",
+								STATS_MAX_DIMENSIONS)));
+
+			numcols++;
+
+			stxexprs = lappend(stxexprs, expr);
+		}
 	}
 
 	/*
@@ -258,13 +303,13 @@ CreateStatistics(CreateStatsStmt *stmt)
 	 * it does not hurt (it does not affect the efficiency, unlike for
 	 * indexes, for example).
 	 */
-	qsort(attnums, numcols, sizeof(int16), compare_int16);
+	qsort(attnums, nattnums, sizeof(int16), compare_int16);
 
 	/*
 	 * Check for duplicates in the list of columns. The attnums are sorted so
 	 * just check consecutive elements.
 	 */
-	for (i = 1; i < numcols; i++)
+	for (i = 1; i < nattnums; i++)
 	{
 		if (attnums[i] == attnums[i - 1])
 			ereport(ERROR,
@@ -273,7 +318,7 @@ CreateStatistics(CreateStatsStmt *stmt)
 	}
 
 	/* Form an int2vector representation of the sorted column list */
-	stxkeys = buildint2vector(attnums, numcols);
+	stxkeys = buildint2vector(attnums, nattnums);
 
 	/*
 	 * Parse the statistics kinds.
@@ -325,6 +370,18 @@ CreateStatistics(CreateStatsStmt *stmt)
 	Assert(ntypes > 0 && ntypes <= lengthof(types));
 	stxkind = construct_array(types, ntypes, CHAROID, 1, true, 'c');
 
+	/* convert the expressions (if any) to a text datum */
+	if (stxexprs != NIL)
+	{
+		char	   *exprsString;
+
+		exprsString = nodeToString(stxexprs);
+		exprsDatum = CStringGetTextDatum(exprsString);
+		pfree(exprsString);
+	}
+	else
+		exprsDatum = (Datum) 0;
+
 	statrel = table_open(StatisticExtRelationId, RowExclusiveLock);
 
 	/*
@@ -344,6 +401,15 @@ CreateStatistics(CreateStatsStmt *stmt)
 	values[Anum_pg_statistic_ext_stxkeys - 1] = PointerGetDatum(stxkeys);
 	values[Anum_pg_statistic_ext_stxkind - 1] = PointerGetDatum(stxkind);
 
+	values[Anum_pg_statistic_ext_stxexprs - 1] = exprsDatum;
+	if (exprsDatum == (Datum) 0)
+		nulls[Anum_pg_statistic_ext_stxexprs - 1] = true;
+
+	/*
+	 * FIXME add dependencies on anything mentioned in the expressions,
+	 * see recordDependencyOnSingleRelExpr in index_create
+	 */
+
 	/* insert it into pg_statistic_ext */
 	htup = heap_form_tuple(statrel->rd_att, values, nulls);
 	CatalogTupleInsert(statrel, htup);
@@ -387,7 +453,7 @@ CreateStatistics(CreateStatsStmt *stmt)
 	 */
 	ObjectAddressSet(myself, StatisticExtRelationId, statoid);
 
-	for (i = 0; i < numcols; i++)
+	for (i = 0; i < nattnums; i++)
 	{
 		ObjectAddressSubSet(parentobject, RelationRelationId, relid, attnums[i]);
 		recordDependencyOn(&myself, &parentobject, DEPENDENCY_AUTO);
@@ -722,14 +788,14 @@ ChooseExtendedStatisticNameAddition(List *exprs)
 	buf[0] = '\0';
 	foreach(lc, exprs)
 	{
-		ColumnRef  *cref = (ColumnRef *) lfirst(lc);
+		StatsElem  *selem = (StatsElem *) lfirst(lc);
 		const char *name;
 
 		/* It should be one of these, but just skip if it happens not to be */
-		if (!IsA(cref, ColumnRef))
+		if (!IsA(selem, StatsElem))
 			continue;
 
-		name = strVal((Value *) linitial(cref->fields));
+		name = selem->name;
 
 		if (buflen > 0)
 			buf[buflen++] = '_';	/* insert _ between names */
@@ -745,3 +811,29 @@ ChooseExtendedStatisticNameAddition(List *exprs)
 	}
 	return pstrdup(buf);
 }
+
+/*
+ * CheckMutability
+ *		Test whether given expression is mutable
+ */
+static bool
+CheckMutability(Expr *expr)
+{
+	/*
+	 * First run the expression through the planner.  This has a couple of
+	 * important consequences.  First, function default arguments will get
+	 * inserted, which may affect volatility (consider "default now()").
+	 * Second, inline-able functions will get inlined, which may allow us to
+	 * conclude that the function is really less volatile than it's marked. As
+	 * an example, polymorphic functions must be marked with the most volatile
+	 * behavior that they have for any input type, but once we inline the
+	 * function we may be able to conclude that it's not so volatile for the
+	 * particular input type we're dealing with.
+	 *
+	 * We assume here that expression_planner() won't scribble on its input.
+	 */
+	expr = expression_planner(expr);
+
+	/* Now we can search for non-immutable functions */
+	return contain_mutable_functions((Node *) expr);
+}
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 54ad62bb7f..477f670862 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -2882,6 +2882,17 @@ _copyIndexElem(const IndexElem *from)
 	return newnode;
 }
 
+static StatsElem *
+_copyStatsElem(const StatsElem *from)
+{
+	StatsElem  *newnode = makeNode(StatsElem);
+
+	COPY_STRING_FIELD(name);
+	COPY_NODE_FIELD(expr);
+
+	return newnode;
+}
+
 static ColumnDef *
 _copyColumnDef(const ColumnDef *from)
 {
@@ -5565,6 +5576,9 @@ copyObjectImpl(const void *from)
 		case T_IndexElem:
 			retval = _copyIndexElem(from);
 			break;
+		case T_StatsElem:
+			retval = _copyStatsElem(from);
+			break;
 		case T_ColumnDef:
 			retval = _copyColumnDef(from);
 			break;
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index 5b1ba143b1..956420cce9 100644
--- a/src/backend/nodes/equalfuncs.c
+++ b/src/backend/nodes/equalfuncs.c
@@ -2569,6 +2569,16 @@ _equalIndexElem(const IndexElem *a, const IndexElem *b)
 	return true;
 }
 
+
+static bool
+_equalStatsElem(const StatsElem *a, const StatsElem *b)
+{
+	COMPARE_STRING_FIELD(name);
+	COMPARE_NODE_FIELD(expr);
+
+	return true;
+}
+
 static bool
 _equalColumnDef(const ColumnDef *a, const ColumnDef *b)
 {
@@ -3662,6 +3672,9 @@ equal(const void *a, const void *b)
 		case T_IndexElem:
 			retval = _equalIndexElem(a, b);
 			break;
+		case T_StatsElem:
+			retval = _equalStatsElem(a, b);
+			break;
 		case T_ColumnDef:
 			retval = _equalColumnDef(a, b);
 			break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index d76fae44b8..a333e95692 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -2870,6 +2870,15 @@ _outIndexElem(StringInfo str, const IndexElem *node)
 	WRITE_ENUM_FIELD(nulls_ordering, SortByNulls);
 }
 
+static void
+_outStatsElem(StringInfo str, const StatsElem *node)
+{
+	WRITE_NODE_TYPE("STATSELEM");
+
+	WRITE_STRING_FIELD(name);
+	WRITE_NODE_FIELD(expr);
+}
+
 static void
 _outQuery(StringInfo str, const Query *node)
 {
@@ -4176,6 +4185,9 @@ outNode(StringInfo str, const void *obj)
 			case T_IndexElem:
 				_outIndexElem(str, obj);
 				break;
+			case T_StatsElem:
+				_outStatsElem(str, obj);
+				break;
 			case T_Query:
 				_outQuery(str, obj);
 				break;
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index d82fc5ab8b..01130c5779 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -34,6 +34,7 @@
 #include "foreign/fdwapi.h"
 #include "miscadmin.h"
 #include "nodes/makefuncs.h"
+#include "nodes/nodeFuncs.h"
 #include "nodes/supportnodes.h"
 #include "optimizer/clauses.h"
 #include "optimizer/cost.h"
@@ -1304,6 +1305,7 @@ get_relation_statistics(RelOptInfo *rel, Relation relation)
 		HeapTuple	dtup;
 		Bitmapset  *keys = NULL;
 		int			i;
+		List	   *exprs = NIL;
 
 		htup = SearchSysCache1(STATEXTOID, ObjectIdGetDatum(statOid));
 		if (!HeapTupleIsValid(htup))
@@ -1322,6 +1324,41 @@ get_relation_statistics(RelOptInfo *rel, Relation relation)
 		for (i = 0; i < staForm->stxkeys.dim1; i++)
 			keys = bms_add_member(keys, staForm->stxkeys.values[i]);
 
+		/*
+		 * preprocess expression (if any)
+		 *
+		 * FIXME we probably need to cache the result somewhere
+		 */
+		{
+			bool		isnull;
+			Datum		datum;
+
+			/* decode expression (if any) */
+			datum = SysCacheGetAttr(STATEXTOID, htup,
+									Anum_pg_statistic_ext_stxexprs, &isnull);
+
+			if (!isnull)
+			{
+				char *exprsString;
+
+				exprsString = TextDatumGetCString(datum);
+				exprs = (List *) stringToNode(exprsString);
+				pfree(exprsString);
+
+				/*
+				 * Run the expressions through eval_const_expressions. This is not just an
+				 * optimization, but is necessary, because the planner will be comparing
+				 * them to similarly-processed qual clauses, and may fail to detect valid
+				 * matches without this.  We must not use canonicalize_qual, however,
+				 * since these aren't qual expressions.
+				 */
+				exprs = (List *) eval_const_expressions(NULL, (Node *) exprs);
+
+				/* May as well fix opfuncids too */
+				fix_opfuncids((Node *) exprs);
+			}
+		}
+
 		/* add one StatisticExtInfo for each kind built */
 		if (statext_is_kind_built(dtup, STATS_EXT_NDISTINCT))
 		{
@@ -1331,6 +1368,7 @@ get_relation_statistics(RelOptInfo *rel, Relation relation)
 			info->rel = rel;
 			info->kind = STATS_EXT_NDISTINCT;
 			info->keys = bms_copy(keys);
+			info->exprs = exprs;
 
 			stainfos = lappend(stainfos, info);
 		}
@@ -1343,6 +1381,7 @@ get_relation_statistics(RelOptInfo *rel, Relation relation)
 			info->rel = rel;
 			info->kind = STATS_EXT_DEPENDENCIES;
 			info->keys = bms_copy(keys);
+			info->exprs = exprs;
 
 			stainfos = lappend(stainfos, info);
 		}
@@ -1355,6 +1394,7 @@ get_relation_statistics(RelOptInfo *rel, Relation relation)
 			info->rel = rel;
 			info->kind = STATS_EXT_MCV;
 			info->keys = bms_copy(keys);
+			info->exprs = exprs;
 
 			stainfos = lappend(stainfos, info);
 		}
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 3806687ae3..da87c60dc3 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -225,6 +225,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 	WindowDef			*windef;
 	JoinExpr			*jexpr;
 	IndexElem			*ielem;
+	StatsElem			*selem;
 	Alias				*alias;
 	RangeVar			*range;
 	IntoClause			*into;
@@ -386,7 +387,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 				old_aggr_definition old_aggr_list
 				oper_argtypes RuleActionList RuleActionMulti
 				opt_column_list columnList opt_name_list
-				sort_clause opt_sort_clause sortby_list index_params
+				sort_clause opt_sort_clause sortby_list index_params stats_params
 				opt_include opt_c_include index_including_params
 				name_list role_list from_clause from_list opt_array_bounds
 				qualified_name_list any_name any_name_list type_name_list
@@ -494,6 +495,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 %type <list>	func_alias_clause
 %type <sortby>	sortby
 %type <ielem>	index_elem
+%type <selem>	stats_param
 %type <node>	table_ref
 %type <jexpr>	joined_table
 %type <range>	relation_expr
@@ -3965,7 +3967,7 @@ ExistingIndex:   USING INDEX index_name				{ $$ = $3; }
 
 CreateStatsStmt:
 			CREATE STATISTICS any_name
-			opt_name_list ON expr_list FROM from_list
+			opt_name_list ON stats_params FROM from_list
 				{
 					CreateStatsStmt *n = makeNode(CreateStatsStmt);
 					n->defnames = $3;
@@ -3977,7 +3979,7 @@ CreateStatsStmt:
 					$$ = (Node *)n;
 				}
 			| CREATE STATISTICS IF_P NOT EXISTS any_name
-			opt_name_list ON expr_list FROM from_list
+			opt_name_list ON stats_params FROM from_list
 				{
 					CreateStatsStmt *n = makeNode(CreateStatsStmt);
 					n->defnames = $6;
@@ -3990,6 +3992,29 @@ CreateStatsStmt:
 				}
 			;
 
+stats_params:	stats_param							{ $$ = list_make1($1); }
+			| stats_params ',' stats_param			{ $$ = lappend($1, $3); }
+		;
+
+stats_param:	ColId
+				{
+					$$ = makeNode(StatsElem);
+					$$->name = $1;
+					$$->expr = NULL;
+				}
+			| func_expr_windowless
+				{
+					$$ = makeNode(StatsElem);
+					$$->name = NULL;
+					$$->expr = $1;
+				}
+			| '(' a_expr ')'
+				{
+					$$ = makeNode(StatsElem);
+					$$->name = NULL;
+					$$->expr = $2;
+				}
+		;
 
 /*****************************************************************************
  *
diff --git a/src/backend/parser/parse_agg.c b/src/backend/parser/parse_agg.c
index f1cc5479e4..169a31bf37 100644
--- a/src/backend/parser/parse_agg.c
+++ b/src/backend/parser/parse_agg.c
@@ -484,6 +484,13 @@ check_agglevels_and_constraints(ParseState *pstate, Node *expr)
 			else
 				err = _("grouping operations are not allowed in index predicates");
 
+			break;
+		case EXPR_KIND_STATS_EXPRESSION:
+			if (isAgg)
+				err = _("aggregate functions are not allowed in statistics expressions");
+			else
+				err = _("grouping operations are not allowed in statistics expressions");
+
 			break;
 		case EXPR_KIND_ALTER_COL_TRANSFORM:
 			if (isAgg)
@@ -906,6 +913,9 @@ transformWindowFuncCall(ParseState *pstate, WindowFunc *wfunc,
 		case EXPR_KIND_INDEX_EXPRESSION:
 			err = _("window functions are not allowed in index expressions");
 			break;
+		case EXPR_KIND_STATS_EXPRESSION:
+			err = _("window functions are not allowed in stats expressions");
+			break;
 		case EXPR_KIND_INDEX_PREDICATE:
 			err = _("window functions are not allowed in index predicates");
 			break;
diff --git a/src/backend/parser/parse_expr.c b/src/backend/parser/parse_expr.c
index 831db4af95..6ddd839654 100644
--- a/src/backend/parser/parse_expr.c
+++ b/src/backend/parser/parse_expr.c
@@ -564,6 +564,7 @@ transformColumnRef(ParseState *pstate, ColumnRef *cref)
 		case EXPR_KIND_FUNCTION_DEFAULT:
 		case EXPR_KIND_INDEX_EXPRESSION:
 		case EXPR_KIND_INDEX_PREDICATE:
+		case EXPR_KIND_STATS_EXPRESSION:
 		case EXPR_KIND_ALTER_COL_TRANSFORM:
 		case EXPR_KIND_EXECUTE_PARAMETER:
 		case EXPR_KIND_TRIGGER_WHEN:
@@ -1913,6 +1914,9 @@ transformSubLink(ParseState *pstate, SubLink *sublink)
 		case EXPR_KIND_INDEX_PREDICATE:
 			err = _("cannot use subquery in index predicate");
 			break;
+		case EXPR_KIND_STATS_EXPRESSION:
+			err = _("cannot use subquery in statistics expression");
+			break;
 		case EXPR_KIND_ALTER_COL_TRANSFORM:
 			err = _("cannot use subquery in transform expression");
 			break;
@@ -3543,6 +3547,8 @@ ParseExprKindName(ParseExprKind exprKind)
 			return "index expression";
 		case EXPR_KIND_INDEX_PREDICATE:
 			return "index predicate";
+		case EXPR_KIND_STATS_EXPRESSION:
+			return "statistics expression";
 		case EXPR_KIND_ALTER_COL_TRANSFORM:
 			return "USING";
 		case EXPR_KIND_EXECUTE_PARAMETER:
diff --git a/src/backend/parser/parse_func.c b/src/backend/parser/parse_func.c
index 9c3b6ad916..cffc276de0 100644
--- a/src/backend/parser/parse_func.c
+++ b/src/backend/parser/parse_func.c
@@ -2495,6 +2495,9 @@ check_srf_call_placement(ParseState *pstate, Node *last_srf, int location)
 		case EXPR_KIND_INDEX_PREDICATE:
 			err = _("set-returning functions are not allowed in index predicates");
 			break;
+		case EXPR_KIND_STATS_EXPRESSION:
+			err = _("set-returning functions are not allowed in stats expressions");
+			break;
 		case EXPR_KIND_ALTER_COL_TRANSFORM:
 			err = _("set-returning functions are not allowed in transform expressions");
 			break;
diff --git a/src/backend/parser/parse_utilcmd.c b/src/backend/parser/parse_utilcmd.c
index 42095ab830..aeada0b396 100644
--- a/src/backend/parser/parse_utilcmd.c
+++ b/src/backend/parser/parse_utilcmd.c
@@ -1736,14 +1736,15 @@ generateClonedExtStatsStmt(RangeVar *heapRel, Oid heapRelid,
 	/* Determine which columns the statistics are on */
 	for (i = 0; i < statsrec->stxkeys.dim1; i++)
 	{
-		ColumnRef  *cref = makeNode(ColumnRef);
+		StatsElem  *selem = makeNode(StatsElem);
 		AttrNumber	attnum = statsrec->stxkeys.values[i];
 
-		cref->fields = list_make1(makeString(get_attname(heapRelid,
-														 attnum, false)));
-		cref->location = -1;
+		selem->name = get_attname(heapRelid, attnum, false);
+		selem->expr = NULL;
 
-		def_names = lappend(def_names, cref);
+		/* FIXME handle expressions properly */
+
+		def_names = lappend(def_names, selem);
 	}
 
 	/* finally, build the output node */
@@ -2688,6 +2689,84 @@ transformIndexStmt(Oid relid, IndexStmt *stmt, const char *queryString)
 	return stmt;
 }
 
+/*
+ * transformStatsStmt - parse analysis for CREATE STATISTICS
+ *
+ * To avoid race conditions, it's important that this function rely only on
+ * the passed-in relid (and not on stmt->relation) to determine the target
+ * relation.
+ */
+CreateStatsStmt *
+transformStatsStmt(Oid relid, CreateStatsStmt *stmt, const char *queryString)
+{
+	ParseState *pstate;
+	RangeTblEntry *rte;
+	ListCell   *l;
+	Relation	rel;
+
+	/* Nothing to do if statement already transformed. */
+	if (stmt->transformed)
+		return stmt;
+
+	/*
+	 * We must not scribble on the passed-in CreateStatsStmt, so copy it.  (This is
+	 * overkill, but easy.)
+	 */
+	stmt = copyObject(stmt);
+
+	/* Set up pstate */
+	pstate = make_parsestate(NULL);
+	pstate->p_sourcetext = queryString;
+
+	/*
+	 * Put the parent table into the rtable so that the expressions can refer
+	 * to its fields without qualification.  Caller is responsible for locking
+	 * relation, but we still need to open it.
+	 */
+	rel = relation_open(relid, NoLock);
+	rte = addRangeTableEntryForRelation(pstate, rel,
+										AccessShareLock,
+										NULL, false, true);
+
+	/* no to join list, yes to namespaces */
+	addRTEtoQuery(pstate, rte, false, true, true);
+
+	/* take care of any expressions */
+	foreach(l, stmt->exprs)
+	{
+		StatsElem  *selem = (StatsElem *) lfirst(l);
+
+		if (selem->expr)
+		{
+			/* Now do parse transformation of the expression */
+			selem->expr = transformExpr(pstate, selem->expr,
+										EXPR_KIND_STATS_EXPRESSION);
+
+			/* We have to fix its collations too */
+			assign_expr_collations(pstate, selem->expr);
+		}
+	}
+
+	/*
+	 * Check that only the base rel is mentioned.  (This should be dead code
+	 * now that add_missing_from is history.)
+	 */
+	if (list_length(pstate->p_rtable) != 1)
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_COLUMN_REFERENCE),
+				 errmsg("index expressions and predicates can refer only to the table being indexed")));
+
+	free_parsestate(pstate);
+
+	/* Close relation */
+	table_close(rel, NoLock);
+
+	/* Mark statement as successfully transformed */
+	stmt->transformed = true;
+
+	return stmt;
+}
+
 
 /*
  * transformRuleStmt -
diff --git a/src/backend/statistics/dependencies.c b/src/backend/statistics/dependencies.c
index e2f6c5bb97..76afb0ea2a 100644
--- a/src/backend/statistics/dependencies.c
+++ b/src/backend/statistics/dependencies.c
@@ -69,8 +69,10 @@ static void generate_dependencies(DependencyGenerator state);
 static DependencyGenerator DependencyGenerator_init(int n, int k);
 static void DependencyGenerator_free(DependencyGenerator state);
 static AttrNumber *DependencyGenerator_next(DependencyGenerator state);
-static double dependency_degree(int numrows, HeapTuple *rows, int k,
-								AttrNumber *dependency, VacAttrStats **stats, Bitmapset *attrs);
+static double dependency_degree(int numrows, HeapTuple *rows,
+								Datum *exprvals, bool *exprnulls, int nexprs, int k,
+								AttrNumber *dependency, VacAttrStats **stats,
+								Bitmapset *attrs);
 static bool dependency_is_fully_matched(MVDependency *dependency,
 										Bitmapset *attnums);
 static bool dependency_implies_attribute(MVDependency *dependency,
@@ -213,8 +215,8 @@ DependencyGenerator_next(DependencyGenerator state)
  * the last one.
  */
 static double
-dependency_degree(int numrows, HeapTuple *rows, int k, AttrNumber *dependency,
-				  VacAttrStats **stats, Bitmapset *attrs)
+dependency_degree(int numrows, HeapTuple *rows, Datum *exprvals, bool *exprnulls,
+				  int nexprs, int k, AttrNumber *dependency, VacAttrStats **stats, Bitmapset *attrs)
 {
 	int			i,
 				nitems;
@@ -283,8 +285,8 @@ dependency_degree(int numrows, HeapTuple *rows, int k, AttrNumber *dependency,
 	 * descriptor.  For now that assumption holds, but it might change in the
 	 * future for example if we support statistics on multiple tables.
 	 */
-	items = build_sorted_items(numrows, &nitems, rows, stats[0]->tupDesc,
-							   mss, k, attnums_dep);
+	items = build_sorted_items(numrows, &nitems, rows, exprvals, exprnulls,
+							   nexprs, stats[0]->tupDesc, mss, k, attnums_dep);
 
 	/*
 	 * Walk through the sorted array, split it into rows according to the
@@ -354,7 +356,9 @@ dependency_degree(int numrows, HeapTuple *rows, int k, AttrNumber *dependency,
  *	   (c) -> b
  */
 MVDependencies *
-statext_dependencies_build(int numrows, HeapTuple *rows, Bitmapset *attrs,
+statext_dependencies_build(int numrows, HeapTuple *rows,
+						   Datum *exprvals, bool *exprnulls,
+						   Bitmapset *attrs, List *exprs,
 						   VacAttrStats **stats)
 {
 	int			i,
@@ -365,6 +369,15 @@ statext_dependencies_build(int numrows, HeapTuple *rows, Bitmapset *attrs,
 	/* result */
 	MVDependencies *dependencies = NULL;
 
+	/*
+	 * Copy the bitmapset and add fake attnums representing expressions,
+	 * starting above MaxHeapAttributeNumber.
+	 */
+	attrs = bms_copy(attrs);
+
+	for (i = 1; i <= list_length(exprs); i++)
+		attrs = bms_add_member(attrs, MaxHeapAttributeNumber + i);
+
 	/*
 	 * Transform the bms into an array, to make accessing i-th member easier.
 	 */
@@ -392,7 +405,9 @@ statext_dependencies_build(int numrows, HeapTuple *rows, Bitmapset *attrs,
 			MVDependency *d;
 
 			/* compute how valid the dependency seems */
-			degree = dependency_degree(numrows, rows, k, dependency, stats, attrs);
+			degree = dependency_degree(numrows, rows, exprvals, exprnulls,
+									   list_length(exprs), k, dependency,
+									   stats, attrs);
 
 			/*
 			 * if the dependency seems entirely invalid, don't store it
@@ -435,6 +450,8 @@ statext_dependencies_build(int numrows, HeapTuple *rows, Bitmapset *attrs,
 		DependencyGenerator_free(DependencyGenerator);
 	}
 
+	pfree(attrs);
+
 	return dependencies;
 }
 
@@ -914,6 +931,128 @@ find_strongest_dependency(MVDependencies **dependencies, int ndependencies,
 	return strongest;
 }
 
+/*
+ * Similar to dependency_is_compatible_clause, but don't enforce that the
+ * expression is a simple Var.
+ */
+static bool
+dependency_clause_matches_expression(Node *clause, Index relid, List *statlist)
+{
+	List	   *vars;
+	ListCell   *lc, *lc2;
+
+	RestrictInfo *rinfo = (RestrictInfo *) clause;
+	Node		   *clause_expr;
+
+	if (!IsA(rinfo, RestrictInfo))
+		return false;
+
+	/* Pseudoconstants are not interesting (they couldn't contain a Var) */
+	if (rinfo->pseudoconstant)
+		return false;
+
+	/* Clauses referencing multiple, or no, varnos are incompatible */
+	if (bms_membership(rinfo->clause_relids) != BMS_SINGLETON)
+		return false;
+
+	if (is_opclause(rinfo->clause))
+	{
+		/* If it's an opclause, check for Var = Const or Const = Var. */
+		OpExpr	   *expr = (OpExpr *) rinfo->clause;
+
+		/* Only expressions with two arguments are candidates. */
+		if (list_length(expr->args) != 2)
+			return false;
+
+		/* Make sure non-selected argument is a pseudoconstant. */
+		if (is_pseudo_constant_clause(lsecond(expr->args)))
+			clause_expr = linitial(expr->args);
+		else if (is_pseudo_constant_clause(linitial(expr->args)))
+			clause_expr = lsecond(expr->args);
+		else
+			return false;
+
+		/*
+		 * If it's not an "=" operator, just ignore the clause, as it's not
+		 * compatible with functional dependencies.
+		 *
+		 * This uses the function for estimating selectivity, not the operator
+		 * directly (a bit awkward, but well ...).
+		 *
+		 * XXX this is pretty dubious; probably it'd be better to check btree
+		 * or hash opclass membership, so as not to be fooled by custom
+		 * selectivity functions, and to be more consistent with decisions
+		 * elsewhere in the planner.
+		 */
+		if (get_oprrest(expr->opno) != F_EQSEL)
+			return false;
+
+		/* OK to proceed with checking "var" */
+	}
+	else if (is_notclause(rinfo->clause))
+	{
+		/*
+		 * "NOT x" can be interpreted as "x = false", so get the argument and
+		 * proceed with seeing if it's a suitable Var.
+		 */
+		clause_expr = (Node *) get_notclausearg(rinfo->clause);
+	}
+	else
+	{
+		/*
+		 * A boolean expression "x" can be interpreted as "x = true", so
+		 * proceed with seeing if it's a suitable Var.
+		 */
+		clause_expr = (Node *) rinfo->clause;
+	}
+
+	/*
+	 * We may ignore any RelabelType node above the operand.  (There won't be
+	 * more than one, since eval_const_expressions has been applied already.)
+	 */
+	if (IsA(clause_expr, RelabelType))
+		clause_expr = (Node *) ((RelabelType *) clause_expr)->arg;
+
+	vars = pull_var_clause(clause_expr, 0);
+
+	elog(WARNING, "nvars = %d", list_length(vars));
+
+	foreach (lc, vars)
+	{
+		Var *var = (Var *) lfirst(lc);
+
+		/* Ensure Var is from the correct relation */
+		if (var->varno != relid)
+			return false;
+
+		/* We also better ensure the Var is from the current level */
+		if (var->varlevelsup != 0)
+			return false;
+
+		/* Also ignore system attributes (we don't allow stats on those) */
+		if (!AttrNumberIsForUserDefinedAttr(var->varattno))
+			return false;
+	}
+
+	foreach (lc, statlist)
+	{
+		StatisticExtInfo *info = (StatisticExtInfo *) lfirst(lc);
+
+		foreach (lc2, info->exprs)
+		{
+			Node *expr = (Node *) lfirst(lc2);
+
+			if (equal(clause_expr, expr))
+			{
+				elog(WARNING, "match");
+				return true;
+			}
+		}
+	}
+
+	return false;
+}
+
 /*
  * dependencies_clauselist_selectivity
  *		Return the estimated selectivity of (a subset of) the given clauses
@@ -982,8 +1121,10 @@ dependencies_clauselist_selectivity(PlannerInfo *root,
 		Node	   *clause = (Node *) lfirst(l);
 		AttrNumber	attnum;
 
+		dependency_clause_matches_expression(clause, rel->relid, rel->statlist);
+
 		if (!bms_is_member(listidx, *estimatedclauses) &&
-			dependency_is_compatible_clause(clause, rel->relid, &attnum))
+			 dependency_is_compatible_clause(clause, rel->relid, &attnum))
 		{
 			list_attnums[listidx] = bms_make_singleton(attnum);
 			clauses_attnums = bms_add_member(clauses_attnums, attnum);
diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c
index d9e854228c..d9936ed684 100644
--- a/src/backend/statistics/extended_stats.c
+++ b/src/backend/statistics/extended_stats.c
@@ -24,6 +24,7 @@
 #include "catalog/pg_collation.h"
 #include "catalog/pg_statistic_ext.h"
 #include "catalog/pg_statistic_ext_data.h"
+#include "executor/executor.h"
 #include "miscadmin.h"
 #include "nodes/nodeFuncs.h"
 #include "optimizer/clauses.h"
@@ -63,11 +64,12 @@ typedef struct StatExtEntry
 	Bitmapset  *columns;		/* attribute numbers covered by the object */
 	List	   *types;			/* 'char' list of enabled statistic kinds */
 	int			stattarget;		/* statistics target (-1 for default) */
+	List	   *exprs;			/* expressions */
 } StatExtEntry;
 
 
 static List *fetch_statentries_for_relation(Relation pg_statext, Oid relid);
-static VacAttrStats **lookup_var_attr_stats(Relation rel, Bitmapset *attrs,
+static VacAttrStats **lookup_var_attr_stats(Relation rel, Bitmapset *attrs, List *exprs,
 											int nvacatts, VacAttrStats **vacatts);
 static void statext_store(Oid relid,
 						  MVNDistinct *ndistinct, MVDependencies *dependencies,
@@ -111,11 +113,15 @@ BuildRelationExtStatistics(Relation onerel, double totalrows,
 		ListCell   *lc2;
 		int			stattarget;
 
+		/* evaluated expressions */
+		Datum	   *exprvals = NULL;
+		bool	   *exprnulls = NULL;
+
 		/*
 		 * Check if we can build these stats based on the column analyzed. If
 		 * not, report this fact (except in autovacuum) and move on.
 		 */
-		stats = lookup_var_attr_stats(onerel, stat->columns,
+		stats = lookup_var_attr_stats(onerel, stat->columns, stat->exprs,
 									  natts, vacattrstats);
 		if (!stats)
 		{
@@ -131,8 +137,8 @@ BuildRelationExtStatistics(Relation onerel, double totalrows,
 		}
 
 		/* check allowed number of dimensions */
-		Assert(bms_num_members(stat->columns) >= 2 &&
-			   bms_num_members(stat->columns) <= STATS_MAX_DIMENSIONS);
+		Assert(bms_num_members(stat->columns) + list_length(stat->exprs) >= 2 &&
+			   bms_num_members(stat->columns) + list_length(stat->exprs) <= STATS_MAX_DIMENSIONS);
 
 		/* compute statistics target for this statistics */
 		stattarget = statext_compute_stattarget(stat->stattarget,
@@ -147,6 +153,78 @@ BuildRelationExtStatistics(Relation onerel, double totalrows,
 		if (stattarget == 0)
 			continue;
 
+		if (stat->exprs)
+		{
+			int			i;
+			int			idx;
+			TupleTableSlot *slot;
+			EState	   *estate;
+			ExprContext *econtext;
+			List	   *exprstates = NIL;
+
+			/*
+			 * Need an EState for evaluation of index expressions and
+			 * partial-index predicates.  Create it in the per-index context to be
+			 * sure it gets cleaned up at the bottom of the loop.
+			 */
+			estate = CreateExecutorState();
+			econtext = GetPerTupleExprContext(estate);
+			/* Need a slot to hold the current heap tuple, too */
+			slot = MakeSingleTupleTableSlot(RelationGetDescr(onerel),
+											&TTSOpsHeapTuple);
+
+			/* Arrange for econtext's scan tuple to be the tuple under test */
+			econtext->ecxt_scantuple = slot;
+
+			/* Compute and save index expression values */
+			exprvals = (Datum *) palloc(numrows * list_length(stat->exprs) * sizeof(Datum));
+			exprnulls = (bool *) palloc(numrows * list_length(stat->exprs) * sizeof(bool));
+
+			/* Set up expression evaluation state */
+			exprstates = ExecPrepareExprList(stat->exprs, estate);
+
+			idx = 0;
+			for (i = 0; i < numrows; i++)
+			{
+				/*
+				 * Reset the per-tuple context each time, to reclaim any cruft
+				 * left behind by evaluating the predicate or index expressions.
+				 */
+				ResetExprContext(econtext);
+
+				/* Set up for predicate or expression evaluation */
+				ExecStoreHeapTuple(rows[i], slot, false);
+
+				foreach (lc2, exprstates)
+				{
+					Datum	datum;
+					bool	isnull;
+					ExprState *exprstate = (ExprState *) lfirst(lc2);
+
+					datum = ExecEvalExprSwitchContext(exprstate,
+											   GetPerTupleExprContext(estate),
+											   &isnull);
+					if (isnull)
+					{
+						exprvals[idx] = (Datum) 0;
+						exprnulls[idx] = true;
+					}
+					else
+					{
+						exprvals[idx] = (Datum) datum;
+						exprnulls[idx] = false;
+					}
+
+					idx++;
+				}
+			}
+
+			ExecDropSingleTupleTableSlot(slot);
+			FreeExecutorState(estate);
+
+			elog(WARNING, "idx = %d", idx);
+		}
+
 		/* compute statistic of each requested type */
 		foreach(lc2, stat->types)
 		{
@@ -154,13 +232,19 @@ BuildRelationExtStatistics(Relation onerel, double totalrows,
 
 			if (t == STATS_EXT_NDISTINCT)
 				ndistinct = statext_ndistinct_build(totalrows, numrows, rows,
-													stat->columns, stats);
+													exprvals, exprnulls,
+													stat->columns, stat->exprs,
+													stats);
 			else if (t == STATS_EXT_DEPENDENCIES)
 				dependencies = statext_dependencies_build(numrows, rows,
-														  stat->columns, stats);
+														  exprvals, exprnulls,
+														  stat->columns,
+														  stat->exprs, stats);
 			else if (t == STATS_EXT_MCV)
-				mcv = statext_mcv_build(numrows, rows, stat->columns, stats,
-										totalrows, stattarget);
+				mcv = statext_mcv_build(numrows, rows,
+										exprvals, exprnulls,
+										stat->columns, stat->exprs,
+										stats, totalrows, stattarget);
 		}
 
 		/* store the statistics in the catalog */
@@ -217,7 +301,7 @@ ComputeExtStatisticsRows(Relation onerel,
 		 * analyzed. If not, ignore it (don't report anything, we'll do that
 		 * during the actual build BuildRelationExtStatistics).
 		 */
-		stats = lookup_var_attr_stats(onerel, stat->columns,
+		stats = lookup_var_attr_stats(onerel, stat->columns, stat->exprs,
 									  natts, vacattrstats);
 
 		if (!stats)
@@ -364,6 +448,7 @@ fetch_statentries_for_relation(Relation pg_statext, Oid relid)
 		ArrayType  *arr;
 		char	   *enabled;
 		Form_pg_statistic_ext staForm;
+		List	   *exprs = NIL;
 
 		entry = palloc0(sizeof(StatExtEntry));
 		staForm = (Form_pg_statistic_ext) GETSTRUCT(htup);
@@ -395,6 +480,34 @@ fetch_statentries_for_relation(Relation pg_statext, Oid relid)
 			entry->types = lappend_int(entry->types, (int) enabled[i]);
 		}
 
+		/* decode expression (if any) */
+		datum = SysCacheGetAttr(STATEXTOID, htup,
+								Anum_pg_statistic_ext_stxexprs, &isnull);
+
+		if (!isnull)
+		{
+			char *exprsString;
+
+			exprsString = TextDatumGetCString(datum);
+			exprs = (List *) stringToNode(exprsString);
+
+			pfree(exprsString);
+
+			/*
+			 * Run the expressions through eval_const_expressions. This is not just an
+			 * optimization, but is necessary, because the planner will be comparing
+			 * them to similarly-processed qual clauses, and may fail to detect valid
+			 * matches without this.  We must not use canonicalize_qual, however,
+			 * since these aren't qual expressions.
+			 */
+			exprs = (List *) eval_const_expressions(NULL, (Node *) exprs);
+
+			/* May as well fix opfuncids too */
+			fix_opfuncids((Node *) exprs);
+		}
+
+		entry->exprs = exprs;
+
 		result = lappend(result, entry);
 	}
 
@@ -403,6 +516,89 @@ fetch_statentries_for_relation(Relation pg_statext, Oid relid)
 	return result;
 }
 
+
+/*
+ * examine_attribute -- pre-analysis of a single column
+ *
+ * Determine whether the column is analyzable; if so, create and initialize
+ * a VacAttrStats struct for it.  If not, return NULL.
+ *
+ * If index_expr isn't NULL, then we're trying to analyze an expression index,
+ * and index_expr is the expression tree representing the column's data.
+ */
+static VacAttrStats *
+examine_attribute(Node *expr)
+{
+	HeapTuple	typtuple;
+	VacAttrStats *stats;
+	int			i;
+	bool		ok;
+
+	/*
+	 * Create the VacAttrStats struct.  Note that we only have a copy of the
+	 * fixed fields of the pg_attribute tuple.
+	 */
+	stats = (VacAttrStats *) palloc0(sizeof(VacAttrStats));
+
+	/* fake the attribute */
+	stats->attr = (Form_pg_attribute) palloc0(ATTRIBUTE_FIXED_PART_SIZE);
+	stats->attr->attstattarget = -1;
+
+	/*
+	 * When analyzing an expression index, believe the expression tree's type
+	 * not the column datatype --- the latter might be the opckeytype storage
+	 * type of the opclass, which is not interesting for our purposes.  (Note:
+	 * if we did anything with non-expression index columns, we'd need to
+	 * figure out where to get the correct type info from, but for now that's
+	 * not a problem.)	It's not clear whether anyone will care about the
+	 * typmod, but we store that too just in case.
+	 */
+	stats->attrtypid = exprType(expr);
+	stats->attrtypmod = exprTypmod(expr);
+	stats->attrcollid = exprCollation(expr);
+
+	typtuple = SearchSysCacheCopy1(TYPEOID,
+								   ObjectIdGetDatum(stats->attrtypid));
+	if (!HeapTupleIsValid(typtuple))
+		elog(ERROR, "cache lookup failed for type %u", stats->attrtypid);
+	stats->attrtype = (Form_pg_type) GETSTRUCT(typtuple);
+	// stats->anl_context = anl_context;
+	stats->tupattnum = InvalidAttrNumber;
+
+	/*
+	 * The fields describing the stats->stavalues[n] element types default to
+	 * the type of the data being analyzed, but the type-specific typanalyze
+	 * function can change them if it wants to store something else.
+	 */
+	for (i = 0; i < STATISTIC_NUM_SLOTS; i++)
+	{
+		stats->statypid[i] = stats->attrtypid;
+		stats->statyplen[i] = stats->attrtype->typlen;
+		stats->statypbyval[i] = stats->attrtype->typbyval;
+		stats->statypalign[i] = stats->attrtype->typalign;
+	}
+
+	/*
+	 * Call the type-specific typanalyze function.  If none is specified, use
+	 * std_typanalyze().
+	 */
+	if (OidIsValid(stats->attrtype->typanalyze))
+		ok = DatumGetBool(OidFunctionCall1(stats->attrtype->typanalyze,
+										   PointerGetDatum(stats)));
+	else
+		ok = std_typanalyze(stats);
+
+	if (!ok || stats->compute_stats == NULL || stats->minrows <= 0)
+	{
+		heap_freetuple(typtuple);
+		pfree(stats->attr);
+		pfree(stats);
+		return NULL;
+	}
+
+	return stats;
+}
+
 /*
  * Using 'vacatts' of size 'nvacatts' as input data, return a newly built
  * VacAttrStats array which includes only the items corresponding to
@@ -411,15 +607,18 @@ fetch_statentries_for_relation(Relation pg_statext, Oid relid)
  * to the caller that the stats should not be built.
  */
 static VacAttrStats **
-lookup_var_attr_stats(Relation rel, Bitmapset *attrs,
+lookup_var_attr_stats(Relation rel, Bitmapset *attrs, List *exprs,
 					  int nvacatts, VacAttrStats **vacatts)
 {
 	int			i = 0;
 	int			x = -1;
+	int			natts;
 	VacAttrStats **stats;
+	ListCell   *lc;
 
-	stats = (VacAttrStats **)
-		palloc(bms_num_members(attrs) * sizeof(VacAttrStats *));
+	natts = bms_num_members(attrs) + list_length(exprs);
+
+	stats = (VacAttrStats **) palloc(natts * sizeof(VacAttrStats *));
 
 	/* lookup VacAttrStats info for the requested columns (same attnum) */
 	while ((x = bms_next_member(attrs, x)) >= 0)
@@ -453,6 +652,19 @@ lookup_var_attr_stats(Relation rel, Bitmapset *attrs,
 		 */
 		Assert(!stats[i]->attr->attisdropped);
 
+		elog(WARNING, "A: %d => %p", i, stats[i]);
+
+		i++;
+	}
+
+	foreach (lc, exprs)
+	{
+		Node *expr = (Node *) lfirst(lc);
+
+		stats[i] = examine_attribute(expr);
+
+		elog(WARNING, "B: %d => %p (%s)", i, stats[i], nodeToString(expr));
+
 		i++;
 	}
 
@@ -717,8 +929,10 @@ build_attnums_array(Bitmapset *attrs, int *numattrs)
  * can simply pfree the return value to release all of it.
  */
 SortItem *
-build_sorted_items(int numrows, int *nitems, HeapTuple *rows, TupleDesc tdesc,
-				   MultiSortSupport mss, int numattrs, AttrNumber *attnums)
+build_sorted_items(int numrows, int *nitems, HeapTuple *rows,
+				   Datum *exprvals, bool *exprnulls, int nexprs,
+				   TupleDesc tdesc, MultiSortSupport mss,
+				   int numattrs, AttrNumber *attnums)
 {
 	int			i,
 				j,
@@ -766,7 +980,16 @@ build_sorted_items(int numrows, int *nitems, HeapTuple *rows, TupleDesc tdesc,
 			Datum		value;
 			bool		isnull;
 
-			value = heap_getattr(rows[i], attnums[j], tdesc, &isnull);
+			if (attnums[j] <= MaxHeapAttributeNumber)
+				value = heap_getattr(rows[i], attnums[j], tdesc, &isnull);
+			else
+			{
+				int	expridx = (attnums[j] - MaxHeapAttributeNumber - 1);
+				int	idx = i * nexprs + expridx;
+
+				value = exprvals[idx];
+				isnull = exprnulls[idx];
+			}
 
 			/*
 			 * If this is a varlena value, check if it's too wide and if yes
@@ -1080,6 +1303,168 @@ statext_is_compatible_clause_internal(PlannerInfo *root, Node *clause,
 	return false;
 }
 
+
+
+/*
+ * statext_extract_clause_internal
+ *		Determines if the clause is compatible with MCV lists.
+ *
+ * Does the heavy lifting of actually inspecting the clauses for
+ * statext_is_compatible_clause. It needs to be split like this because
+ * of recursion.  The attnums bitmap is an input/output parameter collecting
+ * attribute numbers from all compatible clauses (recursively).
+ */
+static List *
+statext_extract_clause_internal(PlannerInfo *root, Node *clause, Index relid)
+{
+	List   *result = NIL;
+
+	/* Look inside any binary-compatible relabeling (as in examine_variable) */
+	if (IsA(clause, RelabelType))
+		clause = (Node *) ((RelabelType *) clause)->arg;
+
+	/* plain Var references (boolean Vars or recursive checks) */
+	if (IsA(clause, Var))
+	{
+		Var		   *var = (Var *) clause;
+
+		/* Ensure var is from the correct relation */
+		if (var->varno != relid)
+			return NIL;
+
+		/* we also better ensure the Var is from the current level */
+		if (var->varlevelsup > 0)
+			return NIL;
+
+		/* Also skip system attributes (we don't allow stats on those). */
+		if (!AttrNumberIsForUserDefinedAttr(var->varattno))
+			return NIL;
+
+		// *attnums = bms_add_member(*attnums, var->varattno);
+
+		result = lappend(result, clause);
+
+		return result;
+	}
+
+	/* (Var op Const) or (Const op Var) */
+	if (is_opclause(clause))
+	{
+		RangeTblEntry *rte = root->simple_rte_array[relid];
+		OpExpr	   *expr = (OpExpr *) clause;
+		Var		   *var;
+		Var		   *var2 = NULL;
+
+		/* Only expressions with two arguments are considered compatible. */
+		if (list_length(expr->args) != 2)
+			return NIL;
+
+		/* Check if the expression the right shape (one Var, one Const) */
+		if ((!examine_opclause_expression(expr, &var, NULL, NULL)) &&
+			(!examine_opclause_expression2(expr, &var, &var2)))
+			return NIL;
+
+		/*
+		 * If it's not one of the supported operators ("=", "<", ">", etc.),
+		 * just ignore the clause, as it's not compatible with MCV lists.
+		 *
+		 * This uses the function for estimating selectivity, not the operator
+		 * directly (a bit awkward, but well ...).
+		 */
+		switch (get_oprrest(expr->opno))
+		{
+			case F_EQSEL:
+			case F_NEQSEL:
+			case F_SCALARLTSEL:
+			case F_SCALARLESEL:
+			case F_SCALARGTSEL:
+			case F_SCALARGESEL:
+				/* supported, will continue with inspection of the Var */
+				break;
+
+			default:
+				/* other estimators are considered unknown/unsupported */
+				return NIL;
+		}
+
+		/*
+		 * If there are any securityQuals on the RTE from security barrier
+		 * views or RLS policies, then the user may not have access to all the
+		 * table's data, and we must check that the operator is leak-proof.
+		 *
+		 * If the operator is leaky, then we must ignore this clause for the
+		 * purposes of estimating with MCV lists, otherwise the operator might
+		 * reveal values from the MCV list that the user doesn't have
+		 * permission to see.
+		 */
+		if (rte->securityQuals != NIL &&
+			!get_func_leakproof(get_opcode(expr->opno)))
+			return NIL;
+
+		result = lappend(result, var);
+
+		if (var2)
+			result = lappend(result, var2);
+
+		return result;
+	}
+
+	/* AND/OR/NOT clause */
+	if (is_andclause(clause) ||
+		is_orclause(clause) ||
+		is_notclause(clause))
+	{
+		/*
+		 * AND/OR/NOT-clauses are supported if all sub-clauses are supported
+		 *
+		 * Perhaps we could improve this by handling mixed cases, when some of
+		 * the clauses are supported and some are not. Selectivity for the
+		 * supported subclauses would be computed using extended statistics,
+		 * and the remaining clauses would be estimated using the traditional
+		 * algorithm (product of selectivities).
+		 *
+		 * It however seems overly complex, and in a way we already do that
+		 * because if we reject the whole clause as unsupported here, it will
+		 * be eventually passed to clauselist_selectivity() which does exactly
+		 * this (split into supported/unsupported clauses etc).
+		 */
+		BoolExpr   *expr = (BoolExpr *) clause;
+		ListCell   *lc;
+
+		foreach(lc, expr->args)
+		{
+			/*
+			 * Had we found incompatible clause in the arguments, treat the
+			 * whole clause as incompatible.
+			 */
+			if (!statext_extract_clause_internal(root,
+												 (Node *) lfirst(lc),
+												 relid))
+				return NIL;
+		}
+
+		return result;
+	}
+
+	/* Var IS NULL */
+	if (IsA(clause, NullTest))
+	{
+		NullTest   *nt = (NullTest *) clause;
+
+		/*
+		 * Only simple (Var IS NULL) expressions supported for now. Maybe we
+		 * could use examine_variable to fix this?
+		 */
+		if (!IsA(nt->arg, Var))
+			return false;
+
+		return statext_extract_clause_internal(root, (Node *) (nt->arg),
+											   relid);
+	}
+
+	return false;
+}
+
 /*
  * statext_is_compatible_clause
  *		Determines if the clause is compatible with MCV lists.
@@ -1154,6 +1539,51 @@ statext_is_compatible_clause(PlannerInfo *root, Node *clause, Index relid,
 	return true;
 }
 
+/*
+ * statext_extract_clause
+ *		Determines if the clause is compatible with MCV lists.
+ *
+ * Currently, we only support three types of clauses:
+ *
+ * (a) OpExprs of the form (Var op Const), or (Const op Var), where the op
+ * is one of ("=", "<", ">", ">=", "<=")
+ *
+ * (b) (Var IS [NOT] NULL)
+ *
+ * (c) combinations using AND/OR/NOT
+ *
+ * In the future, the range of supported clauses may be expanded to more
+ * complex cases, for example (Var op Var).
+ */
+static List *
+statext_extract_clause(PlannerInfo *root, Node *clause, Index relid)
+{
+	RestrictInfo *rinfo = (RestrictInfo *) clause;
+	List		 *exprs;
+
+	if (!IsA(rinfo, RestrictInfo))
+		return false;
+
+	/* Pseudoconstants are not really interesting here. */
+	if (rinfo->pseudoconstant)
+		return false;
+
+	/* clauses referencing multiple varnos are incompatible */
+	if (bms_membership(rinfo->clause_relids) != BMS_SINGLETON)
+		return false;
+
+	/* Check the clause and determine what attributes it references. */
+	exprs = statext_extract_clause_internal(root, (Node *) rinfo->clause, relid);
+
+	if (!exprs)
+		return NULL;
+
+	/* FIXME do the same ACL check as in statext_is_compatible_clause */
+
+	/* If we reach here, the clause is OK */
+	return exprs;
+}
+
 /*
  * statext_mcv_clauselist_selectivity
  *		Estimate clauses using the best multi-column statistics.
@@ -1216,7 +1646,8 @@ statext_mcv_clauselist_selectivity(PlannerInfo *root, List *clauses, int varReli
 								   bool is_or)
 {
 	ListCell   *l;
-	Bitmapset **list_attnums;
+	Bitmapset **list_attnums;	/* attnums extracted from the clause */
+	bool	   *exact_clauses;	/* covered as-is by at least one statistic */
 	int			listidx;
 	Selectivity	sel = 1.0;
 
@@ -1227,6 +1658,8 @@ statext_mcv_clauselist_selectivity(PlannerInfo *root, List *clauses, int varReli
 	list_attnums = (Bitmapset **) palloc(sizeof(Bitmapset *) *
 										 list_length(clauses));
 
+	exact_clauses = (bool *) palloc(sizeof(bool) * list_length(clauses));
+
 	/*
 	 * Pre-process the clauses list to extract the attnums seen in each item.
 	 * We need to determine if there's any clauses which will be useful for
@@ -1244,11 +1677,76 @@ statext_mcv_clauselist_selectivity(PlannerInfo *root, List *clauses, int varReli
 		Node	   *clause = (Node *) lfirst(l);
 		Bitmapset  *attnums = NULL;
 
+		/* the clause is considered incompatible by default */
+		list_attnums[listidx] = NULL;
+
+		/* and it's also not covered exactly by the statistic */
+		exact_clauses[listidx] = false;
+
+		/*
+		 * First see if the clause is simple enough to be covered directly
+		 * by the attributes. If not, see if there's at least one statistic
+		 * object using the expression as-is.
+		 */
 		if (!bms_is_member(listidx, *estimatedclauses) &&
 			statext_is_compatible_clause(root, clause, rel->relid, &attnums))
+			/* simple expression, covered through attnum(s) */
 			list_attnums[listidx] = attnums;
 		else
-			list_attnums[listidx] = NULL;
+		{
+			ListCell   *lc;
+
+			List *exprs = statext_extract_clause(root, clause, rel->relid);
+
+			/* complex expression, search for statistic */
+			foreach(lc, rel->statlist)
+			{
+				ListCell		   *lc2;
+				StatisticExtInfo   *info = (StatisticExtInfo *) lfirst(lc);
+				bool				all_found = true;
+
+				/* have we already found all expressions in a statistic? */
+				Assert(!exact_clauses[listidx]);
+
+				/* no expressions */
+				if (!info->exprs)
+					continue;
+
+				foreach (lc2, exprs)
+				{
+					Node   *expr = (Node *) lfirst(lc2);
+
+					/*
+					 * Walk the expressions, see if all expressions extracted from
+					 * the clause are covered by the extended statistic object.
+					 */
+					foreach (lc2, info->exprs)
+					{
+						Node   *stat_expr = (Node *) lfirst(lc2);
+						bool	expr_found = false;
+
+						if (equal(expr, stat_expr))
+						{
+							expr_found = true;
+							break;
+						}
+
+						if (!expr_found)
+						{
+							all_found = false;
+							break;
+						}
+					}
+				}
+
+				/* stop looking for another statistic */
+				if (all_found)
+				{
+					exact_clauses[listidx] = true;
+					break;
+				}
+			}
+		}
 
 		listidx++;
 	}
diff --git a/src/backend/statistics/mcv.c b/src/backend/statistics/mcv.c
index 4b51af287e..c3c3ede7c5 100644
--- a/src/backend/statistics/mcv.c
+++ b/src/backend/statistics/mcv.c
@@ -180,7 +180,9 @@ get_mincount_for_mcv_list(int samplerows, double totalrows)
  *
  */
 MCVList *
-statext_mcv_build(int numrows, HeapTuple *rows, Bitmapset *attrs,
+statext_mcv_build(int numrows, HeapTuple *rows,
+				  Datum *exprvals, bool *exprnulls,
+				  Bitmapset *attrs, List *exprs,
 				  VacAttrStats **stats, double totalrows, int stattarget)
 {
 	int			i,
@@ -194,13 +196,23 @@ statext_mcv_build(int numrows, HeapTuple *rows, Bitmapset *attrs,
 	MCVList    *mcvlist = NULL;
 	MultiSortSupport mss;
 
+	/*
+	 * Copy the bitmapset and add fake attnums representing expressions,
+	 * starting above MaxHeapAttributeNumber.
+	 */
+	attrs = bms_copy(attrs);
+
+	for (i = 1; i <= list_length(exprs); i++)
+		attrs = bms_add_member(attrs, MaxHeapAttributeNumber + i);
+
 	attnums = build_attnums_array(attrs, &numattrs);
 
 	/* comparator for all the columns */
 	mss = build_mss(stats, numattrs);
 
 	/* sort the rows */
-	items = build_sorted_items(numrows, &nitems, rows, stats[0]->tupDesc,
+	items = build_sorted_items(numrows, &nitems, rows, exprvals, exprnulls,
+							   list_length(exprs), stats[0]->tupDesc,
 							   mss, numattrs, attnums);
 
 	if (!items)
@@ -337,6 +349,7 @@ statext_mcv_build(int numrows, HeapTuple *rows, Bitmapset *attrs,
 
 	pfree(items);
 	pfree(groups);
+	pfree(attrs);
 
 	return mcvlist;
 }
diff --git a/src/backend/statistics/mvdistinct.c b/src/backend/statistics/mvdistinct.c
index 977d6f3e2e..dd874c7a04 100644
--- a/src/backend/statistics/mvdistinct.c
+++ b/src/backend/statistics/mvdistinct.c
@@ -37,8 +37,10 @@
 #include "utils/typcache.h"
 
 static double ndistinct_for_combination(double totalrows, int numrows,
-										HeapTuple *rows, VacAttrStats **stats,
-										int k, int *combination);
+										HeapTuple *rows, Datum *exprvals,
+										bool *exprnulls, int nexprs,
+										VacAttrStats **stats, int k,
+										int *combination);
 static double estimate_ndistinct(double totalrows, int numrows, int d, int f1);
 static int	n_choose_k(int n, int k);
 static int	num_combinations(int n);
@@ -84,14 +86,26 @@ static void generate_combinations(CombinationGenerator *state);
  */
 MVNDistinct *
 statext_ndistinct_build(double totalrows, int numrows, HeapTuple *rows,
-						Bitmapset *attrs, VacAttrStats **stats)
+						Datum *exprvals, bool *exprnulls,
+						Bitmapset *attrs, List *exprs,
+						VacAttrStats **stats)
 {
 	MVNDistinct *result;
+	int			i;
 	int			k;
 	int			itemcnt;
-	int			numattrs = bms_num_members(attrs);
+	int			numattrs = bms_num_members(attrs) + list_length(exprs);
 	int			numcombs = num_combinations(numattrs);
 
+	/*
+	 * Copy the bitmapset and add fake attnums representing expressions,
+	 * starting above MaxHeapAttributeNumber.
+	 */
+	attrs = bms_copy(attrs);
+
+	for (i = 1; i <= list_length(exprs); i++)
+		attrs = bms_add_member(attrs, MaxHeapAttributeNumber + i);
+
 	result = palloc(offsetof(MVNDistinct, items) +
 					numcombs * sizeof(MVNDistinctItem));
 	result->magic = STATS_NDISTINCT_MAGIC;
@@ -114,10 +128,18 @@ statext_ndistinct_build(double totalrows, int numrows, HeapTuple *rows,
 
 			item->attrs = NULL;
 			for (j = 0; j < k; j++)
-				item->attrs = bms_add_member(item->attrs,
-											 stats[combination[j]]->attr->attnum);
+			{
+				if (combination[j] <= MaxHeapAttributeNumber)
+					item->attrs = bms_add_member(item->attrs,
+												 stats[combination[j]]->attr->attnum);
+				else
+					item->attrs = bms_add_member(item->attrs, combination[j]);
+			}
+
 			item->ndistinct =
 				ndistinct_for_combination(totalrows, numrows, rows,
+										  exprvals, exprnulls,
+										  list_length(exprs),
 										  stats, k, combination);
 
 			itemcnt++;
@@ -428,6 +450,7 @@ pg_ndistinct_send(PG_FUNCTION_ARGS)
  */
 static double
 ndistinct_for_combination(double totalrows, int numrows, HeapTuple *rows,
+						  Datum *exprvals, bool *exprnulls, int nexprs,
 						  VacAttrStats **stats, int k, int *combination)
 {
 	int			i,
@@ -481,11 +504,17 @@ ndistinct_for_combination(double totalrows, int numrows, HeapTuple *rows,
 		/* accumulate all the data for this dimension into the arrays */
 		for (j = 0; j < numrows; j++)
 		{
-			items[j].values[i] =
-				heap_getattr(rows[j],
-							 colstat->attr->attnum,
-							 colstat->tupDesc,
-							 &items[j].isnull[i]);
+			if (combination[i] <= MaxHeapAttributeNumber)
+				items[j].values[i] =
+					heap_getattr(rows[j],
+								 colstat->attr->attnum,
+								 colstat->tupDesc,
+								 &items[j].isnull[i]);
+			else
+			{
+				items[j].values[i] = exprvals[j * nexprs + combination[i] - MaxHeapAttributeNumber - 1];
+				items[j].isnull[i] = exprnulls[j * nexprs + combination[i] - MaxHeapAttributeNumber - 1];
+			}
 		}
 	}
 
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index b2c58bf862..701cff1693 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -1678,7 +1678,21 @@ ProcessUtilitySlow(ParseState *pstate,
 				break;
 
 			case T_CreateStatsStmt:
-				address = CreateStatistics((CreateStatsStmt *) parsetree);
+				{
+					Oid			relid;
+					CreateStatsStmt *stmt = (CreateStatsStmt *) parsetree;
+					RangeVar   *rel = (RangeVar *) linitial(stmt->relations);
+
+					relid = RangeVarGetRelidExtended(rel, ShareLock,
+												 0,
+												 RangeVarCallbackOwnsRelation,
+												 NULL);
+
+					/* Run parse analysis ... */
+					stmt = transformStatsStmt(relid, stmt, queryString);
+
+					address = CreateStatistics(stmt);
+				}
 				break;
 
 			case T_AlterStatsStmt:
diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c
index 116e00bce4..7e44afed16 100644
--- a/src/backend/utils/adt/ruleutils.c
+++ b/src/backend/utils/adt/ruleutils.c
@@ -1524,6 +1524,9 @@ pg_get_statisticsobj_worker(Oid statextid, bool missing_ok)
 	bool		dependencies_enabled;
 	bool		mcv_enabled;
 	int			i;
+	List	   *context;
+	ListCell   *lc;
+	List	   *exprs = NIL;
 
 	statexttup = SearchSysCache1(STATEXTOID, ObjectIdGetDatum(statextid));
 
@@ -1616,6 +1619,62 @@ pg_get_statisticsobj_worker(Oid statextid, bool missing_ok)
 		appendStringInfoString(&buf, quote_identifier(attname));
 	}
 
+	/* deparse expressions */
+
+	{
+			bool		isnull;
+			Datum		datum;
+
+			/* decode expression (if any) */
+			datum = SysCacheGetAttr(STATEXTOID, statexttup,
+									Anum_pg_statistic_ext_stxexprs, &isnull);
+
+			if (!isnull)
+			{
+				char *exprsString;
+
+				exprsString = TextDatumGetCString(datum);
+				exprs = (List *) stringToNode(exprsString);
+				pfree(exprsString);
+
+				/*
+				 * Run the expressions through eval_const_expressions. This is not just an
+				 * optimization, but is necessary, because the planner will be comparing
+				 * them to similarly-processed qual clauses, and may fail to detect valid
+				 * matches without this.  We must not use canonicalize_qual, however,
+				 * since these aren't qual expressions.
+				 */
+				exprs = (List *) eval_const_expressions(NULL, (Node *) exprs);
+
+				/* May as well fix opfuncids too */
+				fix_opfuncids((Node *) exprs);
+			}
+	}
+
+	context = deparse_context_for(get_relation_name(statextrec->stxrelid),
+								  statextrec->stxrelid);
+
+	foreach (lc, exprs)
+	{
+		Node	   *expr = (Node *) lfirst(lc);
+		char	   *str;
+		int			prettyFlags = PRETTYFLAG_INDENT;
+
+		str = deparse_expression_pretty(expr, context, false, false,
+										prettyFlags, 0);
+
+		if (colno > 0)
+			appendStringInfoString(&buf, ", ");
+
+		/* Need parens if it's not a bare function call */
+		if (looks_like_function(expr))
+			appendStringInfoString(&buf, str);
+		else
+			appendStringInfo(&buf, "(%s)", str);
+
+		colno++;
+	}
+
 	appendStringInfo(&buf, " FROM %s",
 					 generate_relation_name(statextrec->stxrelid, NIL));
 
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 18d77ac0b7..bfbe92a543 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -3082,6 +3082,7 @@ estimate_num_groups(PlannerInfo *root, List *groupExprs, double input_rows,
 		double		this_srf_multiplier;
 		VariableStatData vardata;
 		List	   *varshere;
+		Relids		varnos;
 		ListCell   *l2;
 
 		/* is expression in this grouping set? */
@@ -3149,6 +3150,16 @@ estimate_num_groups(PlannerInfo *root, List *groupExprs, double input_rows,
 			continue;
 		}
 
+		/*
+		 * Are all the variables from the same relation? If yes, search for
+		 * an extended statistic matching this expression exactly.
+		 */
+		varnos = pull_varnos((Node *) varshere);
+		if (bms_membership(varnos) == BMS_SINGLETON)
+		{
+			// FIXME try to match it to expressions in mvdistinct stats
+		}
+
 		/*
 		 * Else add variables to varinfos list
 		 */
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index f3c7eb96fa..92c2deb1ba 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -2671,6 +2671,7 @@ describeOneTableDetails(const char *schemaname,
 		/* print any extended statistics */
 		if (pset.sversion >= 100000)
 		{
+			/* FIXME improve this with printing expressions the statistics is defined on */
 			printfPQExpBuffer(&buf,
 							  "SELECT oid, "
 							  "stxrelid::pg_catalog.regclass, "
diff --git a/src/include/catalog/pg_statistic_ext.h b/src/include/catalog/pg_statistic_ext.h
index e9491a0a87..dd0f41cd14 100644
--- a/src/include/catalog/pg_statistic_ext.h
+++ b/src/include/catalog/pg_statistic_ext.h
@@ -52,6 +52,9 @@ CATALOG(pg_statistic_ext,3381,StatisticExtRelationId)
 #ifdef CATALOG_VARLEN
 	char		stxkind[1] BKI_FORCE_NOT_NULL;	/* statistics kinds requested
 												 * to build */
+	pg_node_tree stxexprs;		/* expression trees for stats attributes that
+								 * are not simple column references; one for
+								 * each zero entry in stxkeys[] */
 #endif
 
 } FormData_pg_statistic_ext;
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index baced7eec0..72f6534ceb 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -448,6 +448,7 @@ typedef enum NodeTag
 	T_TypeName,
 	T_ColumnDef,
 	T_IndexElem,
+	T_StatsElem,
 	T_Constraint,
 	T_DefElem,
 	T_RangeTblEntry,
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index cdfa0568f7..def7e4fe3f 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -2808,8 +2808,24 @@ typedef struct CreateStatsStmt
 	List	   *relations;		/* rels to build stats on (list of RangeVar) */
 	char	   *stxcomment;		/* comment to apply to stats, or NULL */
 	bool		if_not_exists;	/* do nothing if stats name already exists */
+	bool		transformed;	/* true when transformIndexStmt is finished */
 } CreateStatsStmt;
 
+/*
+ * StatsElem - statistics parameters (used in CREATE STATISTICS)
+ *
+ * For a plain attribute, 'name' is the name of the referenced table column
+ * and 'expr' is NULL.  For an expression, 'name' is NULL and 'expr' is the
+ * expression tree.
+ */
+typedef struct StatsElem
+{
+	NodeTag		type;
+	char	   *name;			/* name of attribute to index, or NULL */
+	Node	   *expr;			/* expression to index, or NULL */
+} StatsElem;
+
+
 /* ----------------------
  *		Alter Statistics Statement
  * ----------------------
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 3d3be197e0..f3ca603570 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -885,6 +885,7 @@ typedef struct StatisticExtInfo
 	RelOptInfo *rel;			/* back-link to statistic's table */
 	char		kind;			/* statistic kind of this entry */
 	Bitmapset  *keys;			/* attnums of the columns covered */
+	List	   *exprs;			/* expressions */
 } StatisticExtInfo;
 
 /*
diff --git a/src/include/parser/parse_node.h b/src/include/parser/parse_node.h
index d25819aa28..82e5190964 100644
--- a/src/include/parser/parse_node.h
+++ b/src/include/parser/parse_node.h
@@ -69,6 +69,7 @@ typedef enum ParseExprKind
 	EXPR_KIND_FUNCTION_DEFAULT, /* default parameter value for function */
 	EXPR_KIND_INDEX_EXPRESSION, /* index expression */
 	EXPR_KIND_INDEX_PREDICATE,	/* index predicate */
+	EXPR_KIND_STATS_EXPRESSION, /* extended statistics expression */
 	EXPR_KIND_ALTER_COL_TRANSFORM,	/* transform expr in ALTER COLUMN TYPE */
 	EXPR_KIND_EXECUTE_PARAMETER,	/* parameter value in EXECUTE */
 	EXPR_KIND_TRIGGER_WHEN,		/* WHEN condition in CREATE TRIGGER */
diff --git a/src/include/parser/parse_utilcmd.h b/src/include/parser/parse_utilcmd.h
index eb73acdbd3..ca94cbd542 100644
--- a/src/include/parser/parse_utilcmd.h
+++ b/src/include/parser/parse_utilcmd.h
@@ -24,6 +24,8 @@ extern List *transformAlterTableStmt(Oid relid, AlterTableStmt *stmt,
 									 const char *queryString);
 extern IndexStmt *transformIndexStmt(Oid relid, IndexStmt *stmt,
 									 const char *queryString);
+extern CreateStatsStmt *transformStatsStmt(Oid relid, CreateStatsStmt *stmt,
+									 const char *queryString);
 extern void transformRuleStmt(RuleStmt *stmt, const char *queryString,
 							  List **actions, Node **whereClause);
 extern List *transformCreateSchemaStmt(CreateSchemaStmt *stmt);
diff --git a/src/include/statistics/extended_stats_internal.h b/src/include/statistics/extended_stats_internal.h
index 23217497bb..96a54f8487 100644
--- a/src/include/statistics/extended_stats_internal.h
+++ b/src/include/statistics/extended_stats_internal.h
@@ -59,17 +59,23 @@ typedef struct SortItem
 
 extern MVNDistinct *statext_ndistinct_build(double totalrows,
 											int numrows, HeapTuple *rows,
-											Bitmapset *attrs, VacAttrStats **stats);
+											Datum *exprvals, bool *exprnulls,
+											Bitmapset *attrs, List *exprs,
+											VacAttrStats **stats);
 extern bytea *statext_ndistinct_serialize(MVNDistinct *ndistinct);
 extern MVNDistinct *statext_ndistinct_deserialize(bytea *data);
 
 extern MVDependencies *statext_dependencies_build(int numrows, HeapTuple *rows,
-												  Bitmapset *attrs, VacAttrStats **stats);
+												  Datum *exprvals, bool *exprnulls,
+												  Bitmapset *attrs, List *exprs,
+												  VacAttrStats **stats);
 extern bytea *statext_dependencies_serialize(MVDependencies *dependencies);
 extern MVDependencies *statext_dependencies_deserialize(bytea *data);
 
 extern MCVList *statext_mcv_build(int numrows, HeapTuple *rows,
-								  Bitmapset *attrs, VacAttrStats **stats,
+								  Datum *exprvals, bool *exprnulls,
+								  Bitmapset *attrs, List *exprs,
+								  VacAttrStats **stats,
 								  double totalrows, int stattarget);
 extern bytea *statext_mcv_serialize(MCVList *mcv, VacAttrStats **stats);
 extern MCVList *statext_mcv_deserialize(bytea *data);
@@ -93,6 +99,7 @@ extern void *bsearch_arg(const void *key, const void *base,
 extern AttrNumber *build_attnums_array(Bitmapset *attrs, int *numattrs);
 
 extern SortItem *build_sorted_items(int numrows, int *nitems, HeapTuple *rows,
+									Datum *exprvals, bool *exprnulls, int nexprs,
 									TupleDesc tdesc, MultiSortSupport mss,
 									int numattrs, AttrNumber *attnums);
 
-- 
2.21.0

0001-Support-using-extended-stats-for-parts-of-O-20200113.patchtext/plain; charset=us-asciiDownload

From e8714d7edbfbafd3203623680e290d00ec3f1f8c Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Mon, 2 Dec 2019 23:02:17 +0100
Subject: [PATCH 1/3] Support using extended stats for parts of OR clauses

---
 src/backend/optimizer/path/clausesel.c        | 88 +++++++++++++++----
 src/backend/statistics/extended_stats.c       | 56 +++++++++---
 src/backend/statistics/mcv.c                  |  5 +-
 .../statistics/extended_stats_internal.h      |  3 +-
 src/include/statistics/statistics.h           |  3 +-
 5 files changed, 120 insertions(+), 35 deletions(-)

diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c
index a3ebe10592..8ff756bb31 100644
--- a/src/backend/optimizer/path/clausesel.c
+++ b/src/backend/optimizer/path/clausesel.c
@@ -92,7 +92,7 @@ clauselist_selectivity(PlannerInfo *root,
 		 */
 		s1 *= statext_clauselist_selectivity(root, clauses, varRelid,
 											 jointype, sjinfo, rel,
-											 &estimatedclauses);
+											 &estimatedclauses, false);
 	}
 
 	/*
@@ -104,6 +104,68 @@ clauselist_selectivity(PlannerInfo *root,
 											  estimatedclauses);
 }
 
+static Selectivity
+clauselist_selectivity_or(PlannerInfo *root,
+						  List *clauses,
+						  int varRelid,
+						  JoinType jointype,
+						  SpecialJoinInfo *sjinfo)
+{
+	ListCell   *lc;
+	Selectivity	s1 = 0.0;
+	RelOptInfo *rel;
+	Bitmapset  *estimatedclauses = NULL;
+	int			idx;
+
+	/*
+	 * Determine if these clauses reference a single relation.  If so, and if
+	 * it has extended statistics, try to apply those.
+	 */
+	rel = find_single_rel_for_clauses(root, clauses);
+	if (rel && rel->rtekind == RTE_RELATION && rel->statlist != NIL)
+	{
+		/*
+		 * Estimate as many clauses as possible using extended statistics.
+		 *
+		 * 'estimatedclauses' tracks the 0-based list position index of
+		 * clauses that we've estimated using extended statistics, and that
+		 * should be ignored.
+		 *
+		 * XXX We can't multiply with current value, because for OR clauses
+		 * we start with 0.0, so we simply assign to s1 directly.
+		 */
+		s1 = statext_clauselist_selectivity(root, clauses, varRelid,
+											jointype, sjinfo, rel,
+											&estimatedclauses, true);
+	}
+
+	/*
+	 * Selectivities of the remaining clauses for an OR clause are computed
+	 * as s1+s2 - s1*s2 to account for the probable overlap of selected tuple
+	 * sets.
+	 *
+	 * XXX is this too conservative?
+	 */
+	idx = 0;
+	foreach(lc, clauses)
+	{
+		Selectivity s2;
+
+		if (bms_is_member(idx, estimatedclauses))
+			continue;
+
+		s2 = clause_selectivity(root,
+								(Node *) lfirst(lc),
+								varRelid,
+								jointype,
+								sjinfo);
+
+		s1 = s1 + s2 - s1 * s2;
+	}
+
+	return s1;
+}
+
 /*
  * clauselist_selectivity_simple -
  *	  Compute the selectivity of an implicitly-ANDed list of boolean
@@ -735,24 +797,14 @@ clause_selectivity(PlannerInfo *root,
 	else if (is_orclause(clause))
 	{
 		/*
-		 * Selectivities for an OR clause are computed as s1+s2 - s1*s2 to
-		 * account for the probable overlap of selected tuple sets.
-		 *
-		 * XXX is this too conservative?
+		 * Almost the same thing as clauselist_selectivity, but with
+		 * the clauses connected by OR.
 		 */
-		ListCell   *arg;
-
-		s1 = 0.0;
-		foreach(arg, ((BoolExpr *) clause)->args)
-		{
-			Selectivity s2 = clause_selectivity(root,
-												(Node *) lfirst(arg),
-												varRelid,
-												jointype,
-												sjinfo);
-
-			s1 = s1 + s2 - s1 * s2;
-		}
+		s1 = clauselist_selectivity_or(root,
+									   ((BoolExpr *) clause)->args,
+									   varRelid,
+									   jointype,
+									   sjinfo);
 	}
 	else if (is_opclause(clause) || IsA(clause, DistinctExpr))
 	{
diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c
index d17b8d9b1f..ccf9565c75 100644
--- a/src/backend/statistics/extended_stats.c
+++ b/src/backend/statistics/extended_stats.c
@@ -1202,7 +1202,8 @@ statext_is_compatible_clause(PlannerInfo *root, Node *clause, Index relid,
 static Selectivity
 statext_mcv_clauselist_selectivity(PlannerInfo *root, List *clauses, int varRelid,
 								   JoinType jointype, SpecialJoinInfo *sjinfo,
-								   RelOptInfo *rel, Bitmapset **estimatedclauses)
+								   RelOptInfo *rel, Bitmapset **estimatedclauses,
+								   bool is_or)
 {
 	ListCell   *l;
 	Bitmapset **list_attnums;
@@ -1289,13 +1290,36 @@ statext_mcv_clauselist_selectivity(PlannerInfo *root, List *clauses, int varReli
 		}
 
 		/*
-		 * First compute "simple" selectivity, i.e. without the extended
-		 * statistics, and essentially assuming independence of the
-		 * columns/clauses. We'll then use the various selectivities computed from
-		 * MCV list to improve it.
+		 * First compute "simple" selectivity, i.e. without the extended stats,
+		 * and essentially assuming independence of the columns/clauses. We'll
+		 * then use the selectivities computed from MCV list to improve it.
 		 */
-		simple_sel = clauselist_selectivity_simple(root, stat_clauses, varRelid,
-												jointype, sjinfo, NULL);
+		if (is_or)
+		{
+			ListCell   *lc;
+			Selectivity	s1 = 0.0,
+						s2;
+
+			/*
+			 * Selectivities of OR clauses are computed s1+s2 - s1*s2 to account
+			 * for the probable overlap of selected tuple sets.
+			 */
+			foreach(lc, stat_clauses)
+			{
+				s2 = clause_selectivity(root,
+										(Node *) lfirst(lc),
+										varRelid,
+										jointype,
+										sjinfo);
+
+				s1 = s1 + s2 - s1 * s2;
+			}
+
+			simple_sel = s1;
+		}
+		else
+			simple_sel = clauselist_selectivity_simple(root, stat_clauses, varRelid,
+													   jointype, sjinfo, NULL);
 
 		/*
 		 * Now compute the multi-column estimate from the MCV list, along with the
@@ -1303,7 +1327,8 @@ statext_mcv_clauselist_selectivity(PlannerInfo *root, List *clauses, int varReli
 		 */
 		mcv_sel = mcv_clauselist_selectivity(root, stat, stat_clauses, varRelid,
 											 jointype, sjinfo, rel,
-											 &mcv_basesel, &mcv_totalsel);
+											 &mcv_basesel, &mcv_totalsel,
+											 is_or);
 
 		/* Estimated selectivity of values not covered by MCV matches */
 		other_sel = simple_sel - mcv_basesel;
@@ -1331,13 +1356,14 @@ statext_mcv_clauselist_selectivity(PlannerInfo *root, List *clauses, int varReli
 Selectivity
 statext_clauselist_selectivity(PlannerInfo *root, List *clauses, int varRelid,
 							   JoinType jointype, SpecialJoinInfo *sjinfo,
-							   RelOptInfo *rel, Bitmapset **estimatedclauses)
+							   RelOptInfo *rel, Bitmapset **estimatedclauses,
+							   bool is_or)
 {
 	Selectivity sel;
 
 	/* First, try estimating clauses using a multivariate MCV list. */
 	sel = statext_mcv_clauselist_selectivity(root, clauses, varRelid, jointype,
-											 sjinfo, rel, estimatedclauses);
+											 sjinfo, rel, estimatedclauses, is_or);
 
 	/*
 	 * Then, apply functional dependencies on the remaining clauses by calling
@@ -1351,10 +1377,14 @@ statext_clauselist_selectivity(PlannerInfo *root, List *clauses, int varRelid,
 	 * For example, MCV list can give us an exact selectivity for values in
 	 * two columns, while functional dependencies can only provide information
 	 * about the overall strength of the dependency.
+	 *
+	 * Functional dependencies only work for clauses connected by AND, so skip
+	 * this for OR clauses.
 	 */
-	sel *= dependencies_clauselist_selectivity(root, clauses, varRelid,
-											   jointype, sjinfo, rel,
-											   estimatedclauses);
+	if (!is_or)
+		sel *= dependencies_clauselist_selectivity(root, clauses, varRelid,
+												   jointype, sjinfo, rel,
+												   estimatedclauses);
 
 	return sel;
 }
diff --git a/src/backend/statistics/mcv.c b/src/backend/statistics/mcv.c
index 87e232fdd4..3f42713aa2 100644
--- a/src/backend/statistics/mcv.c
+++ b/src/backend/statistics/mcv.c
@@ -1795,7 +1795,8 @@ mcv_clauselist_selectivity(PlannerInfo *root, StatisticExtInfo *stat,
 						   List *clauses, int varRelid,
 						   JoinType jointype, SpecialJoinInfo *sjinfo,
 						   RelOptInfo *rel,
-						   Selectivity *basesel, Selectivity *totalsel)
+						   Selectivity *basesel, Selectivity *totalsel,
+						   bool is_or)
 {
 	int			i;
 	MCVList    *mcv;
@@ -1808,7 +1809,7 @@ mcv_clauselist_selectivity(PlannerInfo *root, StatisticExtInfo *stat,
 	mcv = statext_mcv_load(stat->statOid);
 
 	/* build a match bitmap for the clauses */
-	matches = mcv_get_match_bitmap(root, clauses, stat->keys, mcv, false);
+	matches = mcv_get_match_bitmap(root, clauses, stat->keys, mcv, is_or);
 
 	/* sum frequencies for all the matching MCV items */
 	*basesel = 0.0;
diff --git a/src/include/statistics/extended_stats_internal.h b/src/include/statistics/extended_stats_internal.h
index b512ee908a..5171895bba 100644
--- a/src/include/statistics/extended_stats_internal.h
+++ b/src/include/statistics/extended_stats_internal.h
@@ -107,6 +107,7 @@ extern Selectivity mcv_clauselist_selectivity(PlannerInfo *root,
 											  SpecialJoinInfo *sjinfo,
 											  RelOptInfo *rel,
 											  Selectivity *basesel,
-											  Selectivity *totalsel);
+											  Selectivity *totalsel,
+											  bool is_or);
 
 #endif							/* EXTENDED_STATS_INTERNAL_H */
diff --git a/src/include/statistics/statistics.h b/src/include/statistics/statistics.h
index f5d9b6c73a..e18c9a6539 100644
--- a/src/include/statistics/statistics.h
+++ b/src/include/statistics/statistics.h
@@ -116,7 +116,8 @@ extern Selectivity statext_clauselist_selectivity(PlannerInfo *root,
 												  JoinType jointype,
 												  SpecialJoinInfo *sjinfo,
 												  RelOptInfo *rel,
-												  Bitmapset **estimatedclauses);
+												  Bitmapset **estimatedclauses,
+												  bool is_or);
 extern bool has_stats_of_kind(List *stats, char requiredkind);
 extern StatisticExtInfo *choose_best_statistics(List *stats, char requiredkind,
 												Bitmapset **clause_attnums,
-- 
2.21.0

Pavel Stehule

pavel.stehule@gmail.com

almost 6 years ago

In reply to: Tomas Vondra (#1)

Re: Additional improvements to extended statistics

út 14. 1. 2020 v 0:00 odesílatel Tomas Vondra <tomas.vondra@2ndquadrant.com>
napsal:

Hi,

Now that I've committed [1] which allows us to use multiple extended
statistics per table, I'd like to start a thread discussing a couple of
additional improvements for extended statistics. I've considered
starting a separate patch for each, but that would be messy as those
changes will touch roughly the same places. So I've organized it into a
single patch series, with the simpler parts at the beginning.

There are three main improvements:

1) improve estimates of OR clauses

Until now, OR clauses pretty much ignored extended statistics, based on
the experience that they're less vulnerable to misestimates. But it's a
bit weird that AND clauses are handled while OR clauses are not, so this
extends the logic to OR clauses.

Status: I think this is fairly OK.

2) support estimating clauses (Var op Var)

Currently, we only support clauses with a single Var, i.e. clauses like

- Var op Const
- Var IS [NOT] NULL
- [NOT] Var
- ...

and AND/OR clauses built from those simple ones. This patch adds support
for clauses of the form (Var op Var), of course assuming both Vars come
from the same relation.

Status: This works, but it feels a bit hackish. Needs more work.

3) support extended statistics on expressions

Currently we only allow simple references to columns in extended stats,
so we can do

CREATE STATISTICS s ON a, b, c FROM t;

but not

CREATE STATISTICS s ON (a+b), (c + 1) FROM t;

+1 for expression's statisctics - it can be great feature.

Pavel

Show quoted text

This patch aims to allow this. At the moment it's a WIP - it does most
of the catalog changes and stats building, but with some hacks/bugs. And
it does not even try to use those statistics during estimation.

The first question is how to extend the current pg_statistic_ext catalog
to support expressions. I've been planning to do it the way we support
expressions for indexes, i.e. have two catalog fields - one for keys,
one for expressions.

One difference is that for statistics we don't care about order of the
keys, so that we don't need to bother with storing 0 keys in place for
expressions - we can simply assume keys are first, then expressions.

And this is what the patch does now.

I'm however wondering whether to keep this split - why not to just treat
everything as expressions, and be done with it? A key just represents a
Var expression, after all. And it would massively simplify a lot of code
that now has to care about both keys and expressions.

Of course, expressions are a bit more expensive, but I wonder how
noticeable that would be.

Opinions?

ragards

[1] https://commitfest.postgresql.org/26/2320/

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 6 years ago

In reply to: Tomas Vondra (#1)

3 attachment(s)

Re: Additional improvements to extended statistics

Hi,

Here is a rebased version of this patch series. I've polished the first
two parts a bit - estimation of OR clauses and (Var op Var) clauses, and
added a bunch of regression tests to exercise this code. It's not quite
there yet, but I think it's feasible to get this committed for PG13.

The last part (extended stats on expressions) is far from complete, and
it's not feasible to get it into PG13. There's too much missing stuff.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

0001-Support-using-extended-stats-for-parts-of-O-20200306.patchtext/plain; charset=us-asciiDownload

From d7f639b6150fe9fd179066af2a536465d877842a Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Mon, 2 Dec 2019 23:02:17 +0100
Subject: [PATCH 1/3] Support using extended stats for parts of OR clauses

---
 src/backend/optimizer/path/clausesel.c        | 109 +++++++++++++++---
 src/backend/statistics/extended_stats.c       |  45 +++++++-
 src/backend/statistics/mcv.c                  |   5 +-
 .../statistics/extended_stats_internal.h      |   3 +-
 src/include/statistics/statistics.h           |   3 +-
 src/test/regress/expected/stats_ext.out       |   3 +-
 src/test/regress/sql/stats_ext.sql            |   1 -
 7 files changed, 138 insertions(+), 31 deletions(-)

diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c
index a3ebe10592..8c1a404ce2 100644
--- a/src/backend/optimizer/path/clausesel.c
+++ b/src/backend/optimizer/path/clausesel.c
@@ -92,7 +92,7 @@ clauselist_selectivity(PlannerInfo *root,
 		 */
 		s1 *= statext_clauselist_selectivity(root, clauses, varRelid,
 											 jointype, sjinfo, rel,
-											 &estimatedclauses);
+											 &estimatedclauses, false);
 	}
 
 	/*
@@ -104,6 +104,89 @@ clauselist_selectivity(PlannerInfo *root,
 											  estimatedclauses);
 }
 
+/*
+ * clauselist_selectivity_or -
+ *	  Compute the selectivity of an implicitly-ORed list of boolean
+ *	  expression clauses.  The list can be empty, in which case 0.0
+ *	  must be returned.  List elements may be either RestrictInfos
+ *	  or bare expression clauses --- the former is preferred since
+ *	  it allows caching of results.
+ *
+ * See clause_selectivity() for the meaning of the additional parameters.
+ *
+ * The basic approach is to apply extended statistics first, on as many
+ * clauses as possible, in order to capture cross-column dependencies etc.
+ * The remaining clauses are then estimated using regular statistics tracked
+ * for individual columns.  This is done by simply passing the clauses to
+ * clauselist_selectivity and then combining the selectivities using the
+ * regular formula (s1+s2 - s1*s2).
+ */
+static Selectivity
+clauselist_selectivity_or(PlannerInfo *root,
+						  List *clauses,
+						  int varRelid,
+						  JoinType jointype,
+						  SpecialJoinInfo *sjinfo)
+{
+	ListCell   *lc;
+	Selectivity	s1 = 0.0;
+	RelOptInfo *rel;
+	Bitmapset  *estimatedclauses = NULL;
+	int			listidx;
+
+	/*
+	 * Determine if these clauses reference a single relation.  If so, and if
+	 * it has extended statistics, try to apply those.
+	 */
+	rel = find_single_rel_for_clauses(root, clauses);
+	if (rel && rel->rtekind == RTE_RELATION && rel->statlist != NIL)
+	{
+		/*
+		 * Estimate as many clauses as possible using extended statistics.
+		 *
+		 * 'estimatedclauses' tracks the 0-based list position index of
+		 * clauses that we've estimated using extended statistics, and that
+		 * should be ignored.
+		 *
+		 * XXX We can't multiply with current value, because for OR clauses
+		 * we start with 0.0, so we simply assign to s1 directly.
+		 */
+		s1 = statext_clauselist_selectivity(root, clauses, varRelid,
+											jointype, sjinfo, rel,
+											&estimatedclauses, true);
+	}
+
+	/*
+	 * Selectivities of the remaining clauses for an OR clause are computed
+	 * as s1+s2 - s1*s2 to account for the probable overlap of selected tuple
+	 * sets. The clauses estimated using extended statistics are effectively
+	 * treated as a single clause.
+	 *
+	 * XXX is this too conservative?
+	 */
+	listidx = -1;
+	foreach(lc, clauses)
+	{
+		Selectivity s2;
+
+		listidx++;
+
+		/* skip already estimated clauses */
+		if (bms_is_member(listidx, estimatedclauses))
+			continue;
+
+		s2 = clause_selectivity(root,
+								(Node *) lfirst(lc),
+								varRelid,
+								jointype,
+								sjinfo);
+
+		s1 = s1 + s2 - s1 * s2;
+	}
+
+	return s1;
+}
+
 /*
  * clauselist_selectivity_simple -
  *	  Compute the selectivity of an implicitly-ANDed list of boolean
@@ -735,24 +818,14 @@ clause_selectivity(PlannerInfo *root,
 	else if (is_orclause(clause))
 	{
 		/*
-		 * Selectivities for an OR clause are computed as s1+s2 - s1*s2 to
-		 * account for the probable overlap of selected tuple sets.
-		 *
-		 * XXX is this too conservative?
+		 * Almost the same thing as clauselist_selectivity, but with the
+		 * clauses connected by OR.
 		 */
-		ListCell   *arg;
-
-		s1 = 0.0;
-		foreach(arg, ((BoolExpr *) clause)->args)
-		{
-			Selectivity s2 = clause_selectivity(root,
-												(Node *) lfirst(arg),
-												varRelid,
-												jointype,
-												sjinfo);
-
-			s1 = s1 + s2 - s1 * s2;
-		}
+		s1 = clauselist_selectivity_or(root,
+									   ((BoolExpr *) clause)->args,
+									   varRelid,
+									   jointype,
+									   sjinfo);
 	}
 	else if (is_opclause(clause) || IsA(clause, DistinctExpr))
 	{
diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c
index 03e69d057f..24ece6f99c 100644
--- a/src/backend/statistics/extended_stats.c
+++ b/src/backend/statistics/extended_stats.c
@@ -1225,7 +1225,8 @@ statext_is_compatible_clause(PlannerInfo *root, Node *clause, Index relid,
 static Selectivity
 statext_mcv_clauselist_selectivity(PlannerInfo *root, List *clauses, int varRelid,
 								   JoinType jointype, SpecialJoinInfo *sjinfo,
-								   RelOptInfo *rel, Bitmapset **estimatedclauses)
+								   RelOptInfo *rel, Bitmapset **estimatedclauses,
+								   bool is_or)
 {
 	ListCell   *l;
 	Bitmapset **list_attnums;
@@ -1317,8 +1318,32 @@ statext_mcv_clauselist_selectivity(PlannerInfo *root, List *clauses, int varReli
 		 * columns/clauses. We'll then use the various selectivities computed from
 		 * MCV list to improve it.
 		 */
-		simple_sel = clauselist_selectivity_simple(root, stat_clauses, varRelid,
-												jointype, sjinfo, NULL);
+		if (is_or)
+		{
+			ListCell   *lc;
+			Selectivity	s1 = 0.0,
+						s2;
+
+			/*
+			 * Selectivities of OR clauses are computed s1+s2 - s1*s2 to account
+			 * for the probable overlap of selected tuple sets.
+			 */
+			foreach(lc, stat_clauses)
+			{
+				s2 = clause_selectivity(root,
+										(Node *) lfirst(lc),
+										varRelid,
+										jointype,
+										sjinfo);
+
+				s1 = s1 + s2 - s1 * s2;
+			}
+
+			simple_sel = s1;
+		}
+		else
+			simple_sel = clauselist_selectivity_simple(root, stat_clauses, varRelid,
+													   jointype, sjinfo, NULL);
 
 		/*
 		 * Now compute the multi-column estimate from the MCV list, along with the
@@ -1326,7 +1351,7 @@ statext_mcv_clauselist_selectivity(PlannerInfo *root, List *clauses, int varReli
 		 */
 		mcv_sel = mcv_clauselist_selectivity(root, stat, stat_clauses, varRelid,
 											 jointype, sjinfo, rel,
-											 &mcv_basesel, &mcv_totalsel);
+											 &mcv_basesel, &mcv_totalsel, is_or);
 
 		/* Estimated selectivity of values not covered by MCV matches */
 		other_sel = simple_sel - mcv_basesel;
@@ -1354,13 +1379,21 @@ statext_mcv_clauselist_selectivity(PlannerInfo *root, List *clauses, int varReli
 Selectivity
 statext_clauselist_selectivity(PlannerInfo *root, List *clauses, int varRelid,
 							   JoinType jointype, SpecialJoinInfo *sjinfo,
-							   RelOptInfo *rel, Bitmapset **estimatedclauses)
+							   RelOptInfo *rel, Bitmapset **estimatedclauses,
+							   bool is_or)
 {
 	Selectivity sel;
 
 	/* First, try estimating clauses using a multivariate MCV list. */
 	sel = statext_mcv_clauselist_selectivity(root, clauses, varRelid, jointype,
-											 sjinfo, rel, estimatedclauses);
+											 sjinfo, rel, estimatedclauses, is_or);
+
+	/*
+	 * Functional dependencies only work for clauses connected by AND, so for
+	 * OR clauses we're done.
+	 */
+	if (is_or)
+		return sel;
 
 	/*
 	 * Then, apply functional dependencies on the remaining clauses by calling
diff --git a/src/backend/statistics/mcv.c b/src/backend/statistics/mcv.c
index 87e232fdd4..3f42713aa2 100644
--- a/src/backend/statistics/mcv.c
+++ b/src/backend/statistics/mcv.c
@@ -1795,7 +1795,8 @@ mcv_clauselist_selectivity(PlannerInfo *root, StatisticExtInfo *stat,
 						   List *clauses, int varRelid,
 						   JoinType jointype, SpecialJoinInfo *sjinfo,
 						   RelOptInfo *rel,
-						   Selectivity *basesel, Selectivity *totalsel)
+						   Selectivity *basesel, Selectivity *totalsel,
+						   bool is_or)
 {
 	int			i;
 	MCVList    *mcv;
@@ -1808,7 +1809,7 @@ mcv_clauselist_selectivity(PlannerInfo *root, StatisticExtInfo *stat,
 	mcv = statext_mcv_load(stat->statOid);
 
 	/* build a match bitmap for the clauses */
-	matches = mcv_get_match_bitmap(root, clauses, stat->keys, mcv, false);
+	matches = mcv_get_match_bitmap(root, clauses, stat->keys, mcv, is_or);
 
 	/* sum frequencies for all the matching MCV items */
 	*basesel = 0.0;
diff --git a/src/include/statistics/extended_stats_internal.h b/src/include/statistics/extended_stats_internal.h
index b512ee908a..5171895bba 100644
--- a/src/include/statistics/extended_stats_internal.h
+++ b/src/include/statistics/extended_stats_internal.h
@@ -107,6 +107,7 @@ extern Selectivity mcv_clauselist_selectivity(PlannerInfo *root,
 											  SpecialJoinInfo *sjinfo,
 											  RelOptInfo *rel,
 											  Selectivity *basesel,
-											  Selectivity *totalsel);
+											  Selectivity *totalsel,
+											  bool is_or);
 
 #endif							/* EXTENDED_STATS_INTERNAL_H */
diff --git a/src/include/statistics/statistics.h b/src/include/statistics/statistics.h
index f5d9b6c73a..e18c9a6539 100644
--- a/src/include/statistics/statistics.h
+++ b/src/include/statistics/statistics.h
@@ -116,7 +116,8 @@ extern Selectivity statext_clauselist_selectivity(PlannerInfo *root,
 												  JoinType jointype,
 												  SpecialJoinInfo *sjinfo,
 												  RelOptInfo *rel,
-												  Bitmapset **estimatedclauses);
+												  Bitmapset **estimatedclauses,
+												  bool is_or);
 extern bool has_stats_of_kind(List *stats, char requiredkind);
 extern StatisticExtInfo *choose_best_statistics(List *stats, char requiredkind,
 												Bitmapset **clause_attnums,
diff --git a/src/test/regress/expected/stats_ext.out b/src/test/regress/expected/stats_ext.out
index 61237dfb11..5344b70cf4 100644
--- a/src/test/regress/expected/stats_ext.out
+++ b/src/test/regress/expected/stats_ext.out
@@ -648,11 +648,10 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = '
        200 |    200
 (1 row)
 
--- we can't use the statistic for OR clauses that are not fully covered (missing 'd' attribute)
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = ''1'' OR c = 1 OR d IS NOT NULL');
  estimated | actual 
 -----------+--------
-       343 |    200
+       200 |    200
 (1 row)
 
 -- check change of unrelated column type does not reset the MCV statistics
diff --git a/src/test/regress/sql/stats_ext.sql b/src/test/regress/sql/stats_ext.sql
index 84f13e8814..fa989fccb0 100644
--- a/src/test/regress/sql/stats_ext.sql
+++ b/src/test/regress/sql/stats_ext.sql
@@ -400,7 +400,6 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a <= 4 AND b <
 
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = ''1'' OR c = 1');
 
--- we can't use the statistic for OR clauses that are not fully covered (missing 'd' attribute)
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = ''1'' OR c = 1 OR d IS NOT NULL');
 
 -- check change of unrelated column type does not reset the MCV statistics
-- 
2.21.1

0002-Support-clauses-of-the-form-Var-op-Var-20200306.patchtext/plain; charset=us-asciiDownload

From af5921a73a71a8c6adf454c35e2b8e911c94cee7 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Mon, 11 Nov 2019 01:34:11 +0100
Subject: [PATCH 2/3] Support clauses of the form Var op Var

---
 src/backend/statistics/extended_stats.c       | 63 ++++++++++++----
 src/backend/statistics/mcv.c                  | 75 ++++++++++++++++++-
 .../statistics/extended_stats_internal.h      |  2 +-
 src/test/regress/expected/stats_ext.out       | 72 ++++++++++++++++++
 src/test/regress/sql/stats_ext.sql            | 22 ++++++
 5 files changed, 217 insertions(+), 17 deletions(-)

diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c
index 24ece6f99c..1872cd4529 100644
--- a/src/backend/statistics/extended_stats.c
+++ b/src/backend/statistics/extended_stats.c
@@ -986,14 +986,18 @@ statext_is_compatible_clause_internal(PlannerInfo *root, Node *clause,
 	{
 		RangeTblEntry *rte = root->simple_rte_array[relid];
 		OpExpr	   *expr = (OpExpr *) clause;
-		Var		   *var;
+		Var		   *var,
+				   *var2;
 
 		/* Only expressions with two arguments are considered compatible. */
 		if (list_length(expr->args) != 2)
 			return false;
 
-		/* Check if the expression the right shape (one Var, one Const) */
-		if (!examine_opclause_expression(expr, &var, NULL, NULL))
+		/*
+		 * Check if the expression the right shape (one Var and one Const,
+		 * or two Vars).
+		 */
+		if (!examine_opclause_expression(expr, &var, &var2, NULL, NULL))
 			return false;
 
 		/*
@@ -1033,7 +1037,20 @@ statext_is_compatible_clause_internal(PlannerInfo *root, Node *clause,
 			!get_func_leakproof(get_opcode(expr->opno)))
 			return false;
 
-		return statext_is_compatible_clause_internal(root, (Node *) var,
+		/*
+		 * Check compatibility of the first Var - we get this one for both
+		 * types of supported expressions (Var op Const) and (Var op Var).
+		 */
+		if (!statext_is_compatible_clause_internal(root, (Node *) var,
+												   relid, attnums))
+			return false;
+
+		/* For (Var op Const) we don't get the second Var, and we're done. */
+		if (!var2)
+			return true;
+
+		/* For (Var op Var) check compatibility of the second Var. */
+		return statext_is_compatible_clause_internal(root, (Node *) var2,
 													 relid, attnums);
 	}
 
@@ -1419,19 +1436,21 @@ statext_clauselist_selectivity(PlannerInfo *root, List *clauses, int varRelid,
  * examine_opclause_expression
  *		Split expression into Var and Const parts.
  *
- * Attempts to match the arguments to either (Var op Const) or (Const op Var),
- * possibly with a RelabelType on top. When the expression matches this form,
- * returns true, otherwise returns false.
+ * Attempts to match the arguments to either (Var op Const) or (Const op Var)
+ * or (Var op Var), possibly with a RelabelType on top. When the expression
+ * matches this form, returns true, otherwise returns false.
  *
  * Optionally returns pointers to the extracted Var/Const nodes, when passed
  * non-null pointers (varp, cstp and varonleftp). The varonleftp flag specifies
  * on which side of the operator we found the Var node.
  */
 bool
-examine_opclause_expression(OpExpr *expr, Var **varp, Const **cstp, bool *varonleftp)
+examine_opclause_expression(OpExpr *expr, Var **var1p, Var **var2p,
+							Const **cstp, bool *varonleftp)
 {
-	Var	   *var;
-	Const  *cst;
+	Var	   *var1 = NULL;
+	Var	   *var2 = NULL;
+	Const  *cst = NULL;
 	bool	varonleft;
 	Node   *leftop,
 		   *rightop;
@@ -1451,22 +1470,38 @@ examine_opclause_expression(OpExpr *expr, Var **varp, Const **cstp, bool *varonl
 
 	if (IsA(leftop, Var) && IsA(rightop, Const))
 	{
-		var = (Var *) leftop;
+		var1 = (Var *) leftop;
 		cst = (Const *) rightop;
 		varonleft = true;
 	}
 	else if (IsA(leftop, Const) && IsA(rightop, Var))
 	{
-		var = (Var *) rightop;
+		var1 = (Var *) rightop;
 		cst = (Const *) leftop;
 		varonleft = false;
 	}
+	else if (IsA(leftop, Var) && IsA(rightop, Var))
+	{
+		var1 = (Var *) leftop;
+		var2 = (Var *) rightop;
+		varonleft = false;
+
+		/*
+		 * Both variables have to be for the same relation (otherwise it's
+		 * a join clause, and we don't deal with those yet.
+		 */
+		if (var1->varno != var2->varno)
+			return false;
+	}
 	else
 		return false;
 
 	/* return pointers to the extracted parts if requested */
-	if (varp)
-		*varp = var;
+	if (var1p)
+		*var1p = var1;
+
+	if (var2p)
+		*var2p = var2;
 
 	if (cstp)
 		*cstp = cst;
diff --git a/src/backend/statistics/mcv.c b/src/backend/statistics/mcv.c
index 3f42713aa2..97d3083451 100644
--- a/src/backend/statistics/mcv.c
+++ b/src/backend/statistics/mcv.c
@@ -1581,16 +1581,25 @@ mcv_get_match_bitmap(PlannerInfo *root, List *clauses,
 
 			/* valid only after examine_opclause_expression returns true */
 			Var		   *var;
+			Var		   *var2;
 			Const	   *cst;
 			bool		varonleft;
 
 			fmgr_info(get_opcode(expr->opno), &opproc);
 
-			/* extract the var and const from the expression */
-			if (examine_opclause_expression(expr, &var, &cst, &varonleft))
+			/* extract the vars and const from the expression */
+			if (!examine_opclause_expression(expr, &var, &var2, &cst, &varonleft))
+				continue;	/* XXX Can this actually happen? */
+
+			/* We should always get at least one Var. */
+			Assert(var);
+
+			if (cst)
 			{
 				int			idx;
 
+				Assert(!var2);
+
 				/* match the attribute to a dimension of the statistic */
 				idx = bms_member_index(keys, var->varattno);
 
@@ -1651,6 +1660,68 @@ mcv_get_match_bitmap(PlannerInfo *root, List *clauses,
 					matches[i] = RESULT_MERGE(matches[i], is_or, match);
 				}
 			}
+			else
+			{
+				int			idx;
+				int			idx2;
+
+				Assert(var2);
+
+				/* match the attribute to a dimension of the statistic */
+				idx = bms_member_index(keys, var->varattno);
+				idx2 = bms_member_index(keys, var2->varattno);
+
+				/*
+				 * Walk through the MCV items and evaluate the current clause.
+				 * We can skip items that were already ruled out, and
+				 * terminate if there are no remaining MCV items that might
+				 * possibly match.
+				 */
+				for (i = 0; i < mcvlist->nitems; i++)
+				{
+					bool		match = true;
+					MCVItem    *item = &mcvlist->items[i];
+
+					/*
+					 * When either of the MCV items is NULL we can treat this
+					 * as a mismatch. We must not call the operator because
+					 * of strictness.
+					 */
+					if (item->isnull[idx] || item->isnull[idx2])
+					{
+						matches[i] = RESULT_MERGE(matches[i], is_or, false);
+						continue;
+					}
+
+					/*
+					 * Skip MCV items that can't change result in the bitmap.
+					 * Once the value gets false for AND-lists, or true for
+					 * OR-lists, we don't need to look at more clauses.
+					 */
+					if (RESULT_IS_FINAL(matches[i], is_or))
+						continue;
+
+					/*
+					 * First check whether the constant is below the lower
+					 * boundary (in that case we can skip the bucket, because
+					 * there's no overlap).
+					 *
+					 * We don't store collations used to build the statistics,
+					 * but we can use the collation for the attribute itself,
+					 * as stored in varcollid. We do reset the statistics after
+					 * a type change (including collation change), so this is
+					 * OK. We may need to relax this after allowing extended
+					 * statistics on expressions.
+					 */
+					match = DatumGetBool(FunctionCall2Coll(&opproc,
+														   var->varcollid,
+														   item->values[idx],
+														   item->values[idx2]));
+
+					/* update the match bitmap with the result */
+					matches[i] = RESULT_MERGE(matches[i], is_or, match);
+				}
+			}
 		}
 		else if (IsA(clause, NullTest))
 		{
diff --git a/src/include/statistics/extended_stats_internal.h b/src/include/statistics/extended_stats_internal.h
index 5171895bba..804089bc57 100644
--- a/src/include/statistics/extended_stats_internal.h
+++ b/src/include/statistics/extended_stats_internal.h
@@ -96,7 +96,7 @@ extern SortItem *build_sorted_items(int numrows, int *nitems, HeapTuple *rows,
 									TupleDesc tdesc, MultiSortSupport mss,
 									int numattrs, AttrNumber *attnums);
 
-extern bool examine_opclause_expression(OpExpr *expr, Var **varp,
+extern bool examine_opclause_expression(OpExpr *expr, Var **var1p, Var **var2p,
 										Const **cstp, bool *varonleftp);
 
 extern Selectivity mcv_clauselist_selectivity(PlannerInfo *root,
diff --git a/src/test/regress/expected/stats_ext.out b/src/test/regress/expected/stats_ext.out
index 5344b70cf4..4c078ae61f 100644
--- a/src/test/regress/expected/stats_ext.out
+++ b/src/test/regress/expected/stats_ext.out
@@ -603,6 +603,18 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = '
        343 |    200
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a > c');
+ estimated | actual 
+-----------+--------
+      1667 |   3750
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a < c');
+ estimated | actual 
+-----------+--------
+      1667 |      0
+(1 row)
+
 -- create statistics
 CREATE STATISTICS mcv_lists_stats (mcv) ON a, b, c FROM mcv_lists;
 ANALYZE mcv_lists;
@@ -654,6 +666,18 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = '
        200 |    200
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a > c');
+ estimated | actual 
+-----------+--------
+      3750 |   3750
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a < c');
+ estimated | actual 
+-----------+--------
+         1 |      0
+(1 row)
+
 -- check change of unrelated column type does not reset the MCV statistics
 ALTER TABLE mcv_lists ALTER COLUMN d TYPE VARCHAR(64);
 SELECT d.stxdmcv IS NOT NULL
@@ -749,6 +773,12 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE b = ''x'' OR d
       3750 |   2500
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE b = d');
+ estimated | actual 
+-----------+--------
+        25 |   2500
+(1 row)
+
 -- create statistics
 CREATE STATISTICS mcv_lists_stats (mcv) ON b, d FROM mcv_lists;
 ANALYZE mcv_lists;
@@ -758,6 +788,12 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE b = ''x'' OR d
       2500 |   2500
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE b = d');
+ estimated | actual 
+-----------+--------
+      2500 |   2500
+(1 row)
+
 -- mcv with arrays
 CREATE TABLE mcv_lists_arrays (
     a TEXT[],
@@ -808,6 +844,18 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a AND
       1094 |      0
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a = b');
+ estimated | actual 
+-----------+--------
+      9950 |   2500
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a = b AND b = c');
+ estimated | actual 
+-----------+--------
+        50 |   2500
+(1 row)
+
 CREATE STATISTICS mcv_lists_bool_stats (mcv) ON a, b, c
   FROM mcv_lists_bool;
 ANALYZE mcv_lists_bool;
@@ -835,6 +883,18 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a AND
          1 |      0
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a = b');
+ estimated | actual 
+-----------+--------
+      2500 |   2500
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a = b AND b = c');
+ estimated | actual 
+-----------+--------
+      2500 |   2500
+(1 row)
+
 -- check the ability to use multiple MCV lists
 CREATE TABLE mcv_lists_multi (
 	a INTEGER,
@@ -869,6 +929,12 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 AN
          4 |    142
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = b AND c = d');
+ estimated | actual 
+-----------+--------
+         1 |   5000
+(1 row)
+
 -- create separate MCV statistics
 CREATE STATISTICS mcv_lists_multi_1 (mcv) ON a, b FROM mcv_lists_multi;
 CREATE STATISTICS mcv_lists_multi_2 (mcv) ON c, d FROM mcv_lists_multi;
@@ -891,6 +957,12 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 AN
        143 |    142
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = b AND c = d');
+ estimated | actual 
+-----------+--------
+      5000 |   5000
+(1 row)
+
 DROP TABLE mcv_lists_multi;
 -- Permission tests. Users should not be able to see specific data values in
 -- the extended statistics, if they lack permission to see those values in
diff --git a/src/test/regress/sql/stats_ext.sql b/src/test/regress/sql/stats_ext.sql
index fa989fccb0..b7519b275b 100644
--- a/src/test/regress/sql/stats_ext.sql
+++ b/src/test/regress/sql/stats_ext.sql
@@ -381,6 +381,10 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = '
 
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = ''1'' OR c = 1 OR d IS NOT NULL');
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a > c');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a < c');
+
 -- create statistics
 CREATE STATISTICS mcv_lists_stats (mcv) ON a, b, c FROM mcv_lists;
 
@@ -402,6 +406,10 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = '
 
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = ''1'' OR c = 1 OR d IS NOT NULL');
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a > c');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a < c');
+
 -- check change of unrelated column type does not reset the MCV statistics
 ALTER TABLE mcv_lists ALTER COLUMN d TYPE VARCHAR(64);
 
@@ -473,6 +481,8 @@ ANALYZE mcv_lists;
 
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE b = ''x'' OR d = ''x''');
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE b = d');
+
 -- create statistics
 CREATE STATISTICS mcv_lists_stats (mcv) ON b, d FROM mcv_lists;
 
@@ -480,6 +490,8 @@ ANALYZE mcv_lists;
 
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE b = ''x'' OR d = ''x''');
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE b = d');
+
 -- mcv with arrays
 CREATE TABLE mcv_lists_arrays (
     a TEXT[],
@@ -521,6 +533,10 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a AND
 
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a AND b AND NOT c');
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a = b');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a = b AND b = c');
+
 CREATE STATISTICS mcv_lists_bool_stats (mcv) ON a, b, c
   FROM mcv_lists_bool;
 
@@ -534,6 +550,10 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a AND
 
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a AND b AND NOT c');
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a = b');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a = b AND b = c');
+
 -- check the ability to use multiple MCV lists
 CREATE TABLE mcv_lists_multi (
 	a INTEGER,
@@ -556,6 +576,7 @@ ANALYZE mcv_lists_multi;
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 AND b = 0');
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE c = 0 AND d = 0');
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 AND b = 0 AND c = 0 AND d = 0');
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = b AND c = d');
 
 -- create separate MCV statistics
 CREATE STATISTICS mcv_lists_multi_1 (mcv) ON a, b FROM mcv_lists_multi;
@@ -566,6 +587,7 @@ ANALYZE mcv_lists_multi;
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 AND b = 0');
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE c = 0 AND d = 0');
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 AND b = 0 AND c = 0 AND d = 0');
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = b AND c = d');
 
 DROP TABLE mcv_lists_multi;
 
-- 
2.21.1

0003-Support-for-extended-statistics-on-expressi-20200306.patchtext/plain; charset=us-asciiDownload

From 7957f0cc7bd96981fcbc34b4f5eb5948538769ae Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Thu, 5 Mar 2020 22:36:03 +0100
Subject: [PATCH 3/3] Support for extended statistics on expressions

---
 src/backend/commands/statscmds.c              | 190 +++++--
 src/backend/nodes/copyfuncs.c                 |  14 +
 src/backend/nodes/equalfuncs.c                |  13 +
 src/backend/nodes/outfuncs.c                  |  12 +
 src/backend/optimizer/util/plancat.c          |  40 ++
 src/backend/parser/gram.y                     |  31 +-
 src/backend/parser/parse_agg.c                |  10 +
 src/backend/parser/parse_expr.c               |   6 +
 src/backend/parser/parse_func.c               |   3 +
 src/backend/parser/parse_utilcmd.c            |  89 ++-
 src/backend/statistics/dependencies.c         | 159 +++++-
 src/backend/statistics/extended_stats.c       | 532 +++++++++++++++++-
 src/backend/statistics/mcv.c                  |  17 +-
 src/backend/statistics/mvdistinct.c           |  51 +-
 src/backend/tcop/utility.c                    |  16 +-
 src/backend/utils/adt/ruleutils.c             |  59 ++
 src/backend/utils/adt/selfuncs.c              |  11 +
 src/bin/psql/describe.c                       |   1 +
 src/include/catalog/pg_statistic_ext.h        |   3 +
 src/include/nodes/nodes.h                     |   1 +
 src/include/nodes/parsenodes.h                |  16 +
 src/include/nodes/pathnodes.h                 |   1 +
 src/include/parser/parse_node.h               |   1 +
 src/include/parser/parse_utilcmd.h            |   2 +
 .../statistics/extended_stats_internal.h      |  13 +-
 25 files changed, 1191 insertions(+), 100 deletions(-)

diff --git a/src/backend/commands/statscmds.c b/src/backend/commands/statscmds.c
index 988cdba6f5..56559a1e91 100644
--- a/src/backend/commands/statscmds.c
+++ b/src/backend/commands/statscmds.c
@@ -29,6 +29,8 @@
 #include "commands/comment.h"
 #include "commands/defrem.h"
 #include "miscadmin.h"
+#include "nodes/nodeFuncs.h"
+#include "optimizer/optimizer.h"
 #include "statistics/statistics.h"
 #include "utils/builtins.h"
 #include "utils/fmgroids.h"
@@ -42,6 +44,7 @@
 static char *ChooseExtendedStatisticName(const char *name1, const char *name2,
 										 const char *label, Oid namespaceid);
 static char *ChooseExtendedStatisticNameAddition(List *exprs);
+static bool CheckMutability(Expr *expr);
 
 
 /* qsort comparator for the attnums in CreateStatistics */
@@ -62,6 +65,7 @@ ObjectAddress
 CreateStatistics(CreateStatsStmt *stmt)
 {
 	int16		attnums[STATS_MAX_DIMENSIONS];
+	int			nattnums = 0;
 	int			numcols = 0;
 	char	   *namestr;
 	NameData	stxname;
@@ -74,6 +78,8 @@ CreateStatistics(CreateStatsStmt *stmt)
 	Datum		datavalues[Natts_pg_statistic_ext_data];
 	bool		datanulls[Natts_pg_statistic_ext_data];
 	int2vector *stxkeys;
+	List	   *stxexprs = NIL;
+	Datum		exprsDatum;
 	Relation	statrel;
 	Relation	datarel;
 	Relation	rel = NULL;
@@ -192,56 +198,95 @@ CreateStatistics(CreateStatsStmt *stmt)
 	foreach(cell, stmt->exprs)
 	{
 		Node	   *expr = (Node *) lfirst(cell);
-		ColumnRef  *cref;
-		char	   *attname;
+		StatsElem  *selem;
 		HeapTuple	atttuple;
 		Form_pg_attribute attForm;
 		TypeCacheEntry *type;
 
-		if (!IsA(expr, ColumnRef))
+		if (!IsA(expr, StatsElem))
 			ereport(ERROR,
 					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 					 errmsg("only simple column references are allowed in CREATE STATISTICS")));
-		cref = (ColumnRef *) expr;
+		selem = (StatsElem *) expr;
 
-		if (list_length(cref->fields) != 1)
-			ereport(ERROR,
-					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-					 errmsg("only simple column references are allowed in CREATE STATISTICS")));
-		attname = strVal((Value *) linitial(cref->fields));
-
-		atttuple = SearchSysCacheAttName(relid, attname);
-		if (!HeapTupleIsValid(atttuple))
-			ereport(ERROR,
-					(errcode(ERRCODE_UNDEFINED_COLUMN),
-					 errmsg("column \"%s\" does not exist",
-							attname)));
-		attForm = (Form_pg_attribute) GETSTRUCT(atttuple);
-
-		/* Disallow use of system attributes in extended stats */
-		if (attForm->attnum <= 0)
-			ereport(ERROR,
-					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-					 errmsg("statistics creation on system columns is not supported")));
-
-		/* Disallow data types without a less-than operator */
-		type = lookup_type_cache(attForm->atttypid, TYPECACHE_LT_OPR);
-		if (type->lt_opr == InvalidOid)
-			ereport(ERROR,
-					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-					 errmsg("column \"%s\" cannot be used in statistics because its type %s has no default btree operator class",
-							attname, format_type_be(attForm->atttypid))));
-
-		/* Make sure no more than STATS_MAX_DIMENSIONS columns are used */
-		if (numcols >= STATS_MAX_DIMENSIONS)
-			ereport(ERROR,
-					(errcode(ERRCODE_TOO_MANY_COLUMNS),
-					 errmsg("cannot have more than %d columns in statistics",
-							STATS_MAX_DIMENSIONS)));
-
-		attnums[numcols] = attForm->attnum;
-		numcols++;
-		ReleaseSysCache(atttuple);
+		if (selem->name)	/* column reference */
+		{
+			char	   *attname;
+			attname = selem->name;
+
+			atttuple = SearchSysCacheAttName(relid, attname);
+			if (!HeapTupleIsValid(atttuple))
+				ereport(ERROR,
+						(errcode(ERRCODE_UNDEFINED_COLUMN),
+						 errmsg("column \"%s\" does not exist",
+								attname)));
+			attForm = (Form_pg_attribute) GETSTRUCT(atttuple);
+
+			/* Disallow use of system attributes in extended stats */
+			if (attForm->attnum <= 0)
+				ereport(ERROR,
+						(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+						 errmsg("statistics creation on system columns is not supported")));
+
+			/* Disallow data types without a less-than operator */
+			type = lookup_type_cache(attForm->atttypid, TYPECACHE_LT_OPR);
+			if (type->lt_opr == InvalidOid)
+				ereport(ERROR,
+						(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+						 errmsg("column \"%s\" cannot be used in statistics because its type %s has no default btree operator class",
+								attname, format_type_be(attForm->atttypid))));
+
+			/* Make sure no more than STATS_MAX_DIMENSIONS columns are used */
+			if (numcols >= STATS_MAX_DIMENSIONS)
+				ereport(ERROR,
+						(errcode(ERRCODE_TOO_MANY_COLUMNS),
+						 errmsg("cannot have more than %d columns in statistics",
+								STATS_MAX_DIMENSIONS)));
+
+			attnums[nattnums] = attForm->attnum;
+			nattnums++;
+			numcols++;
+			ReleaseSysCache(atttuple);
+		}
+		else	/* expression */
+		{
+			Node	   *expr = selem->expr;
+			TypeCacheEntry *type;
+			Oid			atttype;
+
+			Assert(expr != NULL);
+
+			/*
+			 * An expression using mutable functions is probably wrong,
+			 * since if you aren't going to get the same result for the
+			 * same data every time, it's not clear what the index entries
+			 * mean at all.
+			 */
+			if (CheckMutability((Expr *) expr))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+						 errmsg("functions in statistics expression must be marked IMMUTABLE")));
+
+			/* Disallow data types without a less-than operator */
+			atttype = exprType(expr);
+			type = lookup_type_cache(atttype, TYPECACHE_LT_OPR);
+			if (type->lt_opr == InvalidOid)
+				ereport(ERROR,
+						(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+						 errmsg("expression cannot be used in statistics because its type %s has no default btree operator class",
+								format_type_be(atttype))));
+
+			/* Make sure no more than STATS_MAX_DIMENSIONS columns are used */
+			if (numcols >= STATS_MAX_DIMENSIONS)
+				ereport(ERROR,
+						(errcode(ERRCODE_TOO_MANY_COLUMNS),
+						 errmsg("cannot have more than %d columns in statistics",
+								STATS_MAX_DIMENSIONS)));
+
+			numcols++;
+
+			stxexprs = lappend(stxexprs, expr);
+		}
 	}
 
 	/*
@@ -258,13 +303,13 @@ CreateStatistics(CreateStatsStmt *stmt)
 	 * it does not hurt (it does not affect the efficiency, unlike for
 	 * indexes, for example).
 	 */
-	qsort(attnums, numcols, sizeof(int16), compare_int16);
+	qsort(attnums, nattnums, sizeof(int16), compare_int16);
 
 	/*
 	 * Check for duplicates in the list of columns. The attnums are sorted so
 	 * just check consecutive elements.
 	 */
-	for (i = 1; i < numcols; i++)
+	for (i = 1; i < nattnums; i++)
 	{
 		if (attnums[i] == attnums[i - 1])
 			ereport(ERROR,
@@ -273,7 +318,7 @@ CreateStatistics(CreateStatsStmt *stmt)
 	}
 
 	/* Form an int2vector representation of the sorted column list */
-	stxkeys = buildint2vector(attnums, numcols);
+	stxkeys = buildint2vector(attnums, nattnums);
 
 	/*
 	 * Parse the statistics kinds.
@@ -325,6 +370,18 @@ CreateStatistics(CreateStatsStmt *stmt)
 	Assert(ntypes > 0 && ntypes <= lengthof(types));
 	stxkind = construct_array(types, ntypes, CHAROID, 1, true, TYPALIGN_CHAR);
 
+	/* convert the expressions (if any) to a text datum */
+	if (stxexprs != NIL)
+	{
+		char	   *exprsString;
+
+		exprsString = nodeToString(stxexprs);
+		exprsDatum = CStringGetTextDatum(exprsString);
+		pfree(exprsString);
+	}
+	else
+		exprsDatum = (Datum) 0;
+
 	statrel = table_open(StatisticExtRelationId, RowExclusiveLock);
 
 	/*
@@ -344,6 +401,15 @@ CreateStatistics(CreateStatsStmt *stmt)
 	values[Anum_pg_statistic_ext_stxkeys - 1] = PointerGetDatum(stxkeys);
 	values[Anum_pg_statistic_ext_stxkind - 1] = PointerGetDatum(stxkind);
 
+	values[Anum_pg_statistic_ext_stxexprs - 1] = exprsDatum;
+	if (exprsDatum == (Datum) 0)
+		nulls[Anum_pg_statistic_ext_stxexprs - 1] = true;
+
+	/*
+	 * FIXME add dependencies on anything mentioned in the expressions,
+	 * see recordDependencyOnSingleRelExpr in index_create
+	 */
+
 	/* insert it into pg_statistic_ext */
 	htup = heap_form_tuple(statrel->rd_att, values, nulls);
 	CatalogTupleInsert(statrel, htup);
@@ -387,7 +453,7 @@ CreateStatistics(CreateStatsStmt *stmt)
 	 */
 	ObjectAddressSet(myself, StatisticExtRelationId, statoid);
 
-	for (i = 0; i < numcols; i++)
+	for (i = 0; i < nattnums; i++)
 	{
 		ObjectAddressSubSet(parentobject, RelationRelationId, relid, attnums[i]);
 		recordDependencyOn(&myself, &parentobject, DEPENDENCY_AUTO);
@@ -722,14 +788,14 @@ ChooseExtendedStatisticNameAddition(List *exprs)
 	buf[0] = '\0';
 	foreach(lc, exprs)
 	{
-		ColumnRef  *cref = (ColumnRef *) lfirst(lc);
+		StatsElem  *selem = (StatsElem *) lfirst(lc);
 		const char *name;
 
 		/* It should be one of these, but just skip if it happens not to be */
-		if (!IsA(cref, ColumnRef))
+		if (!IsA(selem, StatsElem))
 			continue;
 
-		name = strVal((Value *) linitial(cref->fields));
+		name = selem->name;
 
 		if (buflen > 0)
 			buf[buflen++] = '_';	/* insert _ between names */
@@ -745,3 +811,29 @@ ChooseExtendedStatisticNameAddition(List *exprs)
 	}
 	return pstrdup(buf);
 }
+
+/*
+ * CheckMutability
+ *		Test whether given expression is mutable
+ */
+static bool
+CheckMutability(Expr *expr)
+{
+	/*
+	 * First run the expression through the planner.  This has a couple of
+	 * important consequences.  First, function default arguments will get
+	 * inserted, which may affect volatility (consider "default now()").
+	 * Second, inline-able functions will get inlined, which may allow us to
+	 * conclude that the function is really less volatile than it's marked. As
+	 * an example, polymorphic functions must be marked with the most volatile
+	 * behavior that they have for any input type, but once we inline the
+	 * function we may be able to conclude that it's not so volatile for the
+	 * particular input type we're dealing with.
+	 *
+	 * We assume here that expression_planner() won't scribble on its input.
+	 */
+	expr = expression_planner(expr);
+
+	/* Now we can search for non-immutable functions */
+	return contain_mutable_functions((Node *) expr);
+}
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index e04c33e4ad..fee5d3b086 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -2883,6 +2883,17 @@ _copyIndexElem(const IndexElem *from)
 	return newnode;
 }
 
+static StatsElem *
+_copyStatsElem(const StatsElem *from)
+{
+	StatsElem  *newnode = makeNode(StatsElem);
+
+	COPY_STRING_FIELD(name);
+	COPY_NODE_FIELD(expr);
+
+	return newnode;
+}
+
 static ColumnDef *
 _copyColumnDef(const ColumnDef *from)
 {
@@ -5566,6 +5577,9 @@ copyObjectImpl(const void *from)
 		case T_IndexElem:
 			retval = _copyIndexElem(from);
 			break;
+		case T_StatsElem:
+			retval = _copyStatsElem(from);
+			break;
 		case T_ColumnDef:
 			retval = _copyColumnDef(from);
 			break;
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index 5b1ba143b1..956420cce9 100644
--- a/src/backend/nodes/equalfuncs.c
+++ b/src/backend/nodes/equalfuncs.c
@@ -2569,6 +2569,16 @@ _equalIndexElem(const IndexElem *a, const IndexElem *b)
 	return true;
 }
 
+
+static bool
+_equalStatsElem(const StatsElem *a, const StatsElem *b)
+{
+	COMPARE_STRING_FIELD(name);
+	COMPARE_NODE_FIELD(expr);
+
+	return true;
+}
+
 static bool
 _equalColumnDef(const ColumnDef *a, const ColumnDef *b)
 {
@@ -3662,6 +3672,9 @@ equal(const void *a, const void *b)
 		case T_IndexElem:
 			retval = _equalIndexElem(a, b);
 			break;
+		case T_StatsElem:
+			retval = _equalStatsElem(a, b);
+			break;
 		case T_ColumnDef:
 			retval = _equalColumnDef(a, b);
 			break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index e084c3f069..dabf62ed55 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -2873,6 +2873,15 @@ _outIndexElem(StringInfo str, const IndexElem *node)
 	WRITE_ENUM_FIELD(nulls_ordering, SortByNulls);
 }
 
+static void
+_outStatsElem(StringInfo str, const StatsElem *node)
+{
+	WRITE_NODE_TYPE("STATSELEM");
+
+	WRITE_STRING_FIELD(name);
+	WRITE_NODE_FIELD(expr);
+}
+
 static void
 _outQuery(StringInfo str, const Query *node)
 {
@@ -4179,6 +4188,9 @@ outNode(StringInfo str, const void *obj)
 			case T_IndexElem:
 				_outIndexElem(str, obj);
 				break;
+			case T_StatsElem:
+				_outStatsElem(str, obj);
+				break;
 			case T_Query:
 				_outQuery(str, obj);
 				break;
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index d82fc5ab8b..01130c5779 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -34,6 +34,7 @@
 #include "foreign/fdwapi.h"
 #include "miscadmin.h"
 #include "nodes/makefuncs.h"
+#include "nodes/nodeFuncs.h"
 #include "nodes/supportnodes.h"
 #include "optimizer/clauses.h"
 #include "optimizer/cost.h"
@@ -1304,6 +1305,7 @@ get_relation_statistics(RelOptInfo *rel, Relation relation)
 		HeapTuple	dtup;
 		Bitmapset  *keys = NULL;
 		int			i;
+		List	   *exprs = NIL;
 
 		htup = SearchSysCache1(STATEXTOID, ObjectIdGetDatum(statOid));
 		if (!HeapTupleIsValid(htup))
@@ -1322,6 +1324,41 @@ get_relation_statistics(RelOptInfo *rel, Relation relation)
 		for (i = 0; i < staForm->stxkeys.dim1; i++)
 			keys = bms_add_member(keys, staForm->stxkeys.values[i]);
 
+		/*
+		 * preprocess expression (if any)
+		 *
+		 * FIXME we probably need to cache the result somewhere
+		 */
+		{
+			bool		isnull;
+			Datum		datum;
+
+			/* decode expression (if any) */
+			datum = SysCacheGetAttr(STATEXTOID, htup,
+									Anum_pg_statistic_ext_stxexprs, &isnull);
+
+			if (!isnull)
+			{
+				char *exprsString;
+
+				exprsString = TextDatumGetCString(datum);
+				exprs = (List *) stringToNode(exprsString);
+				pfree(exprsString);
+
+				/*
+				 * Run the expressions through eval_const_expressions. This is not just an
+				 * optimization, but is necessary, because the planner will be comparing
+				 * them to similarly-processed qual clauses, and may fail to detect valid
+				 * matches without this.  We must not use canonicalize_qual, however,
+				 * since these aren't qual expressions.
+				 */
+				exprs = (List *) eval_const_expressions(NULL, (Node *) exprs);
+
+				/* May as well fix opfuncids too */
+				fix_opfuncids((Node *) exprs);
+			}
+		}
+
 		/* add one StatisticExtInfo for each kind built */
 		if (statext_is_kind_built(dtup, STATS_EXT_NDISTINCT))
 		{
@@ -1331,6 +1368,7 @@ get_relation_statistics(RelOptInfo *rel, Relation relation)
 			info->rel = rel;
 			info->kind = STATS_EXT_NDISTINCT;
 			info->keys = bms_copy(keys);
+			info->exprs = exprs;
 
 			stainfos = lappend(stainfos, info);
 		}
@@ -1343,6 +1381,7 @@ get_relation_statistics(RelOptInfo *rel, Relation relation)
 			info->rel = rel;
 			info->kind = STATS_EXT_DEPENDENCIES;
 			info->keys = bms_copy(keys);
+			info->exprs = exprs;
 
 			stainfos = lappend(stainfos, info);
 		}
@@ -1355,6 +1394,7 @@ get_relation_statistics(RelOptInfo *rel, Relation relation)
 			info->rel = rel;
 			info->kind = STATS_EXT_MCV;
 			info->keys = bms_copy(keys);
+			info->exprs = exprs;
 
 			stainfos = lappend(stainfos, info);
 		}
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 96e7fdbcfe..90204b5768 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -225,6 +225,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 	WindowDef			*windef;
 	JoinExpr			*jexpr;
 	IndexElem			*ielem;
+	StatsElem			*selem;
 	Alias				*alias;
 	RangeVar			*range;
 	IntoClause			*into;
@@ -386,7 +387,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 				old_aggr_definition old_aggr_list
 				oper_argtypes RuleActionList RuleActionMulti
 				opt_column_list columnList opt_name_list
-				sort_clause opt_sort_clause sortby_list index_params
+				sort_clause opt_sort_clause sortby_list index_params stats_params
 				opt_include opt_c_include index_including_params
 				name_list role_list from_clause from_list opt_array_bounds
 				qualified_name_list any_name any_name_list type_name_list
@@ -494,6 +495,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 %type <list>	func_alias_clause
 %type <sortby>	sortby
 %type <ielem>	index_elem
+%type <selem>	stats_param
 %type <node>	table_ref
 %type <jexpr>	joined_table
 %type <range>	relation_expr
@@ -3982,7 +3984,7 @@ ExistingIndex:   USING INDEX index_name				{ $$ = $3; }
 
 CreateStatsStmt:
 			CREATE STATISTICS any_name
-			opt_name_list ON expr_list FROM from_list
+			opt_name_list ON stats_params FROM from_list
 				{
 					CreateStatsStmt *n = makeNode(CreateStatsStmt);
 					n->defnames = $3;
@@ -3994,7 +3996,7 @@ CreateStatsStmt:
 					$$ = (Node *)n;
 				}
 			| CREATE STATISTICS IF_P NOT EXISTS any_name
-			opt_name_list ON expr_list FROM from_list
+			opt_name_list ON stats_params FROM from_list
 				{
 					CreateStatsStmt *n = makeNode(CreateStatsStmt);
 					n->defnames = $6;
@@ -4007,6 +4009,29 @@ CreateStatsStmt:
 				}
 			;
 
+stats_params:	stats_param							{ $$ = list_make1($1); }
+			| stats_params ',' stats_param			{ $$ = lappend($1, $3); }
+		;
+
+stats_param:	ColId
+				{
+					$$ = makeNode(StatsElem);
+					$$->name = $1;
+					$$->expr = NULL;
+				}
+			| func_expr_windowless
+				{
+					$$ = makeNode(StatsElem);
+					$$->name = NULL;
+					$$->expr = $1;
+				}
+			| '(' a_expr ')'
+				{
+					$$ = makeNode(StatsElem);
+					$$->name = NULL;
+					$$->expr = $2;
+				}
+		;
 
 /*****************************************************************************
  *
diff --git a/src/backend/parser/parse_agg.c b/src/backend/parser/parse_agg.c
index f1cc5479e4..169a31bf37 100644
--- a/src/backend/parser/parse_agg.c
+++ b/src/backend/parser/parse_agg.c
@@ -484,6 +484,13 @@ check_agglevels_and_constraints(ParseState *pstate, Node *expr)
 			else
 				err = _("grouping operations are not allowed in index predicates");
 
+			break;
+		case EXPR_KIND_STATS_EXPRESSION:
+			if (isAgg)
+				err = _("aggregate functions are not allowed in statistics expressions");
+			else
+				err = _("grouping operations are not allowed in statistics expressions");
+
 			break;
 		case EXPR_KIND_ALTER_COL_TRANSFORM:
 			if (isAgg)
@@ -906,6 +913,9 @@ transformWindowFuncCall(ParseState *pstate, WindowFunc *wfunc,
 		case EXPR_KIND_INDEX_EXPRESSION:
 			err = _("window functions are not allowed in index expressions");
 			break;
+		case EXPR_KIND_STATS_EXPRESSION:
+			err = _("window functions are not allowed in stats expressions");
+			break;
 		case EXPR_KIND_INDEX_PREDICATE:
 			err = _("window functions are not allowed in index predicates");
 			break;
diff --git a/src/backend/parser/parse_expr.c b/src/backend/parser/parse_expr.c
index 831db4af95..6ddd839654 100644
--- a/src/backend/parser/parse_expr.c
+++ b/src/backend/parser/parse_expr.c
@@ -564,6 +564,7 @@ transformColumnRef(ParseState *pstate, ColumnRef *cref)
 		case EXPR_KIND_FUNCTION_DEFAULT:
 		case EXPR_KIND_INDEX_EXPRESSION:
 		case EXPR_KIND_INDEX_PREDICATE:
+		case EXPR_KIND_STATS_EXPRESSION:
 		case EXPR_KIND_ALTER_COL_TRANSFORM:
 		case EXPR_KIND_EXECUTE_PARAMETER:
 		case EXPR_KIND_TRIGGER_WHEN:
@@ -1913,6 +1914,9 @@ transformSubLink(ParseState *pstate, SubLink *sublink)
 		case EXPR_KIND_INDEX_PREDICATE:
 			err = _("cannot use subquery in index predicate");
 			break;
+		case EXPR_KIND_STATS_EXPRESSION:
+			err = _("cannot use subquery in statistics expression");
+			break;
 		case EXPR_KIND_ALTER_COL_TRANSFORM:
 			err = _("cannot use subquery in transform expression");
 			break;
@@ -3543,6 +3547,8 @@ ParseExprKindName(ParseExprKind exprKind)
 			return "index expression";
 		case EXPR_KIND_INDEX_PREDICATE:
 			return "index predicate";
+		case EXPR_KIND_STATS_EXPRESSION:
+			return "statistics expression";
 		case EXPR_KIND_ALTER_COL_TRANSFORM:
 			return "USING";
 		case EXPR_KIND_EXECUTE_PARAMETER:
diff --git a/src/backend/parser/parse_func.c b/src/backend/parser/parse_func.c
index 9c3b6ad916..cffc276de0 100644
--- a/src/backend/parser/parse_func.c
+++ b/src/backend/parser/parse_func.c
@@ -2495,6 +2495,9 @@ check_srf_call_placement(ParseState *pstate, Node *last_srf, int location)
 		case EXPR_KIND_INDEX_PREDICATE:
 			err = _("set-returning functions are not allowed in index predicates");
 			break;
+		case EXPR_KIND_STATS_EXPRESSION:
+			err = _("set-returning functions are not allowed in stats expressions");
+			break;
 		case EXPR_KIND_ALTER_COL_TRANSFORM:
 			err = _("set-returning functions are not allowed in transform expressions");
 			break;
diff --git a/src/backend/parser/parse_utilcmd.c b/src/backend/parser/parse_utilcmd.c
index af77f1890f..f63068e5fc 100644
--- a/src/backend/parser/parse_utilcmd.c
+++ b/src/backend/parser/parse_utilcmd.c
@@ -1747,14 +1747,15 @@ generateClonedExtStatsStmt(RangeVar *heapRel, Oid heapRelid,
 	/* Determine which columns the statistics are on */
 	for (i = 0; i < statsrec->stxkeys.dim1; i++)
 	{
-		ColumnRef  *cref = makeNode(ColumnRef);
+		StatsElem  *selem = makeNode(StatsElem);
 		AttrNumber	attnum = statsrec->stxkeys.values[i];
 
-		cref->fields = list_make1(makeString(get_attname(heapRelid,
-														 attnum, false)));
-		cref->location = -1;
+		selem->name = get_attname(heapRelid, attnum, false);
+		selem->expr = NULL;
 
-		def_names = lappend(def_names, cref);
+		/* FIXME handle expressions properly */
+
+		def_names = lappend(def_names, selem);
 	}
 
 	/* finally, build the output node */
@@ -2699,6 +2700,84 @@ transformIndexStmt(Oid relid, IndexStmt *stmt, const char *queryString)
 	return stmt;
 }
 
+/*
+ * transformStatsStmt - parse analysis for CREATE STATISTICS
+ *
+ * To avoid race conditions, it's important that this function rely only on
+ * the passed-in relid (and not on stmt->relation) to determine the target
+ * relation.
+ */
+CreateStatsStmt *
+transformStatsStmt(Oid relid, CreateStatsStmt *stmt, const char *queryString)
+{
+	ParseState *pstate;
+	RangeTblEntry *rte;
+	ListCell   *l;
+	Relation	rel;
+
+	/* Nothing to do if statement already transformed. */
+	if (stmt->transformed)
+		return stmt;
+
+	/*
+	 * We must not scribble on the passed-in CreateStatsStmt, so copy it.  (This is
+	 * overkill, but easy.)
+	 */
+	stmt = copyObject(stmt);
+
+	/* Set up pstate */
+	pstate = make_parsestate(NULL);
+	pstate->p_sourcetext = queryString;
+
+	/*
+	 * Put the parent table into the rtable so that the expressions can refer
+	 * to its fields without qualification.  Caller is responsible for locking
+	 * relation, but we still need to open it.
+	 */
+	rel = relation_open(relid, NoLock);
+	rte = addRangeTableEntryForRelation(pstate, rel,
+										AccessShareLock,
+										NULL, false, true);
+
+	/* no to join list, yes to namespaces */
+	addRTEtoQuery(pstate, rte, false, true, true);
+
+	/* take care of any expressions */
+	foreach(l, stmt->exprs)
+	{
+		StatsElem  *selem = (StatsElem *) lfirst(l);
+
+		if (selem->expr)
+		{
+			/* Now do parse transformation of the expression */
+			selem->expr = transformExpr(pstate, selem->expr,
+										EXPR_KIND_STATS_EXPRESSION);
+
+			/* We have to fix its collations too */
+			assign_expr_collations(pstate, selem->expr);
+		}
+	}
+
+	/*
+	 * Check that only the base rel is mentioned.  (This should be dead code
+	 * now that add_missing_from is history.)
+	 */
+	if (list_length(pstate->p_rtable) != 1)
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_COLUMN_REFERENCE),
+				 errmsg("index expressions and predicates can refer only to the table being indexed")));
+
+	free_parsestate(pstate);
+
+	/* Close relation */
+	table_close(rel, NoLock);
+
+	/* Mark statement as successfully transformed */
+	stmt->transformed = true;
+
+	return stmt;
+}
+
 
 /*
  * transformRuleStmt -
diff --git a/src/backend/statistics/dependencies.c b/src/backend/statistics/dependencies.c
index e2f6c5bb97..76afb0ea2a 100644
--- a/src/backend/statistics/dependencies.c
+++ b/src/backend/statistics/dependencies.c
@@ -69,8 +69,10 @@ static void generate_dependencies(DependencyGenerator state);
 static DependencyGenerator DependencyGenerator_init(int n, int k);
 static void DependencyGenerator_free(DependencyGenerator state);
 static AttrNumber *DependencyGenerator_next(DependencyGenerator state);
-static double dependency_degree(int numrows, HeapTuple *rows, int k,
-								AttrNumber *dependency, VacAttrStats **stats, Bitmapset *attrs);
+static double dependency_degree(int numrows, HeapTuple *rows,
+								Datum *exprvals, bool *exprnulls, int nexprs, int k,
+								AttrNumber *dependency, VacAttrStats **stats,
+								Bitmapset *attrs);
 static bool dependency_is_fully_matched(MVDependency *dependency,
 										Bitmapset *attnums);
 static bool dependency_implies_attribute(MVDependency *dependency,
@@ -213,8 +215,8 @@ DependencyGenerator_next(DependencyGenerator state)
  * the last one.
  */
 static double
-dependency_degree(int numrows, HeapTuple *rows, int k, AttrNumber *dependency,
-				  VacAttrStats **stats, Bitmapset *attrs)
+dependency_degree(int numrows, HeapTuple *rows, Datum *exprvals, bool *exprnulls,
+				  int nexprs, int k, AttrNumber *dependency, VacAttrStats **stats, Bitmapset *attrs)
 {
 	int			i,
 				nitems;
@@ -283,8 +285,8 @@ dependency_degree(int numrows, HeapTuple *rows, int k, AttrNumber *dependency,
 	 * descriptor.  For now that assumption holds, but it might change in the
 	 * future for example if we support statistics on multiple tables.
 	 */
-	items = build_sorted_items(numrows, &nitems, rows, stats[0]->tupDesc,
-							   mss, k, attnums_dep);
+	items = build_sorted_items(numrows, &nitems, rows, exprvals, exprnulls,
+							   nexprs, stats[0]->tupDesc, mss, k, attnums_dep);
 
 	/*
 	 * Walk through the sorted array, split it into rows according to the
@@ -354,7 +356,9 @@ dependency_degree(int numrows, HeapTuple *rows, int k, AttrNumber *dependency,
  *	   (c) -> b
  */
 MVDependencies *
-statext_dependencies_build(int numrows, HeapTuple *rows, Bitmapset *attrs,
+statext_dependencies_build(int numrows, HeapTuple *rows,
+						   Datum *exprvals, bool *exprnulls,
+						   Bitmapset *attrs, List *exprs,
 						   VacAttrStats **stats)
 {
 	int			i,
@@ -365,6 +369,15 @@ statext_dependencies_build(int numrows, HeapTuple *rows, Bitmapset *attrs,
 	/* result */
 	MVDependencies *dependencies = NULL;
 
+	/*
+	 * Copy the bitmapset and add fake attnums representing expressions,
+	 * starting above MaxHeapAttributeNumber.
+	 */
+	attrs = bms_copy(attrs);
+
+	for (i = 1; i <= list_length(exprs); i++)
+		attrs = bms_add_member(attrs, MaxHeapAttributeNumber + i);
+
 	/*
 	 * Transform the bms into an array, to make accessing i-th member easier.
 	 */
@@ -392,7 +405,9 @@ statext_dependencies_build(int numrows, HeapTuple *rows, Bitmapset *attrs,
 			MVDependency *d;
 
 			/* compute how valid the dependency seems */
-			degree = dependency_degree(numrows, rows, k, dependency, stats, attrs);
+			degree = dependency_degree(numrows, rows, exprvals, exprnulls,
+									   list_length(exprs), k, dependency,
+									   stats, attrs);
 
 			/*
 			 * if the dependency seems entirely invalid, don't store it
@@ -435,6 +450,8 @@ statext_dependencies_build(int numrows, HeapTuple *rows, Bitmapset *attrs,
 		DependencyGenerator_free(DependencyGenerator);
 	}
 
+	pfree(attrs);
+
 	return dependencies;
 }
 
@@ -914,6 +931,128 @@ find_strongest_dependency(MVDependencies **dependencies, int ndependencies,
 	return strongest;
 }
 
+/*
+ * Similar to dependency_is_compatible_clause, but don't enforce that the
+ * expression is a simple Var.
+ */
+static bool
+dependency_clause_matches_expression(Node *clause, Index relid, List *statlist)
+{
+	List	   *vars;
+	ListCell   *lc, *lc2;
+
+	RestrictInfo *rinfo = (RestrictInfo *) clause;
+	Node		   *clause_expr;
+
+	if (!IsA(rinfo, RestrictInfo))
+		return false;
+
+	/* Pseudoconstants are not interesting (they couldn't contain a Var) */
+	if (rinfo->pseudoconstant)
+		return false;
+
+	/* Clauses referencing multiple, or no, varnos are incompatible */
+	if (bms_membership(rinfo->clause_relids) != BMS_SINGLETON)
+		return false;
+
+	if (is_opclause(rinfo->clause))
+	{
+		/* If it's an opclause, check for Var = Const or Const = Var. */
+		OpExpr	   *expr = (OpExpr *) rinfo->clause;
+
+		/* Only expressions with two arguments are candidates. */
+		if (list_length(expr->args) != 2)
+			return false;
+
+		/* Make sure non-selected argument is a pseudoconstant. */
+		if (is_pseudo_constant_clause(lsecond(expr->args)))
+			clause_expr = linitial(expr->args);
+		else if (is_pseudo_constant_clause(linitial(expr->args)))
+			clause_expr = lsecond(expr->args);
+		else
+			return false;
+
+		/*
+		 * If it's not an "=" operator, just ignore the clause, as it's not
+		 * compatible with functional dependencies.
+		 *
+		 * This uses the function for estimating selectivity, not the operator
+		 * directly (a bit awkward, but well ...).
+		 *
+		 * XXX this is pretty dubious; probably it'd be better to check btree
+		 * or hash opclass membership, so as not to be fooled by custom
+		 * selectivity functions, and to be more consistent with decisions
+		 * elsewhere in the planner.
+		 */
+		if (get_oprrest(expr->opno) != F_EQSEL)
+			return false;
+
+		/* OK to proceed with checking "var" */
+	}
+	else if (is_notclause(rinfo->clause))
+	{
+		/*
+		 * "NOT x" can be interpreted as "x = false", so get the argument and
+		 * proceed with seeing if it's a suitable Var.
+		 */
+		clause_expr = (Node *) get_notclausearg(rinfo->clause);
+	}
+	else
+	{
+		/*
+		 * A boolean expression "x" can be interpreted as "x = true", so
+		 * proceed with seeing if it's a suitable Var.
+		 */
+		clause_expr = (Node *) rinfo->clause;
+	}
+
+	/*
+	 * We may ignore any RelabelType node above the operand.  (There won't be
+	 * more than one, since eval_const_expressions has been applied already.)
+	 */
+	if (IsA(clause_expr, RelabelType))
+		clause_expr = (Node *) ((RelabelType *) clause_expr)->arg;
+
+	vars = pull_var_clause(clause_expr, 0);
+
+	elog(WARNING, "nvars = %d", list_length(vars));
+
+	foreach (lc, vars)
+	{
+		Var *var = (Var *) lfirst(lc);
+
+		/* Ensure Var is from the correct relation */
+		if (var->varno != relid)
+			return false;
+
+		/* We also better ensure the Var is from the current level */
+		if (var->varlevelsup != 0)
+			return false;
+
+		/* Also ignore system attributes (we don't allow stats on those) */
+		if (!AttrNumberIsForUserDefinedAttr(var->varattno))
+			return false;
+	}
+
+	foreach (lc, statlist)
+	{
+		StatisticExtInfo *info = (StatisticExtInfo *) lfirst(lc);
+
+		foreach (lc2, info->exprs)
+		{
+			Node *expr = (Node *) lfirst(lc2);
+
+			if (equal(clause_expr, expr))
+			{
+				elog(WARNING, "match");
+				return true;
+			}
+		}
+	}
+
+	return false;
+}
+
 /*
  * dependencies_clauselist_selectivity
  *		Return the estimated selectivity of (a subset of) the given clauses
@@ -982,8 +1121,10 @@ dependencies_clauselist_selectivity(PlannerInfo *root,
 		Node	   *clause = (Node *) lfirst(l);
 		AttrNumber	attnum;
 
+		dependency_clause_matches_expression(clause, rel->relid, rel->statlist);
+
 		if (!bms_is_member(listidx, *estimatedclauses) &&
-			dependency_is_compatible_clause(clause, rel->relid, &attnum))
+			 dependency_is_compatible_clause(clause, rel->relid, &attnum))
 		{
 			list_attnums[listidx] = bms_make_singleton(attnum);
 			clauses_attnums = bms_add_member(clauses_attnums, attnum);
diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c
index 1872cd4529..9f70db7377 100644
--- a/src/backend/statistics/extended_stats.c
+++ b/src/backend/statistics/extended_stats.c
@@ -24,6 +24,7 @@
 #include "catalog/pg_collation.h"
 #include "catalog/pg_statistic_ext.h"
 #include "catalog/pg_statistic_ext_data.h"
+#include "executor/executor.h"
 #include "commands/progress.h"
 #include "miscadmin.h"
 #include "nodes/nodeFuncs.h"
@@ -65,11 +66,12 @@ typedef struct StatExtEntry
 	Bitmapset  *columns;		/* attribute numbers covered by the object */
 	List	   *types;			/* 'char' list of enabled statistic kinds */
 	int			stattarget;		/* statistics target (-1 for default) */
+	List	   *exprs;			/* expressions */
 } StatExtEntry;
 
 
 static List *fetch_statentries_for_relation(Relation pg_statext, Oid relid);
-static VacAttrStats **lookup_var_attr_stats(Relation rel, Bitmapset *attrs,
+static VacAttrStats **lookup_var_attr_stats(Relation rel, Bitmapset *attrs, List *exprs,
 											int nvacatts, VacAttrStats **vacatts);
 static void statext_store(Oid relid,
 						  MVNDistinct *ndistinct, MVDependencies *dependencies,
@@ -130,11 +132,15 @@ BuildRelationExtStatistics(Relation onerel, double totalrows,
 		ListCell   *lc2;
 		int			stattarget;
 
+		/* evaluated expressions */
+		Datum	   *exprvals = NULL;
+		bool	   *exprnulls = NULL;
+
 		/*
 		 * Check if we can build these stats based on the column analyzed. If
 		 * not, report this fact (except in autovacuum) and move on.
 		 */
-		stats = lookup_var_attr_stats(onerel, stat->columns,
+		stats = lookup_var_attr_stats(onerel, stat->columns, stat->exprs,
 									  natts, vacattrstats);
 		if (!stats)
 		{
@@ -150,8 +156,8 @@ BuildRelationExtStatistics(Relation onerel, double totalrows,
 		}
 
 		/* check allowed number of dimensions */
-		Assert(bms_num_members(stat->columns) >= 2 &&
-			   bms_num_members(stat->columns) <= STATS_MAX_DIMENSIONS);
+		Assert(bms_num_members(stat->columns) + list_length(stat->exprs) >= 2 &&
+			   bms_num_members(stat->columns) + list_length(stat->exprs) <= STATS_MAX_DIMENSIONS);
 
 		/* compute statistics target for this statistics */
 		stattarget = statext_compute_stattarget(stat->stattarget,
@@ -166,6 +172,78 @@ BuildRelationExtStatistics(Relation onerel, double totalrows,
 		if (stattarget == 0)
 			continue;
 
+		if (stat->exprs)
+		{
+			int			i;
+			int			idx;
+			TupleTableSlot *slot;
+			EState	   *estate;
+			ExprContext *econtext;
+			List	   *exprstates = NIL;
+
+			/*
+			 * Need an EState for evaluation of index expressions and
+			 * partial-index predicates.  Create it in the per-index context to be
+			 * sure it gets cleaned up at the bottom of the loop.
+			 */
+			estate = CreateExecutorState();
+			econtext = GetPerTupleExprContext(estate);
+			/* Need a slot to hold the current heap tuple, too */
+			slot = MakeSingleTupleTableSlot(RelationGetDescr(onerel),
+											&TTSOpsHeapTuple);
+
+			/* Arrange for econtext's scan tuple to be the tuple under test */
+			econtext->ecxt_scantuple = slot;
+
+			/* Compute and save index expression values */
+			exprvals = (Datum *) palloc(numrows * list_length(stat->exprs) * sizeof(Datum));
+			exprnulls = (bool *) palloc(numrows * list_length(stat->exprs) * sizeof(bool));
+
+			/* Set up expression evaluation state */
+			exprstates = ExecPrepareExprList(stat->exprs, estate);
+
+			idx = 0;
+			for (i = 0; i < numrows; i++)
+			{
+				/*
+				 * Reset the per-tuple context each time, to reclaim any cruft
+				 * left behind by evaluating the predicate or index expressions.
+				 */
+				ResetExprContext(econtext);
+
+				/* Set up for predicate or expression evaluation */
+				ExecStoreHeapTuple(rows[i], slot, false);
+
+				foreach (lc2, exprstates)
+				{
+					Datum	datum;
+					bool	isnull;
+					ExprState *exprstate = (ExprState *) lfirst(lc2);
+
+					datum = ExecEvalExprSwitchContext(exprstate,
+											   GetPerTupleExprContext(estate),
+											   &isnull);
+					if (isnull)
+					{
+						exprvals[idx] = (Datum) 0;
+						exprnulls[idx] = true;
+					}
+					else
+					{
+						exprvals[idx] = (Datum) datum;
+						exprnulls[idx] = false;
+					}
+
+					idx++;
+				}
+			}
+
+			ExecDropSingleTupleTableSlot(slot);
+			FreeExecutorState(estate);
+
+			elog(WARNING, "idx = %d", idx);
+		}
+
 		/* compute statistic of each requested type */
 		foreach(lc2, stat->types)
 		{
@@ -173,13 +251,19 @@ BuildRelationExtStatistics(Relation onerel, double totalrows,
 
 			if (t == STATS_EXT_NDISTINCT)
 				ndistinct = statext_ndistinct_build(totalrows, numrows, rows,
-													stat->columns, stats);
+													exprvals, exprnulls,
+													stat->columns, stat->exprs,
+													stats);
 			else if (t == STATS_EXT_DEPENDENCIES)
 				dependencies = statext_dependencies_build(numrows, rows,
-														  stat->columns, stats);
+														  exprvals, exprnulls,
+														  stat->columns,
+														  stat->exprs, stats);
 			else if (t == STATS_EXT_MCV)
-				mcv = statext_mcv_build(numrows, rows, stat->columns, stats,
-										totalrows, stattarget);
+				mcv = statext_mcv_build(numrows, rows,
+										exprvals, exprnulls,
+										stat->columns, stat->exprs,
+										stats, totalrows, stattarget);
 		}
 
 		/* store the statistics in the catalog */
@@ -240,7 +324,7 @@ ComputeExtStatisticsRows(Relation onerel,
 		 * analyzed. If not, ignore it (don't report anything, we'll do that
 		 * during the actual build BuildRelationExtStatistics).
 		 */
-		stats = lookup_var_attr_stats(onerel, stat->columns,
+		stats = lookup_var_attr_stats(onerel, stat->columns, stat->exprs,
 									  natts, vacattrstats);
 
 		if (!stats)
@@ -387,6 +471,7 @@ fetch_statentries_for_relation(Relation pg_statext, Oid relid)
 		ArrayType  *arr;
 		char	   *enabled;
 		Form_pg_statistic_ext staForm;
+		List	   *exprs = NIL;
 
 		entry = palloc0(sizeof(StatExtEntry));
 		staForm = (Form_pg_statistic_ext) GETSTRUCT(htup);
@@ -418,6 +503,34 @@ fetch_statentries_for_relation(Relation pg_statext, Oid relid)
 			entry->types = lappend_int(entry->types, (int) enabled[i]);
 		}
 
+		/* decode expression (if any) */
+		datum = SysCacheGetAttr(STATEXTOID, htup,
+								Anum_pg_statistic_ext_stxexprs, &isnull);
+
+		if (!isnull)
+		{
+			char *exprsString;
+
+			exprsString = TextDatumGetCString(datum);
+			exprs = (List *) stringToNode(exprsString);
+
+			pfree(exprsString);
+
+			/*
+			 * Run the expressions through eval_const_expressions. This is not just an
+			 * optimization, but is necessary, because the planner will be comparing
+			 * them to similarly-processed qual clauses, and may fail to detect valid
+			 * matches without this.  We must not use canonicalize_qual, however,
+			 * since these aren't qual expressions.
+			 */
+			exprs = (List *) eval_const_expressions(NULL, (Node *) exprs);
+
+			/* May as well fix opfuncids too */
+			fix_opfuncids((Node *) exprs);
+		}
+
+		entry->exprs = exprs;
+
 		result = lappend(result, entry);
 	}
 
@@ -426,6 +539,89 @@ fetch_statentries_for_relation(Relation pg_statext, Oid relid)
 	return result;
 }
 
+
+/*
+ * examine_attribute -- pre-analysis of a single column
+ *
+ * Determine whether the column is analyzable; if so, create and initialize
+ * a VacAttrStats struct for it.  If not, return NULL.
+ *
+ * If index_expr isn't NULL, then we're trying to analyze an expression index,
+ * and index_expr is the expression tree representing the column's data.
+ */
+static VacAttrStats *
+examine_attribute(Node *expr)
+{
+	HeapTuple	typtuple;
+	VacAttrStats *stats;
+	int			i;
+	bool		ok;
+
+	/*
+	 * Create the VacAttrStats struct.  Note that we only have a copy of the
+	 * fixed fields of the pg_attribute tuple.
+	 */
+	stats = (VacAttrStats *) palloc0(sizeof(VacAttrStats));
+
+	/* fake the attribute */
+	stats->attr = (Form_pg_attribute) palloc0(ATTRIBUTE_FIXED_PART_SIZE);
+	stats->attr->attstattarget = -1;
+
+	/*
+	 * When analyzing an expression index, believe the expression tree's type
+	 * not the column datatype --- the latter might be the opckeytype storage
+	 * type of the opclass, which is not interesting for our purposes.  (Note:
+	 * if we did anything with non-expression index columns, we'd need to
+	 * figure out where to get the correct type info from, but for now that's
+	 * not a problem.)	It's not clear whether anyone will care about the
+	 * typmod, but we store that too just in case.
+	 */
+	stats->attrtypid = exprType(expr);
+	stats->attrtypmod = exprTypmod(expr);
+	stats->attrcollid = exprCollation(expr);
+
+	typtuple = SearchSysCacheCopy1(TYPEOID,
+								   ObjectIdGetDatum(stats->attrtypid));
+	if (!HeapTupleIsValid(typtuple))
+		elog(ERROR, "cache lookup failed for type %u", stats->attrtypid);
+	stats->attrtype = (Form_pg_type) GETSTRUCT(typtuple);
+	// stats->anl_context = anl_context;
+	stats->tupattnum = InvalidAttrNumber;
+
+	/*
+	 * The fields describing the stats->stavalues[n] element types default to
+	 * the type of the data being analyzed, but the type-specific typanalyze
+	 * function can change them if it wants to store something else.
+	 */
+	for (i = 0; i < STATISTIC_NUM_SLOTS; i++)
+	{
+		stats->statypid[i] = stats->attrtypid;
+		stats->statyplen[i] = stats->attrtype->typlen;
+		stats->statypbyval[i] = stats->attrtype->typbyval;
+		stats->statypalign[i] = stats->attrtype->typalign;
+	}
+
+	/*
+	 * Call the type-specific typanalyze function.  If none is specified, use
+	 * std_typanalyze().
+	 */
+	if (OidIsValid(stats->attrtype->typanalyze))
+		ok = DatumGetBool(OidFunctionCall1(stats->attrtype->typanalyze,
+										   PointerGetDatum(stats)));
+	else
+		ok = std_typanalyze(stats);
+
+	if (!ok || stats->compute_stats == NULL || stats->minrows <= 0)
+	{
+		heap_freetuple(typtuple);
+		pfree(stats->attr);
+		pfree(stats);
+		return NULL;
+	}
+
+	return stats;
+}
+
 /*
  * Using 'vacatts' of size 'nvacatts' as input data, return a newly built
  * VacAttrStats array which includes only the items corresponding to
@@ -434,15 +630,18 @@ fetch_statentries_for_relation(Relation pg_statext, Oid relid)
  * to the caller that the stats should not be built.
  */
 static VacAttrStats **
-lookup_var_attr_stats(Relation rel, Bitmapset *attrs,
+lookup_var_attr_stats(Relation rel, Bitmapset *attrs, List *exprs,
 					  int nvacatts, VacAttrStats **vacatts)
 {
 	int			i = 0;
 	int			x = -1;
+	int			natts;
 	VacAttrStats **stats;
+	ListCell   *lc;
 
-	stats = (VacAttrStats **)
-		palloc(bms_num_members(attrs) * sizeof(VacAttrStats *));
+	natts = bms_num_members(attrs) + list_length(exprs);
+
+	stats = (VacAttrStats **) palloc(natts * sizeof(VacAttrStats *));
 
 	/* lookup VacAttrStats info for the requested columns (same attnum) */
 	while ((x = bms_next_member(attrs, x)) >= 0)
@@ -476,6 +675,19 @@ lookup_var_attr_stats(Relation rel, Bitmapset *attrs,
 		 */
 		Assert(!stats[i]->attr->attisdropped);
 
+		elog(WARNING, "A: %d => %p", i, stats[i]);
+
+		i++;
+	}
+
+	foreach (lc, exprs)
+	{
+		Node *expr = (Node *) lfirst(lc);
+
+		stats[i] = examine_attribute(expr);
+
+		elog(WARNING, "B: %d => %p (%s)", i, stats[i], nodeToString(expr));
+
 		i++;
 	}
 
@@ -740,8 +952,10 @@ build_attnums_array(Bitmapset *attrs, int *numattrs)
  * can simply pfree the return value to release all of it.
  */
 SortItem *
-build_sorted_items(int numrows, int *nitems, HeapTuple *rows, TupleDesc tdesc,
-				   MultiSortSupport mss, int numattrs, AttrNumber *attnums)
+build_sorted_items(int numrows, int *nitems, HeapTuple *rows,
+				   Datum *exprvals, bool *exprnulls, int nexprs,
+				   TupleDesc tdesc, MultiSortSupport mss,
+				   int numattrs, AttrNumber *attnums)
 {
 	int			i,
 				j,
@@ -789,7 +1003,16 @@ build_sorted_items(int numrows, int *nitems, HeapTuple *rows, TupleDesc tdesc,
 			Datum		value;
 			bool		isnull;
 
-			value = heap_getattr(rows[i], attnums[j], tdesc, &isnull);
+			if (attnums[j] <= MaxHeapAttributeNumber)
+				value = heap_getattr(rows[i], attnums[j], tdesc, &isnull);
+			else
+			{
+				int	expridx = (attnums[j] - MaxHeapAttributeNumber - 1);
+				int	idx = i * nexprs + expridx;
+
+				value = exprvals[idx];
+				isnull = exprnulls[idx];
+			}
 
 			/*
 			 * If this is a varlena value, check if it's too wide and if yes
@@ -1110,6 +1333,168 @@ statext_is_compatible_clause_internal(PlannerInfo *root, Node *clause,
 	return false;
 }
 
+
+
+/*
+ * statext_extract_clause_internal
+ *		Determines if the clause is compatible with MCV lists.
+ *
+ * Does the heavy lifting of actually inspecting the clauses for
+ * statext_is_compatible_clause. It needs to be split like this because
+ * of recursion.  The attnums bitmap is an input/output parameter collecting
+ * attribute numbers from all compatible clauses (recursively).
+ */
+static List *
+statext_extract_clause_internal(PlannerInfo *root, Node *clause, Index relid)
+{
+	List   *result = NIL;
+
+	/* Look inside any binary-compatible relabeling (as in examine_variable) */
+	if (IsA(clause, RelabelType))
+		clause = (Node *) ((RelabelType *) clause)->arg;
+
+	/* plain Var references (boolean Vars or recursive checks) */
+	if (IsA(clause, Var))
+	{
+		Var		   *var = (Var *) clause;
+
+		/* Ensure var is from the correct relation */
+		if (var->varno != relid)
+			return NIL;
+
+		/* we also better ensure the Var is from the current level */
+		if (var->varlevelsup > 0)
+			return NIL;
+
+		/* Also skip system attributes (we don't allow stats on those). */
+		if (!AttrNumberIsForUserDefinedAttr(var->varattno))
+			return NIL;
+
+		// *attnums = bms_add_member(*attnums, var->varattno);
+
+		result = lappend(result, clause);
+
+		return result;
+	}
+
+	/* (Var op Const) or (Const op Var) */
+	if (is_opclause(clause))
+	{
+		RangeTblEntry *rte = root->simple_rte_array[relid];
+		OpExpr	   *expr = (OpExpr *) clause;
+		Var		   *var;
+		Var		   *var2 = NULL;
+
+		/* Only expressions with two arguments are considered compatible. */
+		if (list_length(expr->args) != 2)
+			return NIL;
+
+		/* Check if the expression the right shape (one Var, one Const) */
+		if ((!examine_opclause_expression(expr, &var, NULL, NULL)) &&
+			(!examine_opclause_expression2(expr, &var, &var2)))
+			return NIL;
+
+		/*
+		 * If it's not one of the supported operators ("=", "<", ">", etc.),
+		 * just ignore the clause, as it's not compatible with MCV lists.
+		 *
+		 * This uses the function for estimating selectivity, not the operator
+		 * directly (a bit awkward, but well ...).
+		 */
+		switch (get_oprrest(expr->opno))
+		{
+			case F_EQSEL:
+			case F_NEQSEL:
+			case F_SCALARLTSEL:
+			case F_SCALARLESEL:
+			case F_SCALARGTSEL:
+			case F_SCALARGESEL:
+				/* supported, will continue with inspection of the Var */
+				break;
+
+			default:
+				/* other estimators are considered unknown/unsupported */
+				return NIL;
+		}
+
+		/*
+		 * If there are any securityQuals on the RTE from security barrier
+		 * views or RLS policies, then the user may not have access to all the
+		 * table's data, and we must check that the operator is leak-proof.
+		 *
+		 * If the operator is leaky, then we must ignore this clause for the
+		 * purposes of estimating with MCV lists, otherwise the operator might
+		 * reveal values from the MCV list that the user doesn't have
+		 * permission to see.
+		 */
+		if (rte->securityQuals != NIL &&
+			!get_func_leakproof(get_opcode(expr->opno)))
+			return NIL;
+
+		result = lappend(result, var);
+
+		if (var2)
+			result = lappend(result, var2);
+
+		return result;
+	}
+
+	/* AND/OR/NOT clause */
+	if (is_andclause(clause) ||
+		is_orclause(clause) ||
+		is_notclause(clause))
+	{
+		/*
+		 * AND/OR/NOT-clauses are supported if all sub-clauses are supported
+		 *
+		 * Perhaps we could improve this by handling mixed cases, when some of
+		 * the clauses are supported and some are not. Selectivity for the
+		 * supported subclauses would be computed using extended statistics,
+		 * and the remaining clauses would be estimated using the traditional
+		 * algorithm (product of selectivities).
+		 *
+		 * It however seems overly complex, and in a way we already do that
+		 * because if we reject the whole clause as unsupported here, it will
+		 * be eventually passed to clauselist_selectivity() which does exactly
+		 * this (split into supported/unsupported clauses etc).
+		 */
+		BoolExpr   *expr = (BoolExpr *) clause;
+		ListCell   *lc;
+
+		foreach(lc, expr->args)
+		{
+			/*
+			 * Had we found incompatible clause in the arguments, treat the
+			 * whole clause as incompatible.
+			 */
+			if (!statext_extract_clause_internal(root,
+												 (Node *) lfirst(lc),
+												 relid))
+				return NIL;
+		}
+
+		return result;
+	}
+
+	/* Var IS NULL */
+	if (IsA(clause, NullTest))
+	{
+		NullTest   *nt = (NullTest *) clause;
+
+		/*
+		 * Only simple (Var IS NULL) expressions supported for now. Maybe we
+		 * could use examine_variable to fix this?
+		 */
+		if (!IsA(nt->arg, Var))
+			return false;
+
+		return statext_extract_clause_internal(root, (Node *) (nt->arg),
+											   relid);
+	}
+
+	return false;
+}
+
 /*
  * statext_is_compatible_clause
  *		Determines if the clause is compatible with MCV lists.
@@ -1184,6 +1569,51 @@ statext_is_compatible_clause(PlannerInfo *root, Node *clause, Index relid,
 	return true;
 }
 
+/*
+ * statext_extract_clause
+ *		Determines if the clause is compatible with MCV lists.
+ *
+ * Currently, we only support three types of clauses:
+ *
+ * (a) OpExprs of the form (Var op Const), or (Const op Var), where the op
+ * is one of ("=", "<", ">", ">=", "<=")
+ *
+ * (b) (Var IS [NOT] NULL)
+ *
+ * (c) combinations using AND/OR/NOT
+ *
+ * In the future, the range of supported clauses may be expanded to more
+ * complex cases, for example (Var op Var).
+ */
+static List *
+statext_extract_clause(PlannerInfo *root, Node *clause, Index relid)
+{
+	RestrictInfo *rinfo = (RestrictInfo *) clause;
+	List		 *exprs;
+
+	if (!IsA(rinfo, RestrictInfo))
+		return false;
+
+	/* Pseudoconstants are not really interesting here. */
+	if (rinfo->pseudoconstant)
+		return false;
+
+	/* clauses referencing multiple varnos are incompatible */
+	if (bms_membership(rinfo->clause_relids) != BMS_SINGLETON)
+		return false;
+
+	/* Check the clause and determine what attributes it references. */
+	exprs = statext_extract_clause_internal(root, (Node *) rinfo->clause, relid);
+
+	if (!exprs)
+		return NULL;
+
+	/* FIXME do the same ACL check as in statext_is_compatible_clause */
+
+	/* If we reach here, the clause is OK */
+	return exprs;
+}
+
 /*
  * statext_mcv_clauselist_selectivity
  *		Estimate clauses using the best multi-column statistics.
@@ -1246,7 +1676,8 @@ statext_mcv_clauselist_selectivity(PlannerInfo *root, List *clauses, int varReli
 								   bool is_or)
 {
 	ListCell   *l;
-	Bitmapset **list_attnums;
+	Bitmapset **list_attnums;	/* attnums extracted from the clause */
+	bool	   *exact_clauses;	/* covered as-is by at least one statistic */
 	int			listidx;
 	Selectivity	sel = 1.0;
 
@@ -1257,6 +1688,8 @@ statext_mcv_clauselist_selectivity(PlannerInfo *root, List *clauses, int varReli
 	list_attnums = (Bitmapset **) palloc(sizeof(Bitmapset *) *
 										 list_length(clauses));
 
+	exact_clauses = (bool *) palloc(sizeof(bool) * list_length(clauses));
+
 	/*
 	 * Pre-process the clauses list to extract the attnums seen in each item.
 	 * We need to determine if there's any clauses which will be useful for
@@ -1274,11 +1707,76 @@ statext_mcv_clauselist_selectivity(PlannerInfo *root, List *clauses, int varReli
 		Node	   *clause = (Node *) lfirst(l);
 		Bitmapset  *attnums = NULL;
 
+		/* the clause is considered incompatible by default */
+		list_attnums[listidx] = NULL;
+
+		/* and it's also not covered exactly by the statistic */
+		exact_clauses[listidx] = false;
+
+		/*
+		 * First see if the clause is simple enough to be covered directly
+		 * by the attributes. If not, see if there's at least one statistic
+		 * object using the expression as-is.
+		 */
 		if (!bms_is_member(listidx, *estimatedclauses) &&
 			statext_is_compatible_clause(root, clause, rel->relid, &attnums))
+			/* simple expression, covered through attnum(s) */
 			list_attnums[listidx] = attnums;
 		else
-			list_attnums[listidx] = NULL;
+		{
+			ListCell   *lc;
+
+			List *exprs = statext_extract_clause(root, clause, rel->relid);
+
+			/* complex expression, search for statistic */
+			foreach(lc, rel->statlist)
+			{
+				ListCell		   *lc2;
+				StatisticExtInfo   *info = (StatisticExtInfo *) lfirst(lc);
+				bool				all_found = true;
+
+				/* have we already found all expressions in a statistic? */
+				Assert(!exact_clauses[listidx]);
+
+				/* no expressions */
+				if (!info->exprs)
+					continue;
+
+				foreach (lc2, exprs)
+				{
+					Node   *expr = (Node *) lfirst(lc2);
+
+					/*
+					 * Walk the expressions, see if all expressions extracted from
+					 * the clause are covered by the extended statistic object.
+					 */
+					foreach (lc2, info->exprs)
+					{
+						Node   *stat_expr = (Node *) lfirst(lc2);
+						bool	expr_found = false;
+
+						if (equal(expr, stat_expr))
+						{
+							expr_found = true;
+							break;
+						}
+
+						if (!expr_found)
+						{
+							all_found = false;
+							break;
+						}
+					}
+				}
+
+				/* stop looking for another statistic */
+				if (all_found)
+				{
+					exact_clauses[listidx] = true;
+					break;
+				}
+			}
+		}
 
 		listidx++;
 	}
diff --git a/src/backend/statistics/mcv.c b/src/backend/statistics/mcv.c
index 97d3083451..9334504714 100644
--- a/src/backend/statistics/mcv.c
+++ b/src/backend/statistics/mcv.c
@@ -180,7 +180,9 @@ get_mincount_for_mcv_list(int samplerows, double totalrows)
  *
  */
 MCVList *
-statext_mcv_build(int numrows, HeapTuple *rows, Bitmapset *attrs,
+statext_mcv_build(int numrows, HeapTuple *rows,
+				  Datum *exprvals, bool *exprnulls,
+				  Bitmapset *attrs, List *exprs,
 				  VacAttrStats **stats, double totalrows, int stattarget)
 {
 	int			i,
@@ -194,13 +196,23 @@ statext_mcv_build(int numrows, HeapTuple *rows, Bitmapset *attrs,
 	MCVList    *mcvlist = NULL;
 	MultiSortSupport mss;
 
+	/*
+	 * Copy the bitmapset and add fake attnums representing expressions,
+	 * starting above MaxHeapAttributeNumber.
+	 */
+	attrs = bms_copy(attrs);
+
+	for (i = 1; i <= list_length(exprs); i++)
+		attrs = bms_add_member(attrs, MaxHeapAttributeNumber + i);
+
 	attnums = build_attnums_array(attrs, &numattrs);
 
 	/* comparator for all the columns */
 	mss = build_mss(stats, numattrs);
 
 	/* sort the rows */
-	items = build_sorted_items(numrows, &nitems, rows, stats[0]->tupDesc,
+	items = build_sorted_items(numrows, &nitems, rows, exprvals, exprnulls,
+							   list_length(exprs), stats[0]->tupDesc,
 							   mss, numattrs, attnums);
 
 	if (!items)
@@ -337,6 +349,7 @@ statext_mcv_build(int numrows, HeapTuple *rows, Bitmapset *attrs,
 
 	pfree(items);
 	pfree(groups);
+	pfree(attrs);
 
 	return mcvlist;
 }
diff --git a/src/backend/statistics/mvdistinct.c b/src/backend/statistics/mvdistinct.c
index 977d6f3e2e..dd874c7a04 100644
--- a/src/backend/statistics/mvdistinct.c
+++ b/src/backend/statistics/mvdistinct.c
@@ -37,8 +37,10 @@
 #include "utils/typcache.h"
 
 static double ndistinct_for_combination(double totalrows, int numrows,
-										HeapTuple *rows, VacAttrStats **stats,
-										int k, int *combination);
+										HeapTuple *rows, Datum *exprvals,
+										bool *exprnulls, int nexprs,
+										VacAttrStats **stats, int k,
+										int *combination);
 static double estimate_ndistinct(double totalrows, int numrows, int d, int f1);
 static int	n_choose_k(int n, int k);
 static int	num_combinations(int n);
@@ -84,14 +86,26 @@ static void generate_combinations(CombinationGenerator *state);
  */
 MVNDistinct *
 statext_ndistinct_build(double totalrows, int numrows, HeapTuple *rows,
-						Bitmapset *attrs, VacAttrStats **stats)
+						Datum *exprvals, bool *exprnulls,
+						Bitmapset *attrs, List *exprs,
+						VacAttrStats **stats)
 {
 	MVNDistinct *result;
+	int			i;
 	int			k;
 	int			itemcnt;
-	int			numattrs = bms_num_members(attrs);
+	int			numattrs = bms_num_members(attrs) + list_length(exprs);
 	int			numcombs = num_combinations(numattrs);
 
+	/*
+	 * Copy the bitmapset and add fake attnums representing expressions,
+	 * starting above MaxHeapAttributeNumber.
+	 */
+	attrs = bms_copy(attrs);
+
+	for (i = 1; i <= list_length(exprs); i++)
+		attrs = bms_add_member(attrs, MaxHeapAttributeNumber + i);
+
 	result = palloc(offsetof(MVNDistinct, items) +
 					numcombs * sizeof(MVNDistinctItem));
 	result->magic = STATS_NDISTINCT_MAGIC;
@@ -114,10 +128,18 @@ statext_ndistinct_build(double totalrows, int numrows, HeapTuple *rows,
 
 			item->attrs = NULL;
 			for (j = 0; j < k; j++)
-				item->attrs = bms_add_member(item->attrs,
-											 stats[combination[j]]->attr->attnum);
+			{
+				if (combination[j] <= MaxHeapAttributeNumber)
+					item->attrs = bms_add_member(item->attrs,
+												 stats[combination[j]]->attr->attnum);
+				else
+					item->attrs = bms_add_member(item->attrs, combination[j]);
+			}
+
 			item->ndistinct =
 				ndistinct_for_combination(totalrows, numrows, rows,
+										  exprvals, exprnulls,
+										  list_length(exprs),
 										  stats, k, combination);
 
 			itemcnt++;
@@ -428,6 +450,7 @@ pg_ndistinct_send(PG_FUNCTION_ARGS)
  */
 static double
 ndistinct_for_combination(double totalrows, int numrows, HeapTuple *rows,
+						  Datum *exprvals, bool *exprnulls, int nexprs,
 						  VacAttrStats **stats, int k, int *combination)
 {
 	int			i,
@@ -481,11 +504,17 @@ ndistinct_for_combination(double totalrows, int numrows, HeapTuple *rows,
 		/* accumulate all the data for this dimension into the arrays */
 		for (j = 0; j < numrows; j++)
 		{
-			items[j].values[i] =
-				heap_getattr(rows[j],
-							 colstat->attr->attnum,
-							 colstat->tupDesc,
-							 &items[j].isnull[i]);
+			if (combination[i] <= MaxHeapAttributeNumber)
+				items[j].values[i] =
+					heap_getattr(rows[j],
+								 colstat->attr->attnum,
+								 colstat->tupDesc,
+								 &items[j].isnull[i]);
+			else
+			{
+				items[j].values[i] = exprvals[j * nexprs + combination[i] - MaxHeapAttributeNumber - 1];
+				items[j].isnull[i] = exprnulls[j * nexprs + combination[i] - MaxHeapAttributeNumber - 1];
+			}
 		}
 	}
 
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 1b460a2612..8c36f516e1 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -1781,7 +1781,21 @@ ProcessUtilitySlow(ParseState *pstate,
 				break;
 
 			case T_CreateStatsStmt:
-				address = CreateStatistics((CreateStatsStmt *) parsetree);
+				{
+					Oid			relid;
+					CreateStatsStmt *stmt = (CreateStatsStmt *) parsetree;
+					RangeVar   *rel = (RangeVar *) linitial(stmt->relations);
+
+					relid = RangeVarGetRelidExtended(rel, ShareLock,
+												 0,
+												 RangeVarCallbackOwnsRelation,
+												 NULL);
+
+					/* Run parse analysis ... */
+					stmt = transformStatsStmt(relid, stmt, queryString);
+
+					address = CreateStatistics(stmt);
+				}
 				break;
 
 			case T_AlterStatsStmt:
diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c
index 5e63238f03..e811a54667 100644
--- a/src/backend/utils/adt/ruleutils.c
+++ b/src/backend/utils/adt/ruleutils.c
@@ -1524,6 +1524,9 @@ pg_get_statisticsobj_worker(Oid statextid, bool missing_ok)
 	bool		dependencies_enabled;
 	bool		mcv_enabled;
 	int			i;
+	List	   *context;
+	ListCell   *lc;
+	List	   *exprs = NIL;
 
 	statexttup = SearchSysCache1(STATEXTOID, ObjectIdGetDatum(statextid));
 
@@ -1616,6 +1619,62 @@ pg_get_statisticsobj_worker(Oid statextid, bool missing_ok)
 		appendStringInfoString(&buf, quote_identifier(attname));
 	}
 
+	/* deparse expressions */
+
+	{
+			bool		isnull;
+			Datum		datum;
+
+			/* decode expression (if any) */
+			datum = SysCacheGetAttr(STATEXTOID, statexttup,
+									Anum_pg_statistic_ext_stxexprs, &isnull);
+
+			if (!isnull)
+			{
+				char *exprsString;
+
+				exprsString = TextDatumGetCString(datum);
+				exprs = (List *) stringToNode(exprsString);
+				pfree(exprsString);
+
+				/*
+				 * Run the expressions through eval_const_expressions. This is not just an
+				 * optimization, but is necessary, because the planner will be comparing
+				 * them to similarly-processed qual clauses, and may fail to detect valid
+				 * matches without this.  We must not use canonicalize_qual, however,
+				 * since these aren't qual expressions.
+				 */
+				exprs = (List *) eval_const_expressions(NULL, (Node *) exprs);
+
+				/* May as well fix opfuncids too */
+				fix_opfuncids((Node *) exprs);
+			}
+	}
+
+	context = deparse_context_for(get_relation_name(statextrec->stxrelid),
+								  statextrec->stxrelid);
+
+	foreach (lc, exprs)
+	{
+		Node	   *expr = (Node *) lfirst(lc);
+		char	   *str;
+		int			prettyFlags = PRETTYFLAG_INDENT;
+
+		str = deparse_expression_pretty(expr, context, false, false,
+										prettyFlags, 0);
+
+		if (colno > 0)
+			appendStringInfoString(&buf, ", ");
+
+		/* Need parens if it's not a bare function call */
+		if (looks_like_function(expr))
+			appendStringInfoString(&buf, str);
+		else
+			appendStringInfo(&buf, "(%s)", str);
+
+		colno++;
+	}
+
 	appendStringInfo(&buf, " FROM %s",
 					 generate_relation_name(statextrec->stxrelid, NIL));
 
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 0be26fe037..7574a5395a 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -3082,6 +3082,7 @@ estimate_num_groups(PlannerInfo *root, List *groupExprs, double input_rows,
 		double		this_srf_multiplier;
 		VariableStatData vardata;
 		List	   *varshere;
+		Relids		varnos;
 		ListCell   *l2;
 
 		/* is expression in this grouping set? */
@@ -3149,6 +3150,16 @@ estimate_num_groups(PlannerInfo *root, List *groupExprs, double input_rows,
 			continue;
 		}
 
+		/*
+		 * Are all the variables from the same relation? If yes, search for
+		 * an extended statistic matching this expression exactly.
+		 */
+		varnos = pull_varnos((Node *) varshere);
+		if (bms_membership(varnos) == BMS_SINGLETON)
+		{
+			// FIXME try to match it to expressions in mvdistinct stats
+		}
+
 		/*
 		 * Else add variables to varinfos list
 		 */
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index f3c7eb96fa..92c2deb1ba 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -2671,6 +2671,7 @@ describeOneTableDetails(const char *schemaname,
 		/* print any extended statistics */
 		if (pset.sversion >= 100000)
 		{
+			/* FIXME improve this with printing expressions the statistics is defined on */
 			printfPQExpBuffer(&buf,
 							  "SELECT oid, "
 							  "stxrelid::pg_catalog.regclass, "
diff --git a/src/include/catalog/pg_statistic_ext.h b/src/include/catalog/pg_statistic_ext.h
index e9491a0a87..dd0f41cd14 100644
--- a/src/include/catalog/pg_statistic_ext.h
+++ b/src/include/catalog/pg_statistic_ext.h
@@ -52,6 +52,9 @@ CATALOG(pg_statistic_ext,3381,StatisticExtRelationId)
 #ifdef CATALOG_VARLEN
 	char		stxkind[1] BKI_FORCE_NOT_NULL;	/* statistics kinds requested
 												 * to build */
+	pg_node_tree stxexprs;		/* expression trees for stats attributes that
+								 * are not simple column references; one for
+								 * each zero entry in stxkeys[] */
 #endif
 
 } FormData_pg_statistic_ext;
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index baced7eec0..72f6534ceb 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -448,6 +448,7 @@ typedef enum NodeTag
 	T_TypeName,
 	T_ColumnDef,
 	T_IndexElem,
+	T_StatsElem,
 	T_Constraint,
 	T_DefElem,
 	T_RangeTblEntry,
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index da0706add5..74e5a855ca 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -2807,8 +2807,24 @@ typedef struct CreateStatsStmt
 	List	   *relations;		/* rels to build stats on (list of RangeVar) */
 	char	   *stxcomment;		/* comment to apply to stats, or NULL */
 	bool		if_not_exists;	/* do nothing if stats name already exists */
+	bool		transformed;	/* true when transformIndexStmt is finished */
 } CreateStatsStmt;
 
+/*
+ * StatsElem - statistics parameters (used in CREATE STATISTICS)
+ *
+ * For a plain attribute, 'name' is the name of the referenced table column
+ * and 'expr' is NULL.  For an expression, 'name' is NULL and 'expr' is the
+ * expression tree.
+ */
+typedef struct StatsElem
+{
+	NodeTag		type;
+	char	   *name;			/* name of attribute to index, or NULL */
+	Node	   *expr;			/* expression to index, or NULL */
+} StatsElem;
+
+
 /* ----------------------
  *		Alter Statistics Statement
  * ----------------------
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 0ceb809644..7e9aeb409b 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -885,6 +885,7 @@ typedef struct StatisticExtInfo
 	RelOptInfo *rel;			/* back-link to statistic's table */
 	char		kind;			/* statistic kind of this entry */
 	Bitmapset  *keys;			/* attnums of the columns covered */
+	List	   *exprs;			/* expressions */
 } StatisticExtInfo;
 
 /*
diff --git a/src/include/parser/parse_node.h b/src/include/parser/parse_node.h
index d25819aa28..82e5190964 100644
--- a/src/include/parser/parse_node.h
+++ b/src/include/parser/parse_node.h
@@ -69,6 +69,7 @@ typedef enum ParseExprKind
 	EXPR_KIND_FUNCTION_DEFAULT, /* default parameter value for function */
 	EXPR_KIND_INDEX_EXPRESSION, /* index expression */
 	EXPR_KIND_INDEX_PREDICATE,	/* index predicate */
+	EXPR_KIND_STATS_EXPRESSION, /* extended statistics expression */
 	EXPR_KIND_ALTER_COL_TRANSFORM,	/* transform expr in ALTER COLUMN TYPE */
 	EXPR_KIND_EXECUTE_PARAMETER,	/* parameter value in EXECUTE */
 	EXPR_KIND_TRIGGER_WHEN,		/* WHEN condition in CREATE TRIGGER */
diff --git a/src/include/parser/parse_utilcmd.h b/src/include/parser/parse_utilcmd.h
index 1a5e0b83a7..43247186b0 100644
--- a/src/include/parser/parse_utilcmd.h
+++ b/src/include/parser/parse_utilcmd.h
@@ -26,6 +26,8 @@ extern AlterTableStmt *transformAlterTableStmt(Oid relid, AlterTableStmt *stmt,
 											   List **afterStmts);
 extern IndexStmt *transformIndexStmt(Oid relid, IndexStmt *stmt,
 									 const char *queryString);
+extern CreateStatsStmt *transformStatsStmt(Oid relid, CreateStatsStmt *stmt,
+									 const char *queryString);
 extern void transformRuleStmt(RuleStmt *stmt, const char *queryString,
 							  List **actions, Node **whereClause);
 extern List *transformCreateSchemaStmt(CreateSchemaStmt *stmt);
diff --git a/src/include/statistics/extended_stats_internal.h b/src/include/statistics/extended_stats_internal.h
index 804089bc57..b159ea0313 100644
--- a/src/include/statistics/extended_stats_internal.h
+++ b/src/include/statistics/extended_stats_internal.h
@@ -59,17 +59,23 @@ typedef struct SortItem
 
 extern MVNDistinct *statext_ndistinct_build(double totalrows,
 											int numrows, HeapTuple *rows,
-											Bitmapset *attrs, VacAttrStats **stats);
+											Datum *exprvals, bool *exprnulls,
+											Bitmapset *attrs, List *exprs,
+											VacAttrStats **stats);
 extern bytea *statext_ndistinct_serialize(MVNDistinct *ndistinct);
 extern MVNDistinct *statext_ndistinct_deserialize(bytea *data);
 
 extern MVDependencies *statext_dependencies_build(int numrows, HeapTuple *rows,
-												  Bitmapset *attrs, VacAttrStats **stats);
+												  Datum *exprvals, bool *exprnulls,
+												  Bitmapset *attrs, List *exprs,
+												  VacAttrStats **stats);
 extern bytea *statext_dependencies_serialize(MVDependencies *dependencies);
 extern MVDependencies *statext_dependencies_deserialize(bytea *data);
 
 extern MCVList *statext_mcv_build(int numrows, HeapTuple *rows,
-								  Bitmapset *attrs, VacAttrStats **stats,
+								  Datum *exprvals, bool *exprnulls,
+								  Bitmapset *attrs, List *exprs,
+								  VacAttrStats **stats,
 								  double totalrows, int stattarget);
 extern bytea *statext_mcv_serialize(MCVList *mcv, VacAttrStats **stats);
 extern MCVList *statext_mcv_deserialize(bytea *data);
@@ -93,6 +99,7 @@ extern void *bsearch_arg(const void *key, const void *base,
 extern AttrNumber *build_attnums_array(Bitmapset *attrs, int *numattrs);
 
 extern SortItem *build_sorted_items(int numrows, int *nitems, HeapTuple *rows,
+									Datum *exprvals, bool *exprnulls, int nexprs,
 									TupleDesc tdesc, MultiSortSupport mss,
 									int numattrs, AttrNumber *attnums);
 
-- 
2.21.1

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 6 years ago

In reply to: Tomas Vondra (#3)

2 attachment(s)

Re: Additional improvements to extended statistics

On Fri, Mar 06, 2020 at 01:15:56AM +0100, Tomas Vondra wrote:

Hi,

Here is a rebased version of this patch series. I've polished the first
two parts a bit - estimation of OR clauses and (Var op Var) clauses, and
added a bunch of regression tests to exercise this code. It's not quite
there yet, but I think it's feasible to get this committed for PG13.

The last part (extended stats on expressions) is far from complete, and
it's not feasible to get it into PG13. There's too much missing stuff.

Meh, the last part with stats on expression is not quite right and it
breaks the cputube tester, so here are the first two parts only. I don't
plan to pursue the 0003 part for PG13 anyway, as mentioned.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

0001-Support-using-extended-stats-for-parts-of-O-20200306.patchtext/plain; charset=us-asciiDownload

From d7f639b6150fe9fd179066af2a536465d877842a Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Mon, 2 Dec 2019 23:02:17 +0100
Subject: [PATCH 1/3] Support using extended stats for parts of OR clauses

---
 src/backend/optimizer/path/clausesel.c        | 109 +++++++++++++++---
 src/backend/statistics/extended_stats.c       |  45 +++++++-
 src/backend/statistics/mcv.c                  |   5 +-
 .../statistics/extended_stats_internal.h      |   3 +-
 src/include/statistics/statistics.h           |   3 +-
 src/test/regress/expected/stats_ext.out       |   3 +-
 src/test/regress/sql/stats_ext.sql            |   1 -
 7 files changed, 138 insertions(+), 31 deletions(-)

diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c
index a3ebe10592..8c1a404ce2 100644
--- a/src/backend/optimizer/path/clausesel.c
+++ b/src/backend/optimizer/path/clausesel.c
@@ -92,7 +92,7 @@ clauselist_selectivity(PlannerInfo *root,
 		 */
 		s1 *= statext_clauselist_selectivity(root, clauses, varRelid,
 											 jointype, sjinfo, rel,
-											 &estimatedclauses);
+											 &estimatedclauses, false);
 	}
 
 	/*
@@ -104,6 +104,89 @@ clauselist_selectivity(PlannerInfo *root,
 											  estimatedclauses);
 }
 
+/*
+ * clauselist_selectivity_or -
+ *	  Compute the selectivity of an implicitly-ORed list of boolean
+ *	  expression clauses.  The list can be empty, in which case 0.0
+ *	  must be returned.  List elements may be either RestrictInfos
+ *	  or bare expression clauses --- the former is preferred since
+ *	  it allows caching of results.
+ *
+ * See clause_selectivity() for the meaning of the additional parameters.
+ *
+ * The basic approach is to apply extended statistics first, on as many
+ * clauses as possible, in order to capture cross-column dependencies etc.
+ * The remaining clauses are then estimated using regular statistics tracked
+ * for individual columns.  This is done by simply passing the clauses to
+ * clauselist_selectivity and then combining the selectivities using the
+ * regular formula (s1+s2 - s1*s2).
+ */
+static Selectivity
+clauselist_selectivity_or(PlannerInfo *root,
+						  List *clauses,
+						  int varRelid,
+						  JoinType jointype,
+						  SpecialJoinInfo *sjinfo)
+{
+	ListCell   *lc;
+	Selectivity	s1 = 0.0;
+	RelOptInfo *rel;
+	Bitmapset  *estimatedclauses = NULL;
+	int			listidx;
+
+	/*
+	 * Determine if these clauses reference a single relation.  If so, and if
+	 * it has extended statistics, try to apply those.
+	 */
+	rel = find_single_rel_for_clauses(root, clauses);
+	if (rel && rel->rtekind == RTE_RELATION && rel->statlist != NIL)
+	{
+		/*
+		 * Estimate as many clauses as possible using extended statistics.
+		 *
+		 * 'estimatedclauses' tracks the 0-based list position index of
+		 * clauses that we've estimated using extended statistics, and that
+		 * should be ignored.
+		 *
+		 * XXX We can't multiply with current value, because for OR clauses
+		 * we start with 0.0, so we simply assign to s1 directly.
+		 */
+		s1 = statext_clauselist_selectivity(root, clauses, varRelid,
+											jointype, sjinfo, rel,
+											&estimatedclauses, true);
+	}
+
+	/*
+	 * Selectivities of the remaining clauses for an OR clause are computed
+	 * as s1+s2 - s1*s2 to account for the probable overlap of selected tuple
+	 * sets. The clauses estimated using extended statistics are effectively
+	 * treated as a single clause.
+	 *
+	 * XXX is this too conservative?
+	 */
+	listidx = -1;
+	foreach(lc, clauses)
+	{
+		Selectivity s2;
+
+		listidx++;
+
+		/* skip already estimated clauses */
+		if (bms_is_member(listidx, estimatedclauses))
+			continue;
+
+		s2 = clause_selectivity(root,
+								(Node *) lfirst(lc),
+								varRelid,
+								jointype,
+								sjinfo);
+
+		s1 = s1 + s2 - s1 * s2;
+	}
+
+	return s1;
+}
+
 /*
  * clauselist_selectivity_simple -
  *	  Compute the selectivity of an implicitly-ANDed list of boolean
@@ -735,24 +818,14 @@ clause_selectivity(PlannerInfo *root,
 	else if (is_orclause(clause))
 	{
 		/*
-		 * Selectivities for an OR clause are computed as s1+s2 - s1*s2 to
-		 * account for the probable overlap of selected tuple sets.
-		 *
-		 * XXX is this too conservative?
+		 * Almost the same thing as clauselist_selectivity, but with the
+		 * clauses connected by OR.
 		 */
-		ListCell   *arg;
-
-		s1 = 0.0;
-		foreach(arg, ((BoolExpr *) clause)->args)
-		{
-			Selectivity s2 = clause_selectivity(root,
-												(Node *) lfirst(arg),
-												varRelid,
-												jointype,
-												sjinfo);
-
-			s1 = s1 + s2 - s1 * s2;
-		}
+		s1 = clauselist_selectivity_or(root,
+									   ((BoolExpr *) clause)->args,
+									   varRelid,
+									   jointype,
+									   sjinfo);
 	}
 	else if (is_opclause(clause) || IsA(clause, DistinctExpr))
 	{
diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c
index 03e69d057f..24ece6f99c 100644
--- a/src/backend/statistics/extended_stats.c
+++ b/src/backend/statistics/extended_stats.c
@@ -1225,7 +1225,8 @@ statext_is_compatible_clause(PlannerInfo *root, Node *clause, Index relid,
 static Selectivity
 statext_mcv_clauselist_selectivity(PlannerInfo *root, List *clauses, int varRelid,
 								   JoinType jointype, SpecialJoinInfo *sjinfo,
-								   RelOptInfo *rel, Bitmapset **estimatedclauses)
+								   RelOptInfo *rel, Bitmapset **estimatedclauses,
+								   bool is_or)
 {
 	ListCell   *l;
 	Bitmapset **list_attnums;
@@ -1317,8 +1318,32 @@ statext_mcv_clauselist_selectivity(PlannerInfo *root, List *clauses, int varReli
 		 * columns/clauses. We'll then use the various selectivities computed from
 		 * MCV list to improve it.
 		 */
-		simple_sel = clauselist_selectivity_simple(root, stat_clauses, varRelid,
-												jointype, sjinfo, NULL);
+		if (is_or)
+		{
+			ListCell   *lc;
+			Selectivity	s1 = 0.0,
+						s2;
+
+			/*
+			 * Selectivities of OR clauses are computed s1+s2 - s1*s2 to account
+			 * for the probable overlap of selected tuple sets.
+			 */
+			foreach(lc, stat_clauses)
+			{
+				s2 = clause_selectivity(root,
+										(Node *) lfirst(lc),
+										varRelid,
+										jointype,
+										sjinfo);
+
+				s1 = s1 + s2 - s1 * s2;
+			}
+
+			simple_sel = s1;
+		}
+		else
+			simple_sel = clauselist_selectivity_simple(root, stat_clauses, varRelid,
+													   jointype, sjinfo, NULL);
 
 		/*
 		 * Now compute the multi-column estimate from the MCV list, along with the
@@ -1326,7 +1351,7 @@ statext_mcv_clauselist_selectivity(PlannerInfo *root, List *clauses, int varReli
 		 */
 		mcv_sel = mcv_clauselist_selectivity(root, stat, stat_clauses, varRelid,
 											 jointype, sjinfo, rel,
-											 &mcv_basesel, &mcv_totalsel);
+											 &mcv_basesel, &mcv_totalsel, is_or);
 
 		/* Estimated selectivity of values not covered by MCV matches */
 		other_sel = simple_sel - mcv_basesel;
@@ -1354,13 +1379,21 @@ statext_mcv_clauselist_selectivity(PlannerInfo *root, List *clauses, int varReli
 Selectivity
 statext_clauselist_selectivity(PlannerInfo *root, List *clauses, int varRelid,
 							   JoinType jointype, SpecialJoinInfo *sjinfo,
-							   RelOptInfo *rel, Bitmapset **estimatedclauses)
+							   RelOptInfo *rel, Bitmapset **estimatedclauses,
+							   bool is_or)
 {
 	Selectivity sel;
 
 	/* First, try estimating clauses using a multivariate MCV list. */
 	sel = statext_mcv_clauselist_selectivity(root, clauses, varRelid, jointype,
-											 sjinfo, rel, estimatedclauses);
+											 sjinfo, rel, estimatedclauses, is_or);
+
+	/*
+	 * Functional dependencies only work for clauses connected by AND, so for
+	 * OR clauses we're done.
+	 */
+	if (is_or)
+		return sel;
 
 	/*
 	 * Then, apply functional dependencies on the remaining clauses by calling
diff --git a/src/backend/statistics/mcv.c b/src/backend/statistics/mcv.c
index 87e232fdd4..3f42713aa2 100644
--- a/src/backend/statistics/mcv.c
+++ b/src/backend/statistics/mcv.c
@@ -1795,7 +1795,8 @@ mcv_clauselist_selectivity(PlannerInfo *root, StatisticExtInfo *stat,
 						   List *clauses, int varRelid,
 						   JoinType jointype, SpecialJoinInfo *sjinfo,
 						   RelOptInfo *rel,
-						   Selectivity *basesel, Selectivity *totalsel)
+						   Selectivity *basesel, Selectivity *totalsel,
+						   bool is_or)
 {
 	int			i;
 	MCVList    *mcv;
@@ -1808,7 +1809,7 @@ mcv_clauselist_selectivity(PlannerInfo *root, StatisticExtInfo *stat,
 	mcv = statext_mcv_load(stat->statOid);
 
 	/* build a match bitmap for the clauses */
-	matches = mcv_get_match_bitmap(root, clauses, stat->keys, mcv, false);
+	matches = mcv_get_match_bitmap(root, clauses, stat->keys, mcv, is_or);
 
 	/* sum frequencies for all the matching MCV items */
 	*basesel = 0.0;
diff --git a/src/include/statistics/extended_stats_internal.h b/src/include/statistics/extended_stats_internal.h
index b512ee908a..5171895bba 100644
--- a/src/include/statistics/extended_stats_internal.h
+++ b/src/include/statistics/extended_stats_internal.h
@@ -107,6 +107,7 @@ extern Selectivity mcv_clauselist_selectivity(PlannerInfo *root,
 											  SpecialJoinInfo *sjinfo,
 											  RelOptInfo *rel,
 											  Selectivity *basesel,
-											  Selectivity *totalsel);
+											  Selectivity *totalsel,
+											  bool is_or);
 
 #endif							/* EXTENDED_STATS_INTERNAL_H */
diff --git a/src/include/statistics/statistics.h b/src/include/statistics/statistics.h
index f5d9b6c73a..e18c9a6539 100644
--- a/src/include/statistics/statistics.h
+++ b/src/include/statistics/statistics.h
@@ -116,7 +116,8 @@ extern Selectivity statext_clauselist_selectivity(PlannerInfo *root,
 												  JoinType jointype,
 												  SpecialJoinInfo *sjinfo,
 												  RelOptInfo *rel,
-												  Bitmapset **estimatedclauses);
+												  Bitmapset **estimatedclauses,
+												  bool is_or);
 extern bool has_stats_of_kind(List *stats, char requiredkind);
 extern StatisticExtInfo *choose_best_statistics(List *stats, char requiredkind,
 												Bitmapset **clause_attnums,
diff --git a/src/test/regress/expected/stats_ext.out b/src/test/regress/expected/stats_ext.out
index 61237dfb11..5344b70cf4 100644
--- a/src/test/regress/expected/stats_ext.out
+++ b/src/test/regress/expected/stats_ext.out
@@ -648,11 +648,10 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = '
        200 |    200
 (1 row)
 
--- we can't use the statistic for OR clauses that are not fully covered (missing 'd' attribute)
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = ''1'' OR c = 1 OR d IS NOT NULL');
  estimated | actual 
 -----------+--------
-       343 |    200
+       200 |    200
 (1 row)
 
 -- check change of unrelated column type does not reset the MCV statistics
diff --git a/src/test/regress/sql/stats_ext.sql b/src/test/regress/sql/stats_ext.sql
index 84f13e8814..fa989fccb0 100644
--- a/src/test/regress/sql/stats_ext.sql
+++ b/src/test/regress/sql/stats_ext.sql
@@ -400,7 +400,6 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a <= 4 AND b <
 
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = ''1'' OR c = 1');
 
--- we can't use the statistic for OR clauses that are not fully covered (missing 'd' attribute)
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = ''1'' OR c = 1 OR d IS NOT NULL');
 
 -- check change of unrelated column type does not reset the MCV statistics
-- 
2.21.1

0002-Support-clauses-of-the-form-Var-op-Var-20200306.patchtext/plain; charset=us-asciiDownload

From af5921a73a71a8c6adf454c35e2b8e911c94cee7 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Mon, 11 Nov 2019 01:34:11 +0100
Subject: [PATCH 2/3] Support clauses of the form Var op Var

---
 src/backend/statistics/extended_stats.c       | 63 ++++++++++++----
 src/backend/statistics/mcv.c                  | 75 ++++++++++++++++++-
 .../statistics/extended_stats_internal.h      |  2 +-
 src/test/regress/expected/stats_ext.out       | 72 ++++++++++++++++++
 src/test/regress/sql/stats_ext.sql            | 22 ++++++
 5 files changed, 217 insertions(+), 17 deletions(-)

diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c
index 24ece6f99c..1872cd4529 100644
--- a/src/backend/statistics/extended_stats.c
+++ b/src/backend/statistics/extended_stats.c
@@ -986,14 +986,18 @@ statext_is_compatible_clause_internal(PlannerInfo *root, Node *clause,
 	{
 		RangeTblEntry *rte = root->simple_rte_array[relid];
 		OpExpr	   *expr = (OpExpr *) clause;
-		Var		   *var;
+		Var		   *var,
+				   *var2;
 
 		/* Only expressions with two arguments are considered compatible. */
 		if (list_length(expr->args) != 2)
 			return false;
 
-		/* Check if the expression the right shape (one Var, one Const) */
-		if (!examine_opclause_expression(expr, &var, NULL, NULL))
+		/*
+		 * Check if the expression the right shape (one Var and one Const,
+		 * or two Vars).
+		 */
+		if (!examine_opclause_expression(expr, &var, &var2, NULL, NULL))
 			return false;
 
 		/*
@@ -1033,7 +1037,20 @@ statext_is_compatible_clause_internal(PlannerInfo *root, Node *clause,
 			!get_func_leakproof(get_opcode(expr->opno)))
 			return false;
 
-		return statext_is_compatible_clause_internal(root, (Node *) var,
+		/*
+		 * Check compatibility of the first Var - we get this one for both
+		 * types of supported expressions (Var op Const) and (Var op Var).
+		 */
+		if (!statext_is_compatible_clause_internal(root, (Node *) var,
+												   relid, attnums))
+			return false;
+
+		/* For (Var op Const) we don't get the second Var, and we're done. */
+		if (!var2)
+			return true;
+
+		/* For (Var op Var) check compatibility of the second Var. */
+		return statext_is_compatible_clause_internal(root, (Node *) var2,
 													 relid, attnums);
 	}
 
@@ -1419,19 +1436,21 @@ statext_clauselist_selectivity(PlannerInfo *root, List *clauses, int varRelid,
  * examine_opclause_expression
  *		Split expression into Var and Const parts.
  *
- * Attempts to match the arguments to either (Var op Const) or (Const op Var),
- * possibly with a RelabelType on top. When the expression matches this form,
- * returns true, otherwise returns false.
+ * Attempts to match the arguments to either (Var op Const) or (Const op Var)
+ * or (Var op Var), possibly with a RelabelType on top. When the expression
+ * matches this form, returns true, otherwise returns false.
  *
  * Optionally returns pointers to the extracted Var/Const nodes, when passed
  * non-null pointers (varp, cstp and varonleftp). The varonleftp flag specifies
  * on which side of the operator we found the Var node.
  */
 bool
-examine_opclause_expression(OpExpr *expr, Var **varp, Const **cstp, bool *varonleftp)
+examine_opclause_expression(OpExpr *expr, Var **var1p, Var **var2p,
+							Const **cstp, bool *varonleftp)
 {
-	Var	   *var;
-	Const  *cst;
+	Var	   *var1 = NULL;
+	Var	   *var2 = NULL;
+	Const  *cst = NULL;
 	bool	varonleft;
 	Node   *leftop,
 		   *rightop;
@@ -1451,22 +1470,38 @@ examine_opclause_expression(OpExpr *expr, Var **varp, Const **cstp, bool *varonl
 
 	if (IsA(leftop, Var) && IsA(rightop, Const))
 	{
-		var = (Var *) leftop;
+		var1 = (Var *) leftop;
 		cst = (Const *) rightop;
 		varonleft = true;
 	}
 	else if (IsA(leftop, Const) && IsA(rightop, Var))
 	{
-		var = (Var *) rightop;
+		var1 = (Var *) rightop;
 		cst = (Const *) leftop;
 		varonleft = false;
 	}
+	else if (IsA(leftop, Var) && IsA(rightop, Var))
+	{
+		var1 = (Var *) leftop;
+		var2 = (Var *) rightop;
+		varonleft = false;
+
+		/*
+		 * Both variables have to be for the same relation (otherwise it's
+		 * a join clause, and we don't deal with those yet.
+		 */
+		if (var1->varno != var2->varno)
+			return false;
+	}
 	else
 		return false;
 
 	/* return pointers to the extracted parts if requested */
-	if (varp)
-		*varp = var;
+	if (var1p)
+		*var1p = var1;
+
+	if (var2p)
+		*var2p = var2;
 
 	if (cstp)
 		*cstp = cst;
diff --git a/src/backend/statistics/mcv.c b/src/backend/statistics/mcv.c
index 3f42713aa2..97d3083451 100644
--- a/src/backend/statistics/mcv.c
+++ b/src/backend/statistics/mcv.c
@@ -1581,16 +1581,25 @@ mcv_get_match_bitmap(PlannerInfo *root, List *clauses,
 
 			/* valid only after examine_opclause_expression returns true */
 			Var		   *var;
+			Var		   *var2;
 			Const	   *cst;
 			bool		varonleft;
 
 			fmgr_info(get_opcode(expr->opno), &opproc);
 
-			/* extract the var and const from the expression */
-			if (examine_opclause_expression(expr, &var, &cst, &varonleft))
+			/* extract the vars and const from the expression */
+			if (!examine_opclause_expression(expr, &var, &var2, &cst, &varonleft))
+				continue;	/* XXX Can this actually happen? */
+
+			/* We should always get at least one Var. */
+			Assert(var);
+
+			if (cst)
 			{
 				int			idx;
 
+				Assert(!var2);
+
 				/* match the attribute to a dimension of the statistic */
 				idx = bms_member_index(keys, var->varattno);
 
@@ -1651,6 +1660,68 @@ mcv_get_match_bitmap(PlannerInfo *root, List *clauses,
 					matches[i] = RESULT_MERGE(matches[i], is_or, match);
 				}
 			}
+			else
+			{
+				int			idx;
+				int			idx2;
+
+				Assert(var2);
+
+				/* match the attribute to a dimension of the statistic */
+				idx = bms_member_index(keys, var->varattno);
+				idx2 = bms_member_index(keys, var2->varattno);
+
+				/*
+				 * Walk through the MCV items and evaluate the current clause.
+				 * We can skip items that were already ruled out, and
+				 * terminate if there are no remaining MCV items that might
+				 * possibly match.
+				 */
+				for (i = 0; i < mcvlist->nitems; i++)
+				{
+					bool		match = true;
+					MCVItem    *item = &mcvlist->items[i];
+
+					/*
+					 * When either of the MCV items is NULL we can treat this
+					 * as a mismatch. We must not call the operator because
+					 * of strictness.
+					 */
+					if (item->isnull[idx] || item->isnull[idx2])
+					{
+						matches[i] = RESULT_MERGE(matches[i], is_or, false);
+						continue;
+					}
+
+					/*
+					 * Skip MCV items that can't change result in the bitmap.
+					 * Once the value gets false for AND-lists, or true for
+					 * OR-lists, we don't need to look at more clauses.
+					 */
+					if (RESULT_IS_FINAL(matches[i], is_or))
+						continue;
+
+					/*
+					 * First check whether the constant is below the lower
+					 * boundary (in that case we can skip the bucket, because
+					 * there's no overlap).
+					 *
+					 * We don't store collations used to build the statistics,
+					 * but we can use the collation for the attribute itself,
+					 * as stored in varcollid. We do reset the statistics after
+					 * a type change (including collation change), so this is
+					 * OK. We may need to relax this after allowing extended
+					 * statistics on expressions.
+					 */
+					match = DatumGetBool(FunctionCall2Coll(&opproc,
+														   var->varcollid,
+														   item->values[idx],
+														   item->values[idx2]));
+
+					/* update the match bitmap with the result */
+					matches[i] = RESULT_MERGE(matches[i], is_or, match);
+				}
+			}
 		}
 		else if (IsA(clause, NullTest))
 		{
diff --git a/src/include/statistics/extended_stats_internal.h b/src/include/statistics/extended_stats_internal.h
index 5171895bba..804089bc57 100644
--- a/src/include/statistics/extended_stats_internal.h
+++ b/src/include/statistics/extended_stats_internal.h
@@ -96,7 +96,7 @@ extern SortItem *build_sorted_items(int numrows, int *nitems, HeapTuple *rows,
 									TupleDesc tdesc, MultiSortSupport mss,
 									int numattrs, AttrNumber *attnums);
 
-extern bool examine_opclause_expression(OpExpr *expr, Var **varp,
+extern bool examine_opclause_expression(OpExpr *expr, Var **var1p, Var **var2p,
 										Const **cstp, bool *varonleftp);
 
 extern Selectivity mcv_clauselist_selectivity(PlannerInfo *root,
diff --git a/src/test/regress/expected/stats_ext.out b/src/test/regress/expected/stats_ext.out
index 5344b70cf4..4c078ae61f 100644
--- a/src/test/regress/expected/stats_ext.out
+++ b/src/test/regress/expected/stats_ext.out
@@ -603,6 +603,18 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = '
        343 |    200
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a > c');
+ estimated | actual 
+-----------+--------
+      1667 |   3750
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a < c');
+ estimated | actual 
+-----------+--------
+      1667 |      0
+(1 row)
+
 -- create statistics
 CREATE STATISTICS mcv_lists_stats (mcv) ON a, b, c FROM mcv_lists;
 ANALYZE mcv_lists;
@@ -654,6 +666,18 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = '
        200 |    200
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a > c');
+ estimated | actual 
+-----------+--------
+      3750 |   3750
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a < c');
+ estimated | actual 
+-----------+--------
+         1 |      0
+(1 row)
+
 -- check change of unrelated column type does not reset the MCV statistics
 ALTER TABLE mcv_lists ALTER COLUMN d TYPE VARCHAR(64);
 SELECT d.stxdmcv IS NOT NULL
@@ -749,6 +773,12 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE b = ''x'' OR d
       3750 |   2500
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE b = d');
+ estimated | actual 
+-----------+--------
+        25 |   2500
+(1 row)
+
 -- create statistics
 CREATE STATISTICS mcv_lists_stats (mcv) ON b, d FROM mcv_lists;
 ANALYZE mcv_lists;
@@ -758,6 +788,12 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE b = ''x'' OR d
       2500 |   2500
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE b = d');
+ estimated | actual 
+-----------+--------
+      2500 |   2500
+(1 row)
+
 -- mcv with arrays
 CREATE TABLE mcv_lists_arrays (
     a TEXT[],
@@ -808,6 +844,18 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a AND
       1094 |      0
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a = b');
+ estimated | actual 
+-----------+--------
+      9950 |   2500
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a = b AND b = c');
+ estimated | actual 
+-----------+--------
+        50 |   2500
+(1 row)
+
 CREATE STATISTICS mcv_lists_bool_stats (mcv) ON a, b, c
   FROM mcv_lists_bool;
 ANALYZE mcv_lists_bool;
@@ -835,6 +883,18 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a AND
          1 |      0
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a = b');
+ estimated | actual 
+-----------+--------
+      2500 |   2500
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a = b AND b = c');
+ estimated | actual 
+-----------+--------
+      2500 |   2500
+(1 row)
+
 -- check the ability to use multiple MCV lists
 CREATE TABLE mcv_lists_multi (
 	a INTEGER,
@@ -869,6 +929,12 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 AN
          4 |    142
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = b AND c = d');
+ estimated | actual 
+-----------+--------
+         1 |   5000
+(1 row)
+
 -- create separate MCV statistics
 CREATE STATISTICS mcv_lists_multi_1 (mcv) ON a, b FROM mcv_lists_multi;
 CREATE STATISTICS mcv_lists_multi_2 (mcv) ON c, d FROM mcv_lists_multi;
@@ -891,6 +957,12 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 AN
        143 |    142
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = b AND c = d');
+ estimated | actual 
+-----------+--------
+      5000 |   5000
+(1 row)
+
 DROP TABLE mcv_lists_multi;
 -- Permission tests. Users should not be able to see specific data values in
 -- the extended statistics, if they lack permission to see those values in
diff --git a/src/test/regress/sql/stats_ext.sql b/src/test/regress/sql/stats_ext.sql
index fa989fccb0..b7519b275b 100644
--- a/src/test/regress/sql/stats_ext.sql
+++ b/src/test/regress/sql/stats_ext.sql
@@ -381,6 +381,10 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = '
 
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = ''1'' OR c = 1 OR d IS NOT NULL');
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a > c');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a < c');
+
 -- create statistics
 CREATE STATISTICS mcv_lists_stats (mcv) ON a, b, c FROM mcv_lists;
 
@@ -402,6 +406,10 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = '
 
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = ''1'' OR c = 1 OR d IS NOT NULL');
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a > c');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a < c');
+
 -- check change of unrelated column type does not reset the MCV statistics
 ALTER TABLE mcv_lists ALTER COLUMN d TYPE VARCHAR(64);
 
@@ -473,6 +481,8 @@ ANALYZE mcv_lists;
 
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE b = ''x'' OR d = ''x''');
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE b = d');
+
 -- create statistics
 CREATE STATISTICS mcv_lists_stats (mcv) ON b, d FROM mcv_lists;
 
@@ -480,6 +490,8 @@ ANALYZE mcv_lists;
 
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE b = ''x'' OR d = ''x''');
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE b = d');
+
 -- mcv with arrays
 CREATE TABLE mcv_lists_arrays (
     a TEXT[],
@@ -521,6 +533,10 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a AND
 
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a AND b AND NOT c');
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a = b');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a = b AND b = c');
+
 CREATE STATISTICS mcv_lists_bool_stats (mcv) ON a, b, c
   FROM mcv_lists_bool;
 
@@ -534,6 +550,10 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a AND
 
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a AND b AND NOT c');
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a = b');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a = b AND b = c');
+
 -- check the ability to use multiple MCV lists
 CREATE TABLE mcv_lists_multi (
 	a INTEGER,
@@ -556,6 +576,7 @@ ANALYZE mcv_lists_multi;
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 AND b = 0');
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE c = 0 AND d = 0');
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 AND b = 0 AND c = 0 AND d = 0');
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = b AND c = d');
 
 -- create separate MCV statistics
 CREATE STATISTICS mcv_lists_multi_1 (mcv) ON a, b FROM mcv_lists_multi;
@@ -566,6 +587,7 @@ ANALYZE mcv_lists_multi;
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 AND b = 0');
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE c = 0 AND d = 0');
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 AND b = 0 AND c = 0 AND d = 0');
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = b AND c = d');
 
 DROP TABLE mcv_lists_multi;
 
-- 
2.21.1

Dean Rasheed

dean.a.rasheed@gmail.com

almost 6 years ago

In reply to: Tomas Vondra (#4)

Re: Additional improvements to extended statistics

On Fri, 6 Mar 2020 at 12:58, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

Here is a rebased version of this patch series. I've polished the first
two parts a bit - estimation of OR clauses and (Var op Var) clauses.

Hi,

I've been looking over the first patch (OR list support). It mostly
looks reasonable to me, except there's a problem with the way
statext_mcv_clauselist_selectivity() combines multiple stat_sel values
into the final result -- in the OR case, it needs to start with sel =
0, and then apply the OR formula to factor in each new estimate. I.e.,
this isn't right for an OR list:

/* Factor the estimate from this MCV to the oveall estimate. */
sel *= stat_sel;

(Oh and there's a typo in that comment: s/oveall/overall/).

For example, with the regression test data, this isn't estimated well:

SELECT * FROM mcv_lists_multi WHERE a = 0 OR b = 0 OR c = 0 OR d = 0;

Similarly, if no extended stats can be applied it needs to return 0
not 1, for example this query on the test data:

SELECT * FROM mcv_lists WHERE a = 1 OR a = 2 OR d IS NOT NULL;

It might also be worth adding a couple more regression test cases like these.

Regards,
Dean

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 6 years ago

In reply to: Dean Rasheed (#5)

Re: Additional improvements to extended statistics

On Sun, Mar 08, 2020 at 07:17:10PM +0000, Dean Rasheed wrote:

On Fri, 6 Mar 2020 at 12:58, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

Here is a rebased version of this patch series. I've polished the first
two parts a bit - estimation of OR clauses and (Var op Var) clauses.

Hi,

I've been looking over the first patch (OR list support). It mostly
looks reasonable to me, except there's a problem with the way
statext_mcv_clauselist_selectivity() combines multiple stat_sel values
into the final result -- in the OR case, it needs to start with sel =
0, and then apply the OR formula to factor in each new estimate. I.e.,
this isn't right for an OR list:

/* Factor the estimate from this MCV to the oveall estimate. */
sel *= stat_sel;

(Oh and there's a typo in that comment: s/oveall/overall/).

For example, with the regression test data, this isn't estimated well:

SELECT * FROM mcv_lists_multi WHERE a = 0 OR b = 0 OR c = 0 OR d = 0;

Similarly, if no extended stats can be applied it needs to return 0
not 1, for example this query on the test data:

SELECT * FROM mcv_lists WHERE a = 1 OR a = 2 OR d IS NOT NULL;

Ah, right. Thanks for noticing this. Attaches is an updated patch series
with parts 0002 and 0003 adding tests demonstrating the issue and then
fixing it (both shall be merged to 0001).

It might also be worth adding a couple more regression test cases like these.

Agreed, 0002 adds a couple of relevant tests.

Incidentally, I've been working on improving test coverage for extended
stats over the past few days (it has ~80% lines covered, which is not
bad nor great). I haven't submitted that to hackers yet, because it's
mostly mechanical and it's would interfere with the two existing threads
about extended stats ...

Speaking of which, would you take a look at [1]/messages/by-id/13902317.Eha0YfKkKy@pierred-pdoc? I think supporting SAOP
is fine, but I wonder if you agree with my conclusion we can't really
support inclusion @> as explained in [2]/messages/by-id/20200202184134.swoqkqlqorqolrqv@development.

[1]: /messages/by-id/13902317.Eha0YfKkKy@pierred-pdoc
[2]: /messages/by-id/20200202184134.swoqkqlqorqolrqv@development

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 6 years ago

In reply to: Tomas Vondra (#6)

4 attachment(s)

Re: Additional improvements to extended statistics

On Mon, Mar 09, 2020 at 01:01:57AM +0100, Tomas Vondra wrote:

On Sun, Mar 08, 2020 at 07:17:10PM +0000, Dean Rasheed wrote:

On Fri, 6 Mar 2020 at 12:58, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

Here is a rebased version of this patch series. I've polished the first
two parts a bit - estimation of OR clauses and (Var op Var) clauses.

Hi,

I've been looking over the first patch (OR list support). It mostly
looks reasonable to me, except there's a problem with the way
statext_mcv_clauselist_selectivity() combines multiple stat_sel values
into the final result -- in the OR case, it needs to start with sel =
0, and then apply the OR formula to factor in each new estimate. I.e.,
this isn't right for an OR list:

/* Factor the estimate from this MCV to the oveall estimate. */
sel *= stat_sel;

(Oh and there's a typo in that comment: s/oveall/overall/).

For example, with the regression test data, this isn't estimated well:

SELECT * FROM mcv_lists_multi WHERE a = 0 OR b = 0 OR c = 0 OR d = 0;

Similarly, if no extended stats can be applied it needs to return 0
not 1, for example this query on the test data:

SELECT * FROM mcv_lists WHERE a = 1 OR a = 2 OR d IS NOT NULL;

Ah, right. Thanks for noticing this. Attaches is an updated patch series
with parts 0002 and 0003 adding tests demonstrating the issue and then
fixing it (both shall be merged to 0001).

One day I won't forget to actually attach the files ...

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

0001-Support-using-extended-stats-for-parts-of-O-20200309.patchtext/plain; charset=us-asciiDownload

From 4a0807b7e8ec511d72361598c390c158dc76e1a0 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Sun, 8 Mar 2020 23:26:50 +0100
Subject: [PATCH 1/4] Support using extended stats for parts of OR clauses

---
 src/backend/optimizer/path/clausesel.c        | 109 +++++++++++++++---
 src/backend/statistics/extended_stats.c       |  45 +++++++-
 src/backend/statistics/mcv.c                  |   5 +-
 .../statistics/extended_stats_internal.h      |   3 +-
 src/include/statistics/statistics.h           |   3 +-
 src/test/regress/expected/stats_ext.out       |   3 +-
 src/test/regress/sql/stats_ext.sql            |   1 -
 7 files changed, 138 insertions(+), 31 deletions(-)

diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c
index a3ebe10592..8c1a404ce2 100644
--- a/src/backend/optimizer/path/clausesel.c
+++ b/src/backend/optimizer/path/clausesel.c
@@ -92,7 +92,7 @@ clauselist_selectivity(PlannerInfo *root,
 		 */
 		s1 *= statext_clauselist_selectivity(root, clauses, varRelid,
 											 jointype, sjinfo, rel,
-											 &estimatedclauses);
+											 &estimatedclauses, false);
 	}
 
 	/*
@@ -104,6 +104,89 @@ clauselist_selectivity(PlannerInfo *root,
 											  estimatedclauses);
 }
 
+/*
+ * clauselist_selectivity_or -
+ *	  Compute the selectivity of an implicitly-ORed list of boolean
+ *	  expression clauses.  The list can be empty, in which case 0.0
+ *	  must be returned.  List elements may be either RestrictInfos
+ *	  or bare expression clauses --- the former is preferred since
+ *	  it allows caching of results.
+ *
+ * See clause_selectivity() for the meaning of the additional parameters.
+ *
+ * The basic approach is to apply extended statistics first, on as many
+ * clauses as possible, in order to capture cross-column dependencies etc.
+ * The remaining clauses are then estimated using regular statistics tracked
+ * for individual columns.  This is done by simply passing the clauses to
+ * clauselist_selectivity and then combining the selectivities using the
+ * regular formula (s1+s2 - s1*s2).
+ */
+static Selectivity
+clauselist_selectivity_or(PlannerInfo *root,
+						  List *clauses,
+						  int varRelid,
+						  JoinType jointype,
+						  SpecialJoinInfo *sjinfo)
+{
+	ListCell   *lc;
+	Selectivity	s1 = 0.0;
+	RelOptInfo *rel;
+	Bitmapset  *estimatedclauses = NULL;
+	int			listidx;
+
+	/*
+	 * Determine if these clauses reference a single relation.  If so, and if
+	 * it has extended statistics, try to apply those.
+	 */
+	rel = find_single_rel_for_clauses(root, clauses);
+	if (rel && rel->rtekind == RTE_RELATION && rel->statlist != NIL)
+	{
+		/*
+		 * Estimate as many clauses as possible using extended statistics.
+		 *
+		 * 'estimatedclauses' tracks the 0-based list position index of
+		 * clauses that we've estimated using extended statistics, and that
+		 * should be ignored.
+		 *
+		 * XXX We can't multiply with current value, because for OR clauses
+		 * we start with 0.0, so we simply assign to s1 directly.
+		 */
+		s1 = statext_clauselist_selectivity(root, clauses, varRelid,
+											jointype, sjinfo, rel,
+											&estimatedclauses, true);
+	}
+
+	/*
+	 * Selectivities of the remaining clauses for an OR clause are computed
+	 * as s1+s2 - s1*s2 to account for the probable overlap of selected tuple
+	 * sets. The clauses estimated using extended statistics are effectively
+	 * treated as a single clause.
+	 *
+	 * XXX is this too conservative?
+	 */
+	listidx = -1;
+	foreach(lc, clauses)
+	{
+		Selectivity s2;
+
+		listidx++;
+
+		/* skip already estimated clauses */
+		if (bms_is_member(listidx, estimatedclauses))
+			continue;
+
+		s2 = clause_selectivity(root,
+								(Node *) lfirst(lc),
+								varRelid,
+								jointype,
+								sjinfo);
+
+		s1 = s1 + s2 - s1 * s2;
+	}
+
+	return s1;
+}
+
 /*
  * clauselist_selectivity_simple -
  *	  Compute the selectivity of an implicitly-ANDed list of boolean
@@ -735,24 +818,14 @@ clause_selectivity(PlannerInfo *root,
 	else if (is_orclause(clause))
 	{
 		/*
-		 * Selectivities for an OR clause are computed as s1+s2 - s1*s2 to
-		 * account for the probable overlap of selected tuple sets.
-		 *
-		 * XXX is this too conservative?
+		 * Almost the same thing as clauselist_selectivity, but with the
+		 * clauses connected by OR.
 		 */
-		ListCell   *arg;
-
-		s1 = 0.0;
-		foreach(arg, ((BoolExpr *) clause)->args)
-		{
-			Selectivity s2 = clause_selectivity(root,
-												(Node *) lfirst(arg),
-												varRelid,
-												jointype,
-												sjinfo);
-
-			s1 = s1 + s2 - s1 * s2;
-		}
+		s1 = clauselist_selectivity_or(root,
+									   ((BoolExpr *) clause)->args,
+									   varRelid,
+									   jointype,
+									   sjinfo);
 	}
 	else if (is_opclause(clause) || IsA(clause, DistinctExpr))
 	{
diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c
index 03e69d057f..24ece6f99c 100644
--- a/src/backend/statistics/extended_stats.c
+++ b/src/backend/statistics/extended_stats.c
@@ -1225,7 +1225,8 @@ statext_is_compatible_clause(PlannerInfo *root, Node *clause, Index relid,
 static Selectivity
 statext_mcv_clauselist_selectivity(PlannerInfo *root, List *clauses, int varRelid,
 								   JoinType jointype, SpecialJoinInfo *sjinfo,
-								   RelOptInfo *rel, Bitmapset **estimatedclauses)
+								   RelOptInfo *rel, Bitmapset **estimatedclauses,
+								   bool is_or)
 {
 	ListCell   *l;
 	Bitmapset **list_attnums;
@@ -1317,8 +1318,32 @@ statext_mcv_clauselist_selectivity(PlannerInfo *root, List *clauses, int varReli
 		 * columns/clauses. We'll then use the various selectivities computed from
 		 * MCV list to improve it.
 		 */
-		simple_sel = clauselist_selectivity_simple(root, stat_clauses, varRelid,
-												jointype, sjinfo, NULL);
+		if (is_or)
+		{
+			ListCell   *lc;
+			Selectivity	s1 = 0.0,
+						s2;
+
+			/*
+			 * Selectivities of OR clauses are computed s1+s2 - s1*s2 to account
+			 * for the probable overlap of selected tuple sets.
+			 */
+			foreach(lc, stat_clauses)
+			{
+				s2 = clause_selectivity(root,
+										(Node *) lfirst(lc),
+										varRelid,
+										jointype,
+										sjinfo);
+
+				s1 = s1 + s2 - s1 * s2;
+			}
+
+			simple_sel = s1;
+		}
+		else
+			simple_sel = clauselist_selectivity_simple(root, stat_clauses, varRelid,
+													   jointype, sjinfo, NULL);
 
 		/*
 		 * Now compute the multi-column estimate from the MCV list, along with the
@@ -1326,7 +1351,7 @@ statext_mcv_clauselist_selectivity(PlannerInfo *root, List *clauses, int varReli
 		 */
 		mcv_sel = mcv_clauselist_selectivity(root, stat, stat_clauses, varRelid,
 											 jointype, sjinfo, rel,
-											 &mcv_basesel, &mcv_totalsel);
+											 &mcv_basesel, &mcv_totalsel, is_or);
 
 		/* Estimated selectivity of values not covered by MCV matches */
 		other_sel = simple_sel - mcv_basesel;
@@ -1354,13 +1379,21 @@ statext_mcv_clauselist_selectivity(PlannerInfo *root, List *clauses, int varReli
 Selectivity
 statext_clauselist_selectivity(PlannerInfo *root, List *clauses, int varRelid,
 							   JoinType jointype, SpecialJoinInfo *sjinfo,
-							   RelOptInfo *rel, Bitmapset **estimatedclauses)
+							   RelOptInfo *rel, Bitmapset **estimatedclauses,
+							   bool is_or)
 {
 	Selectivity sel;
 
 	/* First, try estimating clauses using a multivariate MCV list. */
 	sel = statext_mcv_clauselist_selectivity(root, clauses, varRelid, jointype,
-											 sjinfo, rel, estimatedclauses);
+											 sjinfo, rel, estimatedclauses, is_or);
+
+	/*
+	 * Functional dependencies only work for clauses connected by AND, so for
+	 * OR clauses we're done.
+	 */
+	if (is_or)
+		return sel;
 
 	/*
 	 * Then, apply functional dependencies on the remaining clauses by calling
diff --git a/src/backend/statistics/mcv.c b/src/backend/statistics/mcv.c
index 87e232fdd4..3f42713aa2 100644
--- a/src/backend/statistics/mcv.c
+++ b/src/backend/statistics/mcv.c
@@ -1795,7 +1795,8 @@ mcv_clauselist_selectivity(PlannerInfo *root, StatisticExtInfo *stat,
 						   List *clauses, int varRelid,
 						   JoinType jointype, SpecialJoinInfo *sjinfo,
 						   RelOptInfo *rel,
-						   Selectivity *basesel, Selectivity *totalsel)
+						   Selectivity *basesel, Selectivity *totalsel,
+						   bool is_or)
 {
 	int			i;
 	MCVList    *mcv;
@@ -1808,7 +1809,7 @@ mcv_clauselist_selectivity(PlannerInfo *root, StatisticExtInfo *stat,
 	mcv = statext_mcv_load(stat->statOid);
 
 	/* build a match bitmap for the clauses */
-	matches = mcv_get_match_bitmap(root, clauses, stat->keys, mcv, false);
+	matches = mcv_get_match_bitmap(root, clauses, stat->keys, mcv, is_or);
 
 	/* sum frequencies for all the matching MCV items */
 	*basesel = 0.0;
diff --git a/src/include/statistics/extended_stats_internal.h b/src/include/statistics/extended_stats_internal.h
index b512ee908a..5171895bba 100644
--- a/src/include/statistics/extended_stats_internal.h
+++ b/src/include/statistics/extended_stats_internal.h
@@ -107,6 +107,7 @@ extern Selectivity mcv_clauselist_selectivity(PlannerInfo *root,
 											  SpecialJoinInfo *sjinfo,
 											  RelOptInfo *rel,
 											  Selectivity *basesel,
-											  Selectivity *totalsel);
+											  Selectivity *totalsel,
+											  bool is_or);
 
 #endif							/* EXTENDED_STATS_INTERNAL_H */
diff --git a/src/include/statistics/statistics.h b/src/include/statistics/statistics.h
index f5d9b6c73a..e18c9a6539 100644
--- a/src/include/statistics/statistics.h
+++ b/src/include/statistics/statistics.h
@@ -116,7 +116,8 @@ extern Selectivity statext_clauselist_selectivity(PlannerInfo *root,
 												  JoinType jointype,
 												  SpecialJoinInfo *sjinfo,
 												  RelOptInfo *rel,
-												  Bitmapset **estimatedclauses);
+												  Bitmapset **estimatedclauses,
+												  bool is_or);
 extern bool has_stats_of_kind(List *stats, char requiredkind);
 extern StatisticExtInfo *choose_best_statistics(List *stats, char requiredkind,
 												Bitmapset **clause_attnums,
diff --git a/src/test/regress/expected/stats_ext.out b/src/test/regress/expected/stats_ext.out
index 61237dfb11..5344b70cf4 100644
--- a/src/test/regress/expected/stats_ext.out
+++ b/src/test/regress/expected/stats_ext.out
@@ -648,11 +648,10 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = '
        200 |    200
 (1 row)
 
--- we can't use the statistic for OR clauses that are not fully covered (missing 'd' attribute)
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = ''1'' OR c = 1 OR d IS NOT NULL');
  estimated | actual 
 -----------+--------
-       343 |    200
+       200 |    200
 (1 row)
 
 -- check change of unrelated column type does not reset the MCV statistics
diff --git a/src/test/regress/sql/stats_ext.sql b/src/test/regress/sql/stats_ext.sql
index 84f13e8814..fa989fccb0 100644
--- a/src/test/regress/sql/stats_ext.sql
+++ b/src/test/regress/sql/stats_ext.sql
@@ -400,7 +400,6 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a <= 4 AND b <
 
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = ''1'' OR c = 1');
 
--- we can't use the statistic for OR clauses that are not fully covered (missing 'd' attribute)
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = ''1'' OR c = 1 OR d IS NOT NULL');
 
 -- check change of unrelated column type does not reset the MCV statistics
-- 
2.21.1

0002-Fix-Add-regression-tests-for-OR-clauses-20200309.patchtext/plain; charset=us-asciiDownload

From e41b0b10f432cafce77928f88d97a8b8d7acef71 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Mon, 9 Mar 2020 00:36:49 +0100
Subject: [PATCH 2/4] Fix: Add regression tests for OR clauses

---
 src/test/regress/expected/stats_ext.out | 48 +++++++++++++++++++++++++
 src/test/regress/sql/stats_ext.sql      |  8 +++++
 2 files changed, 56 insertions(+)

diff --git a/src/test/regress/expected/stats_ext.out b/src/test/regress/expected/stats_ext.out
index 5344b70cf4..383465e8cb 100644
--- a/src/test/regress/expected/stats_ext.out
+++ b/src/test/regress/expected/stats_ext.out
@@ -863,12 +863,36 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE c = 0 AN
        102 |    714
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 AND c = 0');
+ estimated | actual 
+-----------+--------
+       143 |    142
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 OR c = 0');
+ estimated | actual 
+-----------+--------
+      1571 |   1572
+(1 row)
+
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 AND b = 0 AND c = 0 AND d = 0');
  estimated | actual 
 -----------+--------
          4 |    142
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE (a = 0 AND b = 0) OR (c = 0 AND d = 0)');
+ estimated | actual 
+-----------+--------
+       298 |   1572
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 OR b = 0 OR c = 0 OR d = 0');
+ estimated | actual 
+-----------+--------
+      2649 |   1572
+(1 row)
+
 -- create separate MCV statistics
 CREATE STATISTICS mcv_lists_multi_1 (mcv) ON a, b FROM mcv_lists_multi;
 CREATE STATISTICS mcv_lists_multi_2 (mcv) ON c, d FROM mcv_lists_multi;
@@ -885,12 +909,36 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE c = 0 AN
        714 |    714
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 AND c = 0');
+ estimated | actual 
+-----------+--------
+       143 |    142
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 OR c = 0');
+ estimated | actual 
+-----------+--------
+      1571 |   1572
+(1 row)
+
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 AND b = 0 AND c = 0 AND d = 0');
  estimated | actual 
 -----------+--------
        143 |    142
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE (a = 0 AND b = 0) OR (c = 0 AND d = 0)');
+ estimated | actual 
+-----------+--------
+      1571 |   1572
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 OR b = 0 OR c = 0 OR d = 0');
+ estimated | actual 
+-----------+--------
+      1571 |   1572
+(1 row)
+
 DROP TABLE mcv_lists_multi;
 -- Permission tests. Users should not be able to see specific data values in
 -- the extended statistics, if they lack permission to see those values in
diff --git a/src/test/regress/sql/stats_ext.sql b/src/test/regress/sql/stats_ext.sql
index fa989fccb0..6fd8b016bd 100644
--- a/src/test/regress/sql/stats_ext.sql
+++ b/src/test/regress/sql/stats_ext.sql
@@ -555,7 +555,11 @@ ANALYZE mcv_lists_multi;
 -- estimates without any mcv statistics
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 AND b = 0');
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE c = 0 AND d = 0');
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 AND c = 0');
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 OR c = 0');
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 AND b = 0 AND c = 0 AND d = 0');
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE (a = 0 AND b = 0) OR (c = 0 AND d = 0)');
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 OR b = 0 OR c = 0 OR d = 0');
 
 -- create separate MCV statistics
 CREATE STATISTICS mcv_lists_multi_1 (mcv) ON a, b FROM mcv_lists_multi;
@@ -565,7 +569,11 @@ ANALYZE mcv_lists_multi;
 
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 AND b = 0');
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE c = 0 AND d = 0');
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 AND c = 0');
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 OR c = 0');
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 AND b = 0 AND c = 0 AND d = 0');
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE (a = 0 AND b = 0) OR (c = 0 AND d = 0)');
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 OR b = 0 OR c = 0 OR d = 0');
 
 DROP TABLE mcv_lists_multi;
 
-- 
2.21.1

0003-Fix-Calculation-of-OR-selectivities-20200309.patchtext/plain; charset=us-asciiDownload

From 7443356dc556eed04bd2f61679a56001936938ad Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Sun, 8 Mar 2020 23:27:08 +0100
Subject: [PATCH 3/4] Fix: Calculation of OR selectivities

---
 src/backend/statistics/extended_stats.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c
index 24ece6f99c..daf95ff437 100644
--- a/src/backend/statistics/extended_stats.c
+++ b/src/backend/statistics/extended_stats.c
@@ -1231,11 +1231,11 @@ statext_mcv_clauselist_selectivity(PlannerInfo *root, List *clauses, int varReli
 	ListCell   *l;
 	Bitmapset **list_attnums;
 	int			listidx;
-	Selectivity	sel = 1.0;
+	Selectivity	sel = (is_or) ? 0.0 : 1.0;
 
 	/* check if there's any stats that might be useful for us. */
 	if (!has_stats_of_kind(rel->statlist, STATS_EXT_MCV))
-		return 1.0;
+		return sel;
 
 	list_attnums = (Bitmapset **) palloc(sizeof(Bitmapset *) *
 										 list_length(clauses));
@@ -1365,8 +1365,11 @@ statext_mcv_clauselist_selectivity(PlannerInfo *root, List *clauses, int varReli
 		stat_sel = mcv_sel + other_sel;
 		CLAMP_PROBABILITY(stat_sel);
 
-		/* Factor the estimate from this MCV to the oveall estimate. */
-		sel *= stat_sel;
+		/* Factor the estimate from this MCV to the overall estimate. */
+		if (is_or)
+			sel = sel + stat_sel - sel * stat_sel;
+		else
+			sel *= stat_sel;
 	}
 
 	return sel;
-- 
2.21.1

0004-Support-clauses-of-the-form-Var-op-Var-20200309.patchtext/plain; charset=us-asciiDownload

From fdd1b9d55a24e7b7f89a6bb946830d08af7e736b Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Sun, 8 Mar 2020 23:27:30 +0100
Subject: [PATCH 4/4] Support clauses of the form Var op Var

---
 src/backend/statistics/extended_stats.c       | 63 ++++++++++++----
 src/backend/statistics/mcv.c                  | 75 ++++++++++++++++++-
 .../statistics/extended_stats_internal.h      |  2 +-
 src/test/regress/expected/stats_ext.out       | 72 ++++++++++++++++++
 src/test/regress/sql/stats_ext.sql            | 22 ++++++
 5 files changed, 217 insertions(+), 17 deletions(-)

diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c
index daf95ff437..ddcb5d2955 100644
--- a/src/backend/statistics/extended_stats.c
+++ b/src/backend/statistics/extended_stats.c
@@ -986,14 +986,18 @@ statext_is_compatible_clause_internal(PlannerInfo *root, Node *clause,
 	{
 		RangeTblEntry *rte = root->simple_rte_array[relid];
 		OpExpr	   *expr = (OpExpr *) clause;
-		Var		   *var;
+		Var		   *var,
+				   *var2;
 
 		/* Only expressions with two arguments are considered compatible. */
 		if (list_length(expr->args) != 2)
 			return false;
 
-		/* Check if the expression the right shape (one Var, one Const) */
-		if (!examine_opclause_expression(expr, &var, NULL, NULL))
+		/*
+		 * Check if the expression the right shape (one Var and one Const,
+		 * or two Vars).
+		 */
+		if (!examine_opclause_expression(expr, &var, &var2, NULL, NULL))
 			return false;
 
 		/*
@@ -1033,7 +1037,20 @@ statext_is_compatible_clause_internal(PlannerInfo *root, Node *clause,
 			!get_func_leakproof(get_opcode(expr->opno)))
 			return false;
 
-		return statext_is_compatible_clause_internal(root, (Node *) var,
+		/*
+		 * Check compatibility of the first Var - we get this one for both
+		 * types of supported expressions (Var op Const) and (Var op Var).
+		 */
+		if (!statext_is_compatible_clause_internal(root, (Node *) var,
+												   relid, attnums))
+			return false;
+
+		/* For (Var op Const) we don't get the second Var, and we're done. */
+		if (!var2)
+			return true;
+
+		/* For (Var op Var) check compatibility of the second Var. */
+		return statext_is_compatible_clause_internal(root, (Node *) var2,
 													 relid, attnums);
 	}
 
@@ -1422,19 +1439,21 @@ statext_clauselist_selectivity(PlannerInfo *root, List *clauses, int varRelid,
  * examine_opclause_expression
  *		Split expression into Var and Const parts.
  *
- * Attempts to match the arguments to either (Var op Const) or (Const op Var),
- * possibly with a RelabelType on top. When the expression matches this form,
- * returns true, otherwise returns false.
+ * Attempts to match the arguments to either (Var op Const) or (Const op Var)
+ * or (Var op Var), possibly with a RelabelType on top. When the expression
+ * matches this form, returns true, otherwise returns false.
  *
  * Optionally returns pointers to the extracted Var/Const nodes, when passed
  * non-null pointers (varp, cstp and varonleftp). The varonleftp flag specifies
  * on which side of the operator we found the Var node.
  */
 bool
-examine_opclause_expression(OpExpr *expr, Var **varp, Const **cstp, bool *varonleftp)
+examine_opclause_expression(OpExpr *expr, Var **var1p, Var **var2p,
+							Const **cstp, bool *varonleftp)
 {
-	Var	   *var;
-	Const  *cst;
+	Var	   *var1 = NULL;
+	Var	   *var2 = NULL;
+	Const  *cst = NULL;
 	bool	varonleft;
 	Node   *leftop,
 		   *rightop;
@@ -1454,22 +1473,38 @@ examine_opclause_expression(OpExpr *expr, Var **varp, Const **cstp, bool *varonl
 
 	if (IsA(leftop, Var) && IsA(rightop, Const))
 	{
-		var = (Var *) leftop;
+		var1 = (Var *) leftop;
 		cst = (Const *) rightop;
 		varonleft = true;
 	}
 	else if (IsA(leftop, Const) && IsA(rightop, Var))
 	{
-		var = (Var *) rightop;
+		var1 = (Var *) rightop;
 		cst = (Const *) leftop;
 		varonleft = false;
 	}
+	else if (IsA(leftop, Var) && IsA(rightop, Var))
+	{
+		var1 = (Var *) leftop;
+		var2 = (Var *) rightop;
+		varonleft = false;
+
+		/*
+		 * Both variables have to be for the same relation (otherwise it's
+		 * a join clause, and we don't deal with those yet.
+		 */
+		if (var1->varno != var2->varno)
+			return false;
+	}
 	else
 		return false;
 
 	/* return pointers to the extracted parts if requested */
-	if (varp)
-		*varp = var;
+	if (var1p)
+		*var1p = var1;
+
+	if (var2p)
+		*var2p = var2;
 
 	if (cstp)
 		*cstp = cst;
diff --git a/src/backend/statistics/mcv.c b/src/backend/statistics/mcv.c
index 3f42713aa2..97d3083451 100644
--- a/src/backend/statistics/mcv.c
+++ b/src/backend/statistics/mcv.c
@@ -1581,16 +1581,25 @@ mcv_get_match_bitmap(PlannerInfo *root, List *clauses,
 
 			/* valid only after examine_opclause_expression returns true */
 			Var		   *var;
+			Var		   *var2;
 			Const	   *cst;
 			bool		varonleft;
 
 			fmgr_info(get_opcode(expr->opno), &opproc);
 
-			/* extract the var and const from the expression */
-			if (examine_opclause_expression(expr, &var, &cst, &varonleft))
+			/* extract the vars and const from the expression */
+			if (!examine_opclause_expression(expr, &var, &var2, &cst, &varonleft))
+				continue;	/* XXX Can this actually happen? */
+
+			/* We should always get at least one Var. */
+			Assert(var);
+
+			if (cst)
 			{
 				int			idx;
 
+				Assert(!var2);
+
 				/* match the attribute to a dimension of the statistic */
 				idx = bms_member_index(keys, var->varattno);
 
@@ -1651,6 +1660,68 @@ mcv_get_match_bitmap(PlannerInfo *root, List *clauses,
 					matches[i] = RESULT_MERGE(matches[i], is_or, match);
 				}
 			}
+			else
+			{
+				int			idx;
+				int			idx2;
+
+				Assert(var2);
+
+				/* match the attribute to a dimension of the statistic */
+				idx = bms_member_index(keys, var->varattno);
+				idx2 = bms_member_index(keys, var2->varattno);
+
+				/*
+				 * Walk through the MCV items and evaluate the current clause.
+				 * We can skip items that were already ruled out, and
+				 * terminate if there are no remaining MCV items that might
+				 * possibly match.
+				 */
+				for (i = 0; i < mcvlist->nitems; i++)
+				{
+					bool		match = true;
+					MCVItem    *item = &mcvlist->items[i];
+
+					/*
+					 * When either of the MCV items is NULL we can treat this
+					 * as a mismatch. We must not call the operator because
+					 * of strictness.
+					 */
+					if (item->isnull[idx] || item->isnull[idx2])
+					{
+						matches[i] = RESULT_MERGE(matches[i], is_or, false);
+						continue;
+					}
+
+					/*
+					 * Skip MCV items that can't change result in the bitmap.
+					 * Once the value gets false for AND-lists, or true for
+					 * OR-lists, we don't need to look at more clauses.
+					 */
+					if (RESULT_IS_FINAL(matches[i], is_or))
+						continue;
+
+					/*
+					 * First check whether the constant is below the lower
+					 * boundary (in that case we can skip the bucket, because
+					 * there's no overlap).
+					 *
+					 * We don't store collations used to build the statistics,
+					 * but we can use the collation for the attribute itself,
+					 * as stored in varcollid. We do reset the statistics after
+					 * a type change (including collation change), so this is
+					 * OK. We may need to relax this after allowing extended
+					 * statistics on expressions.
+					 */
+					match = DatumGetBool(FunctionCall2Coll(&opproc,
+														   var->varcollid,
+														   item->values[idx],
+														   item->values[idx2]));
+
+					/* update the match bitmap with the result */
+					matches[i] = RESULT_MERGE(matches[i], is_or, match);
+				}
+			}
 		}
 		else if (IsA(clause, NullTest))
 		{
diff --git a/src/include/statistics/extended_stats_internal.h b/src/include/statistics/extended_stats_internal.h
index 5171895bba..804089bc57 100644
--- a/src/include/statistics/extended_stats_internal.h
+++ b/src/include/statistics/extended_stats_internal.h
@@ -96,7 +96,7 @@ extern SortItem *build_sorted_items(int numrows, int *nitems, HeapTuple *rows,
 									TupleDesc tdesc, MultiSortSupport mss,
 									int numattrs, AttrNumber *attnums);
 
-extern bool examine_opclause_expression(OpExpr *expr, Var **varp,
+extern bool examine_opclause_expression(OpExpr *expr, Var **var1p, Var **var2p,
 										Const **cstp, bool *varonleftp);
 
 extern Selectivity mcv_clauselist_selectivity(PlannerInfo *root,
diff --git a/src/test/regress/expected/stats_ext.out b/src/test/regress/expected/stats_ext.out
index 383465e8cb..3eb1804cf3 100644
--- a/src/test/regress/expected/stats_ext.out
+++ b/src/test/regress/expected/stats_ext.out
@@ -603,6 +603,18 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = '
        343 |    200
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a > c');
+ estimated | actual 
+-----------+--------
+      1667 |   3750
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a < c');
+ estimated | actual 
+-----------+--------
+      1667 |      0
+(1 row)
+
 -- create statistics
 CREATE STATISTICS mcv_lists_stats (mcv) ON a, b, c FROM mcv_lists;
 ANALYZE mcv_lists;
@@ -654,6 +666,18 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = '
        200 |    200
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a > c');
+ estimated | actual 
+-----------+--------
+      3750 |   3750
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a < c');
+ estimated | actual 
+-----------+--------
+         1 |      0
+(1 row)
+
 -- check change of unrelated column type does not reset the MCV statistics
 ALTER TABLE mcv_lists ALTER COLUMN d TYPE VARCHAR(64);
 SELECT d.stxdmcv IS NOT NULL
@@ -749,6 +773,12 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE b = ''x'' OR d
       3750 |   2500
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE b = d');
+ estimated | actual 
+-----------+--------
+        25 |   2500
+(1 row)
+
 -- create statistics
 CREATE STATISTICS mcv_lists_stats (mcv) ON b, d FROM mcv_lists;
 ANALYZE mcv_lists;
@@ -758,6 +788,12 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE b = ''x'' OR d
       2500 |   2500
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE b = d');
+ estimated | actual 
+-----------+--------
+      2500 |   2500
+(1 row)
+
 -- mcv with arrays
 CREATE TABLE mcv_lists_arrays (
     a TEXT[],
@@ -808,6 +844,18 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a AND
       1094 |      0
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a = b');
+ estimated | actual 
+-----------+--------
+      9950 |   2500
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a = b AND b = c');
+ estimated | actual 
+-----------+--------
+        50 |   2500
+(1 row)
+
 CREATE STATISTICS mcv_lists_bool_stats (mcv) ON a, b, c
   FROM mcv_lists_bool;
 ANALYZE mcv_lists_bool;
@@ -835,6 +883,18 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a AND
          1 |      0
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a = b');
+ estimated | actual 
+-----------+--------
+      2500 |   2500
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a = b AND b = c');
+ estimated | actual 
+-----------+--------
+      2500 |   2500
+(1 row)
+
 -- check the ability to use multiple MCV lists
 CREATE TABLE mcv_lists_multi (
 	a INTEGER,
@@ -893,6 +953,12 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 OR
       2649 |   1572
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = b AND c = d');
+ estimated | actual 
+-----------+--------
+         1 |   5000
+(1 row)
+
 -- create separate MCV statistics
 CREATE STATISTICS mcv_lists_multi_1 (mcv) ON a, b FROM mcv_lists_multi;
 CREATE STATISTICS mcv_lists_multi_2 (mcv) ON c, d FROM mcv_lists_multi;
@@ -939,6 +1005,12 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 OR
       1571 |   1572
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = b AND c = d');
+ estimated | actual 
+-----------+--------
+      5000 |   5000
+(1 row)
+
 DROP TABLE mcv_lists_multi;
 -- Permission tests. Users should not be able to see specific data values in
 -- the extended statistics, if they lack permission to see those values in
diff --git a/src/test/regress/sql/stats_ext.sql b/src/test/regress/sql/stats_ext.sql
index 6fd8b016bd..42b67f7cde 100644
--- a/src/test/regress/sql/stats_ext.sql
+++ b/src/test/regress/sql/stats_ext.sql
@@ -381,6 +381,10 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = '
 
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = ''1'' OR c = 1 OR d IS NOT NULL');
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a > c');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a < c');
+
 -- create statistics
 CREATE STATISTICS mcv_lists_stats (mcv) ON a, b, c FROM mcv_lists;
 
@@ -402,6 +406,10 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = '
 
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = ''1'' OR c = 1 OR d IS NOT NULL');
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a > c');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a < c');
+
 -- check change of unrelated column type does not reset the MCV statistics
 ALTER TABLE mcv_lists ALTER COLUMN d TYPE VARCHAR(64);
 
@@ -473,6 +481,8 @@ ANALYZE mcv_lists;
 
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE b = ''x'' OR d = ''x''');
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE b = d');
+
 -- create statistics
 CREATE STATISTICS mcv_lists_stats (mcv) ON b, d FROM mcv_lists;
 
@@ -480,6 +490,8 @@ ANALYZE mcv_lists;
 
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE b = ''x'' OR d = ''x''');
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE b = d');
+
 -- mcv with arrays
 CREATE TABLE mcv_lists_arrays (
     a TEXT[],
@@ -521,6 +533,10 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a AND
 
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a AND b AND NOT c');
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a = b');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a = b AND b = c');
+
 CREATE STATISTICS mcv_lists_bool_stats (mcv) ON a, b, c
   FROM mcv_lists_bool;
 
@@ -534,6 +550,10 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a AND
 
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a AND b AND NOT c');
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a = b');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a = b AND b = c');
+
 -- check the ability to use multiple MCV lists
 CREATE TABLE mcv_lists_multi (
 	a INTEGER,
@@ -560,6 +580,7 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 OR
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 AND b = 0 AND c = 0 AND d = 0');
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE (a = 0 AND b = 0) OR (c = 0 AND d = 0)');
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 OR b = 0 OR c = 0 OR d = 0');
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = b AND c = d');
 
 -- create separate MCV statistics
 CREATE STATISTICS mcv_lists_multi_1 (mcv) ON a, b FROM mcv_lists_multi;
@@ -574,6 +595,7 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 OR
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 AND b = 0 AND c = 0 AND d = 0');
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE (a = 0 AND b = 0) OR (c = 0 AND d = 0)');
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 OR b = 0 OR c = 0 OR d = 0');
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = b AND c = d');
 
 DROP TABLE mcv_lists_multi;
 
-- 
2.21.1

Dean Rasheed

dean.a.rasheed@gmail.com

almost 6 years ago

In reply to: Tomas Vondra (#6)

Re: Additional improvements to extended statistics

On Mon, 9 Mar 2020 at 00:02, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

Speaking of which, would you take a look at [1]? I think supporting SAOP
is fine, but I wonder if you agree with my conclusion we can't really
support inclusion @> as explained in [2].

Hmm, I'm not sure. However, thinking about your example in [2] reminds
me of a thought I had a while ago, but then forgot about --- there is
a flaw in the formula used for computing probabilities with functional
dependencies:

P(a,b) = P(a) * [f + (1-f)*P(b)]

because it might return a value that is larger that P(b), which
obviously should not be possible. We should amend that formula to
prevent a result larger than P(b). The obvious way to do that would be
to use:

P(a,b) = Min(P(a) * [f + (1-f)*P(b)], P(b))

but actually I think it would be better and more principled to use:

P(a,b) = f*Min(P(a),P(b)) + (1-f)*P(a)*P(b)

I.e., for those rows believed to be functionally dependent, we use the
minimum probability, and for the rows believed to be independent, we
use the product.

I think that would solve the problem with the example you gave at the
end of [2], but I'm not sure if it helps with the general case.

Regards,
Dean

Show quoted text

[1] /messages/by-id/13902317.Eha0YfKkKy@pierred-pdoc
[2] /messages/by-id/20200202184134.swoqkqlqorqolrqv@development

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 6 years ago

In reply to: Dean Rasheed (#8)

Re: Additional improvements to extended statistics

On Mon, Mar 09, 2020 at 08:35:48AM +0000, Dean Rasheed wrote:

On Mon, 9 Mar 2020 at 00:02, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

Speaking of which, would you take a look at [1]? I think supporting SAOP
is fine, but I wonder if you agree with my conclusion we can't really
support inclusion @> as explained in [2].

Hmm, I'm not sure. However, thinking about your example in [2] reminds
me of a thought I had a while ago, but then forgot about --- there is
a flaw in the formula used for computing probabilities with functional
dependencies:

P(a,b) = P(a) * [f + (1-f)*P(b)]

because it might return a value that is larger that P(b), which
obviously should not be possible.

Hmmm, yeah. It took me a while to reproduce this - the trick is in
"inverting" the dependency on a subset of data, e.g. like this:

create table t (a int, b int);
insert into t select mod(i,50), mod(i,25)
from generate_series(1,5000) s(i);
update t set a = 0 where a < 3;
create statistics s (dependencies) on a,b from t;

which then does this:

test=# explain select * from t where a = 0;
QUERY PLAN
----------------------------------------------------
Seq Scan on t (cost=0.00..86.50 rows=300 width=8)
Filter: (a = 0)
(2 rows)

test=# explain select * from t where b = 0;
QUERY PLAN
----------------------------------------------------
Seq Scan on t (cost=0.00..86.50 rows=200 width=8)
Filter: (b = 0)
(2 rows)

test=# explain select * from t where a = 0 and b = 0;
QUERY PLAN
----------------------------------------------------
Seq Scan on t (cost=0.00..99.00 rows=283 width=8)
Filter: ((a = 0) AND (b = 0))
(2 rows)

Which I think is the issue you've described ...

We should amend that formula to prevent a result larger than P(b). The
obvious way to do that would be to use:

P(a,b) = Min(P(a) * [f + (1-f)*P(b)], P(b))

but actually I think it would be better and more principled to use:

P(a,b) = f*Min(P(a),P(b)) + (1-f)*P(a)*P(b)

I.e., for those rows believed to be functionally dependent, we use the
minimum probability, and for the rows believed to be independent, we
use the product.

Hmmm, yeah. The trouble is that we currently don't really have both
selectivities in dependencies_clauselist_selectivity :-(

We get both clauses, but we only compute selectivity of the "implied"
clause, and we leave the other one as not estimated (possibly up to
clauselist_selectivity).

It's also not clear to me how would this work for more than two clauses,
that are all functionally dependent. Like (a => b => c), for example.
But I haven't thought about this very much yet.

I think that would solve the problem with the example you gave at the
end of [2], but I'm not sure if it helps with the general case.

I don't follow. I think the issue with [2] is that we can't really apply
stats about the array values to queries on individual array elements.
Can you explain how would the proposed changes deal with this?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#10

Dean Rasheed

dean.a.rasheed@gmail.com

almost 6 years ago

In reply to: Tomas Vondra (#9)

1 attachment(s)

Re: Additional improvements to extended statistics

On Mon, 9 Mar 2020 at 18:19, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:>

On Mon, Mar 09, 2020 at 08:35:48AM +0000, Dean Rasheed wrote:

P(a,b) = P(a) * [f + (1-f)*P(b)]

because it might return a value that is larger that P(b), which
obviously should not be possible.

Hmmm, yeah. It took me a while to reproduce this - the trick is in
"inverting" the dependency on a subset of data, e.g. like this:

create table t (a int, b int);
insert into t select mod(i,50), mod(i,25)
from generate_series(1,5000) s(i);
update t set a = 0 where a < 3;
create statistics s (dependencies) on a,b from t;

which then does this:

test=# explain select * from t where a = 0;
QUERY PLAN
----------------------------------------------------
Seq Scan on t (cost=0.00..86.50 rows=300 width=8)
Filter: (a = 0)
(2 rows)

test=# explain select * from t where b = 0;
QUERY PLAN
----------------------------------------------------
Seq Scan on t (cost=0.00..86.50 rows=200 width=8)
Filter: (b = 0)
(2 rows)

test=# explain select * from t where a = 0 and b = 0;
QUERY PLAN
----------------------------------------------------
Seq Scan on t (cost=0.00..99.00 rows=283 width=8)
Filter: ((a = 0) AND (b = 0))
(2 rows)

Which I think is the issue you've described ...

I think this is also related to the problem that functional dependency
stats don't take into account the fact that the user clauses may not
be compatible with one another. For example:

CREATE TABLE t (a int, b int);
INSERT INTO t
SELECT x/10,x/10 FROM (SELECT generate_series(1,x)
FROM generate_series(1,100) g(x)) AS t(x);
CREATE STATISTICS s (dependencies) ON a,b FROM t;
ANALYSE t;

EXPLAIN SELECT * FROM t WHERE a = 10;

QUERY PLAN
--------------------------------------------------
Seq Scan on t (cost=0.00..86.12 rows=1 width=8)
Filter: (a = 10)
(2 rows)

EXPLAIN SELECT * FROM t WHERE b = 1;

QUERY PLAN
----------------------------------------------------
Seq Scan on t (cost=0.00..86.12 rows=865 width=8)
Filter: (b = 1)
(2 rows)

EXPLAIN SELECT * FROM t WHERE a = 10 AND b = 1;

QUERY PLAN
----------------------------------------------------
Seq Scan on t (cost=0.00..98.75 rows=865 width=8)
Filter: ((a = 10) AND (b = 1))
(2 rows)

whereas without stats it would estimate 1 row. That kind of
over-estimate could get very bad, so it would be good to find a way to
fix it.

We should amend that formula to prevent a result larger than P(b). The
obvious way to do that would be to use:

P(a,b) = Min(P(a) * [f + (1-f)*P(b)], P(b))

but actually I think it would be better and more principled to use:

P(a,b) = f*Min(P(a),P(b)) + (1-f)*P(a)*P(b)

I.e., for those rows believed to be functionally dependent, we use the
minimum probability, and for the rows believed to be independent, we
use the product.

Hmmm, yeah. The trouble is that we currently don't really have both
selectivities in dependencies_clauselist_selectivity :-(

I hacked on this a bit, and I think it's possible to apply dependency
stats in a more general way (not necessarily assuming equality
clauses), something like the attached very rough patch.

This approach guarantees that the result of combining a pair of
selectivities with a functional dependency between them gives a
combined selectivity that is never greater than either individual
selectivity.

One regression test fails, but looking at it, that's to be expected --
the test alters the type of a column, causing its univariate stats to
be dropped, so the single-column estimate is reduced, and the new code
refuses to give a higher estimate than the single clause's new
estimate.

It's also not clear to me how would this work for more than two clauses,
that are all functionally dependent. Like (a => b => c), for example.
But I haven't thought about this very much yet.

I attempted to solve that by computing a chain of conditional
probabilities. The maths needs checking over (as I said, this is a
very rough patch). In particular, I think it's wrong for cases like (
a->b, a->c ), but I think it's along the right lines.

I think that would solve the problem with the example you gave at the
end of [2], but I'm not sure if it helps with the general case.

I don't follow. I think the issue with [2] is that we can't really apply
stats about the array values to queries on individual array elements.
Can you explain how would the proposed changes deal with this?

With this patch, the original estimate of ~900 rows in that example is
restored with functional dependencies, because of the way it utilises
the minimum selectivity of the 2 clauses.

I've not fully thought this through, but I think it might allow
functional dependencies to be applied to a wider range of operators.

Regards,
Dean

Attachments:

dependencies_clauselist_selectivity.patchtext/x-patch; charset=US-ASCII; name=dependencies_clauselist_selectivity.patchDownload

diff --git a/src/backend/statistics/dependencies.c b/src/backend/statistics/dependencies.c
new file mode 100644
index e2f6c5b..5bd7bb5
--- a/src/backend/statistics/dependencies.c
+++ b/src/backend/statistics/dependencies.c
@@ -30,6 +30,7 @@
 #include "utils/fmgroids.h"
 #include "utils/fmgrprotos.h"
 #include "utils/lsyscache.h"
+#include "utils/selfuncs.h"
 #include "utils/syscache.h"
 #include "utils/typcache.h"
 
@@ -73,8 +74,6 @@ static double dependency_degree(int numr
 								AttrNumber *dependency, VacAttrStats **stats, Bitmapset *attrs);
 static bool dependency_is_fully_matched(MVDependency *dependency,
 										Bitmapset *attnums);
-static bool dependency_implies_attribute(MVDependency *dependency,
-										 AttrNumber attnum);
 static bool dependency_is_compatible_clause(Node *clause, Index relid,
 											AttrNumber *attnum);
 static MVDependency *find_strongest_dependency(MVDependencies **dependencies,
@@ -614,19 +613,6 @@ dependency_is_fully_matched(MVDependency
 }
 
 /*
- * dependency_implies_attribute
- *		check that the attnum matches is implied by the functional dependency
- */
-static bool
-dependency_implies_attribute(MVDependency *dependency, AttrNumber attnum)
-{
-	if (attnum == dependency->attributes[dependency->nattributes - 1])
-		return true;
-
-	return false;
-}
-
-/*
  * statext_dependencies_load
  *		Load the functional dependencies for the indicated pg_statistic_ext tuple
  */
@@ -949,20 +935,27 @@ dependencies_clauselist_selectivity(Plan
 									RelOptInfo *rel,
 									Bitmapset **estimatedclauses)
 {
-	Selectivity s1 = 1.0;
+	Selectivity s1;
 	ListCell   *l;
-	Bitmapset  *clauses_attnums = NULL;
-	Bitmapset **list_attnums;
+	Bitmapset  *clauses_attnums;
+	AttrNumber *list_attnums;
 	int			listidx;
-	MVDependencies    **dependencies = NULL;
-	int					ndependencies = 0;
+	MVDependencies **dependencies;
+	int			ndependencies;
+	int			max_deps;
+	MVDependency **dependencies_used;
+	int			ndependencies_used;
+	Bitmapset  *attrs_used;
+	int			nattrs_used;
+	Selectivity *attr_sel;
 	int			i;
+	int			attidx;
 
 	/* check if there's any stats that might be useful for us. */
 	if (!has_stats_of_kind(rel->statlist, STATS_EXT_DEPENDENCIES))
 		return 1.0;
 
-	list_attnums = (Bitmapset **) palloc(sizeof(Bitmapset *) *
+	list_attnums = (AttrNumber *) palloc(sizeof(AttrNumber) *
 										 list_length(clauses));
 
 	/*
@@ -976,6 +969,7 @@ dependencies_clauselist_selectivity(Plan
 	 * We also skip clauses that we already estimated using different types of
 	 * statistics (we treat them as incompatible).
 	 */
+	clauses_attnums = NULL;
 	listidx = 0;
 	foreach(l, clauses)
 	{
@@ -985,11 +979,11 @@ dependencies_clauselist_selectivity(Plan
 		if (!bms_is_member(listidx, *estimatedclauses) &&
 			dependency_is_compatible_clause(clause, rel->relid, &attnum))
 		{
-			list_attnums[listidx] = bms_make_singleton(attnum);
+			list_attnums[listidx] = attnum;
 			clauses_attnums = bms_add_member(clauses_attnums, attnum);
 		}
 		else
-			list_attnums[listidx] = NULL;
+			list_attnums[listidx] = 0;
 
 		listidx++;
 	}
@@ -1001,6 +995,7 @@ dependencies_clauselist_selectivity(Plan
 	 */
 	if (bms_num_members(clauses_attnums) < 2)
 	{
+		bms_free(clauses_attnums);
 		pfree(list_attnums);
 		return 1.0;
 	}
@@ -1017,6 +1012,7 @@ dependencies_clauselist_selectivity(Plan
 	 * to moving the freed chunks to freelists etc.
 	 */
 	ndependencies = 0;
+	max_deps = 0;
 	dependencies = (MVDependencies **) palloc(sizeof(MVDependencies *) *
 											  list_length(rel->statlist));
 
@@ -1038,104 +1034,183 @@ dependencies_clauselist_selectivity(Plan
 		if (num_matched < 2)
 			continue;
 
-		dependencies[ndependencies++]
+		dependencies[ndependencies]
 			= statext_dependencies_load(stat->statOid);
+
+		max_deps += dependencies[ndependencies]->ndeps;
+		ndependencies++;
 	}
 
 	/* if no matching stats could be found then we've nothing to do */
 	if (!ndependencies)
 	{
+		pfree(dependencies);
+		bms_free(clauses_attnums);
 		pfree(list_attnums);
 		return 1.0;
 	}
 
 	/*
-	 * Apply the dependencies recursively, starting with the widest/strongest
-	 * ones, and proceeding to the smaller/weaker ones. At the end of each
-	 * round we factor in the selectivity of clauses on the implied attribute,
-	 * and remove the clauses from the list.
+	 * Work out which dependencies we can apply, starting with the
+	 * widest/stongest ones, and proceeding to smaller/weaker ones.
 	 */
+	ndependencies_used = 0;
+	dependencies_used = (MVDependency **) palloc(sizeof(MVDependency *) *
+												 max_deps);
+
+	attrs_used = NULL;
+
 	while (true)
 	{
-		Selectivity s2 = 1.0;
 		MVDependency *dependency;
+		AttrNumber	attnum;
 
-		/* the widest/strongest dependency, fully matched by clauses */
+		/* The widest/strongest dependency, fully matched by clauses */
 		dependency = find_strongest_dependency(dependencies, ndependencies,
 											   clauses_attnums);
-
-		/* if no suitable dependency was found, we're done */
 		if (!dependency)
 			break;
 
-		/*
-		 * We found an applicable dependency, so find all the clauses on the
-		 * implied attribute - with dependency (a,b => c) we look for clauses
-		 * on 'c'.
-		 */
-		listidx = -1;
-		foreach(l, clauses)
+		/* Record all attributes used by this dependency */
+		for (i = 0; i < dependency->nattributes; i++)
 		{
-			Node	   *clause;
-			AttrNumber	attnum;
+			attnum = dependency->attributes[i];
+			attrs_used = bms_add_member(attrs_used, attnum);
+		}
 
-			listidx++;
+		/* Ignore any other dependencies with the same implied attribute */
+		clauses_attnums = bms_del_member(clauses_attnums, attnum);
 
-			/*
-			 * Skip incompatible clauses, and ones we've already estimated on.
-			 */
-			if (!list_attnums[listidx])
-				continue;
+		dependencies_used[ndependencies_used++] = dependency;
+	}
 
-			/*
-			 * We expect the bitmaps ton contain a single attribute number.
-			 */
-			attnum = bms_singleton_member(list_attnums[listidx]);
+	if (!ndependencies_used)
+	{
+		pfree(dependencies_used);
+		pfree(dependencies);
+		bms_free(clauses_attnums);
+		pfree(list_attnums);
+		return 1.0;
+	}
 
-			/*
-			 * Technically we could find more than one clause for a given
-			 * attnum. Since these clauses must be equality clauses, we choose
-			 * to only take the selectivity estimate from the final clause in
-			 * the list for this attnum. If the attnum happens to be compared
-			 * to a different Const in another clause then no rows will match
-			 * anyway. If it happens to be compared to the same Const, then
-			 * ignoring the additional clause is just the thing to do.
-			 */
-			if (dependency_implies_attribute(dependency, attnum))
-			{
-				clause = (Node *) lfirst(l);
+	/*
+	 * For each attribute used in the dependencies to be applied, compute the
+	 * selectivity of all the clauses using that attribute, and mark all those
+	 * clauses as having been estimated.
+	 */
+	nattrs_used = bms_num_members(attrs_used);
+	attr_sel = (Selectivity *) palloc(sizeof(Selectivity) * nattrs_used);
 
-				s2 = clause_selectivity(root, clause, varRelid, jointype,
-										sjinfo);
+	attidx = 0;
+	i = -1;
+	while ((i = bms_next_member(attrs_used, i)) >= 0)
+	{
+		List	   *attr_clauses = NIL;
+		Selectivity	simple_sel;
 
-				/* mark this one as done, so we don't touch it again. */
-				*estimatedclauses = bms_add_member(*estimatedclauses, listidx);
+		listidx = -1;
+		foreach(l, clauses)
+		{
+			Node	   *clause = (Node *) lfirst(l);
 
-				/*
-				 * Mark that we've got and used the dependency on this clause.
-				 * We'll want to ignore this when looking for the next
-				 * strongest dependency above.
-				 */
-				clauses_attnums = bms_del_member(clauses_attnums, attnum);
+			listidx++;
+			if (list_attnums[listidx] == i)
+			{
+				attr_clauses = lappend(attr_clauses, clause);
+				*estimatedclauses = bms_add_member(*estimatedclauses, listidx);
 			}
 		}
 
+		simple_sel = clauselist_selectivity_simple(root, attr_clauses, varRelid,
+												   jointype, sjinfo, NULL);
+		attr_sel[attidx++] = simple_sel;
+	}
+
+	/*
+	 * Now combine these selectivities using the dependency information.  We
+	 * traverse the dependencies in the opposite order to which we found them,
+	 * since dependencies near the start of the list may depend on attributes
+	 * implied by dependencies later in the list (but the opposite cannot
+	 * happen).
+	 *
+	 * Pairs of selectivities for attributes with dependencies are combined
+	 * using the formula
+	 *
+	 * P(a,b) = f * Min(P(a), P(b)) + (1-f) * P(a) * P(b)
+	 *
+	 * where 'f' is the degree of validity of the dependency.  This ensures
+	 * that the combined selectivity is never greater than either individual
+	 * selectivity.
+	 *
+	 * Where multiple dependencies apply (e.g., a -> b -> c), we use
+	 * conditional probabilities to compute the overall result as follows:
+	 *
+	 * P(a,b,c) = P(c|a,b) * P(a,b) = P(c|a,b) * P(a|b) * P(b)
+	 *
+	 * so we replace the probabilities of all implied attributes with
+	 * conditional probabilities, conditional on all their implying
+	 * attributes.  The probabilities of all other non-implied attributes are
+	 * left as they are.
+	 */
+	for (i = ndependencies_used-1; i >= 0; i--)
+	{
+		MVDependency *dependency = dependencies_used[i];
+		int			j;
+		AttrNumber	attnum;
+		Selectivity	s2;
+		double		f;
+
+		/* Selectivity of all the implying attributes */
+		s1 = 1.0;
+		for (j = 0; j < dependency->nattributes - 1; j++)
+		{
+			attnum = dependency->attributes[j];
+			attidx = bms_member_index(attrs_used, attnum);
+			s1 *= attr_sel[attidx];
+		}
+
+		/* Base selectivity of the implied attribute */
+		attnum = dependency->attributes[j];
+		attidx = bms_member_index(attrs_used, attnum);
+		s2 = attr_sel[attidx];
+
 		/*
-		 * Now factor in the selectivity for all the "implied" clauses into
-		 * the final one, using this formula:
+		 * Replace the probability s2 with the conditional probability s2
+		 * given s1, computed using the formula P(b|a) = P(a,b) / P(a), which
+		 * simplifies to
 		 *
-		 * P(a,b) = P(a) * (f + (1-f) * P(b))
+		 * P(b|a) = f * Min(P(a), P(b)) / P(a) + (1-f) * P(b)
 		 *
-		 * where 'f' is the degree of validity of the dependency.
+		 * where P(a) = s1, the selectivity of the implying attributes, and
+		 * P(b) = s2, the selectivity of the implied attribute.
 		 */
-		s1 *= (dependency->degree + (1 - dependency->degree) * s2);
+		f = dependency->degree;
+
+		if (s1 <= s2)
+			attr_sel[attidx] = f + (1 - f) * s2;
+		else
+			attr_sel[attidx] = f * s2 / s1 + (1 -f) * s2;
 	}
 
+	/*
+	 * The overall selectivity of all the clauses on all these attributes is
+	 * then the product of all these conditional/base probabilities.
+	 */
+	s1 = 1.0;
+	for (i = 0; i < nattrs_used; i++)
+		s1 *= attr_sel[i];
+
+	CLAMP_PROBABILITY(s1);
+
 	/* free deserialized functional dependencies (and then the array) */
 	for (i = 0; i < ndependencies; i++)
 		pfree(dependencies[i]);
 
+	pfree(attr_sel);
+	bms_free(attrs_used);
+	pfree(dependencies_used);
 	pfree(dependencies);
+	bms_free(clauses_attnums);
 	pfree(list_attnums);
 
 	return s1;

#11

Dean Rasheed

dean.a.rasheed@gmail.com

almost 6 years ago

In reply to: Tomas Vondra (#7)

Re: Additional improvements to extended statistics

On Mon, 9 Mar 2020 at 00:06, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

On Mon, Mar 09, 2020 at 01:01:57AM +0100, Tomas Vondra wrote:

Attaches is an updated patch series
with parts 0002 and 0003 adding tests demonstrating the issue and then
fixing it (both shall be merged to 0001).

One day I won't forget to actually attach the files ...

0001-0003 look reasonable to me.

One minor point -- there are now 2 code blocks that are basically the
same, looping over a list of clauses, calling clause_selectivity() and
then applying the "s1 = s1 + s2 - s1 * s2" formula. Perhaps they could
be combined into a new function (clauselist_selectivity_simple_or(),
say). I guess it would need to be passed the initial starting
selectivity s1, but it ought to help reduce code duplication.

Regards,
Dean

#12

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 6 years ago

In reply to: Dean Rasheed (#11)

4 attachment(s)

Re: Additional improvements to extended statistics

On Fri, Mar 13, 2020 at 04:54:51PM +0000, Dean Rasheed wrote:

On Mon, 9 Mar 2020 at 00:06, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

On Mon, Mar 09, 2020 at 01:01:57AM +0100, Tomas Vondra wrote:

Attaches is an updated patch series
with parts 0002 and 0003 adding tests demonstrating the issue and then
fixing it (both shall be merged to 0001).

One day I won't forget to actually attach the files ...

0001-0003 look reasonable to me.

One minor point -- there are now 2 code blocks that are basically the
same, looping over a list of clauses, calling clause_selectivity() and
then applying the "s1 = s1 + s2 - s1 * s2" formula. Perhaps they could
be combined into a new function (clauselist_selectivity_simple_or(),
say). I guess it would need to be passed the initial starting
selectivity s1, but it ought to help reduce code duplication.

Attached is a patch series rebased on top of the current master, after
committing the ScalarArrayOpExpr enhancements. I've updated the OR patch
to get rid of the code duplication, and barring objections I'll get it
committed shortly together with the two parts improving test coverage.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

0001-Improve-test-coverage-for-multi-column-MCV--20200314.patchtext/plain; charset=us-asciiDownload

From cbe6cf599e52fb2335345ddc6736b0139d113760 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Thu, 5 Mar 2020 01:47:16 +0100
Subject: [PATCH 1/4] Improve test coverage for multi-column MCV lists

The regression tests for extended statistics were not testing a couple
of important cases for the MCV lists:

  * IS NOT NULL clauses - We did have queries with IS NULL clauses, but
    not the negative case.

  * clauses with variable on the right - All the clauses had the Var on
    the left, i.e. (Var op Const), so this adds (Const op Var) too.

  * columns with fixed-length types passed by reference - All columns
    were using either by-value or varlena types, so add a test with
    UUID columns too. This matters for (de)serialization.

  * NULL-only dimension - When one of the columns contains only NULL
    values, we treat it a a special case during (de)serialization.

  * arrays containing NULL - When the constant parameter contains NULL
    value, we need to handle it correctly during estimation, for all
    IN, ANY and ALL clauses.

Discussion: https://www.postgresql.org/message-id/flat/13902317.Eha0YfKkKy%40pierred-pdoc
Author: Tomas Vondra
---
 src/test/regress/expected/stats_ext.out | 237 +++++++++++++++++++++++-
 src/test/regress/sql/stats_ext.sql      | 102 +++++++++-
 2 files changed, 335 insertions(+), 4 deletions(-)

diff --git a/src/test/regress/expected/stats_ext.out b/src/test/regress/expected/stats_ext.out
index 9fa659c71d..2ef19c590a 100644
--- a/src/test/regress/expected/stats_ext.out
+++ b/src/test/regress/expected/stats_ext.out
@@ -785,18 +785,36 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 AND b =
          1 |     50
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE 1 = a AND ''1'' = b');
+ estimated | actual 
+-----------+--------
+         1 |     50
+(1 row)
+
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a < 1 AND b < ''1''');
  estimated | actual 
 -----------+--------
          1 |     50
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE 1 > a AND ''1'' > b');
+ estimated | actual 
+-----------+--------
+         1 |     50
+(1 row)
+
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a <= 0 AND b <= ''0''');
  estimated | actual 
 -----------+--------
          1 |     50
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE 0 >= a AND ''0'' >= b');
+ estimated | actual 
+-----------+--------
+         1 |     50
+(1 row)
+
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 AND b = ''1'' AND c = 1');
  estimated | actual 
 -----------+--------
@@ -809,12 +827,24 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a < 5 AND b <
          1 |     50
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a < 5 AND ''1'' > b AND 5 > c');
+ estimated | actual 
+-----------+--------
+         1 |     50
+(1 row)
+
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a <= 4 AND b <= ''0'' AND c <= 4');
  estimated | actual 
 -----------+--------
          1 |     50
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE 4 >= a AND ''0'' >= b AND 4 >= c');
+ estimated | actual 
+-----------+--------
+         1 |     50
+(1 row)
+
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = ''1'' OR c = 1');
  estimated | actual 
 -----------+--------
@@ -833,30 +863,60 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IN (1, 2, 51
          8 |    200
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IN (1, 2, 51, 52, NULL) AND b IN ( ''1'', ''2'', NULL)');
+ estimated | actual 
+-----------+--------
+         8 |    200
+(1 row)
+
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = ANY (ARRAY[1, 2, 51, 52]) AND b = ANY (ARRAY[''1'', ''2''])');
  estimated | actual 
 -----------+--------
          8 |    200
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = ANY (ARRAY[NULL, 1, 2, 51, 52]) AND b = ANY (ARRAY[''1'', ''2'', NULL])');
+ estimated | actual 
+-----------+--------
+         8 |    200
+(1 row)
+
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a <= ANY (ARRAY[1, 2, 3]) AND b IN (''1'', ''2'', ''3'')');
  estimated | actual 
 -----------+--------
         26 |    150
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a <= ANY (ARRAY[1, NULL, 2, 3]) AND b IN (''1'', ''2'', NULL, ''3'')');
+ estimated | actual 
+-----------+--------
+        26 |    150
+(1 row)
+
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a < ALL (ARRAY[4, 5]) AND c > ANY (ARRAY[1, 2, 3])');
  estimated | actual 
 -----------+--------
         10 |    100
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a < ALL (ARRAY[4, 5]) AND c > ANY (ARRAY[1, 2, 3, NULL])');
+ estimated | actual 
+-----------+--------
+        10 |    100
+(1 row)
+
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a < ALL (ARRAY[4, 5]) AND b IN (''1'', ''2'', ''3'') AND c > ANY (ARRAY[1, 2, 3])');
  estimated | actual 
 -----------+--------
          1 |    100
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a < ALL (ARRAY[4, 5]) AND b IN (''1'', ''2'', NULL, ''3'') AND c > ANY (ARRAY[1, 2, NULL, 3])');
+ estimated | actual 
+-----------+--------
+         1 |    100
+(1 row)
+
 -- create statistics
 CREATE STATISTICS mcv_lists_stats (mcv) ON a, b, c FROM mcv_lists;
 ANALYZE mcv_lists;
@@ -866,18 +926,36 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 AND b =
         50 |     50
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE 1 = a AND ''1'' = b');
+ estimated | actual 
+-----------+--------
+        50 |     50
+(1 row)
+
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a < 1 AND b < ''1''');
  estimated | actual 
 -----------+--------
         50 |     50
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE 1 > a AND ''1'' > b');
+ estimated | actual 
+-----------+--------
+        50 |     50
+(1 row)
+
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a <= 0 AND b <= ''0''');
  estimated | actual 
 -----------+--------
         50 |     50
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE 0 >= a AND ''0'' >= b');
+ estimated | actual 
+-----------+--------
+        50 |     50
+(1 row)
+
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 AND b = ''1'' AND c = 1');
  estimated | actual 
 -----------+--------
@@ -890,12 +968,24 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a < 5 AND b <
         50 |     50
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a < 5 AND ''1'' > b AND 5 > c');
+ estimated | actual 
+-----------+--------
+        50 |     50
+(1 row)
+
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a <= 4 AND b <= ''0'' AND c <= 4');
  estimated | actual 
 -----------+--------
         50 |     50
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE 4 >= a AND ''0'' >= b AND 4 >= c');
+ estimated | actual 
+-----------+--------
+        50 |     50
+(1 row)
+
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = ''1'' OR c = 1');
  estimated | actual 
 -----------+--------
@@ -908,30 +998,60 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IN (1, 2, 51
        200 |    200
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IN (1, 2, 51, 52, NULL) AND b IN ( ''1'', ''2'', NULL)');
+ estimated | actual 
+-----------+--------
+       200 |    200
+(1 row)
+
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = ANY (ARRAY[1, 2, 51, 52]) AND b = ANY (ARRAY[''1'', ''2''])');
  estimated | actual 
 -----------+--------
        200 |    200
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = ANY (ARRAY[NULL, 1, 2, 51, 52]) AND b = ANY (ARRAY[''1'', ''2'', NULL])');
+ estimated | actual 
+-----------+--------
+       200 |    200
+(1 row)
+
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a <= ANY (ARRAY[1, 2, 3]) AND b IN (''1'', ''2'', ''3'')');
  estimated | actual 
 -----------+--------
        150 |    150
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a <= ANY (ARRAY[1, NULL, 2, 3]) AND b IN (''1'', ''2'', NULL, ''3'')');
+ estimated | actual 
+-----------+--------
+       150 |    150
+(1 row)
+
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a < ALL (ARRAY[4, 5]) AND c > ANY (ARRAY[1, 2, 3])');
  estimated | actual 
 -----------+--------
        100 |    100
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a < ALL (ARRAY[4, 5]) AND c > ANY (ARRAY[1, 2, 3, NULL])');
+ estimated | actual 
+-----------+--------
+       100 |    100
+(1 row)
+
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a < ALL (ARRAY[4, 5]) AND b IN (''1'', ''2'', ''3'') AND c > ANY (ARRAY[1, 2, 3])');
  estimated | actual 
 -----------+--------
        100 |    100
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a < ALL (ARRAY[4, 5]) AND b IN (''1'', ''2'', NULL, ''3'') AND c > ANY (ARRAY[1, 2, NULL, 3])');
+ estimated | actual 
+-----------+--------
+       100 |    100
+(1 row)
+
 -- we can't use the statistic for OR clauses that are not fully covered (missing 'd' attribute)
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = ''1'' OR c = 1 OR d IS NOT NULL');
  estimated | actual 
@@ -988,6 +1108,24 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IS NULL AND
          1 |     50
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IS NULL AND b IS NOT NULL');
+ estimated | actual 
+-----------+--------
+        49 |      0
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IS NOT NULL AND b IS NULL AND c IS NOT NULL');
+ estimated | actual 
+-----------+--------
+        95 |      0
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IN (0, 1) AND b IN (''0'', ''1'')');
+ estimated | actual 
+-----------+--------
+         1 |     50
+(1 row)
+
 -- create statistics
 CREATE STATISTICS mcv_lists_stats (mcv) ON a, b, c FROM mcv_lists;
 ANALYZE mcv_lists;
@@ -1003,6 +1141,24 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IS NULL AND
         50 |     50
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IS NULL AND b IS NOT NULL');
+ estimated | actual 
+-----------+--------
+         1 |      0
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IS NOT NULL AND b IS NULL AND c IS NOT NULL');
+ estimated | actual 
+-----------+--------
+         1 |      0
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IN (0, 1) AND b IN (''0'', ''1'')');
+ estimated | actual 
+-----------+--------
+        50 |     50
+(1 row)
+
 -- test pg_mcv_list_items with a very simple (single item) MCV list
 TRUNCATE mcv_lists;
 INSERT INTO mcv_lists (a, b, c) SELECT 1, 2, 3 FROM generate_series(1,1000) s(i);
@@ -1022,7 +1178,7 @@ TRUNCATE mcv_lists;
 DROP STATISTICS mcv_lists_stats;
 INSERT INTO mcv_lists (a, b, c, d)
      SELECT
-         (CASE WHEN mod(i,2) = 0 THEN NULL ELSE 0 END),
+         NULL, -- always NULL
          (CASE WHEN mod(i,2) = 0 THEN NULL ELSE 'x' END),
          (CASE WHEN mod(i,2) = 0 THEN NULL ELSE 0 END),
          (CASE WHEN mod(i,2) = 0 THEN NULL ELSE 'x' END)
@@ -1034,15 +1190,92 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE b = ''x'' OR d
       3750 |   2500
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = ''x'' OR d = ''x''');
+ estimated | actual 
+-----------+--------
+      3750 |   2500
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IS NULL AND (b = ''x'' OR d = ''x'')');
+ estimated | actual 
+-----------+--------
+      3750 |   2500
+(1 row)
+
 -- create statistics
-CREATE STATISTICS mcv_lists_stats (mcv) ON b, d FROM mcv_lists;
+CREATE STATISTICS mcv_lists_stats (mcv) ON a, b, d FROM mcv_lists;
 ANALYZE mcv_lists;
+-- test pg_mcv_list_items with MCV list containing variable-length data and NULLs
+SELECT m.*
+  FROM pg_statistic_ext s, pg_statistic_ext_data d,
+       pg_mcv_list_items(d.stxdmcv) m
+ WHERE s.stxname = 'mcv_lists_stats'
+   AND d.stxoid = s.oid;
+ index |      values      |  nulls  | frequency | base_frequency 
+-------+------------------+---------+-----------+----------------
+     0 | {NULL,x,x}       | {t,f,f} |       0.5 |           0.25
+     1 | {NULL,NULL,NULL} | {t,t,t} |       0.5 |           0.25
+(2 rows)
+
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE b = ''x'' OR d = ''x''');
  estimated | actual 
 -----------+--------
       2500 |   2500
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = ''x'' OR d = ''x''');
+ estimated | actual 
+-----------+--------
+      2500 |   2500
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IS NULL AND (b = ''x'' OR d = ''x'')');
+ estimated | actual 
+-----------+--------
+      2500 |   2500
+(1 row)
+
+-- mcv with pass-by-ref fixlen types, e.g. uuid
+CREATE TABLE mcv_lists_uuid (
+    a UUID,
+    b UUID,
+    c UUID
+);
+INSERT INTO mcv_lists_uuid (a, b, c)
+     SELECT
+         md5(mod(i,100)::text)::uuid,
+         md5(mod(i,50)::text)::uuid,
+         md5(mod(i,25)::text)::uuid
+     FROM generate_series(1,5000) s(i);
+ANALYZE mcv_lists_uuid;
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_uuid WHERE a = ''1679091c-5a88-0faf-6fb5-e6087eb1b2dc'' AND b = ''1679091c-5a88-0faf-6fb5-e6087eb1b2dc''');
+ estimated | actual 
+-----------+--------
+         1 |     50
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_uuid WHERE a = ''1679091c-5a88-0faf-6fb5-e6087eb1b2dc'' AND b = ''1679091c-5a88-0faf-6fb5-e6087eb1b2dc'' AND c = ''1679091c-5a88-0faf-6fb5-e6087eb1b2dc''');
+ estimated | actual 
+-----------+--------
+         1 |     50
+(1 row)
+
+CREATE STATISTICS mcv_lists_uuid_stats (mcv) ON a, b, c
+  FROM mcv_lists_uuid;
+ANALYZE mcv_lists_uuid;
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_uuid WHERE a = ''1679091c-5a88-0faf-6fb5-e6087eb1b2dc'' AND b = ''1679091c-5a88-0faf-6fb5-e6087eb1b2dc''');
+ estimated | actual 
+-----------+--------
+        50 |     50
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_uuid WHERE a = ''1679091c-5a88-0faf-6fb5-e6087eb1b2dc'' AND b = ''1679091c-5a88-0faf-6fb5-e6087eb1b2dc'' AND c = ''1679091c-5a88-0faf-6fb5-e6087eb1b2dc''');
+ estimated | actual 
+-----------+--------
+        50 |     50
+(1 row)
+
+DROP TABLE mcv_lists_uuid;
 -- mcv with arrays
 CREATE TABLE mcv_lists_arrays (
     a TEXT[],
diff --git a/src/test/regress/sql/stats_ext.sql b/src/test/regress/sql/stats_ext.sql
index 0ece39a279..cc353d0466 100644
--- a/src/test/regress/sql/stats_ext.sql
+++ b/src/test/regress/sql/stats_ext.sql
@@ -447,30 +447,50 @@ ANALYZE mcv_lists;
 
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 AND b = ''1''');
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE 1 = a AND ''1'' = b');
+
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a < 1 AND b < ''1''');
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE 1 > a AND ''1'' > b');
+
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a <= 0 AND b <= ''0''');
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE 0 >= a AND ''0'' >= b');
+
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 AND b = ''1'' AND c = 1');
 
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a < 5 AND b < ''1'' AND c < 5');
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a < 5 AND ''1'' > b AND 5 > c');
+
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a <= 4 AND b <= ''0'' AND c <= 4');
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE 4 >= a AND ''0'' >= b AND 4 >= c');
+
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = ''1'' OR c = 1');
 
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = ''1'' OR c = 1 OR d IS NOT NULL');
 
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IN (1, 2, 51, 52) AND b IN ( ''1'', ''2'')');
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IN (1, 2, 51, 52, NULL) AND b IN ( ''1'', ''2'', NULL)');
+
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = ANY (ARRAY[1, 2, 51, 52]) AND b = ANY (ARRAY[''1'', ''2''])');
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = ANY (ARRAY[NULL, 1, 2, 51, 52]) AND b = ANY (ARRAY[''1'', ''2'', NULL])');
+
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a <= ANY (ARRAY[1, 2, 3]) AND b IN (''1'', ''2'', ''3'')');
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a <= ANY (ARRAY[1, NULL, 2, 3]) AND b IN (''1'', ''2'', NULL, ''3'')');
+
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a < ALL (ARRAY[4, 5]) AND c > ANY (ARRAY[1, 2, 3])');
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a < ALL (ARRAY[4, 5]) AND c > ANY (ARRAY[1, 2, 3, NULL])');
+
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a < ALL (ARRAY[4, 5]) AND b IN (''1'', ''2'', ''3'') AND c > ANY (ARRAY[1, 2, 3])');
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a < ALL (ARRAY[4, 5]) AND b IN (''1'', ''2'', NULL, ''3'') AND c > ANY (ARRAY[1, 2, NULL, 3])');
+
 -- create statistics
 CREATE STATISTICS mcv_lists_stats (mcv) ON a, b, c FROM mcv_lists;
 
@@ -478,28 +498,48 @@ ANALYZE mcv_lists;
 
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 AND b = ''1''');
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE 1 = a AND ''1'' = b');
+
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a < 1 AND b < ''1''');
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE 1 > a AND ''1'' > b');
+
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a <= 0 AND b <= ''0''');
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE 0 >= a AND ''0'' >= b');
+
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 AND b = ''1'' AND c = 1');
 
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a < 5 AND b < ''1'' AND c < 5');
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a < 5 AND ''1'' > b AND 5 > c');
+
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a <= 4 AND b <= ''0'' AND c <= 4');
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE 4 >= a AND ''0'' >= b AND 4 >= c');
+
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = ''1'' OR c = 1');
 
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IN (1, 2, 51, 52) AND b IN ( ''1'', ''2'')');
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IN (1, 2, 51, 52, NULL) AND b IN ( ''1'', ''2'', NULL)');
+
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = ANY (ARRAY[1, 2, 51, 52]) AND b = ANY (ARRAY[''1'', ''2''])');
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = ANY (ARRAY[NULL, 1, 2, 51, 52]) AND b = ANY (ARRAY[''1'', ''2'', NULL])');
+
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a <= ANY (ARRAY[1, 2, 3]) AND b IN (''1'', ''2'', ''3'')');
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a <= ANY (ARRAY[1, NULL, 2, 3]) AND b IN (''1'', ''2'', NULL, ''3'')');
+
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a < ALL (ARRAY[4, 5]) AND c > ANY (ARRAY[1, 2, 3])');
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a < ALL (ARRAY[4, 5]) AND c > ANY (ARRAY[1, 2, 3, NULL])');
+
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a < ALL (ARRAY[4, 5]) AND b IN (''1'', ''2'', ''3'') AND c > ANY (ARRAY[1, 2, 3])');
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a < ALL (ARRAY[4, 5]) AND b IN (''1'', ''2'', NULL, ''3'') AND c > ANY (ARRAY[1, 2, NULL, 3])');
+
 -- we can't use the statistic for OR clauses that are not fully covered (missing 'd' attribute)
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = ''1'' OR c = 1 OR d IS NOT NULL');
 
@@ -538,6 +578,12 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IS NULL AND
 
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IS NULL AND b IS NULL AND c IS NULL');
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IS NULL AND b IS NOT NULL');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IS NOT NULL AND b IS NULL AND c IS NOT NULL');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IN (0, 1) AND b IN (''0'', ''1'')');
+
 -- create statistics
 CREATE STATISTICS mcv_lists_stats (mcv) ON a, b, c FROM mcv_lists;
 
@@ -547,6 +593,12 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IS NULL AND
 
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IS NULL AND b IS NULL AND c IS NULL');
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IS NULL AND b IS NOT NULL');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IS NOT NULL AND b IS NULL AND c IS NOT NULL');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IN (0, 1) AND b IN (''0'', ''1'')');
+
 -- test pg_mcv_list_items with a very simple (single item) MCV list
 TRUNCATE mcv_lists;
 INSERT INTO mcv_lists (a, b, c) SELECT 1, 2, 3 FROM generate_series(1,1000) s(i);
@@ -564,7 +616,7 @@ DROP STATISTICS mcv_lists_stats;
 
 INSERT INTO mcv_lists (a, b, c, d)
      SELECT
-         (CASE WHEN mod(i,2) = 0 THEN NULL ELSE 0 END),
+         NULL, -- always NULL
          (CASE WHEN mod(i,2) = 0 THEN NULL ELSE 'x' END),
          (CASE WHEN mod(i,2) = 0 THEN NULL ELSE 0 END),
          (CASE WHEN mod(i,2) = 0 THEN NULL ELSE 'x' END)
@@ -574,13 +626,59 @@ ANALYZE mcv_lists;
 
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE b = ''x'' OR d = ''x''');
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = ''x'' OR d = ''x''');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IS NULL AND (b = ''x'' OR d = ''x'')');
+
 -- create statistics
-CREATE STATISTICS mcv_lists_stats (mcv) ON b, d FROM mcv_lists;
+CREATE STATISTICS mcv_lists_stats (mcv) ON a, b, d FROM mcv_lists;
 
 ANALYZE mcv_lists;
 
+-- test pg_mcv_list_items with MCV list containing variable-length data and NULLs
+SELECT m.*
+  FROM pg_statistic_ext s, pg_statistic_ext_data d,
+       pg_mcv_list_items(d.stxdmcv) m
+ WHERE s.stxname = 'mcv_lists_stats'
+   AND d.stxoid = s.oid;
+
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE b = ''x'' OR d = ''x''');
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = ''x'' OR d = ''x''');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IS NULL AND (b = ''x'' OR d = ''x'')');
+
+-- mcv with pass-by-ref fixlen types, e.g. uuid
+CREATE TABLE mcv_lists_uuid (
+    a UUID,
+    b UUID,
+    c UUID
+);
+
+INSERT INTO mcv_lists_uuid (a, b, c)
+     SELECT
+         md5(mod(i,100)::text)::uuid,
+         md5(mod(i,50)::text)::uuid,
+         md5(mod(i,25)::text)::uuid
+     FROM generate_series(1,5000) s(i);
+
+ANALYZE mcv_lists_uuid;
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_uuid WHERE a = ''1679091c-5a88-0faf-6fb5-e6087eb1b2dc'' AND b = ''1679091c-5a88-0faf-6fb5-e6087eb1b2dc''');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_uuid WHERE a = ''1679091c-5a88-0faf-6fb5-e6087eb1b2dc'' AND b = ''1679091c-5a88-0faf-6fb5-e6087eb1b2dc'' AND c = ''1679091c-5a88-0faf-6fb5-e6087eb1b2dc''');
+
+CREATE STATISTICS mcv_lists_uuid_stats (mcv) ON a, b, c
+  FROM mcv_lists_uuid;
+
+ANALYZE mcv_lists_uuid;
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_uuid WHERE a = ''1679091c-5a88-0faf-6fb5-e6087eb1b2dc'' AND b = ''1679091c-5a88-0faf-6fb5-e6087eb1b2dc''');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_uuid WHERE a = ''1679091c-5a88-0faf-6fb5-e6087eb1b2dc'' AND b = ''1679091c-5a88-0faf-6fb5-e6087eb1b2dc'' AND c = ''1679091c-5a88-0faf-6fb5-e6087eb1b2dc''');
+
+DROP TABLE mcv_lists_uuid;
+
 -- mcv with arrays
 CREATE TABLE mcv_lists_arrays (
     a TEXT[],
-- 
2.21.1

0002-Improve-estimation-of-OR-clauses-with-exten-20200314.patchtext/plain; charset=us-asciiDownload

From 90c56a1a98f23d0d2038a2b5219459732bf46438 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Sun, 8 Mar 2020 23:26:50 +0100
Subject: [PATCH 2/4] Improve estimation of OR clauses with extended statistics

Until now, OR clauses were estimated using extended statistics only when
the whole clause (all the arguments) are compatible. If even just one
argument was found to be incompatible, the whole clause was estimated
ignoring extended statistics.

This was considered acceptable, because the estimation errors are not
that bad for OR clauses. It may however be an issue when the OR clause
is more complex, e.g. when some of the arguments are AND clauses etc.

This relaxes this restriction, using similar logic as for AND clauses.
We first estimate as much as possible using extended statistics, and
then use the existing (s1 + s2 - s1 * s2) formula for the rest.

Discussion: https://www.postgresql.org/message-id/flat/13902317.Eha0YfKkKy%40pierred-pdoc
Author: Tomas Vondra
Reviewed-by: Dean Rasheed
---
 src/backend/optimizer/path/clausesel.c        | 141 +++++++++++++++---
 src/backend/statistics/extended_stats.c       |  36 +++--
 src/backend/statistics/mcv.c                  |   5 +-
 src/include/optimizer/optimizer.h             |   7 +
 .../statistics/extended_stats_internal.h      |   3 +-
 src/include/statistics/statistics.h           |   3 +-
 src/test/regress/expected/stats_ext.out       |  61 +++++++-
 src/test/regress/sql/stats_ext.sql            |  13 +-
 8 files changed, 227 insertions(+), 42 deletions(-)

diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c
index a3ebe10592..2ae2e0cc70 100644
--- a/src/backend/optimizer/path/clausesel.c
+++ b/src/backend/optimizer/path/clausesel.c
@@ -92,7 +92,7 @@ clauselist_selectivity(PlannerInfo *root,
 		 */
 		s1 *= statext_clauselist_selectivity(root, clauses, varRelid,
 											 jointype, sjinfo, rel,
-											 &estimatedclauses);
+											 &estimatedclauses, false);
 	}
 
 	/*
@@ -104,6 +104,62 @@ clauselist_selectivity(PlannerInfo *root,
 											  estimatedclauses);
 }
 
+/*
+ * clauselist_selectivity_or -
+ *	  Compute the selectivity of an implicitly-ORed list of boolean
+ *	  expression clauses.  The list can be empty, in which case 0.0
+ *	  must be returned.  List elements may be either RestrictInfos
+ *	  or bare expression clauses --- the former is preferred since
+ *	  it allows caching of results.
+ *
+ * See clause_selectivity() for the meaning of the additional parameters.
+ *
+ * The basic approach is to apply extended statistics first, on as many
+ * clauses as possible, in order to capture cross-column dependencies etc.
+ * The remaining clauses are then estimated using regular statistics tracked
+ * for individual columns.  This is done by simply passing the clauses to
+ * clauselist_selectivity and then combining the selectivities using the
+ * regular formula (s1+s2 - s1*s2).
+ */
+static Selectivity
+clauselist_selectivity_or(PlannerInfo *root,
+						  List *clauses,
+						  int varRelid,
+						  JoinType jointype,
+						  SpecialJoinInfo *sjinfo)
+{
+	Selectivity	s = 0.0;
+	RelOptInfo *rel;
+	Bitmapset  *estimatedclauses = NULL;
+
+	/*
+	 * Determine if these clauses reference a single relation.  If so, and if
+	 * it has extended statistics, try to apply those.
+	 */
+	rel = find_single_rel_for_clauses(root, clauses);
+	if (rel && rel->rtekind == RTE_RELATION && rel->statlist != NIL)
+	{
+		/*
+		 * Estimate as many clauses as possible using extended statistics.
+		 *
+		 * 'estimatedclauses' tracks the 0-based list position index of
+		 * clauses that we've estimated using extended statistics, and that
+		 * should be ignored.
+		 *
+		 * XXX We can't multiply with current value, because for OR clauses
+		 * we start with 0.0, so we simply assign to s1 directly.
+		 */
+		s = statext_clauselist_selectivity(root, clauses, varRelid,
+										   jointype, sjinfo, rel,
+										   &estimatedclauses, true);
+	}
+
+	/* Estimate the rest of the clauses as if they were independent. */
+	return clauselist_selectivity_simple_or(root, clauses, varRelid,
+											jointype, sjinfo,
+											estimatedclauses, s);
+}
+
 /*
  * clauselist_selectivity_simple -
  *	  Compute the selectivity of an implicitly-ANDed list of boolean
@@ -351,6 +407,65 @@ clauselist_selectivity_simple(PlannerInfo *root,
 	return s1;
 }
 
+/*
+ * clauselist_selectivity_simple_or -
+ *	  Compute the selectivity of an implicitly-ORed list of boolean
+ *	  expression clauses.  The list can be empty, in which case the
+ *	  value passed in the last parameter must be returned.
+ *	  The estimatedclauses bitmap tracks clauses that have already
+ *	  been estimated by other means.
+ *
+ * See clause_selectivity() for the meaning of the additional parameters.
+ *
+ * Our basic approach is to combine selectivities of the subclauses
+ * using the formula (s1 + s2 - s1 * s2).  It's possible some of the
+ * clauses are already estimated by other means, in which case we treat
+ * them as a single logical clause, with the selecitivity passed in the
+ * last parameter.
+ */
+Selectivity
+clauselist_selectivity_simple_or(PlannerInfo *root,
+								 List *clauses,
+								 int varRelid,
+								 JoinType jointype,
+								 SpecialJoinInfo *sjinfo,
+								 Bitmapset *estimatedclauses,
+								 Selectivity s)
+{
+	ListCell   *lc;
+	int			listidx;
+
+	/*
+	 * Selectivities of the remaining clauses for an OR clause are computed
+	 * as s1+s2 - s1*s2 to account for the probable overlap of selected tuple
+	 * sets. The clauses estimated using extended statistics are effectively
+	 * treated as a single clause.
+	 *
+	 * XXX is this too conservative?
+	 */
+	listidx = -1;
+	foreach(lc, clauses)
+	{
+		Selectivity s2;
+
+		listidx++;
+
+		/* skip already estimated clauses */
+		if (bms_is_member(listidx, estimatedclauses))
+			continue;
+
+		s2 = clause_selectivity(root,
+								(Node *) lfirst(lc),
+								varRelid,
+								jointype,
+								sjinfo);
+
+		s = s + s2 - s * s2;
+	}
+
+	return s;
+}
+
 /*
  * addRangeClause --- add a new range clause for clauselist_selectivity
  *
@@ -735,24 +850,14 @@ clause_selectivity(PlannerInfo *root,
 	else if (is_orclause(clause))
 	{
 		/*
-		 * Selectivities for an OR clause are computed as s1+s2 - s1*s2 to
-		 * account for the probable overlap of selected tuple sets.
-		 *
-		 * XXX is this too conservative?
+		 * Almost the same thing as clauselist_selectivity, but with the
+		 * clauses connected by OR.
 		 */
-		ListCell   *arg;
-
-		s1 = 0.0;
-		foreach(arg, ((BoolExpr *) clause)->args)
-		{
-			Selectivity s2 = clause_selectivity(root,
-												(Node *) lfirst(arg),
-												varRelid,
-												jointype,
-												sjinfo);
-
-			s1 = s1 + s2 - s1 * s2;
-		}
+		s1 = clauselist_selectivity_or(root,
+									   ((BoolExpr *) clause)->args,
+									   varRelid,
+									   jointype,
+									   sjinfo);
 	}
 	else if (is_opclause(clause) || IsA(clause, DistinctExpr))
 	{
diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c
index c3f0da4e23..fa1273f1ae 100644
--- a/src/backend/statistics/extended_stats.c
+++ b/src/backend/statistics/extended_stats.c
@@ -1282,16 +1282,17 @@ statext_is_compatible_clause(PlannerInfo *root, Node *clause, Index relid,
 static Selectivity
 statext_mcv_clauselist_selectivity(PlannerInfo *root, List *clauses, int varRelid,
 								   JoinType jointype, SpecialJoinInfo *sjinfo,
-								   RelOptInfo *rel, Bitmapset **estimatedclauses)
+								   RelOptInfo *rel, Bitmapset **estimatedclauses,
+								   bool is_or)
 {
 	ListCell   *l;
 	Bitmapset **list_attnums;
 	int			listidx;
-	Selectivity	sel = 1.0;
+	Selectivity	sel = (is_or) ? 0.0 : 1.0;
 
 	/* check if there's any stats that might be useful for us. */
 	if (!has_stats_of_kind(rel->statlist, STATS_EXT_MCV))
-		return 1.0;
+		return sel;
 
 	list_attnums = (Bitmapset **) palloc(sizeof(Bitmapset *) *
 										 list_length(clauses));
@@ -1374,8 +1375,12 @@ statext_mcv_clauselist_selectivity(PlannerInfo *root, List *clauses, int varReli
 		 * columns/clauses. We'll then use the various selectivities computed from
 		 * MCV list to improve it.
 		 */
-		simple_sel = clauselist_selectivity_simple(root, stat_clauses, varRelid,
-												jointype, sjinfo, NULL);
+		if (is_or)
+			simple_sel = clauselist_selectivity_simple_or(root, stat_clauses, varRelid,
+														  jointype, sjinfo, NULL, 1.0);
+		else
+			simple_sel = clauselist_selectivity_simple(root, stat_clauses, varRelid,
+													   jointype, sjinfo, NULL);
 
 		/*
 		 * Now compute the multi-column estimate from the MCV list, along with the
@@ -1383,7 +1388,7 @@ statext_mcv_clauselist_selectivity(PlannerInfo *root, List *clauses, int varReli
 		 */
 		mcv_sel = mcv_clauselist_selectivity(root, stat, stat_clauses, varRelid,
 											 jointype, sjinfo, rel,
-											 &mcv_basesel, &mcv_totalsel);
+											 &mcv_basesel, &mcv_totalsel, is_or);
 
 		/* Estimated selectivity of values not covered by MCV matches */
 		other_sel = simple_sel - mcv_basesel;
@@ -1397,8 +1402,11 @@ statext_mcv_clauselist_selectivity(PlannerInfo *root, List *clauses, int varReli
 		stat_sel = mcv_sel + other_sel;
 		CLAMP_PROBABILITY(stat_sel);
 
-		/* Factor the estimate from this MCV to the oveall estimate. */
-		sel *= stat_sel;
+		/* Factor the estimate from this MCV to the overall estimate. */
+		if (is_or)
+			sel = sel + stat_sel - sel * stat_sel;
+		else
+			sel *= stat_sel;
 	}
 
 	return sel;
@@ -1411,13 +1419,21 @@ statext_mcv_clauselist_selectivity(PlannerInfo *root, List *clauses, int varReli
 Selectivity
 statext_clauselist_selectivity(PlannerInfo *root, List *clauses, int varRelid,
 							   JoinType jointype, SpecialJoinInfo *sjinfo,
-							   RelOptInfo *rel, Bitmapset **estimatedclauses)
+							   RelOptInfo *rel, Bitmapset **estimatedclauses,
+							   bool is_or)
 {
 	Selectivity sel;
 
 	/* First, try estimating clauses using a multivariate MCV list. */
 	sel = statext_mcv_clauselist_selectivity(root, clauses, varRelid, jointype,
-											 sjinfo, rel, estimatedclauses);
+											 sjinfo, rel, estimatedclauses, is_or);
+
+	/*
+	 * Functional dependencies only work for clauses connected by AND, so for
+	 * OR clauses we're done.
+	 */
+	if (is_or)
+		return sel;
 
 	/*
 	 * Then, apply functional dependencies on the remaining clauses by calling
diff --git a/src/backend/statistics/mcv.c b/src/backend/statistics/mcv.c
index 3147d8fedc..343011178b 100644
--- a/src/backend/statistics/mcv.c
+++ b/src/backend/statistics/mcv.c
@@ -1902,7 +1902,8 @@ mcv_clauselist_selectivity(PlannerInfo *root, StatisticExtInfo *stat,
 						   List *clauses, int varRelid,
 						   JoinType jointype, SpecialJoinInfo *sjinfo,
 						   RelOptInfo *rel,
-						   Selectivity *basesel, Selectivity *totalsel)
+						   Selectivity *basesel, Selectivity *totalsel,
+						   bool is_or)
 {
 	int			i;
 	MCVList    *mcv;
@@ -1915,7 +1916,7 @@ mcv_clauselist_selectivity(PlannerInfo *root, StatisticExtInfo *stat,
 	mcv = statext_mcv_load(stat->statOid);
 
 	/* build a match bitmap for the clauses */
-	matches = mcv_get_match_bitmap(root, clauses, stat->keys, mcv, false);
+	matches = mcv_get_match_bitmap(root, clauses, stat->keys, mcv, is_or);
 
 	/* sum frequencies for all the matching MCV items */
 	*basesel = 0.0;
diff --git a/src/include/optimizer/optimizer.h b/src/include/optimizer/optimizer.h
index 5283995df8..67adcd4485 100644
--- a/src/include/optimizer/optimizer.h
+++ b/src/include/optimizer/optimizer.h
@@ -64,6 +64,13 @@ extern Selectivity clauselist_selectivity_simple(PlannerInfo *root,
 												 JoinType jointype,
 												 SpecialJoinInfo *sjinfo,
 												 Bitmapset *estimatedclauses);
+extern Selectivity clauselist_selectivity_simple_or(PlannerInfo *root,
+													List *clauses,
+													int varRelid,
+													JoinType jointype,
+													SpecialJoinInfo *sjinfo,
+													Bitmapset *estimatedclauses,
+													Selectivity s);
 extern Selectivity clauselist_selectivity(PlannerInfo *root,
 										  List *clauses,
 										  int varRelid,
diff --git a/src/include/statistics/extended_stats_internal.h b/src/include/statistics/extended_stats_internal.h
index 2b14ab238c..6c039a81c2 100644
--- a/src/include/statistics/extended_stats_internal.h
+++ b/src/include/statistics/extended_stats_internal.h
@@ -107,6 +107,7 @@ extern Selectivity mcv_clauselist_selectivity(PlannerInfo *root,
 											  SpecialJoinInfo *sjinfo,
 											  RelOptInfo *rel,
 											  Selectivity *basesel,
-											  Selectivity *totalsel);
+											  Selectivity *totalsel,
+											  bool is_or);
 
 #endif							/* EXTENDED_STATS_INTERNAL_H */
diff --git a/src/include/statistics/statistics.h b/src/include/statistics/statistics.h
index f5d9b6c73a..e18c9a6539 100644
--- a/src/include/statistics/statistics.h
+++ b/src/include/statistics/statistics.h
@@ -116,7 +116,8 @@ extern Selectivity statext_clauselist_selectivity(PlannerInfo *root,
 												  JoinType jointype,
 												  SpecialJoinInfo *sjinfo,
 												  RelOptInfo *rel,
-												  Bitmapset **estimatedclauses);
+												  Bitmapset **estimatedclauses,
+												  bool is_or);
 extern bool has_stats_of_kind(List *stats, char requiredkind);
 extern StatisticExtInfo *choose_best_statistics(List *stats, char requiredkind,
 												Bitmapset **clause_attnums,
diff --git a/src/test/regress/expected/stats_ext.out b/src/test/regress/expected/stats_ext.out
index 2ef19c590a..0456ca809a 100644
--- a/src/test/regress/expected/stats_ext.out
+++ b/src/test/regress/expected/stats_ext.out
@@ -992,6 +992,12 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = '
        200 |    200
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = ''1'' OR c = 1 OR d IS NOT NULL');
+ estimated | actual 
+-----------+--------
+       200 |    200
+(1 row)
+
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IN (1, 2, 51, 52) AND b IN ( ''1'', ''2'')');
  estimated | actual 
 -----------+--------
@@ -1052,13 +1058,6 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a < ALL (ARRAY
        100 |    100
 (1 row)
 
--- we can't use the statistic for OR clauses that are not fully covered (missing 'd' attribute)
-SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = ''1'' OR c = 1 OR d IS NOT NULL');
- estimated | actual 
------------+--------
-       343 |    200
-(1 row)
-
 -- check change of unrelated column type does not reset the MCV statistics
 ALTER TABLE mcv_lists ALTER COLUMN d TYPE VARCHAR(64);
 SELECT d.stxdmcv IS NOT NULL
@@ -1381,12 +1380,36 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE c = 0 AN
        102 |    714
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 AND c = 0');
+ estimated | actual 
+-----------+--------
+       143 |    142
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 OR c = 0');
+ estimated | actual 
+-----------+--------
+      1571 |   1572
+(1 row)
+
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 AND b = 0 AND c = 0 AND d = 0');
  estimated | actual 
 -----------+--------
          4 |    142
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE (a = 0 AND b = 0) OR (c = 0 AND d = 0)');
+ estimated | actual 
+-----------+--------
+       298 |   1572
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 OR b = 0 OR c = 0 OR d = 0');
+ estimated | actual 
+-----------+--------
+      2649 |   1572
+(1 row)
+
 -- create separate MCV statistics
 CREATE STATISTICS mcv_lists_multi_1 (mcv) ON a, b FROM mcv_lists_multi;
 CREATE STATISTICS mcv_lists_multi_2 (mcv) ON c, d FROM mcv_lists_multi;
@@ -1403,12 +1426,36 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE c = 0 AN
        714 |    714
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 AND c = 0');
+ estimated | actual 
+-----------+--------
+       143 |    142
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 OR c = 0');
+ estimated | actual 
+-----------+--------
+      1571 |   1572
+(1 row)
+
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 AND b = 0 AND c = 0 AND d = 0');
  estimated | actual 
 -----------+--------
        143 |    142
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE (a = 0 AND b = 0) OR (c = 0 AND d = 0)');
+ estimated | actual 
+-----------+--------
+      1571 |   1572
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 OR b = 0 OR c = 0 OR d = 0');
+ estimated | actual 
+-----------+--------
+      1571 |   1572
+(1 row)
+
 DROP TABLE mcv_lists_multi;
 -- Permission tests. Users should not be able to see specific data values in
 -- the extended statistics, if they lack permission to see those values in
diff --git a/src/test/regress/sql/stats_ext.sql b/src/test/regress/sql/stats_ext.sql
index cc353d0466..0f3a0a7a3b 100644
--- a/src/test/regress/sql/stats_ext.sql
+++ b/src/test/regress/sql/stats_ext.sql
@@ -520,6 +520,8 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE 4 >= a AND ''0
 
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = ''1'' OR c = 1');
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = ''1'' OR c = 1 OR d IS NOT NULL');
+
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IN (1, 2, 51, 52) AND b IN ( ''1'', ''2'')');
 
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IN (1, 2, 51, 52, NULL) AND b IN ( ''1'', ''2'', NULL)');
@@ -540,9 +542,6 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a < ALL (ARRAY
 
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a < ALL (ARRAY[4, 5]) AND b IN (''1'', ''2'', NULL, ''3'') AND c > ANY (ARRAY[1, 2, NULL, 3])');
 
--- we can't use the statistic for OR clauses that are not fully covered (missing 'd' attribute)
-SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = ''1'' OR c = 1 OR d IS NOT NULL');
-
 -- check change of unrelated column type does not reset the MCV statistics
 ALTER TABLE mcv_lists ALTER COLUMN d TYPE VARCHAR(64);
 
@@ -754,7 +753,11 @@ ANALYZE mcv_lists_multi;
 -- estimates without any mcv statistics
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 AND b = 0');
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE c = 0 AND d = 0');
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 AND c = 0');
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 OR c = 0');
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 AND b = 0 AND c = 0 AND d = 0');
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE (a = 0 AND b = 0) OR (c = 0 AND d = 0)');
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 OR b = 0 OR c = 0 OR d = 0');
 
 -- create separate MCV statistics
 CREATE STATISTICS mcv_lists_multi_1 (mcv) ON a, b FROM mcv_lists_multi;
@@ -764,7 +767,11 @@ ANALYZE mcv_lists_multi;
 
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 AND b = 0');
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE c = 0 AND d = 0');
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 AND c = 0');
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 OR c = 0');
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 AND b = 0 AND c = 0 AND d = 0');
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE (a = 0 AND b = 0) OR (c = 0 AND d = 0)');
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 OR b = 0 OR c = 0 OR d = 0');
 
 DROP TABLE mcv_lists_multi;
 
-- 
2.21.1

0003-Improve-test-coverage-for-functional-depend-20200314.patchtext/plain; charset=us-asciiDownload

From 65a9edddc273dfd8cff146caab47b9940a53331f Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Fri, 13 Mar 2020 21:52:10 +0100
Subject: [PATCH 3/4] Improve test coverage for functional dependencies

The regression tests for extended statistics were only using clauses of
the form (Var op Const), i.e. with Var on the left side. This adds some
queries with clauses with Var on the right, to test other code paths.

It also prints one of the functional dependencies, to test the output
function for the pg_dependencies data type.

Discussion: https://www.postgresql.org/message-id/flat/13902317.Eha0YfKkKy%40pierred-pdoc
Author: Tomas Vondra
---
 src/test/regress/expected/stats_ext.out | 31 +++++++++++++++++++++++++
 src/test/regress/sql/stats_ext.sql      |  7 ++++++
 2 files changed, 38 insertions(+)

diff --git a/src/test/regress/expected/stats_ext.out b/src/test/regress/expected/stats_ext.out
index 0456ca809a..c136b21279 100644
--- a/src/test/regress/expected/stats_ext.out
+++ b/src/test/regress/expected/stats_ext.out
@@ -536,6 +536,13 @@ SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE
 -- create statistics
 CREATE STATISTICS func_deps_stat (dependencies) ON a, b, c FROM functional_dependencies;
 ANALYZE functional_dependencies;
+-- print the detected dependencies
+SELECT dependencies FROM pg_stats_ext WHERE statistics_name = 'func_deps_stat';
+                                                dependencies                                                
+------------------------------------------------------------------------------------------------------------
+ {"3 => 4": 1.000000, "3 => 6": 1.000000, "4 => 6": 1.000000, "3, 4 => 6": 1.000000, "3, 6 => 4": 1.000000}
+(1 row)
+
 SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a = 1 AND b = ''1''');
  estimated | actual 
 -----------+--------
@@ -697,6 +704,12 @@ SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies_multi
        102 |    714
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies_multi WHERE 0 = a AND 0 = b');
+ estimated | actual 
+-----------+--------
+       102 |    714
+(1 row)
+
 SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies_multi WHERE c = 0 AND d = 0');
  estimated | actual 
 -----------+--------
@@ -709,6 +722,12 @@ SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies_multi
          1 |     64
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies_multi WHERE 0 = a AND b = 0 AND 0 = c AND d = 0');
+ estimated | actual 
+-----------+--------
+         1 |     64
+(1 row)
+
 -- create separate functional dependencies
 CREATE STATISTICS functional_dependencies_multi_1 (dependencies) ON a, b FROM functional_dependencies_multi;
 CREATE STATISTICS functional_dependencies_multi_2 (dependencies) ON c, d FROM functional_dependencies_multi;
@@ -719,6 +738,12 @@ SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies_multi
        714 |    714
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies_multi WHERE 0 = a AND 0 = b');
+ estimated | actual 
+-----------+--------
+       714 |    714
+(1 row)
+
 SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies_multi WHERE c = 0 AND d = 0');
  estimated | actual 
 -----------+--------
@@ -731,6 +756,12 @@ SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies_multi
         65 |     64
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies_multi WHERE 0 = a AND b = 0 AND 0 = c AND d = 0');
+ estimated | actual 
+-----------+--------
+        65 |     64
+(1 row)
+
 DROP TABLE functional_dependencies_multi;
 -- MCV lists
 CREATE TABLE mcv_lists (
diff --git a/src/test/regress/sql/stats_ext.sql b/src/test/regress/sql/stats_ext.sql
index 0f3a0a7a3b..9d05b639a8 100644
--- a/src/test/regress/sql/stats_ext.sql
+++ b/src/test/regress/sql/stats_ext.sql
@@ -318,6 +318,9 @@ CREATE STATISTICS func_deps_stat (dependencies) ON a, b, c FROM functional_depen
 
 ANALYZE functional_dependencies;
 
+-- print the detected dependencies
+SELECT dependencies FROM pg_stats_ext WHERE statistics_name = 'func_deps_stat';
+
 SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a = 1 AND b = ''1''');
 
 SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a = 1 AND b = ''1'' AND c = 1');
@@ -391,8 +394,10 @@ ANALYZE functional_dependencies_multi;
 
 -- estimates without any functional dependencies
 SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies_multi WHERE a = 0 AND b = 0');
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies_multi WHERE 0 = a AND 0 = b');
 SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies_multi WHERE c = 0 AND d = 0');
 SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies_multi WHERE a = 0 AND b = 0 AND c = 0 AND d = 0');
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies_multi WHERE 0 = a AND b = 0 AND 0 = c AND d = 0');
 
 -- create separate functional dependencies
 CREATE STATISTICS functional_dependencies_multi_1 (dependencies) ON a, b FROM functional_dependencies_multi;
@@ -401,8 +406,10 @@ CREATE STATISTICS functional_dependencies_multi_2 (dependencies) ON c, d FROM fu
 ANALYZE functional_dependencies_multi;
 
 SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies_multi WHERE a = 0 AND b = 0');
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies_multi WHERE 0 = a AND 0 = b');
 SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies_multi WHERE c = 0 AND d = 0');
 SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies_multi WHERE a = 0 AND b = 0 AND c = 0 AND d = 0');
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies_multi WHERE 0 = a AND b = 0 AND 0 = c AND d = 0');
 
 DROP TABLE functional_dependencies_multi;
 
-- 
2.21.1

0004-Support-clauses-of-the-form-Var-op-Var-20200314.patchtext/plain; charset=us-asciiDownload

From b3c1da689573398ac7d3d009afc1028f045d19a9 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Sat, 14 Mar 2020 17:24:53 +0100
Subject: [PATCH 4/4] Support clauses of the form Var op Var

---
 src/backend/statistics/extended_stats.c       | 67 ++++++++++++-----
 src/backend/statistics/mcv.c                  | 71 +++++++++++++++++-
 .../statistics/extended_stats_internal.h      |  4 +-
 src/test/regress/expected/stats_ext.out       | 72 +++++++++++++++++++
 src/test/regress/sql/stats_ext.sql            | 22 ++++++
 5 files changed, 216 insertions(+), 20 deletions(-)

diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c
index fa1273f1ae..56ec13ddb0 100644
--- a/src/backend/statistics/extended_stats.c
+++ b/src/backend/statistics/extended_stats.c
@@ -987,14 +987,18 @@ statext_is_compatible_clause_internal(PlannerInfo *root, Node *clause,
 	{
 		RangeTblEntry *rte = root->simple_rte_array[relid];
 		OpExpr	   *expr = (OpExpr *) clause;
-		Var		   *var;
+		Var		   *var,
+				   *var2;
 
 		/* Only expressions with two arguments are considered compatible. */
 		if (list_length(expr->args) != 2)
 			return false;
 
-		/* Check if the expression the right shape (one Var, one Const) */
-		if (!examine_clause_args(expr->args, &var, NULL, NULL))
+		/*
+		 * Check if the expression the right shape (one Var and one Const,
+		 * or two Vars).
+		 */
+		if (!examine_clause_arguments(expr->args, &var, &var2, NULL, NULL))
 			return false;
 
 		/*
@@ -1034,7 +1038,20 @@ statext_is_compatible_clause_internal(PlannerInfo *root, Node *clause,
 			!get_func_leakproof(get_opcode(expr->opno)))
 			return false;
 
-		return statext_is_compatible_clause_internal(root, (Node *) var,
+		/*
+		 * Check compatibility of the first Var - we get this one for both
+		 * types of supported expressions (Var op Const) and (Var op Var).
+		 */
+		if (!statext_is_compatible_clause_internal(root, (Node *) var,
+												   relid, attnums))
+			return false;
+
+		/* For (Var op Const) we don't get the second Var, and we're done. */
+		if (!var2)
+			return true;
+
+		/* For (Var op Var) check compatibility of the second Var. */
+		return statext_is_compatible_clause_internal(root, (Node *) var2,
 													 relid, attnums);
 	}
 
@@ -1050,7 +1067,7 @@ statext_is_compatible_clause_internal(PlannerInfo *root, Node *clause,
 			return false;
 
 		/* Check if the expression the right shape (one Var, one Const) */
-		if (!examine_clause_args(expr->args, &var, NULL, NULL))
+		if (!examine_clause_arguments(expr->args, &var, NULL, NULL, NULL))
 			return false;
 
 		/*
@@ -1456,22 +1473,24 @@ statext_clauselist_selectivity(PlannerInfo *root, List *clauses, int varRelid,
 }
 
 /*
- * examine_opclause_expression
+ * examine_clause_arguments
  *		Split expression into Var and Const parts.
  *
- * Attempts to match the arguments to either (Var op Const) or (Const op Var),
- * possibly with a RelabelType on top. When the expression matches this form,
- * returns true, otherwise returns false.
+ * Attempts to match the arguments to either (Var op Const) or (Const op Var)
+ * or (Var op Var), possibly with a RelabelType on top. When the expression
+ * matches this form, returns true, otherwise returns false.
  *
  * Optionally returns pointers to the extracted Var/Const nodes, when passed
  * non-null pointers (varp, cstp and varonleftp). The varonleftp flag specifies
  * on which side of the operator we found the Var node.
  */
 bool
-examine_clause_args(List *args, Var **varp, Const **cstp, bool *varonleftp)
+examine_clause_arguments(List *args, Var **var1p, Var **var2p, Const **cstp,
+						 bool *varonleftp)
 {
-	Var	   *var;
-	Const  *cst;
+	Var	   *var1 = NULL;
+	Var	   *var2 = NULL;
+	Const  *cst = NULL;
 	bool	varonleft;
 	Node   *leftop,
 		   *rightop;
@@ -1491,22 +1510,38 @@ examine_clause_args(List *args, Var **varp, Const **cstp, bool *varonleftp)
 
 	if (IsA(leftop, Var) && IsA(rightop, Const))
 	{
-		var = (Var *) leftop;
+		var1 = (Var *) leftop;
 		cst = (Const *) rightop;
 		varonleft = true;
 	}
 	else if (IsA(leftop, Const) && IsA(rightop, Var))
 	{
-		var = (Var *) rightop;
+		var1 = (Var *) rightop;
 		cst = (Const *) leftop;
 		varonleft = false;
 	}
+	else if (IsA(leftop, Var) && IsA(rightop, Var))
+	{
+		var1 = (Var *) leftop;
+		var2 = (Var *) rightop;
+		varonleft = false;
+
+		/*
+		 * Both variables have to be for the same relation (otherwise it's
+		 * a join clause, and we don't deal with those yet.
+		 */
+		if (var1->varno != var2->varno)
+			return false;
+	}
 	else
 		return false;
 
 	/* return pointers to the extracted parts if requested */
-	if (varp)
-		*varp = var;
+	if (var1p)
+		*var1p = var1;
+
+	if (var2p)
+		*var2p = var2;
 
 	if (cstp)
 		*cstp = cst;
diff --git a/src/backend/statistics/mcv.c b/src/backend/statistics/mcv.c
index 343011178b..27986b6bbd 100644
--- a/src/backend/statistics/mcv.c
+++ b/src/backend/statistics/mcv.c
@@ -1581,13 +1581,17 @@ mcv_get_match_bitmap(PlannerInfo *root, List *clauses,
 
 			/* valid only after examine_clause_args returns true */
 			Var		   *var;
+			Var		   *var2;
 			Const	   *cst;
 			bool		varonleft;
 
 			fmgr_info(get_opcode(expr->opno), &opproc);
 
 			/* extract the var and const from the expression */
-			if (examine_clause_args(expr->args, &var, &cst, &varonleft))
+			if (!examine_clause_arguments(expr->args, &var, &var2, &cst, &varonleft))
+				continue;
+
+			if (cst)	/* Var op Const */
 			{
 				int			idx;
 
@@ -1651,6 +1655,68 @@ mcv_get_match_bitmap(PlannerInfo *root, List *clauses,
 					matches[i] = RESULT_MERGE(matches[i], is_or, match);
 				}
 			}
+			else	/* Var op Var */
+			{
+				int			idx;
+				int			idx2;
+
+				Assert(var2);
+
+				/* match the attribute to a dimension of the statistic */
+				idx = bms_member_index(keys, var->varattno);
+				idx2 = bms_member_index(keys, var2->varattno);
+
+				/*
+				 * Walk through the MCV items and evaluate the current clause.
+				 * We can skip items that were already ruled out, and
+				 * terminate if there are no remaining MCV items that might
+				 * possibly match.
+				 */
+				for (i = 0; i < mcvlist->nitems; i++)
+				{
+					bool		match = true;
+					MCVItem    *item = &mcvlist->items[i];
+
+					/*
+					 * When either of the MCV items is NULL we can treat this
+					 * as a mismatch. We must not call the operator because
+					 * of strictness.
+					 */
+					if (item->isnull[idx] || item->isnull[idx2])
+					{
+						matches[i] = RESULT_MERGE(matches[i], is_or, false);
+						continue;
+					}
+
+					/*
+					 * Skip MCV items that can't change result in the bitmap.
+					 * Once the value gets false for AND-lists, or true for
+					 * OR-lists, we don't need to look at more clauses.
+					 */
+					if (RESULT_IS_FINAL(matches[i], is_or))
+						continue;
+
+					/*
+					 * First check whether the constant is below the lower
+					 * boundary (in that case we can skip the bucket, because
+					 * there's no overlap).
+					 *
+					 * We don't store collations used to build the statistics,
+					 * but we can use the collation for the attribute itself,
+					 * as stored in varcollid. We do reset the statistics after
+					 * a type change (including collation change), so this is
+					 * OK. We may need to relax this after allowing extended
+					 * statistics on expressions.
+					 */
+					match = DatumGetBool(FunctionCall2Coll(&opproc,
+														   var->varcollid,
+														   item->values[idx],
+														   item->values[idx2]));
+
+					/* update the match bitmap with the result */
+					matches[i] = RESULT_MERGE(matches[i], is_or, match);
+				}
+			}
 		}
 		else if (IsA(clause, ScalarArrayOpExpr))
 		{
@@ -1665,7 +1731,7 @@ mcv_get_match_bitmap(PlannerInfo *root, List *clauses,
 			fmgr_info(get_opcode(expr->opno), &opproc);
 
 			/* extract the var and const from the expression */
-			if (examine_clause_args(expr->args, &var, &cst, &varonleft))
+			if (examine_clause_arguments(expr->args, &var, NULL, &cst, &varonleft))
 			{
 				int			idx;
 
@@ -1679,6 +1745,7 @@ mcv_get_match_bitmap(PlannerInfo *root, List *clauses,
 
 				/* ScalarArrayOpExpr has the Var always on the left */
 				Assert(varonleft);
+				Assert(cst);
 
 				if (!cst->constisnull)
 				{
diff --git a/src/include/statistics/extended_stats_internal.h b/src/include/statistics/extended_stats_internal.h
index 6c039a81c2..035928b1b9 100644
--- a/src/include/statistics/extended_stats_internal.h
+++ b/src/include/statistics/extended_stats_internal.h
@@ -96,8 +96,8 @@ extern SortItem *build_sorted_items(int numrows, int *nitems, HeapTuple *rows,
 									TupleDesc tdesc, MultiSortSupport mss,
 									int numattrs, AttrNumber *attnums);
 
-extern bool examine_clause_args(List *args, Var **varp,
-								Const **cstp, bool *varonleftp);
+extern bool examine_clause_arguments(List *args, Var **var1p, Var **var2p,
+									 Const **cstp, bool *varonleftp);
 
 extern Selectivity mcv_clauselist_selectivity(PlannerInfo *root,
 											  StatisticExtInfo *stat,
diff --git a/src/test/regress/expected/stats_ext.out b/src/test/regress/expected/stats_ext.out
index c136b21279..df3352b188 100644
--- a/src/test/regress/expected/stats_ext.out
+++ b/src/test/regress/expected/stats_ext.out
@@ -888,6 +888,18 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = '
        343 |    200
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a > c');
+ estimated | actual 
+-----------+--------
+      1667 |   3750
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a < c');
+ estimated | actual 
+-----------+--------
+      1667 |      0
+(1 row)
+
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IN (1, 2, 51, 52) AND b IN ( ''1'', ''2'')');
  estimated | actual 
 -----------+--------
@@ -1029,6 +1041,18 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = '
        200 |    200
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a > c');
+ estimated | actual 
+-----------+--------
+      3750 |   3750
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a < c');
+ estimated | actual 
+-----------+--------
+         1 |      0
+(1 row)
+
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IN (1, 2, 51, 52) AND b IN ( ''1'', ''2'')');
  estimated | actual 
 -----------+--------
@@ -1232,6 +1256,12 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IS NULL AND
       3750 |   2500
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE b = d');
+ estimated | actual 
+-----------+--------
+        25 |   2500
+(1 row)
+
 -- create statistics
 CREATE STATISTICS mcv_lists_stats (mcv) ON a, b, d FROM mcv_lists;
 ANALYZE mcv_lists;
@@ -1265,6 +1295,12 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IS NULL AND
       2500 |   2500
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE b = d');
+ estimated | actual 
+-----------+--------
+      2500 |   2500
+(1 row)
+
 -- mcv with pass-by-ref fixlen types, e.g. uuid
 CREATE TABLE mcv_lists_uuid (
     a UUID,
@@ -1356,6 +1392,18 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a AND
       1094 |      0
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a = b');
+ estimated | actual 
+-----------+--------
+      9950 |   2500
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a = b AND b = c');
+ estimated | actual 
+-----------+--------
+        50 |   2500
+(1 row)
+
 CREATE STATISTICS mcv_lists_bool_stats (mcv) ON a, b, c
   FROM mcv_lists_bool;
 ANALYZE mcv_lists_bool;
@@ -1383,6 +1431,18 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a AND
          1 |      0
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a = b');
+ estimated | actual 
+-----------+--------
+      2500 |   2500
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a = b AND b = c');
+ estimated | actual 
+-----------+--------
+      2500 |   2500
+(1 row)
+
 -- check the ability to use multiple MCV lists
 CREATE TABLE mcv_lists_multi (
 	a INTEGER,
@@ -1441,6 +1501,12 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 OR
       2649 |   1572
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = b AND c = d');
+ estimated | actual 
+-----------+--------
+         1 |   5000
+(1 row)
+
 -- create separate MCV statistics
 CREATE STATISTICS mcv_lists_multi_1 (mcv) ON a, b FROM mcv_lists_multi;
 CREATE STATISTICS mcv_lists_multi_2 (mcv) ON c, d FROM mcv_lists_multi;
@@ -1487,6 +1553,12 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 OR
       1571 |   1572
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = b AND c = d');
+ estimated | actual 
+-----------+--------
+      5000 |   5000
+(1 row)
+
 DROP TABLE mcv_lists_multi;
 -- Permission tests. Users should not be able to see specific data values in
 -- the extended statistics, if they lack permission to see those values in
diff --git a/src/test/regress/sql/stats_ext.sql b/src/test/regress/sql/stats_ext.sql
index 9d05b639a8..60d8b2b119 100644
--- a/src/test/regress/sql/stats_ext.sql
+++ b/src/test/regress/sql/stats_ext.sql
@@ -478,6 +478,10 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = '
 
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = ''1'' OR c = 1 OR d IS NOT NULL');
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a > c');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a < c');
+
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IN (1, 2, 51, 52) AND b IN ( ''1'', ''2'')');
 
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IN (1, 2, 51, 52, NULL) AND b IN ( ''1'', ''2'', NULL)');
@@ -529,6 +533,10 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = '
 
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = ''1'' OR c = 1 OR d IS NOT NULL');
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a > c');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a < c');
+
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IN (1, 2, 51, 52) AND b IN ( ''1'', ''2'')');
 
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IN (1, 2, 51, 52, NULL) AND b IN ( ''1'', ''2'', NULL)');
@@ -636,6 +644,8 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = '
 
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IS NULL AND (b = ''x'' OR d = ''x'')');
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE b = d');
+
 -- create statistics
 CREATE STATISTICS mcv_lists_stats (mcv) ON a, b, d FROM mcv_lists;
 
@@ -654,6 +664,8 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = '
 
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IS NULL AND (b = ''x'' OR d = ''x'')');
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE b = d');
+
 -- mcv with pass-by-ref fixlen types, e.g. uuid
 CREATE TABLE mcv_lists_uuid (
     a UUID,
@@ -726,6 +738,10 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a AND
 
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a AND b AND NOT c');
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a = b');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a = b AND b = c');
+
 CREATE STATISTICS mcv_lists_bool_stats (mcv) ON a, b, c
   FROM mcv_lists_bool;
 
@@ -739,6 +755,10 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a AND
 
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a AND b AND NOT c');
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a = b');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a = b AND b = c');
+
 -- check the ability to use multiple MCV lists
 CREATE TABLE mcv_lists_multi (
 	a INTEGER,
@@ -765,6 +785,7 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 OR
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 AND b = 0 AND c = 0 AND d = 0');
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE (a = 0 AND b = 0) OR (c = 0 AND d = 0)');
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 OR b = 0 OR c = 0 OR d = 0');
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = b AND c = d');
 
 -- create separate MCV statistics
 CREATE STATISTICS mcv_lists_multi_1 (mcv) ON a, b FROM mcv_lists_multi;
@@ -779,6 +800,7 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 OR
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 AND b = 0 AND c = 0 AND d = 0');
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE (a = 0 AND b = 0) OR (c = 0 AND d = 0)');
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 OR b = 0 OR c = 0 OR d = 0');
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = b AND c = d');
 
 DROP TABLE mcv_lists_multi;
 
-- 
2.21.1

#13

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 6 years ago

In reply to: Tomas Vondra (#12)

2 attachment(s)

Re: Additional improvements to extended statistics

On Sat, Mar 14, 2020 at 05:56:10PM +0100, Tomas Vondra wrote:

...

Attached is a patch series rebased on top of the current master, after
committing the ScalarArrayOpExpr enhancements. I've updated the OR patch
to get rid of the code duplication, and barring objections I'll get it
committed shortly together with the two parts improving test coverage.

I've pushed the two patches improving test coverage for functional
dependencies and MCV lists, which seems mostly non-controversial. I'll
wait a bit more with the two patches actually changing behavior (rebased
version attached, to keep cputube happy).

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

0001-Improve-estimation-of-OR-clauses-with-extend-2020315.patchtext/plain; charset=us-asciiDownload

From ed72b6da747673a8c848a4bb8676e0d5debf2c26 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Sun, 8 Mar 2020 23:26:50 +0100
Subject: [PATCH 1/2] Improve estimation of OR clauses with extended statistics

Until now, OR clauses were estimated using extended statistics only when
the whole clause (all the arguments) are compatible. If even just one
argument was found to be incompatible, the whole clause was estimated
ignoring extended statistics.

This was considered acceptable, because the estimation errors are not
that bad for OR clauses. It may however be an issue when the OR clause
is more complex, e.g. when some of the arguments are AND clauses etc.

This relaxes this restriction, using similar logic as for AND clauses.
We first estimate as much as possible using extended statistics, and
then use the existing (s1 + s2 - s1 * s2) formula for the rest.

Discussion: https://www.postgresql.org/message-id/flat/13902317.Eha0YfKkKy%40pierred-pdoc
Author: Tomas Vondra
Reviewed-by: Dean Rasheed
---
 src/backend/optimizer/path/clausesel.c        | 141 +++++++++++++++---
 src/backend/statistics/extended_stats.c       |  36 +++--
 src/backend/statistics/mcv.c                  |   5 +-
 src/include/optimizer/optimizer.h             |   7 +
 .../statistics/extended_stats_internal.h      |   3 +-
 src/include/statistics/statistics.h           |   3 +-
 src/test/regress/expected/stats_ext.out       |  61 +++++++-
 src/test/regress/sql/stats_ext.sql            |  13 +-
 8 files changed, 227 insertions(+), 42 deletions(-)

diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c
index a3ebe10592..2ae2e0cc70 100644
--- a/src/backend/optimizer/path/clausesel.c
+++ b/src/backend/optimizer/path/clausesel.c
@@ -92,7 +92,7 @@ clauselist_selectivity(PlannerInfo *root,
 		 */
 		s1 *= statext_clauselist_selectivity(root, clauses, varRelid,
 											 jointype, sjinfo, rel,
-											 &estimatedclauses);
+											 &estimatedclauses, false);
 	}
 
 	/*
@@ -104,6 +104,62 @@ clauselist_selectivity(PlannerInfo *root,
 											  estimatedclauses);
 }
 
+/*
+ * clauselist_selectivity_or -
+ *	  Compute the selectivity of an implicitly-ORed list of boolean
+ *	  expression clauses.  The list can be empty, in which case 0.0
+ *	  must be returned.  List elements may be either RestrictInfos
+ *	  or bare expression clauses --- the former is preferred since
+ *	  it allows caching of results.
+ *
+ * See clause_selectivity() for the meaning of the additional parameters.
+ *
+ * The basic approach is to apply extended statistics first, on as many
+ * clauses as possible, in order to capture cross-column dependencies etc.
+ * The remaining clauses are then estimated using regular statistics tracked
+ * for individual columns.  This is done by simply passing the clauses to
+ * clauselist_selectivity and then combining the selectivities using the
+ * regular formula (s1+s2 - s1*s2).
+ */
+static Selectivity
+clauselist_selectivity_or(PlannerInfo *root,
+						  List *clauses,
+						  int varRelid,
+						  JoinType jointype,
+						  SpecialJoinInfo *sjinfo)
+{
+	Selectivity	s = 0.0;
+	RelOptInfo *rel;
+	Bitmapset  *estimatedclauses = NULL;
+
+	/*
+	 * Determine if these clauses reference a single relation.  If so, and if
+	 * it has extended statistics, try to apply those.
+	 */
+	rel = find_single_rel_for_clauses(root, clauses);
+	if (rel && rel->rtekind == RTE_RELATION && rel->statlist != NIL)
+	{
+		/*
+		 * Estimate as many clauses as possible using extended statistics.
+		 *
+		 * 'estimatedclauses' tracks the 0-based list position index of
+		 * clauses that we've estimated using extended statistics, and that
+		 * should be ignored.
+		 *
+		 * XXX We can't multiply with current value, because for OR clauses
+		 * we start with 0.0, so we simply assign to s1 directly.
+		 */
+		s = statext_clauselist_selectivity(root, clauses, varRelid,
+										   jointype, sjinfo, rel,
+										   &estimatedclauses, true);
+	}
+
+	/* Estimate the rest of the clauses as if they were independent. */
+	return clauselist_selectivity_simple_or(root, clauses, varRelid,
+											jointype, sjinfo,
+											estimatedclauses, s);
+}
+
 /*
  * clauselist_selectivity_simple -
  *	  Compute the selectivity of an implicitly-ANDed list of boolean
@@ -351,6 +407,65 @@ clauselist_selectivity_simple(PlannerInfo *root,
 	return s1;
 }
 
+/*
+ * clauselist_selectivity_simple_or -
+ *	  Compute the selectivity of an implicitly-ORed list of boolean
+ *	  expression clauses.  The list can be empty, in which case the
+ *	  value passed in the last parameter must be returned.
+ *	  The estimatedclauses bitmap tracks clauses that have already
+ *	  been estimated by other means.
+ *
+ * See clause_selectivity() for the meaning of the additional parameters.
+ *
+ * Our basic approach is to combine selectivities of the subclauses
+ * using the formula (s1 + s2 - s1 * s2).  It's possible some of the
+ * clauses are already estimated by other means, in which case we treat
+ * them as a single logical clause, with the selecitivity passed in the
+ * last parameter.
+ */
+Selectivity
+clauselist_selectivity_simple_or(PlannerInfo *root,
+								 List *clauses,
+								 int varRelid,
+								 JoinType jointype,
+								 SpecialJoinInfo *sjinfo,
+								 Bitmapset *estimatedclauses,
+								 Selectivity s)
+{
+	ListCell   *lc;
+	int			listidx;
+
+	/*
+	 * Selectivities of the remaining clauses for an OR clause are computed
+	 * as s1+s2 - s1*s2 to account for the probable overlap of selected tuple
+	 * sets. The clauses estimated using extended statistics are effectively
+	 * treated as a single clause.
+	 *
+	 * XXX is this too conservative?
+	 */
+	listidx = -1;
+	foreach(lc, clauses)
+	{
+		Selectivity s2;
+
+		listidx++;
+
+		/* skip already estimated clauses */
+		if (bms_is_member(listidx, estimatedclauses))
+			continue;
+
+		s2 = clause_selectivity(root,
+								(Node *) lfirst(lc),
+								varRelid,
+								jointype,
+								sjinfo);
+
+		s = s + s2 - s * s2;
+	}
+
+	return s;
+}
+
 /*
  * addRangeClause --- add a new range clause for clauselist_selectivity
  *
@@ -735,24 +850,14 @@ clause_selectivity(PlannerInfo *root,
 	else if (is_orclause(clause))
 	{
 		/*
-		 * Selectivities for an OR clause are computed as s1+s2 - s1*s2 to
-		 * account for the probable overlap of selected tuple sets.
-		 *
-		 * XXX is this too conservative?
+		 * Almost the same thing as clauselist_selectivity, but with the
+		 * clauses connected by OR.
 		 */
-		ListCell   *arg;
-
-		s1 = 0.0;
-		foreach(arg, ((BoolExpr *) clause)->args)
-		{
-			Selectivity s2 = clause_selectivity(root,
-												(Node *) lfirst(arg),
-												varRelid,
-												jointype,
-												sjinfo);
-
-			s1 = s1 + s2 - s1 * s2;
-		}
+		s1 = clauselist_selectivity_or(root,
+									   ((BoolExpr *) clause)->args,
+									   varRelid,
+									   jointype,
+									   sjinfo);
 	}
 	else if (is_opclause(clause) || IsA(clause, DistinctExpr))
 	{
diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c
index c3f0da4e23..fa1273f1ae 100644
--- a/src/backend/statistics/extended_stats.c
+++ b/src/backend/statistics/extended_stats.c
@@ -1282,16 +1282,17 @@ statext_is_compatible_clause(PlannerInfo *root, Node *clause, Index relid,
 static Selectivity
 statext_mcv_clauselist_selectivity(PlannerInfo *root, List *clauses, int varRelid,
 								   JoinType jointype, SpecialJoinInfo *sjinfo,
-								   RelOptInfo *rel, Bitmapset **estimatedclauses)
+								   RelOptInfo *rel, Bitmapset **estimatedclauses,
+								   bool is_or)
 {
 	ListCell   *l;
 	Bitmapset **list_attnums;
 	int			listidx;
-	Selectivity	sel = 1.0;
+	Selectivity	sel = (is_or) ? 0.0 : 1.0;
 
 	/* check if there's any stats that might be useful for us. */
 	if (!has_stats_of_kind(rel->statlist, STATS_EXT_MCV))
-		return 1.0;
+		return sel;
 
 	list_attnums = (Bitmapset **) palloc(sizeof(Bitmapset *) *
 										 list_length(clauses));
@@ -1374,8 +1375,12 @@ statext_mcv_clauselist_selectivity(PlannerInfo *root, List *clauses, int varReli
 		 * columns/clauses. We'll then use the various selectivities computed from
 		 * MCV list to improve it.
 		 */
-		simple_sel = clauselist_selectivity_simple(root, stat_clauses, varRelid,
-												jointype, sjinfo, NULL);
+		if (is_or)
+			simple_sel = clauselist_selectivity_simple_or(root, stat_clauses, varRelid,
+														  jointype, sjinfo, NULL, 1.0);
+		else
+			simple_sel = clauselist_selectivity_simple(root, stat_clauses, varRelid,
+													   jointype, sjinfo, NULL);
 
 		/*
 		 * Now compute the multi-column estimate from the MCV list, along with the
@@ -1383,7 +1388,7 @@ statext_mcv_clauselist_selectivity(PlannerInfo *root, List *clauses, int varReli
 		 */
 		mcv_sel = mcv_clauselist_selectivity(root, stat, stat_clauses, varRelid,
 											 jointype, sjinfo, rel,
-											 &mcv_basesel, &mcv_totalsel);
+											 &mcv_basesel, &mcv_totalsel, is_or);
 
 		/* Estimated selectivity of values not covered by MCV matches */
 		other_sel = simple_sel - mcv_basesel;
@@ -1397,8 +1402,11 @@ statext_mcv_clauselist_selectivity(PlannerInfo *root, List *clauses, int varReli
 		stat_sel = mcv_sel + other_sel;
 		CLAMP_PROBABILITY(stat_sel);
 
-		/* Factor the estimate from this MCV to the oveall estimate. */
-		sel *= stat_sel;
+		/* Factor the estimate from this MCV to the overall estimate. */
+		if (is_or)
+			sel = sel + stat_sel - sel * stat_sel;
+		else
+			sel *= stat_sel;
 	}
 
 	return sel;
@@ -1411,13 +1419,21 @@ statext_mcv_clauselist_selectivity(PlannerInfo *root, List *clauses, int varReli
 Selectivity
 statext_clauselist_selectivity(PlannerInfo *root, List *clauses, int varRelid,
 							   JoinType jointype, SpecialJoinInfo *sjinfo,
-							   RelOptInfo *rel, Bitmapset **estimatedclauses)
+							   RelOptInfo *rel, Bitmapset **estimatedclauses,
+							   bool is_or)
 {
 	Selectivity sel;
 
 	/* First, try estimating clauses using a multivariate MCV list. */
 	sel = statext_mcv_clauselist_selectivity(root, clauses, varRelid, jointype,
-											 sjinfo, rel, estimatedclauses);
+											 sjinfo, rel, estimatedclauses, is_or);
+
+	/*
+	 * Functional dependencies only work for clauses connected by AND, so for
+	 * OR clauses we're done.
+	 */
+	if (is_or)
+		return sel;
 
 	/*
 	 * Then, apply functional dependencies on the remaining clauses by calling
diff --git a/src/backend/statistics/mcv.c b/src/backend/statistics/mcv.c
index 3147d8fedc..343011178b 100644
--- a/src/backend/statistics/mcv.c
+++ b/src/backend/statistics/mcv.c
@@ -1902,7 +1902,8 @@ mcv_clauselist_selectivity(PlannerInfo *root, StatisticExtInfo *stat,
 						   List *clauses, int varRelid,
 						   JoinType jointype, SpecialJoinInfo *sjinfo,
 						   RelOptInfo *rel,
-						   Selectivity *basesel, Selectivity *totalsel)
+						   Selectivity *basesel, Selectivity *totalsel,
+						   bool is_or)
 {
 	int			i;
 	MCVList    *mcv;
@@ -1915,7 +1916,7 @@ mcv_clauselist_selectivity(PlannerInfo *root, StatisticExtInfo *stat,
 	mcv = statext_mcv_load(stat->statOid);
 
 	/* build a match bitmap for the clauses */
-	matches = mcv_get_match_bitmap(root, clauses, stat->keys, mcv, false);
+	matches = mcv_get_match_bitmap(root, clauses, stat->keys, mcv, is_or);
 
 	/* sum frequencies for all the matching MCV items */
 	*basesel = 0.0;
diff --git a/src/include/optimizer/optimizer.h b/src/include/optimizer/optimizer.h
index 5283995df8..67adcd4485 100644
--- a/src/include/optimizer/optimizer.h
+++ b/src/include/optimizer/optimizer.h
@@ -64,6 +64,13 @@ extern Selectivity clauselist_selectivity_simple(PlannerInfo *root,
 												 JoinType jointype,
 												 SpecialJoinInfo *sjinfo,
 												 Bitmapset *estimatedclauses);
+extern Selectivity clauselist_selectivity_simple_or(PlannerInfo *root,
+													List *clauses,
+													int varRelid,
+													JoinType jointype,
+													SpecialJoinInfo *sjinfo,
+													Bitmapset *estimatedclauses,
+													Selectivity s);
 extern Selectivity clauselist_selectivity(PlannerInfo *root,
 										  List *clauses,
 										  int varRelid,
diff --git a/src/include/statistics/extended_stats_internal.h b/src/include/statistics/extended_stats_internal.h
index 2b14ab238c..6c039a81c2 100644
--- a/src/include/statistics/extended_stats_internal.h
+++ b/src/include/statistics/extended_stats_internal.h
@@ -107,6 +107,7 @@ extern Selectivity mcv_clauselist_selectivity(PlannerInfo *root,
 											  SpecialJoinInfo *sjinfo,
 											  RelOptInfo *rel,
 											  Selectivity *basesel,
-											  Selectivity *totalsel);
+											  Selectivity *totalsel,
+											  bool is_or);
 
 #endif							/* EXTENDED_STATS_INTERNAL_H */
diff --git a/src/include/statistics/statistics.h b/src/include/statistics/statistics.h
index f5d9b6c73a..e18c9a6539 100644
--- a/src/include/statistics/statistics.h
+++ b/src/include/statistics/statistics.h
@@ -116,7 +116,8 @@ extern Selectivity statext_clauselist_selectivity(PlannerInfo *root,
 												  JoinType jointype,
 												  SpecialJoinInfo *sjinfo,
 												  RelOptInfo *rel,
-												  Bitmapset **estimatedclauses);
+												  Bitmapset **estimatedclauses,
+												  bool is_or);
 extern bool has_stats_of_kind(List *stats, char requiredkind);
 extern StatisticExtInfo *choose_best_statistics(List *stats, char requiredkind,
 												Bitmapset **clause_attnums,
diff --git a/src/test/regress/expected/stats_ext.out b/src/test/regress/expected/stats_ext.out
index dbc2532354..c136b21279 100644
--- a/src/test/regress/expected/stats_ext.out
+++ b/src/test/regress/expected/stats_ext.out
@@ -1023,6 +1023,12 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = '
        200 |    200
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = ''1'' OR c = 1 OR d IS NOT NULL');
+ estimated | actual 
+-----------+--------
+       200 |    200
+(1 row)
+
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IN (1, 2, 51, 52) AND b IN ( ''1'', ''2'')');
  estimated | actual 
 -----------+--------
@@ -1083,13 +1089,6 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a < ALL (ARRAY
        100 |    100
 (1 row)
 
--- we can't use the statistic for OR clauses that are not fully covered (missing 'd' attribute)
-SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = ''1'' OR c = 1 OR d IS NOT NULL');
- estimated | actual 
------------+--------
-       343 |    200
-(1 row)
-
 -- check change of unrelated column type does not reset the MCV statistics
 ALTER TABLE mcv_lists ALTER COLUMN d TYPE VARCHAR(64);
 SELECT d.stxdmcv IS NOT NULL
@@ -1412,12 +1411,36 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE c = 0 AN
        102 |    714
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 AND c = 0');
+ estimated | actual 
+-----------+--------
+       143 |    142
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 OR c = 0');
+ estimated | actual 
+-----------+--------
+      1571 |   1572
+(1 row)
+
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 AND b = 0 AND c = 0 AND d = 0');
  estimated | actual 
 -----------+--------
          4 |    142
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE (a = 0 AND b = 0) OR (c = 0 AND d = 0)');
+ estimated | actual 
+-----------+--------
+       298 |   1572
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 OR b = 0 OR c = 0 OR d = 0');
+ estimated | actual 
+-----------+--------
+      2649 |   1572
+(1 row)
+
 -- create separate MCV statistics
 CREATE STATISTICS mcv_lists_multi_1 (mcv) ON a, b FROM mcv_lists_multi;
 CREATE STATISTICS mcv_lists_multi_2 (mcv) ON c, d FROM mcv_lists_multi;
@@ -1434,12 +1457,36 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE c = 0 AN
        714 |    714
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 AND c = 0');
+ estimated | actual 
+-----------+--------
+       143 |    142
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 OR c = 0');
+ estimated | actual 
+-----------+--------
+      1571 |   1572
+(1 row)
+
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 AND b = 0 AND c = 0 AND d = 0');
  estimated | actual 
 -----------+--------
        143 |    142
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE (a = 0 AND b = 0) OR (c = 0 AND d = 0)');
+ estimated | actual 
+-----------+--------
+      1571 |   1572
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 OR b = 0 OR c = 0 OR d = 0');
+ estimated | actual 
+-----------+--------
+      1571 |   1572
+(1 row)
+
 DROP TABLE mcv_lists_multi;
 -- Permission tests. Users should not be able to see specific data values in
 -- the extended statistics, if they lack permission to see those values in
diff --git a/src/test/regress/sql/stats_ext.sql b/src/test/regress/sql/stats_ext.sql
index 8beb04278d..9d05b639a8 100644
--- a/src/test/regress/sql/stats_ext.sql
+++ b/src/test/regress/sql/stats_ext.sql
@@ -527,6 +527,8 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE 4 >= a AND ''0
 
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = ''1'' OR c = 1');
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = ''1'' OR c = 1 OR d IS NOT NULL');
+
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IN (1, 2, 51, 52) AND b IN ( ''1'', ''2'')');
 
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IN (1, 2, 51, 52, NULL) AND b IN ( ''1'', ''2'', NULL)');
@@ -547,9 +549,6 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a < ALL (ARRAY
 
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a < ALL (ARRAY[4, 5]) AND b IN (''1'', ''2'', NULL, ''3'') AND c > ANY (ARRAY[1, 2, NULL, 3])');
 
--- we can't use the statistic for OR clauses that are not fully covered (missing 'd' attribute)
-SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = ''1'' OR c = 1 OR d IS NOT NULL');
-
 -- check change of unrelated column type does not reset the MCV statistics
 ALTER TABLE mcv_lists ALTER COLUMN d TYPE VARCHAR(64);
 
@@ -761,7 +760,11 @@ ANALYZE mcv_lists_multi;
 -- estimates without any mcv statistics
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 AND b = 0');
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE c = 0 AND d = 0');
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 AND c = 0');
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 OR c = 0');
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 AND b = 0 AND c = 0 AND d = 0');
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE (a = 0 AND b = 0) OR (c = 0 AND d = 0)');
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 OR b = 0 OR c = 0 OR d = 0');
 
 -- create separate MCV statistics
 CREATE STATISTICS mcv_lists_multi_1 (mcv) ON a, b FROM mcv_lists_multi;
@@ -771,7 +774,11 @@ ANALYZE mcv_lists_multi;
 
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 AND b = 0');
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE c = 0 AND d = 0');
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 AND c = 0');
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 OR c = 0');
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 AND b = 0 AND c = 0 AND d = 0');
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE (a = 0 AND b = 0) OR (c = 0 AND d = 0)');
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 OR b = 0 OR c = 0 OR d = 0');
 
 DROP TABLE mcv_lists_multi;
 
-- 
2.21.1

0002-Support-clauses-of-the-form-Var-op-Var-2020315.patchtext/plain; charset=us-asciiDownload

From 280e5643caf4d98e07937ecf4ca5114ce50038af Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Sat, 14 Mar 2020 17:24:53 +0100
Subject: [PATCH 2/2] Support clauses of the form Var op Var

---
 src/backend/statistics/extended_stats.c       | 67 ++++++++++++-----
 src/backend/statistics/mcv.c                  | 71 +++++++++++++++++-
 .../statistics/extended_stats_internal.h      |  4 +-
 src/test/regress/expected/stats_ext.out       | 72 +++++++++++++++++++
 src/test/regress/sql/stats_ext.sql            | 22 ++++++
 5 files changed, 216 insertions(+), 20 deletions(-)

diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c
index fa1273f1ae..56ec13ddb0 100644
--- a/src/backend/statistics/extended_stats.c
+++ b/src/backend/statistics/extended_stats.c
@@ -987,14 +987,18 @@ statext_is_compatible_clause_internal(PlannerInfo *root, Node *clause,
 	{
 		RangeTblEntry *rte = root->simple_rte_array[relid];
 		OpExpr	   *expr = (OpExpr *) clause;
-		Var		   *var;
+		Var		   *var,
+				   *var2;
 
 		/* Only expressions with two arguments are considered compatible. */
 		if (list_length(expr->args) != 2)
 			return false;
 
-		/* Check if the expression the right shape (one Var, one Const) */
-		if (!examine_clause_args(expr->args, &var, NULL, NULL))
+		/*
+		 * Check if the expression the right shape (one Var and one Const,
+		 * or two Vars).
+		 */
+		if (!examine_clause_arguments(expr->args, &var, &var2, NULL, NULL))
 			return false;
 
 		/*
@@ -1034,7 +1038,20 @@ statext_is_compatible_clause_internal(PlannerInfo *root, Node *clause,
 			!get_func_leakproof(get_opcode(expr->opno)))
 			return false;
 
-		return statext_is_compatible_clause_internal(root, (Node *) var,
+		/*
+		 * Check compatibility of the first Var - we get this one for both
+		 * types of supported expressions (Var op Const) and (Var op Var).
+		 */
+		if (!statext_is_compatible_clause_internal(root, (Node *) var,
+												   relid, attnums))
+			return false;
+
+		/* For (Var op Const) we don't get the second Var, and we're done. */
+		if (!var2)
+			return true;
+
+		/* For (Var op Var) check compatibility of the second Var. */
+		return statext_is_compatible_clause_internal(root, (Node *) var2,
 													 relid, attnums);
 	}
 
@@ -1050,7 +1067,7 @@ statext_is_compatible_clause_internal(PlannerInfo *root, Node *clause,
 			return false;
 
 		/* Check if the expression the right shape (one Var, one Const) */
-		if (!examine_clause_args(expr->args, &var, NULL, NULL))
+		if (!examine_clause_arguments(expr->args, &var, NULL, NULL, NULL))
 			return false;
 
 		/*
@@ -1456,22 +1473,24 @@ statext_clauselist_selectivity(PlannerInfo *root, List *clauses, int varRelid,
 }
 
 /*
- * examine_opclause_expression
+ * examine_clause_arguments
  *		Split expression into Var and Const parts.
  *
- * Attempts to match the arguments to either (Var op Const) or (Const op Var),
- * possibly with a RelabelType on top. When the expression matches this form,
- * returns true, otherwise returns false.
+ * Attempts to match the arguments to either (Var op Const) or (Const op Var)
+ * or (Var op Var), possibly with a RelabelType on top. When the expression
+ * matches this form, returns true, otherwise returns false.
  *
  * Optionally returns pointers to the extracted Var/Const nodes, when passed
  * non-null pointers (varp, cstp and varonleftp). The varonleftp flag specifies
  * on which side of the operator we found the Var node.
  */
 bool
-examine_clause_args(List *args, Var **varp, Const **cstp, bool *varonleftp)
+examine_clause_arguments(List *args, Var **var1p, Var **var2p, Const **cstp,
+						 bool *varonleftp)
 {
-	Var	   *var;
-	Const  *cst;
+	Var	   *var1 = NULL;
+	Var	   *var2 = NULL;
+	Const  *cst = NULL;
 	bool	varonleft;
 	Node   *leftop,
 		   *rightop;
@@ -1491,22 +1510,38 @@ examine_clause_args(List *args, Var **varp, Const **cstp, bool *varonleftp)
 
 	if (IsA(leftop, Var) && IsA(rightop, Const))
 	{
-		var = (Var *) leftop;
+		var1 = (Var *) leftop;
 		cst = (Const *) rightop;
 		varonleft = true;
 	}
 	else if (IsA(leftop, Const) && IsA(rightop, Var))
 	{
-		var = (Var *) rightop;
+		var1 = (Var *) rightop;
 		cst = (Const *) leftop;
 		varonleft = false;
 	}
+	else if (IsA(leftop, Var) && IsA(rightop, Var))
+	{
+		var1 = (Var *) leftop;
+		var2 = (Var *) rightop;
+		varonleft = false;
+
+		/*
+		 * Both variables have to be for the same relation (otherwise it's
+		 * a join clause, and we don't deal with those yet.
+		 */
+		if (var1->varno != var2->varno)
+			return false;
+	}
 	else
 		return false;
 
 	/* return pointers to the extracted parts if requested */
-	if (varp)
-		*varp = var;
+	if (var1p)
+		*var1p = var1;
+
+	if (var2p)
+		*var2p = var2;
 
 	if (cstp)
 		*cstp = cst;
diff --git a/src/backend/statistics/mcv.c b/src/backend/statistics/mcv.c
index 343011178b..27986b6bbd 100644
--- a/src/backend/statistics/mcv.c
+++ b/src/backend/statistics/mcv.c
@@ -1581,13 +1581,17 @@ mcv_get_match_bitmap(PlannerInfo *root, List *clauses,
 
 			/* valid only after examine_clause_args returns true */
 			Var		   *var;
+			Var		   *var2;
 			Const	   *cst;
 			bool		varonleft;
 
 			fmgr_info(get_opcode(expr->opno), &opproc);
 
 			/* extract the var and const from the expression */
-			if (examine_clause_args(expr->args, &var, &cst, &varonleft))
+			if (!examine_clause_arguments(expr->args, &var, &var2, &cst, &varonleft))
+				continue;
+
+			if (cst)	/* Var op Const */
 			{
 				int			idx;
 
@@ -1651,6 +1655,68 @@ mcv_get_match_bitmap(PlannerInfo *root, List *clauses,
 					matches[i] = RESULT_MERGE(matches[i], is_or, match);
 				}
 			}
+			else	/* Var op Var */
+			{
+				int			idx;
+				int			idx2;
+
+				Assert(var2);
+
+				/* match the attribute to a dimension of the statistic */
+				idx = bms_member_index(keys, var->varattno);
+				idx2 = bms_member_index(keys, var2->varattno);
+
+				/*
+				 * Walk through the MCV items and evaluate the current clause.
+				 * We can skip items that were already ruled out, and
+				 * terminate if there are no remaining MCV items that might
+				 * possibly match.
+				 */
+				for (i = 0; i < mcvlist->nitems; i++)
+				{
+					bool		match = true;
+					MCVItem    *item = &mcvlist->items[i];
+
+					/*
+					 * When either of the MCV items is NULL we can treat this
+					 * as a mismatch. We must not call the operator because
+					 * of strictness.
+					 */
+					if (item->isnull[idx] || item->isnull[idx2])
+					{
+						matches[i] = RESULT_MERGE(matches[i], is_or, false);
+						continue;
+					}
+
+					/*
+					 * Skip MCV items that can't change result in the bitmap.
+					 * Once the value gets false for AND-lists, or true for
+					 * OR-lists, we don't need to look at more clauses.
+					 */
+					if (RESULT_IS_FINAL(matches[i], is_or))
+						continue;
+
+					/*
+					 * First check whether the constant is below the lower
+					 * boundary (in that case we can skip the bucket, because
+					 * there's no overlap).
+					 *
+					 * We don't store collations used to build the statistics,
+					 * but we can use the collation for the attribute itself,
+					 * as stored in varcollid. We do reset the statistics after
+					 * a type change (including collation change), so this is
+					 * OK. We may need to relax this after allowing extended
+					 * statistics on expressions.
+					 */
+					match = DatumGetBool(FunctionCall2Coll(&opproc,
+														   var->varcollid,
+														   item->values[idx],
+														   item->values[idx2]));
+
+					/* update the match bitmap with the result */
+					matches[i] = RESULT_MERGE(matches[i], is_or, match);
+				}
+			}
 		}
 		else if (IsA(clause, ScalarArrayOpExpr))
 		{
@@ -1665,7 +1731,7 @@ mcv_get_match_bitmap(PlannerInfo *root, List *clauses,
 			fmgr_info(get_opcode(expr->opno), &opproc);
 
 			/* extract the var and const from the expression */
-			if (examine_clause_args(expr->args, &var, &cst, &varonleft))
+			if (examine_clause_arguments(expr->args, &var, NULL, &cst, &varonleft))
 			{
 				int			idx;
 
@@ -1679,6 +1745,7 @@ mcv_get_match_bitmap(PlannerInfo *root, List *clauses,
 
 				/* ScalarArrayOpExpr has the Var always on the left */
 				Assert(varonleft);
+				Assert(cst);
 
 				if (!cst->constisnull)
 				{
diff --git a/src/include/statistics/extended_stats_internal.h b/src/include/statistics/extended_stats_internal.h
index 6c039a81c2..035928b1b9 100644
--- a/src/include/statistics/extended_stats_internal.h
+++ b/src/include/statistics/extended_stats_internal.h
@@ -96,8 +96,8 @@ extern SortItem *build_sorted_items(int numrows, int *nitems, HeapTuple *rows,
 									TupleDesc tdesc, MultiSortSupport mss,
 									int numattrs, AttrNumber *attnums);
 
-extern bool examine_clause_args(List *args, Var **varp,
-								Const **cstp, bool *varonleftp);
+extern bool examine_clause_arguments(List *args, Var **var1p, Var **var2p,
+									 Const **cstp, bool *varonleftp);
 
 extern Selectivity mcv_clauselist_selectivity(PlannerInfo *root,
 											  StatisticExtInfo *stat,
diff --git a/src/test/regress/expected/stats_ext.out b/src/test/regress/expected/stats_ext.out
index c136b21279..df3352b188 100644
--- a/src/test/regress/expected/stats_ext.out
+++ b/src/test/regress/expected/stats_ext.out
@@ -888,6 +888,18 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = '
        343 |    200
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a > c');
+ estimated | actual 
+-----------+--------
+      1667 |   3750
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a < c');
+ estimated | actual 
+-----------+--------
+      1667 |      0
+(1 row)
+
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IN (1, 2, 51, 52) AND b IN ( ''1'', ''2'')');
  estimated | actual 
 -----------+--------
@@ -1029,6 +1041,18 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = '
        200 |    200
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a > c');
+ estimated | actual 
+-----------+--------
+      3750 |   3750
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a < c');
+ estimated | actual 
+-----------+--------
+         1 |      0
+(1 row)
+
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IN (1, 2, 51, 52) AND b IN ( ''1'', ''2'')');
  estimated | actual 
 -----------+--------
@@ -1232,6 +1256,12 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IS NULL AND
       3750 |   2500
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE b = d');
+ estimated | actual 
+-----------+--------
+        25 |   2500
+(1 row)
+
 -- create statistics
 CREATE STATISTICS mcv_lists_stats (mcv) ON a, b, d FROM mcv_lists;
 ANALYZE mcv_lists;
@@ -1265,6 +1295,12 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IS NULL AND
       2500 |   2500
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE b = d');
+ estimated | actual 
+-----------+--------
+      2500 |   2500
+(1 row)
+
 -- mcv with pass-by-ref fixlen types, e.g. uuid
 CREATE TABLE mcv_lists_uuid (
     a UUID,
@@ -1356,6 +1392,18 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a AND
       1094 |      0
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a = b');
+ estimated | actual 
+-----------+--------
+      9950 |   2500
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a = b AND b = c');
+ estimated | actual 
+-----------+--------
+        50 |   2500
+(1 row)
+
 CREATE STATISTICS mcv_lists_bool_stats (mcv) ON a, b, c
   FROM mcv_lists_bool;
 ANALYZE mcv_lists_bool;
@@ -1383,6 +1431,18 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a AND
          1 |      0
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a = b');
+ estimated | actual 
+-----------+--------
+      2500 |   2500
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a = b AND b = c');
+ estimated | actual 
+-----------+--------
+      2500 |   2500
+(1 row)
+
 -- check the ability to use multiple MCV lists
 CREATE TABLE mcv_lists_multi (
 	a INTEGER,
@@ -1441,6 +1501,12 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 OR
       2649 |   1572
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = b AND c = d');
+ estimated | actual 
+-----------+--------
+         1 |   5000
+(1 row)
+
 -- create separate MCV statistics
 CREATE STATISTICS mcv_lists_multi_1 (mcv) ON a, b FROM mcv_lists_multi;
 CREATE STATISTICS mcv_lists_multi_2 (mcv) ON c, d FROM mcv_lists_multi;
@@ -1487,6 +1553,12 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 OR
       1571 |   1572
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = b AND c = d');
+ estimated | actual 
+-----------+--------
+      5000 |   5000
+(1 row)
+
 DROP TABLE mcv_lists_multi;
 -- Permission tests. Users should not be able to see specific data values in
 -- the extended statistics, if they lack permission to see those values in
diff --git a/src/test/regress/sql/stats_ext.sql b/src/test/regress/sql/stats_ext.sql
index 9d05b639a8..60d8b2b119 100644
--- a/src/test/regress/sql/stats_ext.sql
+++ b/src/test/regress/sql/stats_ext.sql
@@ -478,6 +478,10 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = '
 
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = ''1'' OR c = 1 OR d IS NOT NULL');
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a > c');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a < c');
+
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IN (1, 2, 51, 52) AND b IN ( ''1'', ''2'')');
 
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IN (1, 2, 51, 52, NULL) AND b IN ( ''1'', ''2'', NULL)');
@@ -529,6 +533,10 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = '
 
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = ''1'' OR c = 1 OR d IS NOT NULL');
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a > c');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a < c');
+
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IN (1, 2, 51, 52) AND b IN ( ''1'', ''2'')');
 
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IN (1, 2, 51, 52, NULL) AND b IN ( ''1'', ''2'', NULL)');
@@ -636,6 +644,8 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = '
 
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IS NULL AND (b = ''x'' OR d = ''x'')');
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE b = d');
+
 -- create statistics
 CREATE STATISTICS mcv_lists_stats (mcv) ON a, b, d FROM mcv_lists;
 
@@ -654,6 +664,8 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = '
 
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IS NULL AND (b = ''x'' OR d = ''x'')');
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE b = d');
+
 -- mcv with pass-by-ref fixlen types, e.g. uuid
 CREATE TABLE mcv_lists_uuid (
     a UUID,
@@ -726,6 +738,10 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a AND
 
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a AND b AND NOT c');
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a = b');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a = b AND b = c');
+
 CREATE STATISTICS mcv_lists_bool_stats (mcv) ON a, b, c
   FROM mcv_lists_bool;
 
@@ -739,6 +755,10 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a AND
 
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a AND b AND NOT c');
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a = b');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a = b AND b = c');
+
 -- check the ability to use multiple MCV lists
 CREATE TABLE mcv_lists_multi (
 	a INTEGER,
@@ -765,6 +785,7 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 OR
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 AND b = 0 AND c = 0 AND d = 0');
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE (a = 0 AND b = 0) OR (c = 0 AND d = 0)');
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 OR b = 0 OR c = 0 OR d = 0');
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = b AND c = d');
 
 -- create separate MCV statistics
 CREATE STATISTICS mcv_lists_multi_1 (mcv) ON a, b FROM mcv_lists_multi;
@@ -779,6 +800,7 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 OR
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 AND b = 0 AND c = 0 AND d = 0');
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE (a = 0 AND b = 0) OR (c = 0 AND d = 0)');
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 OR b = 0 OR c = 0 OR d = 0');
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = b AND c = d');
 
 DROP TABLE mcv_lists_multi;
 
-- 
2.21.1

#14

Thomas Munro

thomas.munro@gmail.com

almost 6 years ago

In reply to: Tomas Vondra (#13)

Re: Additional improvements to extended statistics

On Sun, Mar 15, 2020 at 1:08 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Sat, Mar 14, 2020 at 05:56:10PM +0100, Tomas Vondra wrote:

Attached is a patch series rebased on top of the current master, after
committing the ScalarArrayOpExpr enhancements. I've updated the OR patch
to get rid of the code duplication, and barring objections I'll get it
committed shortly together with the two parts improving test coverage.

I've pushed the two patches improving test coverage for functional
dependencies and MCV lists, which seems mostly non-controversial. I'll
wait a bit more with the two patches actually changing behavior (rebased
version attached, to keep cputube happy).

Some comment fixes:

-               /* Check if the expression the right shape (one Var,
one Const) */
-               if (!examine_clause_args(expr->args, &var, NULL, NULL))
+               /*
+                * Check if the expression the right shape (one Var
and one Const,
+                * or two Vars).
+                */

Check if the expression "has" or "is of" the right shape.

- * Attempts to match the arguments to either (Var op Const) or (Const op Var),
- * possibly with a RelabelType on top. When the expression matches this form,
- * returns true, otherwise returns false.
+ * Attempts to match the arguments to either (Var op Const) or (Const op Var)
+ * or (Var op Var), possibly with a RelabelType on top. When the expression
+ * matches this form, returns true, otherwise returns false.

... match the arguments to (Var op Const), (Const op Var) or (Var op Var), ...

+               /*
+                * Both variables have to be for the same relation
(otherwise it's
+                * a join clause, and we don't deal with those yet.
+                */

Missing close parenthesis.

Stimulated by some bad plans involving JSON, I found my way to your
WIP stats-on-expressions patch in this thread. Do I understand
correctly that it will eventually also support single expressions,
like CREATE STATISTICS t_distinct_abc (ndistinct) ON
(my_jsonb_column->>'abc') FROM t? It looks like that would solve
problems that otherwise require a generated column or an expression
index just to get ndistinct.

#15

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 6 years ago

In reply to: Thomas Munro (#14)

Re: Additional improvements to extended statistics

On Sun, Mar 15, 2020 at 02:48:02PM +1300, Thomas Munro wrote:

On Sun, Mar 15, 2020 at 1:08 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Sat, Mar 14, 2020 at 05:56:10PM +0100, Tomas Vondra wrote:

Attached is a patch series rebased on top of the current master, after
committing the ScalarArrayOpExpr enhancements. I've updated the OR patch
to get rid of the code duplication, and barring objections I'll get it
committed shortly together with the two parts improving test coverage.

I've pushed the two patches improving test coverage for functional
dependencies and MCV lists, which seems mostly non-controversial. I'll
wait a bit more with the two patches actually changing behavior (rebased
version attached, to keep cputube happy).

Some comment fixes:
-               /* Check if the expression the right shape (one Var,
one Const) */
-               if (!examine_clause_args(expr->args, &var, NULL, NULL))
+               /*
+                * Check if the expression the right shape (one Var
and one Const,
+                * or two Vars).
+                */
Check if the expression "has" or "is of" the right shape.
- * Attempts to match the arguments to either (Var op Const) or (Const op Var),
- * possibly with a RelabelType on top. When the expression matches this form,
- * returns true, otherwise returns false.
+ * Attempts to match the arguments to either (Var op Const) or (Const op Var)
+ * or (Var op Var), possibly with a RelabelType on top. When the expression
+ * matches this form, returns true, otherwise returns false.
... match the arguments to (Var op Const), (Const op Var) or (Var op Var), ...
+               /*
+                * Both variables have to be for the same relation
(otherwise it's
+                * a join clause, and we don't deal with those yet.
+                */
Missing close parenthesis.

Thanks, I'll get this fixed.

Stimulated by some bad plans involving JSON, I found my way to your
WIP stats-on-expressions patch in this thread. Do I understand
correctly that it will eventually also support single expressions,
like CREATE STATISTICS t_distinct_abc (ndistinct) ON
(my_jsonb_column->>'abc') FROM t? It looks like that would solve
problems that otherwise require a generated column or an expression
index just to get ndistinct.

Yes, I think that's generally the plan. I was also thinking about
inventing some sort of special JSON statistics (e.g. extracting paths
from the JSONB and computing frequencies, or something like that). But
stats on expressions are one of the things I'd like to do in PG14.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#16

Dean Rasheed

dean.a.rasheed@gmail.com

almost 6 years ago

In reply to: Tomas Vondra (#13)

Re: Additional improvements to extended statistics

On Sun, 15 Mar 2020 at 00:08, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

On Sat, Mar 14, 2020 at 05:56:10PM +0100, Tomas Vondra wrote:

Attached is a patch series rebased on top of the current master, after
committing the ScalarArrayOpExpr enhancements. I've updated the OR patch
to get rid of the code duplication, and barring objections I'll get it
committed shortly together with the two parts improving test coverage.

I've pushed the two patches improving test coverage for functional
dependencies and MCV lists, which seems mostly non-controversial. I'll
wait a bit more with the two patches actually changing behavior (rebased
version attached, to keep cputube happy).

Patch 0001 looks to be mostly ready. Just a couple of final comments:

+       if (is_or)
+           simple_sel = clauselist_selectivity_simple_or(root,
stat_clauses, varRelid,
+                                                         jointype,
sjinfo, NULL, 1.0);
+       else

Surely that should be passing 0.0 as the final argument, otherwise it
will always just return simple_sel = 1.0.

+        *
+        * XXX We can't multiply with current value, because for OR clauses
+        * we start with 0.0, so we simply assign to s1 directly.
+        */
+       s = statext_clauselist_selectivity(root, clauses, varRelid,
+                                          jointype, sjinfo, rel,
+                                          &estimatedclauses, true);

That final part of the comment is no longer relevant (variable s1 no
longer exists). Probably it could now just be deleted, since I think
there are sufficient comments elsewhere to explain what's going on.

Otherwise it looks good, and I think this will lead to some very
worthwhile improvements.

Regards,
Dean

#17

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 6 years ago

In reply to: Dean Rasheed (#16)

2 attachment(s)

Re: Additional improvements to extended statistics

On Sun, Mar 15, 2020 at 12:37:37PM +0000, Dean Rasheed wrote:

On Sun, 15 Mar 2020 at 00:08, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

On Sat, Mar 14, 2020 at 05:56:10PM +0100, Tomas Vondra wrote:

Attached is a patch series rebased on top of the current master, after
committing the ScalarArrayOpExpr enhancements. I've updated the OR patch
to get rid of the code duplication, and barring objections I'll get it
committed shortly together with the two parts improving test coverage.

I've pushed the two patches improving test coverage for functional
dependencies and MCV lists, which seems mostly non-controversial. I'll
wait a bit more with the two patches actually changing behavior (rebased
version attached, to keep cputube happy).

Patch 0001 looks to be mostly ready. Just a couple of final comments:
+       if (is_or)
+           simple_sel = clauselist_selectivity_simple_or(root,
stat_clauses, varRelid,
+                                                         jointype,
sjinfo, NULL, 1.0);
+       else
Surely that should be passing 0.0 as the final argument, otherwise it
will always just return simple_sel = 1.0.
+        *
+        * XXX We can't multiply with current value, because for OR clauses
+        * we start with 0.0, so we simply assign to s1 directly.
+        */
+       s = statext_clauselist_selectivity(root, clauses, varRelid,
+                                          jointype, sjinfo, rel,
+                                          &estimatedclauses, true);
That final part of the comment is no longer relevant (variable s1 no
longer exists). Probably it could now just be deleted, since I think
there are sufficient comments elsewhere to explain what's going on.

Otherwise it looks good, and I think this will lead to some very
worthwhile improvements.

Attached is a rebased patch series, addressing both those issues.

I've been wondering why none of the regression tests failed because of
the 0.0 vs. 1.0 issue, but I think the explanation is pretty simple - to
make the tests stable, all the MCV lists we use are "perfect" i.e. it
represents 100% of the data. But this selectivity is used to compute
selectivity only for the part not represented by the MCV list, i.e. it's
not really used. I suppose we could add a test that would use larger
MCV item, but I'm afraid that'd be inherently unstable :-(

Another thing I was thinking about is the changes to the API. We need to
pass information whether the clauses are connected by AND or OR to a
number of places, and 0001 does that in two ways. For some functions it
adds a new parameter (called is_or), and for other functiosn it creates
a new copy of a function. So for example

- statext_mcv_clauselist_selectivity
- statext_clauselist_selectivity

got the new flag, while e.g. clauselist_selectivity gets a new "copy"
sibling called clauselist_selectivity_or.

There were two reasons for not using flag. First, clauselist_selectivity
and similar functions have to do very different stuff for these two
cases, so it'd be just one huge if/else block. Second, minimizing
breakage of third-party code - pretty much all the extensions I've seen
only work with AND clauses, and call clauselist_selectivity. Adding a
flag would break that code. (Also, there's a bit of laziness, because
this was the simplest thing to do during development.)

But I wonder if that's sufficient reason - maybe we should just add the
flag in all cases. It might break some code, but the fix is trivial (add
a false there).

Opinions?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

0001-Improve-estimation-of-OR-clauses-using-exte-20200318.patchtext/plain; charset=us-asciiDownload

From 4d1b7a071ff7ea27bb991f7999133eb8d34f4720 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Sun, 8 Mar 2020 23:26:50 +0100
Subject: [PATCH 1/2] Improve estimation of OR clauses using extended
 statistics

Until now, OR clauses were estimated using extended statistics only when
the whole clause (all the arguments) are compatible. If even just one
argument was found to be incompatible, the whole clause was estimated
ignoring extended statistics. Estimation errors for OR clauses tend to
be fairly mild, so this was considered acceptable, but it may become an
issue for OR clauses with more complex arguments, etc.

This commit relaxes the restriction, using mostly the same logic as AND
clauses. We first apply extended statistics to as many arguments as
possible, and then use the (s1 + s2 - s1 * s2) formula to factor in the
remaining clauses.

The OR clause is still considered incompatible, though. If any argument
is unsupported or references variable not covered by the statistics, the
whole OR clause is incompatible. The consequence is that e.g. clauses

    (a = 1) AND (b = 1 OR c = 1 OR d = 1)

can't be estimated by statistics on (a,b,c) because the OR clause also
references "d". So we'll estimate each of the AND arguments separately,
and the extended statistics will be used only to estimate the OR clause.
This may be solved by creating statistics including the "d" column, but
the issue applies to cases where the clause type is unsupported, e.g.

    (a = 1) AND (b = 1 OR c = 1 OR mod(d,10) = 0)

which can't be solved by adding "d" to the statistics, at least for now.

Author: Tomas Vondra
Reviewed-by: Dean Rasheed, Thomas Munro
Discussion: Discussion: https://www.postgresql.org/message-id/flat/20200113230008.g67iyk4cs3xbnjju@development
---
 src/backend/optimizer/path/clausesel.c        | 141 +++++++++++++++---
 src/backend/statistics/extended_stats.c       |  36 +++--
 src/backend/statistics/mcv.c                  |   5 +-
 src/include/optimizer/optimizer.h             |   7 +
 .../statistics/extended_stats_internal.h      |   3 +-
 src/include/statistics/statistics.h           |   3 +-
 src/test/regress/expected/stats_ext.out       |  61 +++++++-
 src/test/regress/sql/stats_ext.sql            |  13 +-
 8 files changed, 227 insertions(+), 42 deletions(-)

diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c
index a3ebe10592..ce14d47409 100644
--- a/src/backend/optimizer/path/clausesel.c
+++ b/src/backend/optimizer/path/clausesel.c
@@ -92,7 +92,7 @@ clauselist_selectivity(PlannerInfo *root,
 		 */
 		s1 *= statext_clauselist_selectivity(root, clauses, varRelid,
 											 jointype, sjinfo, rel,
-											 &estimatedclauses);
+											 &estimatedclauses, false);
 	}
 
 	/*
@@ -104,6 +104,62 @@ clauselist_selectivity(PlannerInfo *root,
 											  estimatedclauses);
 }
 
+/*
+ * clauselist_selectivity_or -
+ *	  Compute the selectivity of an implicitly-ORed list of boolean
+ *	  expression clauses.  The list can be empty, in which case 0.0
+ *	  must be returned.  List elements may be either RestrictInfos
+ *	  or bare expression clauses --- the former is preferred since
+ *	  it allows caching of results.
+ *
+ * See clause_selectivity() for the meaning of the additional parameters.
+ *
+ * The basic approach is to apply extended statistics first, on as many
+ * clauses as possible, in order to capture cross-column dependencies etc.
+ * The remaining clauses are then estimated using regular statistics tracked
+ * for individual columns.  This is done by simply passing the clauses to
+ * clauselist_selectivity and then combining the selectivities using the
+ * regular formula (s1+s2 - s1*s2).
+ */
+static Selectivity
+clauselist_selectivity_or(PlannerInfo *root,
+						  List *clauses,
+						  int varRelid,
+						  JoinType jointype,
+						  SpecialJoinInfo *sjinfo)
+{
+	Selectivity	s = 0.0;
+	RelOptInfo *rel;
+	Bitmapset  *estimatedclauses = NULL;
+
+	/*
+	 * Determine if these clauses reference a single relation.  If so, and if
+	 * it has extended statistics, try to apply those.
+	 */
+	rel = find_single_rel_for_clauses(root, clauses);
+	if (rel && rel->rtekind == RTE_RELATION && rel->statlist != NIL)
+	{
+		/*
+		 * Estimate as many clauses as possible using extended statistics.
+		 *
+		 * 'estimatedclauses' tracks the 0-based list position index of
+		 * clauses that we've estimated using extended statistics, and that
+		 * should be ignored.
+		 *
+		 * XXX We can't multiply with current value, because for OR clauses
+		 * we start with 0.0, so we simply assign to 's' directly.
+		 */
+		s = statext_clauselist_selectivity(root, clauses, varRelid,
+										   jointype, sjinfo, rel,
+										   &estimatedclauses, true);
+	}
+
+	/* Estimate the rest of the clauses as if they were independent. */
+	return clauselist_selectivity_simple_or(root, clauses, varRelid,
+											jointype, sjinfo,
+											estimatedclauses, s);
+}
+
 /*
  * clauselist_selectivity_simple -
  *	  Compute the selectivity of an implicitly-ANDed list of boolean
@@ -351,6 +407,65 @@ clauselist_selectivity_simple(PlannerInfo *root,
 	return s1;
 }
 
+/*
+ * clauselist_selectivity_simple_or -
+ *	  Compute the selectivity of an implicitly-ORed list of boolean
+ *	  expression clauses.  The list can be empty, in which case the
+ *	  value passed in the last parameter must be returned.
+ *	  The estimatedclauses bitmap tracks clauses that have already
+ *	  been estimated by other means.
+ *
+ * See clause_selectivity() for the meaning of the additional parameters.
+ *
+ * Our basic approach is to combine selectivities of the subclauses
+ * using the formula (s1 + s2 - s1 * s2).  It's possible some of the
+ * clauses are already estimated by other means, in which case we treat
+ * them as a single logical clause, with the selecitivity passed in the
+ * last parameter.
+ */
+Selectivity
+clauselist_selectivity_simple_or(PlannerInfo *root,
+								 List *clauses,
+								 int varRelid,
+								 JoinType jointype,
+								 SpecialJoinInfo *sjinfo,
+								 Bitmapset *estimatedclauses,
+								 Selectivity s)
+{
+	ListCell   *lc;
+	int			listidx;
+
+	/*
+	 * Selectivities of the remaining clauses for an OR clause are computed
+	 * as s1+s2 - s1*s2 to account for the probable overlap of selected tuple
+	 * sets. The clauses estimated using extended statistics are effectively
+	 * treated as a single clause.
+	 *
+	 * XXX is this too conservative?
+	 */
+	listidx = -1;
+	foreach(lc, clauses)
+	{
+		Selectivity s2;
+
+		listidx++;
+
+		/* skip already estimated clauses */
+		if (bms_is_member(listidx, estimatedclauses))
+			continue;
+
+		s2 = clause_selectivity(root,
+								(Node *) lfirst(lc),
+								varRelid,
+								jointype,
+								sjinfo);
+
+		s = s + s2 - s * s2;
+	}
+
+	return s;
+}
+
 /*
  * addRangeClause --- add a new range clause for clauselist_selectivity
  *
@@ -735,24 +850,14 @@ clause_selectivity(PlannerInfo *root,
 	else if (is_orclause(clause))
 	{
 		/*
-		 * Selectivities for an OR clause are computed as s1+s2 - s1*s2 to
-		 * account for the probable overlap of selected tuple sets.
-		 *
-		 * XXX is this too conservative?
+		 * Almost the same thing as clauselist_selectivity, but with the
+		 * clauses connected by OR.
 		 */
-		ListCell   *arg;
-
-		s1 = 0.0;
-		foreach(arg, ((BoolExpr *) clause)->args)
-		{
-			Selectivity s2 = clause_selectivity(root,
-												(Node *) lfirst(arg),
-												varRelid,
-												jointype,
-												sjinfo);
-
-			s1 = s1 + s2 - s1 * s2;
-		}
+		s1 = clauselist_selectivity_or(root,
+									   ((BoolExpr *) clause)->args,
+									   varRelid,
+									   jointype,
+									   sjinfo);
 	}
 	else if (is_opclause(clause) || IsA(clause, DistinctExpr))
 	{
diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c
index 318fdb8f6d..93d67791c5 100644
--- a/src/backend/statistics/extended_stats.c
+++ b/src/backend/statistics/extended_stats.c
@@ -1282,16 +1282,17 @@ statext_is_compatible_clause(PlannerInfo *root, Node *clause, Index relid,
 static Selectivity
 statext_mcv_clauselist_selectivity(PlannerInfo *root, List *clauses, int varRelid,
 								   JoinType jointype, SpecialJoinInfo *sjinfo,
-								   RelOptInfo *rel, Bitmapset **estimatedclauses)
+								   RelOptInfo *rel, Bitmapset **estimatedclauses,
+								   bool is_or)
 {
 	ListCell   *l;
 	Bitmapset **list_attnums;
 	int			listidx;
-	Selectivity	sel = 1.0;
+	Selectivity	sel = (is_or) ? 0.0 : 1.0;
 
 	/* check if there's any stats that might be useful for us. */
 	if (!has_stats_of_kind(rel->statlist, STATS_EXT_MCV))
-		return 1.0;
+		return sel;
 
 	list_attnums = (Bitmapset **) palloc(sizeof(Bitmapset *) *
 										 list_length(clauses));
@@ -1374,8 +1375,12 @@ statext_mcv_clauselist_selectivity(PlannerInfo *root, List *clauses, int varReli
 		 * columns/clauses. We'll then use the various selectivities computed from
 		 * MCV list to improve it.
 		 */
-		simple_sel = clauselist_selectivity_simple(root, stat_clauses, varRelid,
-												jointype, sjinfo, NULL);
+		if (is_or)
+			simple_sel = clauselist_selectivity_simple_or(root, stat_clauses, varRelid,
+														  jointype, sjinfo, NULL, 0.0);
+		else
+			simple_sel = clauselist_selectivity_simple(root, stat_clauses, varRelid,
+													   jointype, sjinfo, NULL);
 
 		/*
 		 * Now compute the multi-column estimate from the MCV list, along with the
@@ -1383,7 +1388,7 @@ statext_mcv_clauselist_selectivity(PlannerInfo *root, List *clauses, int varReli
 		 */
 		mcv_sel = mcv_clauselist_selectivity(root, stat, stat_clauses, varRelid,
 											 jointype, sjinfo, rel,
-											 &mcv_basesel, &mcv_totalsel);
+											 &mcv_basesel, &mcv_totalsel, is_or);
 
 		/* Estimated selectivity of values not covered by MCV matches */
 		other_sel = simple_sel - mcv_basesel;
@@ -1397,8 +1402,11 @@ statext_mcv_clauselist_selectivity(PlannerInfo *root, List *clauses, int varReli
 		stat_sel = mcv_sel + other_sel;
 		CLAMP_PROBABILITY(stat_sel);
 
-		/* Factor the estimate from this MCV to the oveall estimate. */
-		sel *= stat_sel;
+		/* Factor the estimate from this MCV to the overall estimate. */
+		if (is_or)
+			sel = sel + stat_sel - sel * stat_sel;
+		else
+			sel *= stat_sel;
 	}
 
 	return sel;
@@ -1411,13 +1419,21 @@ statext_mcv_clauselist_selectivity(PlannerInfo *root, List *clauses, int varReli
 Selectivity
 statext_clauselist_selectivity(PlannerInfo *root, List *clauses, int varRelid,
 							   JoinType jointype, SpecialJoinInfo *sjinfo,
-							   RelOptInfo *rel, Bitmapset **estimatedclauses)
+							   RelOptInfo *rel, Bitmapset **estimatedclauses,
+							   bool is_or)
 {
 	Selectivity sel;
 
 	/* First, try estimating clauses using a multivariate MCV list. */
 	sel = statext_mcv_clauselist_selectivity(root, clauses, varRelid, jointype,
-											 sjinfo, rel, estimatedclauses);
+											 sjinfo, rel, estimatedclauses, is_or);
+
+	/*
+	 * Functional dependencies only work for clauses connected by AND, so for
+	 * OR clauses we're done.
+	 */
+	if (is_or)
+		return sel;
 
 	/*
 	 * Then, apply functional dependencies on the remaining clauses by calling
diff --git a/src/backend/statistics/mcv.c b/src/backend/statistics/mcv.c
index 3147d8fedc..343011178b 100644
--- a/src/backend/statistics/mcv.c
+++ b/src/backend/statistics/mcv.c
@@ -1902,7 +1902,8 @@ mcv_clauselist_selectivity(PlannerInfo *root, StatisticExtInfo *stat,
 						   List *clauses, int varRelid,
 						   JoinType jointype, SpecialJoinInfo *sjinfo,
 						   RelOptInfo *rel,
-						   Selectivity *basesel, Selectivity *totalsel)
+						   Selectivity *basesel, Selectivity *totalsel,
+						   bool is_or)
 {
 	int			i;
 	MCVList    *mcv;
@@ -1915,7 +1916,7 @@ mcv_clauselist_selectivity(PlannerInfo *root, StatisticExtInfo *stat,
 	mcv = statext_mcv_load(stat->statOid);
 
 	/* build a match bitmap for the clauses */
-	matches = mcv_get_match_bitmap(root, clauses, stat->keys, mcv, false);
+	matches = mcv_get_match_bitmap(root, clauses, stat->keys, mcv, is_or);
 
 	/* sum frequencies for all the matching MCV items */
 	*basesel = 0.0;
diff --git a/src/include/optimizer/optimizer.h b/src/include/optimizer/optimizer.h
index 5283995df8..67adcd4485 100644
--- a/src/include/optimizer/optimizer.h
+++ b/src/include/optimizer/optimizer.h
@@ -64,6 +64,13 @@ extern Selectivity clauselist_selectivity_simple(PlannerInfo *root,
 												 JoinType jointype,
 												 SpecialJoinInfo *sjinfo,
 												 Bitmapset *estimatedclauses);
+extern Selectivity clauselist_selectivity_simple_or(PlannerInfo *root,
+													List *clauses,
+													int varRelid,
+													JoinType jointype,
+													SpecialJoinInfo *sjinfo,
+													Bitmapset *estimatedclauses,
+													Selectivity s);
 extern Selectivity clauselist_selectivity(PlannerInfo *root,
 										  List *clauses,
 										  int varRelid,
diff --git a/src/include/statistics/extended_stats_internal.h b/src/include/statistics/extended_stats_internal.h
index 2b14ab238c..6c039a81c2 100644
--- a/src/include/statistics/extended_stats_internal.h
+++ b/src/include/statistics/extended_stats_internal.h
@@ -107,6 +107,7 @@ extern Selectivity mcv_clauselist_selectivity(PlannerInfo *root,
 											  SpecialJoinInfo *sjinfo,
 											  RelOptInfo *rel,
 											  Selectivity *basesel,
-											  Selectivity *totalsel);
+											  Selectivity *totalsel,
+											  bool is_or);
 
 #endif							/* EXTENDED_STATS_INTERNAL_H */
diff --git a/src/include/statistics/statistics.h b/src/include/statistics/statistics.h
index f5d9b6c73a..e18c9a6539 100644
--- a/src/include/statistics/statistics.h
+++ b/src/include/statistics/statistics.h
@@ -116,7 +116,8 @@ extern Selectivity statext_clauselist_selectivity(PlannerInfo *root,
 												  JoinType jointype,
 												  SpecialJoinInfo *sjinfo,
 												  RelOptInfo *rel,
-												  Bitmapset **estimatedclauses);
+												  Bitmapset **estimatedclauses,
+												  bool is_or);
 extern bool has_stats_of_kind(List *stats, char requiredkind);
 extern StatisticExtInfo *choose_best_statistics(List *stats, char requiredkind,
 												Bitmapset **clause_attnums,
diff --git a/src/test/regress/expected/stats_ext.out b/src/test/regress/expected/stats_ext.out
index bd6d8be434..3aba4a37c6 100644
--- a/src/test/regress/expected/stats_ext.out
+++ b/src/test/regress/expected/stats_ext.out
@@ -1075,6 +1075,12 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = '
        200 |    200
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = ''1'' OR c = 1 OR d IS NOT NULL');
+ estimated | actual 
+-----------+--------
+       200 |    200
+(1 row)
+
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IN (1, 2, 51, 52) AND b IN ( ''1'', ''2'')');
  estimated | actual 
 -----------+--------
@@ -1135,13 +1141,6 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a < ALL (ARRAY
        100 |    100
 (1 row)
 
--- we can't use the statistic for OR clauses that are not fully covered (missing 'd' attribute)
-SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = ''1'' OR c = 1 OR d IS NOT NULL');
- estimated | actual 
------------+--------
-       343 |    200
-(1 row)
-
 -- check change of unrelated column type does not reset the MCV statistics
 ALTER TABLE mcv_lists ALTER COLUMN d TYPE VARCHAR(64);
 SELECT d.stxdmcv IS NOT NULL
@@ -1464,12 +1463,36 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE c = 0 AN
        102 |    714
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 AND c = 0');
+ estimated | actual 
+-----------+--------
+       143 |    142
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 OR c = 0');
+ estimated | actual 
+-----------+--------
+      1571 |   1572
+(1 row)
+
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 AND b = 0 AND c = 0 AND d = 0');
  estimated | actual 
 -----------+--------
          4 |    142
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE (a = 0 AND b = 0) OR (c = 0 AND d = 0)');
+ estimated | actual 
+-----------+--------
+       298 |   1572
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 OR b = 0 OR c = 0 OR d = 0');
+ estimated | actual 
+-----------+--------
+      2649 |   1572
+(1 row)
+
 -- create separate MCV statistics
 CREATE STATISTICS mcv_lists_multi_1 (mcv) ON a, b FROM mcv_lists_multi;
 CREATE STATISTICS mcv_lists_multi_2 (mcv) ON c, d FROM mcv_lists_multi;
@@ -1486,12 +1509,36 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE c = 0 AN
        714 |    714
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 AND c = 0');
+ estimated | actual 
+-----------+--------
+       143 |    142
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 OR c = 0');
+ estimated | actual 
+-----------+--------
+      1571 |   1572
+(1 row)
+
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 AND b = 0 AND c = 0 AND d = 0');
  estimated | actual 
 -----------+--------
        143 |    142
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE (a = 0 AND b = 0) OR (c = 0 AND d = 0)');
+ estimated | actual 
+-----------+--------
+      1571 |   1572
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 OR b = 0 OR c = 0 OR d = 0');
+ estimated | actual 
+-----------+--------
+      1571 |   1572
+(1 row)
+
 DROP TABLE mcv_lists_multi;
 -- Permission tests. Users should not be able to see specific data values in
 -- the extended statistics, if they lack permission to see those values in
diff --git a/src/test/regress/sql/stats_ext.sql b/src/test/regress/sql/stats_ext.sql
index 897ed3051c..54ceb1f4ee 100644
--- a/src/test/regress/sql/stats_ext.sql
+++ b/src/test/regress/sql/stats_ext.sql
@@ -547,6 +547,8 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE 4 >= a AND ''0
 
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = ''1'' OR c = 1');
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = ''1'' OR c = 1 OR d IS NOT NULL');
+
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IN (1, 2, 51, 52) AND b IN ( ''1'', ''2'')');
 
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IN (1, 2, 51, 52, NULL) AND b IN ( ''1'', ''2'', NULL)');
@@ -567,9 +569,6 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a < ALL (ARRAY
 
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a < ALL (ARRAY[4, 5]) AND b IN (''1'', ''2'', NULL, ''3'') AND c > ANY (ARRAY[1, 2, NULL, 3])');
 
--- we can't use the statistic for OR clauses that are not fully covered (missing 'd' attribute)
-SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = ''1'' OR c = 1 OR d IS NOT NULL');
-
 -- check change of unrelated column type does not reset the MCV statistics
 ALTER TABLE mcv_lists ALTER COLUMN d TYPE VARCHAR(64);
 
@@ -781,7 +780,11 @@ ANALYZE mcv_lists_multi;
 -- estimates without any mcv statistics
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 AND b = 0');
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE c = 0 AND d = 0');
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 AND c = 0');
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 OR c = 0');
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 AND b = 0 AND c = 0 AND d = 0');
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE (a = 0 AND b = 0) OR (c = 0 AND d = 0)');
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 OR b = 0 OR c = 0 OR d = 0');
 
 -- create separate MCV statistics
 CREATE STATISTICS mcv_lists_multi_1 (mcv) ON a, b FROM mcv_lists_multi;
@@ -791,7 +794,11 @@ ANALYZE mcv_lists_multi;
 
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 AND b = 0');
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE c = 0 AND d = 0');
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 AND c = 0');
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 OR c = 0');
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 AND b = 0 AND c = 0 AND d = 0');
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE (a = 0 AND b = 0) OR (c = 0 AND d = 0)');
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 OR b = 0 OR c = 0 OR d = 0');
 
 DROP TABLE mcv_lists_multi;
 
-- 
2.21.1

0002-Support-clauses-of-the-form-Var-op-Var-20200318.patchtext/plain; charset=us-asciiDownload

From deb998bcbaac39ce90b6ad47b4828e2857ae1a57 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Tue, 17 Mar 2020 02:57:12 +0100
Subject: [PATCH 2/2] Support clauses of the form Var op Var

---
 src/backend/statistics/extended_stats.c       | 67 ++++++++++++-----
 src/backend/statistics/mcv.c                  | 71 +++++++++++++++++-
 .../statistics/extended_stats_internal.h      |  2 +-
 src/test/regress/expected/stats_ext.out       | 72 +++++++++++++++++++
 src/test/regress/sql/stats_ext.sql            | 22 ++++++
 5 files changed, 215 insertions(+), 19 deletions(-)

diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c
index 93d67791c5..f8947d9097 100644
--- a/src/backend/statistics/extended_stats.c
+++ b/src/backend/statistics/extended_stats.c
@@ -987,14 +987,18 @@ statext_is_compatible_clause_internal(PlannerInfo *root, Node *clause,
 	{
 		RangeTblEntry *rte = root->simple_rte_array[relid];
 		OpExpr	   *expr = (OpExpr *) clause;
-		Var		   *var;
+		Var		   *var,
+				   *var2;
 
 		/* Only expressions with two arguments are considered compatible. */
 		if (list_length(expr->args) != 2)
 			return false;
 
-		/* Check if the expression has the right shape (one Var, one Const) */
-		if (!examine_clause_args(expr->args, &var, NULL, NULL))
+		/*
+		 * Check if the expression has the right shape (one Var and one Const,
+		 * or two Vars).
+		 */
+		if (!examine_clause_args(expr->args, &var, &var2, NULL, NULL))
 			return false;
 
 		/*
@@ -1034,7 +1038,20 @@ statext_is_compatible_clause_internal(PlannerInfo *root, Node *clause,
 			!get_func_leakproof(get_opcode(expr->opno)))
 			return false;
 
-		return statext_is_compatible_clause_internal(root, (Node *) var,
+		/*
+		 * Check compatibility of the first Var - we get this one for both
+		 * types of supported expressions (Var op Const) and (Var op Var).
+		 */
+		if (!statext_is_compatible_clause_internal(root, (Node *) var,
+												   relid, attnums))
+			return false;
+
+		/* For (Var op Const) we don't get the second Var, and we're done. */
+		if (!var2)
+			return true;
+
+		/* For (Var op Var) check compatibility of the second Var. */
+		return statext_is_compatible_clause_internal(root, (Node *) var2,
 													 relid, attnums);
 	}
 
@@ -1050,7 +1067,7 @@ statext_is_compatible_clause_internal(PlannerInfo *root, Node *clause,
 			return false;
 
 		/* Check if the expression has the right shape (one Var, one Const) */
-		if (!examine_clause_args(expr->args, &var, NULL, NULL))
+		if (!examine_clause_args(expr->args, &var, NULL, NULL, NULL))
 			return false;
 
 		/*
@@ -1456,22 +1473,24 @@ statext_clauselist_selectivity(PlannerInfo *root, List *clauses, int varRelid,
 }
 
 /*
- * examine_opclause_expression
+ * examine_clause_args
  *		Split expression into Var and Const parts.
  *
- * Attempts to match the arguments to either (Var op Const) or (Const op Var),
- * possibly with a RelabelType on top. When the expression matches this form,
- * returns true, otherwise returns false.
+ * Attempts to match the arguments to either (Var op Const) or (Const op Var)
+ * or (Var op Var), possibly with a RelabelType on top. When the expression
+ * matches this form, returns true, otherwise returns false.
  *
  * Optionally returns pointers to the extracted Var/Const nodes, when passed
  * non-null pointers (varp, cstp and varonleftp). The varonleftp flag specifies
  * on which side of the operator we found the Var node.
  */
 bool
-examine_clause_args(List *args, Var **varp, Const **cstp, bool *varonleftp)
+examine_clause_args(List *args, Var **var1p, Var **var2p, Const **cstp,
+					bool *varonleftp)
 {
-	Var	   *var;
-	Const  *cst;
+	Var	   *var1 = NULL;
+	Var	   *var2 = NULL;
+	Const  *cst = NULL;
 	bool	varonleft;
 	Node   *leftop,
 		   *rightop;
@@ -1491,22 +1510,38 @@ examine_clause_args(List *args, Var **varp, Const **cstp, bool *varonleftp)
 
 	if (IsA(leftop, Var) && IsA(rightop, Const))
 	{
-		var = (Var *) leftop;
+		var1 = (Var *) leftop;
 		cst = (Const *) rightop;
 		varonleft = true;
 	}
 	else if (IsA(leftop, Const) && IsA(rightop, Var))
 	{
-		var = (Var *) rightop;
+		var1 = (Var *) rightop;
 		cst = (Const *) leftop;
 		varonleft = false;
 	}
+	else if (IsA(leftop, Var) && IsA(rightop, Var))
+	{
+		var1 = (Var *) leftop;
+		var2 = (Var *) rightop;
+		varonleft = false;
+
+		/*
+		 * Both variables have to be for the same relation (otherwise it's
+		 * a join clause, and we don't deal with those yet.
+		 */
+		if (var1->varno != var2->varno)
+			return false;
+	}
 	else
 		return false;
 
 	/* return pointers to the extracted parts if requested */
-	if (varp)
-		*varp = var;
+	if (var1p)
+		*var1p = var1;
+
+	if (var2p)
+		*var2p = var2;
 
 	if (cstp)
 		*cstp = cst;
diff --git a/src/backend/statistics/mcv.c b/src/backend/statistics/mcv.c
index 343011178b..3e58d36d68 100644
--- a/src/backend/statistics/mcv.c
+++ b/src/backend/statistics/mcv.c
@@ -1581,13 +1581,17 @@ mcv_get_match_bitmap(PlannerInfo *root, List *clauses,
 
 			/* valid only after examine_clause_args returns true */
 			Var		   *var;
+			Var		   *var2;
 			Const	   *cst;
 			bool		varonleft;
 
 			fmgr_info(get_opcode(expr->opno), &opproc);
 
 			/* extract the var and const from the expression */
-			if (examine_clause_args(expr->args, &var, &cst, &varonleft))
+			if (!examine_clause_args(expr->args, &var, &var2, &cst, &varonleft))
+				continue;
+
+			if (cst)	/* Var op Const */
 			{
 				int			idx;
 
@@ -1651,6 +1655,68 @@ mcv_get_match_bitmap(PlannerInfo *root, List *clauses,
 					matches[i] = RESULT_MERGE(matches[i], is_or, match);
 				}
 			}
+			else	/* Var op Var */
+			{
+				int			idx;
+				int			idx2;
+
+				Assert(var2);
+
+				/* match the attribute to a dimension of the statistic */
+				idx = bms_member_index(keys, var->varattno);
+				idx2 = bms_member_index(keys, var2->varattno);
+
+				/*
+				 * Walk through the MCV items and evaluate the current clause.
+				 * We can skip items that were already ruled out, and
+				 * terminate if there are no remaining MCV items that might
+				 * possibly match.
+				 */
+				for (i = 0; i < mcvlist->nitems; i++)
+				{
+					bool		match = true;
+					MCVItem    *item = &mcvlist->items[i];
+
+					/*
+					 * When either of the MCV items is NULL we can treat this
+					 * as a mismatch. We must not call the operator because
+					 * of strictness.
+					 */
+					if (item->isnull[idx] || item->isnull[idx2])
+					{
+						matches[i] = RESULT_MERGE(matches[i], is_or, false);
+						continue;
+					}
+
+					/*
+					 * Skip MCV items that can't change result in the bitmap.
+					 * Once the value gets false for AND-lists, or true for
+					 * OR-lists, we don't need to look at more clauses.
+					 */
+					if (RESULT_IS_FINAL(matches[i], is_or))
+						continue;
+
+					/*
+					 * First check whether the constant is below the lower
+					 * boundary (in that case we can skip the bucket, because
+					 * there's no overlap).
+					 *
+					 * We don't store collations used to build the statistics,
+					 * but we can use the collation for the attribute itself,
+					 * as stored in varcollid. We do reset the statistics after
+					 * a type change (including collation change), so this is
+					 * OK. We may need to relax this after allowing extended
+					 * statistics on expressions.
+					 */
+					match = DatumGetBool(FunctionCall2Coll(&opproc,
+														   var->varcollid,
+														   item->values[idx],
+														   item->values[idx2]));
+
+					/* update the match bitmap with the result */
+					matches[i] = RESULT_MERGE(matches[i], is_or, match);
+				}
+			}
 		}
 		else if (IsA(clause, ScalarArrayOpExpr))
 		{
@@ -1665,7 +1731,7 @@ mcv_get_match_bitmap(PlannerInfo *root, List *clauses,
 			fmgr_info(get_opcode(expr->opno), &opproc);
 
 			/* extract the var and const from the expression */
-			if (examine_clause_args(expr->args, &var, &cst, &varonleft))
+			if (examine_clause_args(expr->args, &var, NULL, &cst, &varonleft))
 			{
 				int			idx;
 
@@ -1679,6 +1745,7 @@ mcv_get_match_bitmap(PlannerInfo *root, List *clauses,
 
 				/* ScalarArrayOpExpr has the Var always on the left */
 				Assert(varonleft);
+				Assert(cst);
 
 				if (!cst->constisnull)
 				{
diff --git a/src/include/statistics/extended_stats_internal.h b/src/include/statistics/extended_stats_internal.h
index 6c039a81c2..ee34a897c8 100644
--- a/src/include/statistics/extended_stats_internal.h
+++ b/src/include/statistics/extended_stats_internal.h
@@ -96,7 +96,7 @@ extern SortItem *build_sorted_items(int numrows, int *nitems, HeapTuple *rows,
 									TupleDesc tdesc, MultiSortSupport mss,
 									int numattrs, AttrNumber *attnums);
 
-extern bool examine_clause_args(List *args, Var **varp,
+extern bool examine_clause_args(List *args, Var **var1p, Var **var2p,
 								Const **cstp, bool *varonleftp);
 
 extern Selectivity mcv_clauselist_selectivity(PlannerInfo *root,
diff --git a/src/test/regress/expected/stats_ext.out b/src/test/regress/expected/stats_ext.out
index 3aba4a37c6..481be40141 100644
--- a/src/test/regress/expected/stats_ext.out
+++ b/src/test/regress/expected/stats_ext.out
@@ -940,6 +940,18 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = '
        343 |    200
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a > c');
+ estimated | actual 
+-----------+--------
+      1667 |   3750
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a < c');
+ estimated | actual 
+-----------+--------
+      1667 |      0
+(1 row)
+
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IN (1, 2, 51, 52) AND b IN ( ''1'', ''2'')');
  estimated | actual 
 -----------+--------
@@ -1081,6 +1093,18 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = '
        200 |    200
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a > c');
+ estimated | actual 
+-----------+--------
+      3750 |   3750
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a < c');
+ estimated | actual 
+-----------+--------
+         1 |      0
+(1 row)
+
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IN (1, 2, 51, 52) AND b IN ( ''1'', ''2'')');
  estimated | actual 
 -----------+--------
@@ -1284,6 +1308,12 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IS NULL AND
       3750 |   2500
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE b = d');
+ estimated | actual 
+-----------+--------
+        25 |   2500
+(1 row)
+
 -- create statistics
 CREATE STATISTICS mcv_lists_stats (mcv) ON a, b, d FROM mcv_lists;
 ANALYZE mcv_lists;
@@ -1317,6 +1347,12 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IS NULL AND
       2500 |   2500
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE b = d');
+ estimated | actual 
+-----------+--------
+      2500 |   2500
+(1 row)
+
 -- mcv with pass-by-ref fixlen types, e.g. uuid
 CREATE TABLE mcv_lists_uuid (
     a UUID,
@@ -1408,6 +1444,18 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a AND
       1094 |      0
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a = b');
+ estimated | actual 
+-----------+--------
+      9950 |   2500
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a = b AND b = c');
+ estimated | actual 
+-----------+--------
+        50 |   2500
+(1 row)
+
 CREATE STATISTICS mcv_lists_bool_stats (mcv) ON a, b, c
   FROM mcv_lists_bool;
 ANALYZE mcv_lists_bool;
@@ -1435,6 +1483,18 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a AND
          1 |      0
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a = b');
+ estimated | actual 
+-----------+--------
+      2500 |   2500
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a = b AND b = c');
+ estimated | actual 
+-----------+--------
+      2500 |   2500
+(1 row)
+
 -- check the ability to use multiple MCV lists
 CREATE TABLE mcv_lists_multi (
 	a INTEGER,
@@ -1493,6 +1553,12 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 OR
       2649 |   1572
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = b AND c = d');
+ estimated | actual 
+-----------+--------
+         1 |   5000
+(1 row)
+
 -- create separate MCV statistics
 CREATE STATISTICS mcv_lists_multi_1 (mcv) ON a, b FROM mcv_lists_multi;
 CREATE STATISTICS mcv_lists_multi_2 (mcv) ON c, d FROM mcv_lists_multi;
@@ -1539,6 +1605,12 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 OR
       1571 |   1572
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = b AND c = d');
+ estimated | actual 
+-----------+--------
+      5000 |   5000
+(1 row)
+
 DROP TABLE mcv_lists_multi;
 -- Permission tests. Users should not be able to see specific data values in
 -- the extended statistics, if they lack permission to see those values in
diff --git a/src/test/regress/sql/stats_ext.sql b/src/test/regress/sql/stats_ext.sql
index 54ceb1f4ee..f5e4468129 100644
--- a/src/test/regress/sql/stats_ext.sql
+++ b/src/test/regress/sql/stats_ext.sql
@@ -498,6 +498,10 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = '
 
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = ''1'' OR c = 1 OR d IS NOT NULL');
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a > c');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a < c');
+
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IN (1, 2, 51, 52) AND b IN ( ''1'', ''2'')');
 
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IN (1, 2, 51, 52, NULL) AND b IN ( ''1'', ''2'', NULL)');
@@ -549,6 +553,10 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = '
 
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = ''1'' OR c = 1 OR d IS NOT NULL');
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a > c');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a < c');
+
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IN (1, 2, 51, 52) AND b IN ( ''1'', ''2'')');
 
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IN (1, 2, 51, 52, NULL) AND b IN ( ''1'', ''2'', NULL)');
@@ -656,6 +664,8 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = '
 
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IS NULL AND (b = ''x'' OR d = ''x'')');
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE b = d');
+
 -- create statistics
 CREATE STATISTICS mcv_lists_stats (mcv) ON a, b, d FROM mcv_lists;
 
@@ -674,6 +684,8 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = '
 
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IS NULL AND (b = ''x'' OR d = ''x'')');
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE b = d');
+
 -- mcv with pass-by-ref fixlen types, e.g. uuid
 CREATE TABLE mcv_lists_uuid (
     a UUID,
@@ -746,6 +758,10 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a AND
 
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a AND b AND NOT c');
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a = b');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a = b AND b = c');
+
 CREATE STATISTICS mcv_lists_bool_stats (mcv) ON a, b, c
   FROM mcv_lists_bool;
 
@@ -759,6 +775,10 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a AND
 
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a AND b AND NOT c');
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a = b');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a = b AND b = c');
+
 -- check the ability to use multiple MCV lists
 CREATE TABLE mcv_lists_multi (
 	a INTEGER,
@@ -785,6 +805,7 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 OR
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 AND b = 0 AND c = 0 AND d = 0');
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE (a = 0 AND b = 0) OR (c = 0 AND d = 0)');
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 OR b = 0 OR c = 0 OR d = 0');
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = b AND c = d');
 
 -- create separate MCV statistics
 CREATE STATISTICS mcv_lists_multi_1 (mcv) ON a, b FROM mcv_lists_multi;
@@ -799,6 +820,7 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 OR
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 AND b = 0 AND c = 0 AND d = 0');
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE (a = 0 AND b = 0) OR (c = 0 AND d = 0)');
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 OR b = 0 OR c = 0 OR d = 0');
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = b AND c = d');
 
 DROP TABLE mcv_lists_multi;
 
-- 
2.21.1

#18

Dean Rasheed

dean.a.rasheed@gmail.com

almost 6 years ago

In reply to: Tomas Vondra (#17)

Re: Additional improvements to extended statistics

On Wed, 18 Mar 2020 at 19:31, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

Attached is a rebased patch series, addressing both those issues.

I've been wondering why none of the regression tests failed because of
the 0.0 vs. 1.0 issue, but I think the explanation is pretty simple - to
make the tests stable, all the MCV lists we use are "perfect" i.e. it
represents 100% of the data. But this selectivity is used to compute
selectivity only for the part not represented by the MCV list, i.e. it's
not really used. I suppose we could add a test that would use larger
MCV item, but I'm afraid that'd be inherently unstable :-(

I think it ought to be possible to write stable tests for this code
branch -- I think all you need is for the number of rows to remain
small, so that the stats sample every row and are predictable, while
the MCVs don't cover all values, which can be achieved with a mix of
some common values, and some that only occur once.

I haven't tried it, but it seems like it would be possible in principle.

Another thing I was thinking about is the changes to the API. We need to
pass information whether the clauses are connected by AND or OR to a
number of places, and 0001 does that in two ways. For some functions it
adds a new parameter (called is_or), and for other functiosn it creates
a new copy of a function. So for example

- statext_mcv_clauselist_selectivity
- statext_clauselist_selectivity

got the new flag, while e.g. clauselist_selectivity gets a new "copy"
sibling called clauselist_selectivity_or.

There were two reasons for not using flag. First, clauselist_selectivity
and similar functions have to do very different stuff for these two
cases, so it'd be just one huge if/else block. Second, minimizing
breakage of third-party code - pretty much all the extensions I've seen
only work with AND clauses, and call clauselist_selectivity. Adding a
flag would break that code. (Also, there's a bit of laziness, because
this was the simplest thing to do during development.)

But I wonder if that's sufficient reason - maybe we should just add the
flag in all cases. It might break some code, but the fix is trivial (add
a false there).

Opinions?

-1

I think of clause_selectivity() and clauselist_selectivity() as the
public API that everyone is using, whilst the functions that support
lists of clauses to be combined using OR are internal (to the planner)
implementation details. I think callers of public API tend to either
have implicitly AND'ed list of clauses, or a single OR clause.

Regards,
Dean

#19

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 6 years ago

In reply to: Dean Rasheed (#18)

1 attachment(s)

Re: Additional improvements to extended statistics

On Thu, Mar 19, 2020 at 07:08:07PM +0000, Dean Rasheed wrote:

On Wed, 18 Mar 2020 at 19:31, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

Attached is a rebased patch series, addressing both those issues.

I've been wondering why none of the regression tests failed because of
the 0.0 vs. 1.0 issue, but I think the explanation is pretty simple - to
make the tests stable, all the MCV lists we use are "perfect" i.e. it
represents 100% of the data. But this selectivity is used to compute
selectivity only for the part not represented by the MCV list, i.e. it's
not really used. I suppose we could add a test that would use larger
MCV item, but I'm afraid that'd be inherently unstable :-(

I think it ought to be possible to write stable tests for this code
branch -- I think all you need is for the number of rows to remain
small, so that the stats sample every row and are predictable, while
the MCVs don't cover all values, which can be achieved with a mix of
some common values, and some that only occur once.

I haven't tried it, but it seems like it would be possible in principle.

Ah, right. Yeah, I think that should work. I thought there would be some
volatility due to groups randomly not making it into the MCV list, but
you're right it's possible to construct the data in a way to make it
perfectly deterministic. So that's what I've done in the attached patch.

Another thing I was thinking about is the changes to the API. We need to
pass information whether the clauses are connected by AND or OR to a
number of places, and 0001 does that in two ways. For some functions it
adds a new parameter (called is_or), and for other functiosn it creates
a new copy of a function. So for example

- statext_mcv_clauselist_selectivity
- statext_clauselist_selectivity

got the new flag, while e.g. clauselist_selectivity gets a new "copy"
sibling called clauselist_selectivity_or.

There were two reasons for not using flag. First, clauselist_selectivity
and similar functions have to do very different stuff for these two
cases, so it'd be just one huge if/else block. Second, minimizing
breakage of third-party code - pretty much all the extensions I've seen
only work with AND clauses, and call clauselist_selectivity. Adding a
flag would break that code. (Also, there's a bit of laziness, because
this was the simplest thing to do during development.)

But I wonder if that's sufficient reason - maybe we should just add the
flag in all cases. It might break some code, but the fix is trivial (add
a false there).

Opinions?

-1

I think of clause_selectivity() and clauselist_selectivity() as the
public API that everyone is using, whilst the functions that support
lists of clauses to be combined using OR are internal (to the planner)
implementation details. I think callers of public API tend to either
have implicitly AND'ed list of clauses, or a single OR clause.

OK, thanks. That was mostly my reasoning too - not wanting to cause
unnecessary breakage. And yes, I agree most people just call
clauselist_selectivity with a list of clauses combined using AND.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

0001-Improve-estimation-of-OR-clauses-using-exte-20200321.patchtext/plain; charset=us-asciiDownload

From f94929e936670d189fd628430fd73e49e410e71a Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Sat, 21 Mar 2020 22:29:24 +0100
Subject: [PATCH] Improve estimation of OR clauses using extended statistics

Until now, OR clauses were estimated using extended statistics only when
the whole clause (all the arguments) are compatible. If even just one
argument was found to be incompatible, the whole clause was estimated
ignoring extended statistics. Estimation errors for OR clauses tend to
be fairly mild, so this was considered acceptable, but it may become an
issue for OR clauses with more complex arguments, etc.

This commit relaxes the restriction, using mostly the same logic as AND
clauses. We first apply extended statistics to as many arguments as
possible, and then use the (s1 + s2 - s1 * s2) formula to factor in the
remaining clauses.

The OR clause is still considered incompatible, though. If any argument
is unsupported or references variable not covered by the statistics, the
whole OR clause is incompatible. The consequence is that e.g. clauses

    (a = 1) AND (b = 1 OR c = 1 OR d = 1)

can't be estimated by statistics on (a,b,c) because the OR clause also
references "d". So we'll estimate each of the AND arguments separately,
and the extended statistics will be used only to estimate the OR clause.
This may be solved by creating statistics including the "d" column, but
the issue applies to cases where the clause type is unsupported, e.g.

    (a = 1) AND (b = 1 OR c = 1 OR mod(d,10) = 0)

which can't be solved by adding "d" to the statistics, at least for now.

Author: Tomas Vondra
Reviewed-by: Dean Rasheed, Thomas Munro
Discussion: Discussion: https://www.postgresql.org/message-id/flat/20200113230008.g67iyk4cs3xbnjju@development
---
 src/backend/optimizer/path/clausesel.c        | 141 +++++++++++++--
 src/backend/statistics/extended_stats.c       |  36 ++--
 src/backend/statistics/mcv.c                  |   5 +-
 src/include/optimizer/optimizer.h             |   7 +
 .../statistics/extended_stats_internal.h      |   3 +-
 src/include/statistics/statistics.h           |   3 +-
 src/test/regress/expected/stats_ext.out       | 165 +++++++++++++++++-
 src/test/regress/sql/stats_ext.sql            |  77 +++++++-
 8 files changed, 395 insertions(+), 42 deletions(-)

diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c
index a3ebe10592..ce14d47409 100644
--- a/src/backend/optimizer/path/clausesel.c
+++ b/src/backend/optimizer/path/clausesel.c
@@ -92,7 +92,7 @@ clauselist_selectivity(PlannerInfo *root,
 		 */
 		s1 *= statext_clauselist_selectivity(root, clauses, varRelid,
 											 jointype, sjinfo, rel,
-											 &estimatedclauses);
+											 &estimatedclauses, false);
 	}
 
 	/*
@@ -104,6 +104,62 @@ clauselist_selectivity(PlannerInfo *root,
 											  estimatedclauses);
 }
 
+/*
+ * clauselist_selectivity_or -
+ *	  Compute the selectivity of an implicitly-ORed list of boolean
+ *	  expression clauses.  The list can be empty, in which case 0.0
+ *	  must be returned.  List elements may be either RestrictInfos
+ *	  or bare expression clauses --- the former is preferred since
+ *	  it allows caching of results.
+ *
+ * See clause_selectivity() for the meaning of the additional parameters.
+ *
+ * The basic approach is to apply extended statistics first, on as many
+ * clauses as possible, in order to capture cross-column dependencies etc.
+ * The remaining clauses are then estimated using regular statistics tracked
+ * for individual columns.  This is done by simply passing the clauses to
+ * clauselist_selectivity and then combining the selectivities using the
+ * regular formula (s1+s2 - s1*s2).
+ */
+static Selectivity
+clauselist_selectivity_or(PlannerInfo *root,
+						  List *clauses,
+						  int varRelid,
+						  JoinType jointype,
+						  SpecialJoinInfo *sjinfo)
+{
+	Selectivity	s = 0.0;
+	RelOptInfo *rel;
+	Bitmapset  *estimatedclauses = NULL;
+
+	/*
+	 * Determine if these clauses reference a single relation.  If so, and if
+	 * it has extended statistics, try to apply those.
+	 */
+	rel = find_single_rel_for_clauses(root, clauses);
+	if (rel && rel->rtekind == RTE_RELATION && rel->statlist != NIL)
+	{
+		/*
+		 * Estimate as many clauses as possible using extended statistics.
+		 *
+		 * 'estimatedclauses' tracks the 0-based list position index of
+		 * clauses that we've estimated using extended statistics, and that
+		 * should be ignored.
+		 *
+		 * XXX We can't multiply with current value, because for OR clauses
+		 * we start with 0.0, so we simply assign to 's' directly.
+		 */
+		s = statext_clauselist_selectivity(root, clauses, varRelid,
+										   jointype, sjinfo, rel,
+										   &estimatedclauses, true);
+	}
+
+	/* Estimate the rest of the clauses as if they were independent. */
+	return clauselist_selectivity_simple_or(root, clauses, varRelid,
+											jointype, sjinfo,
+											estimatedclauses, s);
+}
+
 /*
  * clauselist_selectivity_simple -
  *	  Compute the selectivity of an implicitly-ANDed list of boolean
@@ -351,6 +407,65 @@ clauselist_selectivity_simple(PlannerInfo *root,
 	return s1;
 }
 
+/*
+ * clauselist_selectivity_simple_or -
+ *	  Compute the selectivity of an implicitly-ORed list of boolean
+ *	  expression clauses.  The list can be empty, in which case the
+ *	  value passed in the last parameter must be returned.
+ *	  The estimatedclauses bitmap tracks clauses that have already
+ *	  been estimated by other means.
+ *
+ * See clause_selectivity() for the meaning of the additional parameters.
+ *
+ * Our basic approach is to combine selectivities of the subclauses
+ * using the formula (s1 + s2 - s1 * s2).  It's possible some of the
+ * clauses are already estimated by other means, in which case we treat
+ * them as a single logical clause, with the selecitivity passed in the
+ * last parameter.
+ */
+Selectivity
+clauselist_selectivity_simple_or(PlannerInfo *root,
+								 List *clauses,
+								 int varRelid,
+								 JoinType jointype,
+								 SpecialJoinInfo *sjinfo,
+								 Bitmapset *estimatedclauses,
+								 Selectivity s)
+{
+	ListCell   *lc;
+	int			listidx;
+
+	/*
+	 * Selectivities of the remaining clauses for an OR clause are computed
+	 * as s1+s2 - s1*s2 to account for the probable overlap of selected tuple
+	 * sets. The clauses estimated using extended statistics are effectively
+	 * treated as a single clause.
+	 *
+	 * XXX is this too conservative?
+	 */
+	listidx = -1;
+	foreach(lc, clauses)
+	{
+		Selectivity s2;
+
+		listidx++;
+
+		/* skip already estimated clauses */
+		if (bms_is_member(listidx, estimatedclauses))
+			continue;
+
+		s2 = clause_selectivity(root,
+								(Node *) lfirst(lc),
+								varRelid,
+								jointype,
+								sjinfo);
+
+		s = s + s2 - s * s2;
+	}
+
+	return s;
+}
+
 /*
  * addRangeClause --- add a new range clause for clauselist_selectivity
  *
@@ -735,24 +850,14 @@ clause_selectivity(PlannerInfo *root,
 	else if (is_orclause(clause))
 	{
 		/*
-		 * Selectivities for an OR clause are computed as s1+s2 - s1*s2 to
-		 * account for the probable overlap of selected tuple sets.
-		 *
-		 * XXX is this too conservative?
+		 * Almost the same thing as clauselist_selectivity, but with the
+		 * clauses connected by OR.
 		 */
-		ListCell   *arg;
-
-		s1 = 0.0;
-		foreach(arg, ((BoolExpr *) clause)->args)
-		{
-			Selectivity s2 = clause_selectivity(root,
-												(Node *) lfirst(arg),
-												varRelid,
-												jointype,
-												sjinfo);
-
-			s1 = s1 + s2 - s1 * s2;
-		}
+		s1 = clauselist_selectivity_or(root,
+									   ((BoolExpr *) clause)->args,
+									   varRelid,
+									   jointype,
+									   sjinfo);
 	}
 	else if (is_opclause(clause) || IsA(clause, DistinctExpr))
 	{
diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c
index 318fdb8f6d..93d67791c5 100644
--- a/src/backend/statistics/extended_stats.c
+++ b/src/backend/statistics/extended_stats.c
@@ -1282,16 +1282,17 @@ statext_is_compatible_clause(PlannerInfo *root, Node *clause, Index relid,
 static Selectivity
 statext_mcv_clauselist_selectivity(PlannerInfo *root, List *clauses, int varRelid,
 								   JoinType jointype, SpecialJoinInfo *sjinfo,
-								   RelOptInfo *rel, Bitmapset **estimatedclauses)
+								   RelOptInfo *rel, Bitmapset **estimatedclauses,
+								   bool is_or)
 {
 	ListCell   *l;
 	Bitmapset **list_attnums;
 	int			listidx;
-	Selectivity	sel = 1.0;
+	Selectivity	sel = (is_or) ? 0.0 : 1.0;
 
 	/* check if there's any stats that might be useful for us. */
 	if (!has_stats_of_kind(rel->statlist, STATS_EXT_MCV))
-		return 1.0;
+		return sel;
 
 	list_attnums = (Bitmapset **) palloc(sizeof(Bitmapset *) *
 										 list_length(clauses));
@@ -1374,8 +1375,12 @@ statext_mcv_clauselist_selectivity(PlannerInfo *root, List *clauses, int varReli
 		 * columns/clauses. We'll then use the various selectivities computed from
 		 * MCV list to improve it.
 		 */
-		simple_sel = clauselist_selectivity_simple(root, stat_clauses, varRelid,
-												jointype, sjinfo, NULL);
+		if (is_or)
+			simple_sel = clauselist_selectivity_simple_or(root, stat_clauses, varRelid,
+														  jointype, sjinfo, NULL, 0.0);
+		else
+			simple_sel = clauselist_selectivity_simple(root, stat_clauses, varRelid,
+													   jointype, sjinfo, NULL);
 
 		/*
 		 * Now compute the multi-column estimate from the MCV list, along with the
@@ -1383,7 +1388,7 @@ statext_mcv_clauselist_selectivity(PlannerInfo *root, List *clauses, int varReli
 		 */
 		mcv_sel = mcv_clauselist_selectivity(root, stat, stat_clauses, varRelid,
 											 jointype, sjinfo, rel,
-											 &mcv_basesel, &mcv_totalsel);
+											 &mcv_basesel, &mcv_totalsel, is_or);
 
 		/* Estimated selectivity of values not covered by MCV matches */
 		other_sel = simple_sel - mcv_basesel;
@@ -1397,8 +1402,11 @@ statext_mcv_clauselist_selectivity(PlannerInfo *root, List *clauses, int varReli
 		stat_sel = mcv_sel + other_sel;
 		CLAMP_PROBABILITY(stat_sel);
 
-		/* Factor the estimate from this MCV to the oveall estimate. */
-		sel *= stat_sel;
+		/* Factor the estimate from this MCV to the overall estimate. */
+		if (is_or)
+			sel = sel + stat_sel - sel * stat_sel;
+		else
+			sel *= stat_sel;
 	}
 
 	return sel;
@@ -1411,13 +1419,21 @@ statext_mcv_clauselist_selectivity(PlannerInfo *root, List *clauses, int varReli
 Selectivity
 statext_clauselist_selectivity(PlannerInfo *root, List *clauses, int varRelid,
 							   JoinType jointype, SpecialJoinInfo *sjinfo,
-							   RelOptInfo *rel, Bitmapset **estimatedclauses)
+							   RelOptInfo *rel, Bitmapset **estimatedclauses,
+							   bool is_or)
 {
 	Selectivity sel;
 
 	/* First, try estimating clauses using a multivariate MCV list. */
 	sel = statext_mcv_clauselist_selectivity(root, clauses, varRelid, jointype,
-											 sjinfo, rel, estimatedclauses);
+											 sjinfo, rel, estimatedclauses, is_or);
+
+	/*
+	 * Functional dependencies only work for clauses connected by AND, so for
+	 * OR clauses we're done.
+	 */
+	if (is_or)
+		return sel;
 
 	/*
 	 * Then, apply functional dependencies on the remaining clauses by calling
diff --git a/src/backend/statistics/mcv.c b/src/backend/statistics/mcv.c
index 3147d8fedc..343011178b 100644
--- a/src/backend/statistics/mcv.c
+++ b/src/backend/statistics/mcv.c
@@ -1902,7 +1902,8 @@ mcv_clauselist_selectivity(PlannerInfo *root, StatisticExtInfo *stat,
 						   List *clauses, int varRelid,
 						   JoinType jointype, SpecialJoinInfo *sjinfo,
 						   RelOptInfo *rel,
-						   Selectivity *basesel, Selectivity *totalsel)
+						   Selectivity *basesel, Selectivity *totalsel,
+						   bool is_or)
 {
 	int			i;
 	MCVList    *mcv;
@@ -1915,7 +1916,7 @@ mcv_clauselist_selectivity(PlannerInfo *root, StatisticExtInfo *stat,
 	mcv = statext_mcv_load(stat->statOid);
 
 	/* build a match bitmap for the clauses */
-	matches = mcv_get_match_bitmap(root, clauses, stat->keys, mcv, false);
+	matches = mcv_get_match_bitmap(root, clauses, stat->keys, mcv, is_or);
 
 	/* sum frequencies for all the matching MCV items */
 	*basesel = 0.0;
diff --git a/src/include/optimizer/optimizer.h b/src/include/optimizer/optimizer.h
index 5283995df8..67adcd4485 100644
--- a/src/include/optimizer/optimizer.h
+++ b/src/include/optimizer/optimizer.h
@@ -64,6 +64,13 @@ extern Selectivity clauselist_selectivity_simple(PlannerInfo *root,
 												 JoinType jointype,
 												 SpecialJoinInfo *sjinfo,
 												 Bitmapset *estimatedclauses);
+extern Selectivity clauselist_selectivity_simple_or(PlannerInfo *root,
+													List *clauses,
+													int varRelid,
+													JoinType jointype,
+													SpecialJoinInfo *sjinfo,
+													Bitmapset *estimatedclauses,
+													Selectivity s);
 extern Selectivity clauselist_selectivity(PlannerInfo *root,
 										  List *clauses,
 										  int varRelid,
diff --git a/src/include/statistics/extended_stats_internal.h b/src/include/statistics/extended_stats_internal.h
index 2b14ab238c..6c039a81c2 100644
--- a/src/include/statistics/extended_stats_internal.h
+++ b/src/include/statistics/extended_stats_internal.h
@@ -107,6 +107,7 @@ extern Selectivity mcv_clauselist_selectivity(PlannerInfo *root,
 											  SpecialJoinInfo *sjinfo,
 											  RelOptInfo *rel,
 											  Selectivity *basesel,
-											  Selectivity *totalsel);
+											  Selectivity *totalsel,
+											  bool is_or);
 
 #endif							/* EXTENDED_STATS_INTERNAL_H */
diff --git a/src/include/statistics/statistics.h b/src/include/statistics/statistics.h
index f5d9b6c73a..e18c9a6539 100644
--- a/src/include/statistics/statistics.h
+++ b/src/include/statistics/statistics.h
@@ -116,7 +116,8 @@ extern Selectivity statext_clauselist_selectivity(PlannerInfo *root,
 												  JoinType jointype,
 												  SpecialJoinInfo *sjinfo,
 												  RelOptInfo *rel,
-												  Bitmapset **estimatedclauses);
+												  Bitmapset **estimatedclauses,
+												  bool is_or);
 extern bool has_stats_of_kind(List *stats, char requiredkind);
 extern StatisticExtInfo *choose_best_statistics(List *stats, char requiredkind,
 												Bitmapset **clause_attnums,
diff --git a/src/test/regress/expected/stats_ext.out b/src/test/regress/expected/stats_ext.out
index bd6d8be434..b8565b9891 100644
--- a/src/test/regress/expected/stats_ext.out
+++ b/src/test/regress/expected/stats_ext.out
@@ -1075,6 +1075,12 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = '
        200 |    200
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = ''1'' OR c = 1 OR d IS NOT NULL');
+ estimated | actual 
+-----------+--------
+       200 |    200
+(1 row)
+
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IN (1, 2, 51, 52) AND b IN ( ''1'', ''2'')');
  estimated | actual 
 -----------+--------
@@ -1135,13 +1141,6 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a < ALL (ARRAY
        100 |    100
 (1 row)
 
--- we can't use the statistic for OR clauses that are not fully covered (missing 'd' attribute)
-SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = ''1'' OR c = 1 OR d IS NOT NULL');
- estimated | actual 
------------+--------
-       343 |    200
-(1 row)
-
 -- check change of unrelated column type does not reset the MCV statistics
 ALTER TABLE mcv_lists ALTER COLUMN d TYPE VARCHAR(64);
 SELECT d.stxdmcv IS NOT NULL
@@ -1436,6 +1435,110 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a AND
          1 |      0
 (1 row)
 
+-- mcv covering just a small fraction of data
+CREATE TABLE mcv_lists_partial (
+    a INT,
+    b INT,
+    c INT
+);
+-- 10 frequent groups, each with 100 elements
+INSERT INTO mcv_lists_partial (a, b, c)
+     SELECT
+         mod(i,10),
+         mod(i,10),
+         mod(i,10)
+     FROM generate_series(0,999) s(i);
+-- 100 groups that will make it to the MCV list (includes the 10 frequent ones)
+INSERT INTO mcv_lists_partial (a, b, c)
+     SELECT
+         i,
+         i,
+         i
+     FROM generate_series(0,99) s(i);
+-- 4000 groups in total, most of which won't make it (just a single item)
+INSERT INTO mcv_lists_partial (a, b, c)
+     SELECT
+         i,
+         i,
+         i
+     FROM generate_series(0,3999) s(i);
+ANALYZE mcv_lists_partial;
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 AND b = 0 AND c = 0');
+ estimated | actual 
+-----------+--------
+         1 |    102
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 OR b = 0 OR c = 0');
+ estimated | actual 
+-----------+--------
+       300 |    102
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 10 AND b = 10 AND c = 10');
+ estimated | actual 
+-----------+--------
+         1 |      2
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 10 OR b = 10 OR c = 10');
+ estimated | actual 
+-----------+--------
+         6 |      2
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 AND b = 0 AND c = 10');
+ estimated | actual 
+-----------+--------
+         1 |      0
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 OR b = 0 OR c = 10');
+ estimated | actual 
+-----------+--------
+       204 |    104
+(1 row)
+
+CREATE STATISTICS mcv_lists_partial_stats (mcv) ON a, b, c
+  FROM mcv_lists_partial;
+ANALYZE mcv_lists_partial;
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 AND b = 0 AND c = 0');
+ estimated | actual 
+-----------+--------
+       102 |    102
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 OR b = 0 OR c = 0');
+ estimated | actual 
+-----------+--------
+       504 |    102
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 10 AND b = 10 AND c = 10');
+ estimated | actual 
+-----------+--------
+         2 |      2
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 10 OR b = 10 OR c = 10');
+ estimated | actual 
+-----------+--------
+        10 |      2
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 AND b = 0 AND c = 10');
+ estimated | actual 
+-----------+--------
+         1 |      0
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 OR b = 0 OR c = 10');
+ estimated | actual 
+-----------+--------
+       412 |    104
+(1 row)
+
+DROP TABLE mcv_lists_partial;
 -- check the ability to use multiple MCV lists
 CREATE TABLE mcv_lists_multi (
 	a INTEGER,
@@ -1464,12 +1567,36 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE c = 0 AN
        102 |    714
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 AND c = 0');
+ estimated | actual 
+-----------+--------
+       143 |    142
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 OR c = 0');
+ estimated | actual 
+-----------+--------
+      1571 |   1572
+(1 row)
+
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 AND b = 0 AND c = 0 AND d = 0');
  estimated | actual 
 -----------+--------
          4 |    142
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE (a = 0 AND b = 0) OR (c = 0 AND d = 0)');
+ estimated | actual 
+-----------+--------
+       298 |   1572
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 OR b = 0 OR c = 0 OR d = 0');
+ estimated | actual 
+-----------+--------
+      2649 |   1572
+(1 row)
+
 -- create separate MCV statistics
 CREATE STATISTICS mcv_lists_multi_1 (mcv) ON a, b FROM mcv_lists_multi;
 CREATE STATISTICS mcv_lists_multi_2 (mcv) ON c, d FROM mcv_lists_multi;
@@ -1486,12 +1613,36 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE c = 0 AN
        714 |    714
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 AND c = 0');
+ estimated | actual 
+-----------+--------
+       143 |    142
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 OR c = 0');
+ estimated | actual 
+-----------+--------
+      1571 |   1572
+(1 row)
+
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 AND b = 0 AND c = 0 AND d = 0');
  estimated | actual 
 -----------+--------
        143 |    142
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE (a = 0 AND b = 0) OR (c = 0 AND d = 0)');
+ estimated | actual 
+-----------+--------
+      1571 |   1572
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 OR b = 0 OR c = 0 OR d = 0');
+ estimated | actual 
+-----------+--------
+      1571 |   1572
+(1 row)
+
 DROP TABLE mcv_lists_multi;
 -- Permission tests. Users should not be able to see specific data values in
 -- the extended statistics, if they lack permission to see those values in
diff --git a/src/test/regress/sql/stats_ext.sql b/src/test/regress/sql/stats_ext.sql
index 897ed3051c..bf14dabf65 100644
--- a/src/test/regress/sql/stats_ext.sql
+++ b/src/test/regress/sql/stats_ext.sql
@@ -547,6 +547,8 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE 4 >= a AND ''0
 
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = ''1'' OR c = 1');
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = ''1'' OR c = 1 OR d IS NOT NULL');
+
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IN (1, 2, 51, 52) AND b IN ( ''1'', ''2'')');
 
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IN (1, 2, 51, 52, NULL) AND b IN ( ''1'', ''2'', NULL)');
@@ -567,9 +569,6 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a < ALL (ARRAY
 
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a < ALL (ARRAY[4, 5]) AND b IN (''1'', ''2'', NULL, ''3'') AND c > ANY (ARRAY[1, 2, NULL, 3])');
 
--- we can't use the statistic for OR clauses that are not fully covered (missing 'd' attribute)
-SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = ''1'' OR c = 1 OR d IS NOT NULL');
-
 -- check change of unrelated column type does not reset the MCV statistics
 ALTER TABLE mcv_lists ALTER COLUMN d TYPE VARCHAR(64);
 
@@ -760,6 +759,70 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a AND
 
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a AND b AND NOT c');
 
+-- mcv covering just a small fraction of data
+CREATE TABLE mcv_lists_partial (
+    a INT,
+    b INT,
+    c INT
+);
+
+-- 10 frequent groups, each with 100 elements
+INSERT INTO mcv_lists_partial (a, b, c)
+     SELECT
+         mod(i,10),
+         mod(i,10),
+         mod(i,10)
+     FROM generate_series(0,999) s(i);
+
+-- 100 groups that will make it to the MCV list (includes the 10 frequent ones)
+INSERT INTO mcv_lists_partial (a, b, c)
+     SELECT
+         i,
+         i,
+         i
+     FROM generate_series(0,99) s(i);
+
+-- 4000 groups in total, most of which won't make it (just a single item)
+INSERT INTO mcv_lists_partial (a, b, c)
+     SELECT
+         i,
+         i,
+         i
+     FROM generate_series(0,3999) s(i);
+
+ANALYZE mcv_lists_partial;
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 AND b = 0 AND c = 0');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 OR b = 0 OR c = 0');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 10 AND b = 10 AND c = 10');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 10 OR b = 10 OR c = 10');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 AND b = 0 AND c = 10');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 OR b = 0 OR c = 10');
+
+CREATE STATISTICS mcv_lists_partial_stats (mcv) ON a, b, c
+  FROM mcv_lists_partial;
+
+ANALYZE mcv_lists_partial;
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 AND b = 0 AND c = 0');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 OR b = 0 OR c = 0');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 10 AND b = 10 AND c = 10');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 10 OR b = 10 OR c = 10');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 AND b = 0 AND c = 10');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 OR b = 0 OR c = 10');
+
+DROP TABLE mcv_lists_partial;
+
 -- check the ability to use multiple MCV lists
 CREATE TABLE mcv_lists_multi (
 	a INTEGER,
@@ -781,7 +844,11 @@ ANALYZE mcv_lists_multi;
 -- estimates without any mcv statistics
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 AND b = 0');
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE c = 0 AND d = 0');
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 AND c = 0');
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 OR c = 0');
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 AND b = 0 AND c = 0 AND d = 0');
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE (a = 0 AND b = 0) OR (c = 0 AND d = 0)');
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 OR b = 0 OR c = 0 OR d = 0');
 
 -- create separate MCV statistics
 CREATE STATISTICS mcv_lists_multi_1 (mcv) ON a, b FROM mcv_lists_multi;
@@ -791,7 +858,11 @@ ANALYZE mcv_lists_multi;
 
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 AND b = 0');
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE c = 0 AND d = 0');
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 AND c = 0');
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 OR c = 0');
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 AND b = 0 AND c = 0 AND d = 0');
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE (a = 0 AND b = 0) OR (c = 0 AND d = 0)');
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 OR b = 0 OR c = 0 OR d = 0');
 
 DROP TABLE mcv_lists_multi;
 
-- 
2.21.1

#20

Dean Rasheed

dean.a.rasheed@gmail.com

almost 6 years ago

In reply to: Tomas Vondra (#19)

Re: Additional improvements to extended statistics

On Sat, 21 Mar 2020 at 21:59, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

Ah, right. Yeah, I think that should work. I thought there would be some
volatility due to groups randomly not making it into the MCV list, but
you're right it's possible to construct the data in a way to make it
perfectly deterministic. So that's what I've done in the attached patch.

Looking through those new tests, another issue occurred to me -- under
some circumstances this patch can lead to extended stats being applied
twice to the same clause, which is not great, because it involves
quite a lot of extra work, and also because it can lead to
overestimates because of the way that MCV stats are applied as a delta
correction to simple_sel.

The way this comes about is as follows -- if we start with an OR
clause, that will be passed as a single-item implicitly AND'ed list to
clauselist_selectivity(), and from there to
statext_clauselist_selectivity() and then to
statext_mcv_clauselist_selectivity(). This will call
clauselist_selectivity_simple() to get simple_sel, before calling
mcv_clauselist_selectivity(), which will recursively compute all the
MCV corrections. However, the call to clauselist_selectivity_simple()
will call clause_selectivity() for each clause (just a single OR
clause in this example), which will now call
clauselist_selectivity_or(), which will go back into
statext_clauselist_selectivity() with "is_or = true", which will apply
the MCV corrections to the same set of clauses that the outer call was
about to process.

I'm not sure what's the best way to resolve that. Perhaps
statext_mcv_clauselist_selectivity() / statext_is_compatible_clause()
should ignore OR clauses from an AND-list, on the basis that they will
get processed recursively later. Or perhaps estimatedclauses can
somehow be used to prevent this, though I'm not sure exactly how that
would work.

Regards,
Dean

#21

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 6 years ago

In reply to: Dean Rasheed (#20)

Re: Additional improvements to extended statistics

On Mon, Mar 23, 2020 at 08:21:42AM +0000, Dean Rasheed wrote:

On Sat, 21 Mar 2020 at 21:59, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

Ah, right. Yeah, I think that should work. I thought there would be some
volatility due to groups randomly not making it into the MCV list, but
you're right it's possible to construct the data in a way to make it
perfectly deterministic. So that's what I've done in the attached patch.

Looking through those new tests, another issue occurred to me -- under
some circumstances this patch can lead to extended stats being applied
twice to the same clause, which is not great, because it involves
quite a lot of extra work, and also because it can lead to
overestimates because of the way that MCV stats are applied as a delta
correction to simple_sel.

The way this comes about is as follows -- if we start with an OR
clause, that will be passed as a single-item implicitly AND'ed list to
clauselist_selectivity(), and from there to
statext_clauselist_selectivity() and then to
statext_mcv_clauselist_selectivity(). This will call
clauselist_selectivity_simple() to get simple_sel, before calling
mcv_clauselist_selectivity(), which will recursively compute all the
MCV corrections. However, the call to clauselist_selectivity_simple()
will call clause_selectivity() for each clause (just a single OR
clause in this example), which will now call
clauselist_selectivity_or(), which will go back into
statext_clauselist_selectivity() with "is_or = true", which will apply
the MCV corrections to the same set of clauses that the outer call was
about to process.

Hmmm. So let's consider a simple OR clause with two arguments, both
covered by single statistics object. Something like this:

CREATE TABLE t (a int, b int);
INSERT INTO t SELECT mod(i, 10), mod(i, 10)
FROM generate_series(1,100000);
CREATE STATISTICS s (mcv) ON a,b FROM t;

SELECT * FROM t WHERE a = 0 OR b = 0;

Which is estimated correctly, but the calls graph looks like this:

clauselist_selectivity
statext_clauselist_selectivity
statext_mcv_clauselist_selectivity <---
clauselist_selectivity_simple
clause_selectivity
clauselist_selectivity_or
statext_clauselist_selectivity
statext_mcv_clauselist_selectivity <---
clauselist_selectivity_simple_or
clause_selectivity
clause_selectivity
mcv_clauselist_selectivity
clauselist_selectivity_simple_or
mcv_clauselist_selectivity
clauselist_selectivity_simple
(already estimated)

IIUC the problem you have in mind is that we end up calling
statext_mcv_clauselist_selectivity twice, once for the top-level AND
clause with a single argument, and then recursively for the expanded OR
clause. Indeed, that seems to be applying the correction twice.

I'm not sure what's the best way to resolve that. Perhaps
statext_mcv_clauselist_selectivity() / statext_is_compatible_clause()
should ignore OR clauses from an AND-list, on the basis that they will
get processed recursively later. Or perhaps estimatedclauses can
somehow be used to prevent this, though I'm not sure exactly how that
would work.

I don't know. I feel uneasy about just ignoring some of the clauses,
because what happens for complex clauses, where the OR is not directly
in the AND clause, but is negated or something like that?

Isn't it the case that clauselist_selectivity_simple (and the OR
variant) should ignore extended stats entirely? That is, we'd need to
add a flag (or _simple variant) to clause_selectivity, so that it calls
causelist_selectivity_simple_or. So the calls would look like this:

clauselist_selectivity
statext_clauselist_selectivity
statext_mcv_clauselist_selectivity
clauselist_selectivity_simple <--- disable extended stats
clause_selectivity
clauselist_selectivity_simple_or
clause_selectivity
clause_selectivity
mcv_clauselist_selectivity
clauselist_selectivity_simple
already estimated

I've only quickly hacked clause_selectivity, but it does not seems very
invasive (of course, it means disruption to clause_selectivity callers,
but I suppose most call clauselist_selectivity).

BTW do you have an example where this would actually cause an issue?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#22

Dean Rasheed

dean.a.rasheed@gmail.com

almost 6 years ago

In reply to: Tomas Vondra (#21)

Re: Additional improvements to extended statistics

On Tue, 24 Mar 2020 at 01:08, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

Hmmm. So let's consider a simple OR clause with two arguments, both
covered by single statistics object. Something like this:

CREATE TABLE t (a int, b int);
INSERT INTO t SELECT mod(i, 10), mod(i, 10)
FROM generate_series(1,100000);
CREATE STATISTICS s (mcv) ON a,b FROM t;

SELECT * FROM t WHERE a = 0 OR b = 0;

Which is estimated correctly...

Hmm, the reason that is estimated correctly is that the MCV values
cover all values in the table, so mcv_totalsel is 1 (or pretty close
to 1), and other_sel is capped to nearly 0, and so the result is
basically just mcv_sel -- i.e., in this case the MCV estimates are
returned more-or-less as-is, rather than being applied as a
correction, and so the result is independent of how many times
extended stats are applied.

The more interesting (and probably more realistic) case is where the
MCV values do not cover the all values in the table, as in the new
mcv_lists_partial examples in the regression tests, for example this
test case, which produces a significant overestimate:

SELECT * FROM mcv_lists_partial WHERE a = 0 OR b = 0 OR c = 0

although actually, I think there's another reason for that (in
addition to the extended stats correction being applied twice) -- the
way the MCV code makes use of base selectivity doesn't seem really
appropriate for OR clauses, because the way base_frequency is computed
is really based on the assumption that every column would be matched.
I'm not sure that there's any easy answer to that though. I feels like
what's needed when computing the match bitmaps in mcv.c, is to produce
a bitmap (would it fit in 1 byte?) per value, to indicate which
columns match, and base_frequency values per column. That would be
significantly more work though, so almost certainly isn't feasible for
PG13.

IIUC the problem you have in mind is that we end up calling
statext_mcv_clauselist_selectivity twice, once for the top-level AND
clause with a single argument, and then recursively for the expanded OR
clause. Indeed, that seems to be applying the correction twice.

I'm not sure what's the best way to resolve that. Perhaps
statext_mcv_clauselist_selectivity() / statext_is_compatible_clause()
should ignore OR clauses from an AND-list, on the basis that they will
get processed recursively later. Or perhaps estimatedclauses can
somehow be used to prevent this, though I'm not sure exactly how that
would work.

I don't know. I feel uneasy about just ignoring some of the clauses,
because what happens for complex clauses, where the OR is not directly
in the AND clause, but is negated or something like that?

Isn't it the case that clauselist_selectivity_simple (and the OR
variant) should ignore extended stats entirely? That is, we'd need to
add a flag (or _simple variant) to clause_selectivity, so that it calls
causelist_selectivity_simple_or. So the calls would look like this:

clauselist_selectivity
statext_clauselist_selectivity
statext_mcv_clauselist_selectivity
clauselist_selectivity_simple <--- disable extended stats
clause_selectivity
clauselist_selectivity_simple_or
clause_selectivity
clause_selectivity
mcv_clauselist_selectivity
clauselist_selectivity_simple
already estimated

I've only quickly hacked clause_selectivity, but it does not seems very
invasive (of course, it means disruption to clause_selectivity callers,
but I suppose most call clauselist_selectivity).

Sounds like a reasonable approach, but I think it would be better to
preserve the current public API by having clauselist_selectivity()
become a thin wrapper around a new function that optionally applies
extended stats.

Regards,
Dean

#23

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 6 years ago

In reply to: Dean Rasheed (#22)

Re: Additional improvements to extended statistics

On Tue, Mar 24, 2020 at 01:20:07PM +0000, Dean Rasheed wrote:

On Tue, 24 Mar 2020 at 01:08, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

Hmmm. So let's consider a simple OR clause with two arguments, both
covered by single statistics object. Something like this:

CREATE TABLE t (a int, b int);
INSERT INTO t SELECT mod(i, 10), mod(i, 10)
FROM generate_series(1,100000);
CREATE STATISTICS s (mcv) ON a,b FROM t;

SELECT * FROM t WHERE a = 0 OR b = 0;

Which is estimated correctly...

Hmm, the reason that is estimated correctly is that the MCV values
cover all values in the table, so mcv_totalsel is 1 (or pretty close
to 1), and other_sel is capped to nearly 0, and so the result is
basically just mcv_sel -- i.e., in this case the MCV estimates are
returned more-or-less as-is, rather than being applied as a
correction, and so the result is independent of how many times
extended stats are applied.

The more interesting (and probably more realistic) case is where the
MCV values do not cover the all values in the table, as in the new
mcv_lists_partial examples in the regression tests, for example this
test case, which produces a significant overestimate:

SELECT * FROM mcv_lists_partial WHERE a = 0 OR b = 0 OR c = 0

although actually, I think there's another reason for that (in
addition to the extended stats correction being applied twice) -- the
way the MCV code makes use of base selectivity doesn't seem really
appropriate for OR clauses, because the way base_frequency is computed
is really based on the assumption that every column would be matched.
I'm not sure that there's any easy answer to that though. I feels like
what's needed when computing the match bitmaps in mcv.c, is to produce
a bitmap (would it fit in 1 byte?) per value, to indicate which
columns match, and base_frequency values per column. That would be
significantly more work though, so almost certainly isn't feasible for
PG13.

Good point. I haven't thought about the base frequencies. I think 1 byte
should be enough, as we limit the number of columns to 8.

IIUC the problem you have in mind is that we end up calling
statext_mcv_clauselist_selectivity twice, once for the top-level AND
clause with a single argument, and then recursively for the expanded OR
clause. Indeed, that seems to be applying the correction twice.

I'm not sure what's the best way to resolve that. Perhaps
statext_mcv_clauselist_selectivity() / statext_is_compatible_clause()
should ignore OR clauses from an AND-list, on the basis that they will
get processed recursively later. Or perhaps estimatedclauses can
somehow be used to prevent this, though I'm not sure exactly how that
would work.

I don't know. I feel uneasy about just ignoring some of the clauses,
because what happens for complex clauses, where the OR is not directly
in the AND clause, but is negated or something like that?

Isn't it the case that clauselist_selectivity_simple (and the OR
variant) should ignore extended stats entirely? That is, we'd need to
add a flag (or _simple variant) to clause_selectivity, so that it calls
causelist_selectivity_simple_or. So the calls would look like this:

clauselist_selectivity
statext_clauselist_selectivity
statext_mcv_clauselist_selectivity
clauselist_selectivity_simple <--- disable extended stats
clause_selectivity
clauselist_selectivity_simple_or
clause_selectivity
clause_selectivity
mcv_clauselist_selectivity
clauselist_selectivity_simple
already estimated

I've only quickly hacked clause_selectivity, but it does not seems very
invasive (of course, it means disruption to clause_selectivity callers,
but I suppose most call clauselist_selectivity).

Sounds like a reasonable approach, but I think it would be better to
preserve the current public API by having clauselist_selectivity()
become a thin wrapper around a new function that optionally applies
extended stats.

OK, makes sense. I'll take a stab at it.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#24

Thomas Munro

thomas.munro@gmail.com

almost 6 years ago

In reply to: Tomas Vondra (#15)

Re: Additional improvements to extended statistics

On Sun, Mar 15, 2020 at 3:23 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Sun, Mar 15, 2020 at 02:48:02PM +1300, Thomas Munro wrote:

Stimulated by some bad plans involving JSON, I found my way to your
WIP stats-on-expressions patch in this thread. Do I understand
correctly that it will eventually also support single expressions,
like CREATE STATISTICS t_distinct_abc (ndistinct) ON
(my_jsonb_column->>'abc') FROM t? It looks like that would solve
problems that otherwise require a generated column or an expression
index just to get ndistinct.

Yes, I think that's generally the plan. I was also thinking about
inventing some sort of special JSON statistics (e.g. extracting paths
from the JSONB and computing frequencies, or something like that). But
stats on expressions are one of the things I'd like to do in PG14.

Interesting idea. If you had simple single-expression statistics, I
suppose a cave-person version of this would be to write a
script/stored procedure that extracts the distinct set of JSON paths
and does CREATE STATISTICS for expressions to access each path. That
said, I suspect that in many cases there's a small set of a paths and
a human designer would know what to do. I didn't manage to try your
WIP stats-on-expressions patch due to bitrot and unfinished parts, but
I am hoping it just needs to remove the "if (numcols < 2)
ereport(ERROR ...)" check to get a very very useful thing.

#25

Daniel Gustafsson

daniel@yesql.se

over 5 years ago

In reply to: Tomas Vondra (#23)

Re: Additional improvements to extended statistics

On 24 Mar 2020, at 15:33, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

On Tue, Mar 24, 2020 at 01:20:07PM +0000, Dean Rasheed wrote:

Sounds like a reasonable approach, but I think it would be better to
preserve the current public API by having clauselist_selectivity()
become a thin wrapper around a new function that optionally applies
extended stats.

OK, makes sense. I'll take a stab at it.

Have you had time to hack on this? The proposed patch no longer applies, so
I've marked the entry Waiting on Author.

cheers ./daniel

#26

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 5 years ago

In reply to: Daniel Gustafsson (#25)

1 attachment(s)

Re: Additional improvements to extended statistics

On Wed, Jul 01, 2020 at 01:19:40PM +0200, Daniel Gustafsson wrote:

On 24 Mar 2020, at 15:33, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

On Tue, Mar 24, 2020 at 01:20:07PM +0000, Dean Rasheed wrote:

Sounds like a reasonable approach, but I think it would be better to
preserve the current public API by having clauselist_selectivity()
become a thin wrapper around a new function that optionally applies
extended stats.

OK, makes sense. I'll take a stab at it.

Have you had time to hack on this? The proposed patch no longer applies, so
I've marked the entry Waiting on Author.

Yep, here's a rebased patch. This does not include the changes we've
discussed with Dean in March, but I plan to address that soon.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

0001-Improve-estimation-of-OR-clauses-using-exte-20200703.patchtext/plain; charset=us-asciiDownload

From 3120d4b483c203bb83e114a1099f03555591fe5b Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Fri, 3 Jul 2020 03:06:33 +0200
Subject: [PATCH] Improve estimation of OR clauses using extended statistics

Until now, OR clauses were estimated using extended statistics only when
the whole clause (all the arguments) are compatible. If even just one
argument was found to be incompatible, the whole clause was estimated
ignoring extended statistics. Estimation errors for OR clauses tend to
be fairly mild, so this was considered acceptable, but it may become an
issue for OR clauses with more complex arguments, etc.

This commit relaxes the restriction, using mostly the same logic as AND
clauses. We first apply extended statistics to as many arguments as
possible, and then use the (s1 + s2 - s1 * s2) formula to factor in the
remaining clauses.

The OR clause is still considered incompatible, though. If any argument
is unsupported or references variable not covered by the statistics, the
whole OR clause is incompatible. The consequence is that e.g. clauses

    (a = 1) AND (b = 1 OR c = 1 OR d = 1)

can't be estimated by statistics on (a,b,c) because the OR clause also
references "d". So we'll estimate each of the AND arguments separately,
and the extended statistics will be used only to estimate the OR clause.
This may be solved by creating statistics including the "d" column, but
the issue applies to cases where the clause type is unsupported, e.g.

    (a = 1) AND (b = 1 OR c = 1 OR mod(d,10) = 0)

which can't be solved by adding "d" to the statistics, at least for now.

Author: Tomas Vondra
Reviewed-by: Dean Rasheed, Thomas Munro
Discussion: Discussion: https://www.postgresql.org/message-id/flat/20200113230008.g67iyk4cs3xbnjju@development
---
 src/backend/optimizer/path/clausesel.c        | 141 +++++++++++++--
 src/backend/statistics/extended_stats.c       |  36 ++--
 src/backend/statistics/mcv.c                  |   5 +-
 src/include/optimizer/optimizer.h             |   7 +
 .../statistics/extended_stats_internal.h      |   3 +-
 src/include/statistics/statistics.h           |   3 +-
 src/test/regress/expected/stats_ext.out       | 165 +++++++++++++++++-
 src/test/regress/sql/stats_ext.sql            |  77 +++++++-
 8 files changed, 395 insertions(+), 42 deletions(-)

diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c
index a3ebe10592..ce14d47409 100644
--- a/src/backend/optimizer/path/clausesel.c
+++ b/src/backend/optimizer/path/clausesel.c
@@ -92,7 +92,7 @@ clauselist_selectivity(PlannerInfo *root,
 		 */
 		s1 *= statext_clauselist_selectivity(root, clauses, varRelid,
 											 jointype, sjinfo, rel,
-											 &estimatedclauses);
+											 &estimatedclauses, false);
 	}
 
 	/*
@@ -104,6 +104,62 @@ clauselist_selectivity(PlannerInfo *root,
 											  estimatedclauses);
 }
 
+/*
+ * clauselist_selectivity_or -
+ *	  Compute the selectivity of an implicitly-ORed list of boolean
+ *	  expression clauses.  The list can be empty, in which case 0.0
+ *	  must be returned.  List elements may be either RestrictInfos
+ *	  or bare expression clauses --- the former is preferred since
+ *	  it allows caching of results.
+ *
+ * See clause_selectivity() for the meaning of the additional parameters.
+ *
+ * The basic approach is to apply extended statistics first, on as many
+ * clauses as possible, in order to capture cross-column dependencies etc.
+ * The remaining clauses are then estimated using regular statistics tracked
+ * for individual columns.  This is done by simply passing the clauses to
+ * clauselist_selectivity and then combining the selectivities using the
+ * regular formula (s1+s2 - s1*s2).
+ */
+static Selectivity
+clauselist_selectivity_or(PlannerInfo *root,
+						  List *clauses,
+						  int varRelid,
+						  JoinType jointype,
+						  SpecialJoinInfo *sjinfo)
+{
+	Selectivity	s = 0.0;
+	RelOptInfo *rel;
+	Bitmapset  *estimatedclauses = NULL;
+
+	/*
+	 * Determine if these clauses reference a single relation.  If so, and if
+	 * it has extended statistics, try to apply those.
+	 */
+	rel = find_single_rel_for_clauses(root, clauses);
+	if (rel && rel->rtekind == RTE_RELATION && rel->statlist != NIL)
+	{
+		/*
+		 * Estimate as many clauses as possible using extended statistics.
+		 *
+		 * 'estimatedclauses' tracks the 0-based list position index of
+		 * clauses that we've estimated using extended statistics, and that
+		 * should be ignored.
+		 *
+		 * XXX We can't multiply with current value, because for OR clauses
+		 * we start with 0.0, so we simply assign to 's' directly.
+		 */
+		s = statext_clauselist_selectivity(root, clauses, varRelid,
+										   jointype, sjinfo, rel,
+										   &estimatedclauses, true);
+	}
+
+	/* Estimate the rest of the clauses as if they were independent. */
+	return clauselist_selectivity_simple_or(root, clauses, varRelid,
+											jointype, sjinfo,
+											estimatedclauses, s);
+}
+
 /*
  * clauselist_selectivity_simple -
  *	  Compute the selectivity of an implicitly-ANDed list of boolean
@@ -351,6 +407,65 @@ clauselist_selectivity_simple(PlannerInfo *root,
 	return s1;
 }
 
+/*
+ * clauselist_selectivity_simple_or -
+ *	  Compute the selectivity of an implicitly-ORed list of boolean
+ *	  expression clauses.  The list can be empty, in which case the
+ *	  value passed in the last parameter must be returned.
+ *	  The estimatedclauses bitmap tracks clauses that have already
+ *	  been estimated by other means.
+ *
+ * See clause_selectivity() for the meaning of the additional parameters.
+ *
+ * Our basic approach is to combine selectivities of the subclauses
+ * using the formula (s1 + s2 - s1 * s2).  It's possible some of the
+ * clauses are already estimated by other means, in which case we treat
+ * them as a single logical clause, with the selecitivity passed in the
+ * last parameter.
+ */
+Selectivity
+clauselist_selectivity_simple_or(PlannerInfo *root,
+								 List *clauses,
+								 int varRelid,
+								 JoinType jointype,
+								 SpecialJoinInfo *sjinfo,
+								 Bitmapset *estimatedclauses,
+								 Selectivity s)
+{
+	ListCell   *lc;
+	int			listidx;
+
+	/*
+	 * Selectivities of the remaining clauses for an OR clause are computed
+	 * as s1+s2 - s1*s2 to account for the probable overlap of selected tuple
+	 * sets. The clauses estimated using extended statistics are effectively
+	 * treated as a single clause.
+	 *
+	 * XXX is this too conservative?
+	 */
+	listidx = -1;
+	foreach(lc, clauses)
+	{
+		Selectivity s2;
+
+		listidx++;
+
+		/* skip already estimated clauses */
+		if (bms_is_member(listidx, estimatedclauses))
+			continue;
+
+		s2 = clause_selectivity(root,
+								(Node *) lfirst(lc),
+								varRelid,
+								jointype,
+								sjinfo);
+
+		s = s + s2 - s * s2;
+	}
+
+	return s;
+}
+
 /*
  * addRangeClause --- add a new range clause for clauselist_selectivity
  *
@@ -735,24 +850,14 @@ clause_selectivity(PlannerInfo *root,
 	else if (is_orclause(clause))
 	{
 		/*
-		 * Selectivities for an OR clause are computed as s1+s2 - s1*s2 to
-		 * account for the probable overlap of selected tuple sets.
-		 *
-		 * XXX is this too conservative?
+		 * Almost the same thing as clauselist_selectivity, but with the
+		 * clauses connected by OR.
 		 */
-		ListCell   *arg;
-
-		s1 = 0.0;
-		foreach(arg, ((BoolExpr *) clause)->args)
-		{
-			Selectivity s2 = clause_selectivity(root,
-												(Node *) lfirst(arg),
-												varRelid,
-												jointype,
-												sjinfo);
-
-			s1 = s1 + s2 - s1 * s2;
-		}
+		s1 = clauselist_selectivity_or(root,
+									   ((BoolExpr *) clause)->args,
+									   varRelid,
+									   jointype,
+									   sjinfo);
 	}
 	else if (is_opclause(clause) || IsA(clause, DistinctExpr))
 	{
diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c
index ab6f1e1c9d..9373f7feb5 100644
--- a/src/backend/statistics/extended_stats.c
+++ b/src/backend/statistics/extended_stats.c
@@ -1282,16 +1282,17 @@ statext_is_compatible_clause(PlannerInfo *root, Node *clause, Index relid,
 static Selectivity
 statext_mcv_clauselist_selectivity(PlannerInfo *root, List *clauses, int varRelid,
 								   JoinType jointype, SpecialJoinInfo *sjinfo,
-								   RelOptInfo *rel, Bitmapset **estimatedclauses)
+								   RelOptInfo *rel, Bitmapset **estimatedclauses,
+								   bool is_or)
 {
 	ListCell   *l;
 	Bitmapset **list_attnums;
 	int			listidx;
-	Selectivity sel = 1.0;
+	Selectivity sel = (is_or) ? 0.0 : 1.0;
 
 	/* check if there's any stats that might be useful for us. */
 	if (!has_stats_of_kind(rel->statlist, STATS_EXT_MCV))
-		return 1.0;
+		return sel;
 
 	list_attnums = (Bitmapset **) palloc(sizeof(Bitmapset *) *
 										 list_length(clauses));
@@ -1377,8 +1378,12 @@ statext_mcv_clauselist_selectivity(PlannerInfo *root, List *clauses, int varReli
 		 * columns/clauses. We'll then use the various selectivities computed
 		 * from MCV list to improve it.
 		 */
-		simple_sel = clauselist_selectivity_simple(root, stat_clauses, varRelid,
-												   jointype, sjinfo, NULL);
+		if (is_or)
+			simple_sel = clauselist_selectivity_simple_or(root, stat_clauses, varRelid,
+														  jointype, sjinfo, NULL, 0.0);
+		else
+			simple_sel = clauselist_selectivity_simple(root, stat_clauses, varRelid,
+													   jointype, sjinfo, NULL);
 
 		/*
 		 * Now compute the multi-column estimate from the MCV list, along with
@@ -1386,7 +1391,7 @@ statext_mcv_clauselist_selectivity(PlannerInfo *root, List *clauses, int varReli
 		 */
 		mcv_sel = mcv_clauselist_selectivity(root, stat, stat_clauses, varRelid,
 											 jointype, sjinfo, rel,
-											 &mcv_basesel, &mcv_totalsel);
+											 &mcv_basesel, &mcv_totalsel, is_or);
 
 		/* Estimated selectivity of values not covered by MCV matches */
 		other_sel = simple_sel - mcv_basesel;
@@ -1403,8 +1408,11 @@ statext_mcv_clauselist_selectivity(PlannerInfo *root, List *clauses, int varReli
 		stat_sel = mcv_sel + other_sel;
 		CLAMP_PROBABILITY(stat_sel);
 
-		/* Factor the estimate from this MCV to the oveall estimate. */
-		sel *= stat_sel;
+		/* Factor the estimate from this MCV to the overall estimate. */
+		if (is_or)
+			sel = sel + stat_sel - sel * stat_sel;
+		else
+			sel *= stat_sel;
 	}
 
 	return sel;
@@ -1417,13 +1425,21 @@ statext_mcv_clauselist_selectivity(PlannerInfo *root, List *clauses, int varReli
 Selectivity
 statext_clauselist_selectivity(PlannerInfo *root, List *clauses, int varRelid,
 							   JoinType jointype, SpecialJoinInfo *sjinfo,
-							   RelOptInfo *rel, Bitmapset **estimatedclauses)
+							   RelOptInfo *rel, Bitmapset **estimatedclauses,
+							   bool is_or)
 {
 	Selectivity sel;
 
 	/* First, try estimating clauses using a multivariate MCV list. */
 	sel = statext_mcv_clauselist_selectivity(root, clauses, varRelid, jointype,
-											 sjinfo, rel, estimatedclauses);
+											 sjinfo, rel, estimatedclauses, is_or);
+
+	/*
+	 * Functional dependencies only work for clauses connected by AND, so for
+	 * OR clauses we're done.
+	 */
+	if (is_or)
+		return sel;
 
 	/*
 	 * Then, apply functional dependencies on the remaining clauses by calling
diff --git a/src/backend/statistics/mcv.c b/src/backend/statistics/mcv.c
index 6a262f1543..7c5841ed37 100644
--- a/src/backend/statistics/mcv.c
+++ b/src/backend/statistics/mcv.c
@@ -1904,7 +1904,8 @@ mcv_clauselist_selectivity(PlannerInfo *root, StatisticExtInfo *stat,
 						   List *clauses, int varRelid,
 						   JoinType jointype, SpecialJoinInfo *sjinfo,
 						   RelOptInfo *rel,
-						   Selectivity *basesel, Selectivity *totalsel)
+						   Selectivity *basesel, Selectivity *totalsel,
+						   bool is_or)
 {
 	int			i;
 	MCVList    *mcv;
@@ -1917,7 +1918,7 @@ mcv_clauselist_selectivity(PlannerInfo *root, StatisticExtInfo *stat,
 	mcv = statext_mcv_load(stat->statOid);
 
 	/* build a match bitmap for the clauses */
-	matches = mcv_get_match_bitmap(root, clauses, stat->keys, mcv, false);
+	matches = mcv_get_match_bitmap(root, clauses, stat->keys, mcv, is_or);
 
 	/* sum frequencies for all the matching MCV items */
 	*basesel = 0.0;
diff --git a/src/include/optimizer/optimizer.h b/src/include/optimizer/optimizer.h
index 3e4171056e..de24d7e1d4 100644
--- a/src/include/optimizer/optimizer.h
+++ b/src/include/optimizer/optimizer.h
@@ -64,6 +64,13 @@ extern Selectivity clauselist_selectivity_simple(PlannerInfo *root,
 												 JoinType jointype,
 												 SpecialJoinInfo *sjinfo,
 												 Bitmapset *estimatedclauses);
+extern Selectivity clauselist_selectivity_simple_or(PlannerInfo *root,
+													List *clauses,
+													int varRelid,
+													JoinType jointype,
+													SpecialJoinInfo *sjinfo,
+													Bitmapset *estimatedclauses,
+													Selectivity s);
 extern Selectivity clauselist_selectivity(PlannerInfo *root,
 										  List *clauses,
 										  int varRelid,
diff --git a/src/include/statistics/extended_stats_internal.h b/src/include/statistics/extended_stats_internal.h
index 61e69696cf..08d639e66a 100644
--- a/src/include/statistics/extended_stats_internal.h
+++ b/src/include/statistics/extended_stats_internal.h
@@ -107,6 +107,7 @@ extern Selectivity mcv_clauselist_selectivity(PlannerInfo *root,
 											  SpecialJoinInfo *sjinfo,
 											  RelOptInfo *rel,
 											  Selectivity *basesel,
-											  Selectivity *totalsel);
+											  Selectivity *totalsel,
+											  bool is_or);
 
 #endif							/* EXTENDED_STATS_INTERNAL_H */
diff --git a/src/include/statistics/statistics.h b/src/include/statistics/statistics.h
index 50fce4935f..c9ed21155c 100644
--- a/src/include/statistics/statistics.h
+++ b/src/include/statistics/statistics.h
@@ -116,7 +116,8 @@ extern Selectivity statext_clauselist_selectivity(PlannerInfo *root,
 												  JoinType jointype,
 												  SpecialJoinInfo *sjinfo,
 												  RelOptInfo *rel,
-												  Bitmapset **estimatedclauses);
+												  Bitmapset **estimatedclauses,
+												  bool is_or);
 extern bool has_stats_of_kind(List *stats, char requiredkind);
 extern StatisticExtInfo *choose_best_statistics(List *stats, char requiredkind,
 												Bitmapset **clause_attnums,
diff --git a/src/test/regress/expected/stats_ext.out b/src/test/regress/expected/stats_ext.out
index 0ae779a3b9..a95aa72c7b 100644
--- a/src/test/regress/expected/stats_ext.out
+++ b/src/test/regress/expected/stats_ext.out
@@ -1095,6 +1095,12 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = '
        200 |    200
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = ''1'' OR c = 1 OR d IS NOT NULL');
+ estimated | actual 
+-----------+--------
+       200 |    200
+(1 row)
+
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IN (1, 2, 51, 52) AND b IN ( ''1'', ''2'')');
  estimated | actual 
 -----------+--------
@@ -1155,13 +1161,6 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a < ALL (ARRAY
        100 |    100
 (1 row)
 
--- we can't use the statistic for OR clauses that are not fully covered (missing 'd' attribute)
-SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = ''1'' OR c = 1 OR d IS NOT NULL');
- estimated | actual 
------------+--------
-       343 |    200
-(1 row)
-
 -- check change of unrelated column type does not reset the MCV statistics
 ALTER TABLE mcv_lists ALTER COLUMN d TYPE VARCHAR(64);
 SELECT d.stxdmcv IS NOT NULL
@@ -1459,6 +1458,110 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a AND
          1 |      0
 (1 row)
 
+-- mcv covering just a small fraction of data
+CREATE TABLE mcv_lists_partial (
+    a INT,
+    b INT,
+    c INT
+);
+-- 10 frequent groups, each with 100 elements
+INSERT INTO mcv_lists_partial (a, b, c)
+     SELECT
+         mod(i,10),
+         mod(i,10),
+         mod(i,10)
+     FROM generate_series(0,999) s(i);
+-- 100 groups that will make it to the MCV list (includes the 10 frequent ones)
+INSERT INTO mcv_lists_partial (a, b, c)
+     SELECT
+         i,
+         i,
+         i
+     FROM generate_series(0,99) s(i);
+-- 4000 groups in total, most of which won't make it (just a single item)
+INSERT INTO mcv_lists_partial (a, b, c)
+     SELECT
+         i,
+         i,
+         i
+     FROM generate_series(0,3999) s(i);
+ANALYZE mcv_lists_partial;
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 AND b = 0 AND c = 0');
+ estimated | actual 
+-----------+--------
+         1 |    102
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 OR b = 0 OR c = 0');
+ estimated | actual 
+-----------+--------
+       300 |    102
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 10 AND b = 10 AND c = 10');
+ estimated | actual 
+-----------+--------
+         1 |      2
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 10 OR b = 10 OR c = 10');
+ estimated | actual 
+-----------+--------
+         6 |      2
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 AND b = 0 AND c = 10');
+ estimated | actual 
+-----------+--------
+         1 |      0
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 OR b = 0 OR c = 10');
+ estimated | actual 
+-----------+--------
+       204 |    104
+(1 row)
+
+CREATE STATISTICS mcv_lists_partial_stats (mcv) ON a, b, c
+  FROM mcv_lists_partial;
+ANALYZE mcv_lists_partial;
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 AND b = 0 AND c = 0');
+ estimated | actual 
+-----------+--------
+       102 |    102
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 OR b = 0 OR c = 0');
+ estimated | actual 
+-----------+--------
+       504 |    102
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 10 AND b = 10 AND c = 10');
+ estimated | actual 
+-----------+--------
+         2 |      2
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 10 OR b = 10 OR c = 10');
+ estimated | actual 
+-----------+--------
+        10 |      2
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 AND b = 0 AND c = 10');
+ estimated | actual 
+-----------+--------
+         1 |      0
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 OR b = 0 OR c = 10');
+ estimated | actual 
+-----------+--------
+       412 |    104
+(1 row)
+
+DROP TABLE mcv_lists_partial;
 -- check the ability to use multiple MCV lists
 CREATE TABLE mcv_lists_multi (
 	a INTEGER,
@@ -1488,12 +1591,36 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE c = 0 AN
        102 |    714
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 AND c = 0');
+ estimated | actual 
+-----------+--------
+       143 |    142
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 OR c = 0');
+ estimated | actual 
+-----------+--------
+      1571 |   1572
+(1 row)
+
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 AND b = 0 AND c = 0 AND d = 0');
  estimated | actual 
 -----------+--------
          4 |    142
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE (a = 0 AND b = 0) OR (c = 0 AND d = 0)');
+ estimated | actual 
+-----------+--------
+       298 |   1572
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 OR b = 0 OR c = 0 OR d = 0');
+ estimated | actual 
+-----------+--------
+      2649 |   1572
+(1 row)
+
 -- create separate MCV statistics
 CREATE STATISTICS mcv_lists_multi_1 (mcv) ON a, b FROM mcv_lists_multi;
 CREATE STATISTICS mcv_lists_multi_2 (mcv) ON c, d FROM mcv_lists_multi;
@@ -1510,12 +1637,36 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE c = 0 AN
        714 |    714
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 AND c = 0');
+ estimated | actual 
+-----------+--------
+       143 |    142
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 OR c = 0');
+ estimated | actual 
+-----------+--------
+      1571 |   1572
+(1 row)
+
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 AND b = 0 AND c = 0 AND d = 0');
  estimated | actual 
 -----------+--------
        143 |    142
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE (a = 0 AND b = 0) OR (c = 0 AND d = 0)');
+ estimated | actual 
+-----------+--------
+      1571 |   1572
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 OR b = 0 OR c = 0 OR d = 0');
+ estimated | actual 
+-----------+--------
+      1571 |   1572
+(1 row)
+
 DROP TABLE mcv_lists_multi;
 -- Permission tests. Users should not be able to see specific data values in
 -- the extended statistics, if they lack permission to see those values in
diff --git a/src/test/regress/sql/stats_ext.sql b/src/test/regress/sql/stats_ext.sql
index 2834a902a7..a1aa66db36 100644
--- a/src/test/regress/sql/stats_ext.sql
+++ b/src/test/regress/sql/stats_ext.sql
@@ -559,6 +559,8 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE 4 >= a AND ''0
 
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = ''1'' OR c = 1');
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = ''1'' OR c = 1 OR d IS NOT NULL');
+
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IN (1, 2, 51, 52) AND b IN ( ''1'', ''2'')');
 
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IN (1, 2, 51, 52, NULL) AND b IN ( ''1'', ''2'', NULL)');
@@ -579,9 +581,6 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a < ALL (ARRAY
 
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a < ALL (ARRAY[4, 5]) AND b IN (''1'', ''2'', NULL, ''3'') AND c > ANY (ARRAY[1, 2, NULL, 3])');
 
--- we can't use the statistic for OR clauses that are not fully covered (missing 'd' attribute)
-SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = ''1'' OR c = 1 OR d IS NOT NULL');
-
 -- check change of unrelated column type does not reset the MCV statistics
 ALTER TABLE mcv_lists ALTER COLUMN d TYPE VARCHAR(64);
 
@@ -775,6 +774,70 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a AND
 
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a AND b AND NOT c');
 
+-- mcv covering just a small fraction of data
+CREATE TABLE mcv_lists_partial (
+    a INT,
+    b INT,
+    c INT
+);
+
+-- 10 frequent groups, each with 100 elements
+INSERT INTO mcv_lists_partial (a, b, c)
+     SELECT
+         mod(i,10),
+         mod(i,10),
+         mod(i,10)
+     FROM generate_series(0,999) s(i);
+
+-- 100 groups that will make it to the MCV list (includes the 10 frequent ones)
+INSERT INTO mcv_lists_partial (a, b, c)
+     SELECT
+         i,
+         i,
+         i
+     FROM generate_series(0,99) s(i);
+
+-- 4000 groups in total, most of which won't make it (just a single item)
+INSERT INTO mcv_lists_partial (a, b, c)
+     SELECT
+         i,
+         i,
+         i
+     FROM generate_series(0,3999) s(i);
+
+ANALYZE mcv_lists_partial;
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 AND b = 0 AND c = 0');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 OR b = 0 OR c = 0');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 10 AND b = 10 AND c = 10');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 10 OR b = 10 OR c = 10');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 AND b = 0 AND c = 10');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 OR b = 0 OR c = 10');
+
+CREATE STATISTICS mcv_lists_partial_stats (mcv) ON a, b, c
+  FROM mcv_lists_partial;
+
+ANALYZE mcv_lists_partial;
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 AND b = 0 AND c = 0');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 OR b = 0 OR c = 0');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 10 AND b = 10 AND c = 10');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 10 OR b = 10 OR c = 10');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 AND b = 0 AND c = 10');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 OR b = 0 OR c = 10');
+
+DROP TABLE mcv_lists_partial;
+
 -- check the ability to use multiple MCV lists
 CREATE TABLE mcv_lists_multi (
 	a INTEGER,
@@ -797,7 +860,11 @@ ANALYZE mcv_lists_multi;
 -- estimates without any mcv statistics
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 AND b = 0');
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE c = 0 AND d = 0');
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 AND c = 0');
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 OR c = 0');
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 AND b = 0 AND c = 0 AND d = 0');
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE (a = 0 AND b = 0) OR (c = 0 AND d = 0)');
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 OR b = 0 OR c = 0 OR d = 0');
 
 -- create separate MCV statistics
 CREATE STATISTICS mcv_lists_multi_1 (mcv) ON a, b FROM mcv_lists_multi;
@@ -807,7 +874,11 @@ ANALYZE mcv_lists_multi;
 
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 AND b = 0');
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE c = 0 AND d = 0');
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 AND c = 0');
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 OR c = 0');
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 AND b = 0 AND c = 0 AND d = 0');
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE (a = 0 AND b = 0) OR (c = 0 AND d = 0)');
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 OR b = 0 OR c = 0 OR d = 0');
 
 DROP TABLE mcv_lists_multi;
 
-- 
2.25.4

#27

Tomas Vondra

tomas.vondra@2ndquadrant.com

about 5 years ago

In reply to: Tomas Vondra (#26)

3 attachment(s)

Re: Additional improvements to extended statistics

Hi,

Here is an improved WIP version of the patch series, modified to address
the issue with repeatedly applying the extended statistics, as discussed
with Dean in this thread. It's a bit rough and not committable, but I
need some feedback so I'm posting it in this state.

(Note: The WIP patch is expected to fail regression tests. A couple
stats_ext regression tests fail due to changed estimate - I've left it
like that to make the changes more obvious for now.)

Earlier in this thread I used this example:

CREATE TABLE t (a int, b int);
INSERT INTO t SELECT mod(i, 10), mod(i, 10)
FROM generate_series(1,100000) s(i);
CREATE STATISTICS s (mcv) ON a,b FROM t;
ANALYZE t;

EXPLAIN SELECT * FROM t WHERE a = 0 OR b = 0;

which had this call graph with two statext_mcv_clauselist_selectivity
calls (which was kinda the issue):

clauselist_selectivity
statext_clauselist_selectivity
statext_mcv_clauselist_selectivity <--- (1)
clauselist_selectivity_simple
clause_selectivity
clauselist_selectivity_or
statext_clauselist_selectivity
statext_mcv_clauselist_selectivity <--- (2)
clauselist_selectivity_simple_or
clause_selectivity
clause_selectivity
mcv_clauselist_selectivity
clauselist_selectivity_simple_or
mcv_clauselist_selectivity
clauselist_selectivity_simple
(already estimated)

with the patches applied, the call looks like this:

clauselist_selectivity_internal (use_extended_stats=1)
statext_clauselist_selectivity
statext_mcv_clauselist_selectivity (is_or=0)
clauselist_selectivity_simple
clause_selectivity_internal (use_extended_stats=0)
clauselist_selectivity_or (use_extended_stats=0)
clauselist_selectivity_simple_or
clause_selectivity_internal (use_extended_stats=0)
clause_selectivity_internal (use_extended_stats=0)
mcv_clauselist_selectivity (is_or=0)
clauselist_selectivity_simple

The nested call is removed, which I think addresses the issue. As for
the effect on estimates, there's a couple regression tests where the
estimates change - not much though, an example is:

SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial
WHERE a = 0 OR b = 0 OR c = 10');
estimated | actual
-----------+--------
- 412 | 104
+ 308 | 104
(1 row)

This is on top of 0001, though. Interestingly enough, this ends up with
the same estimate as current master, but I consider that a coincidence.

As for the patches:

0001 is the original patch improving estimates of OR clauses

0002 adds thin wrappers for clause[list]_selectivity, with "internal"
functions allowing to specify whether to keep considering extended stats

0003 does the same for the "simple" functions

I've kept it like this to demonstrate that 0002 is not sufficient. In my
response from March 24 I wrote this:

Isn't it the case that clauselist_selectivity_simple (and the OR
variant) should ignore extended stats entirely? That is, we'd need
to add a flag (or _simple variant) to clause_selectivity, so that it
calls causelist_selectivity_simple_or.

But that's actually wrong, as 0002 shows (as it breaks a couple of
regression tests), because of the way we handle OR clauses. At the top
level, an OR-clause is actually just a single clause and it may get
passed to clauselist_selectivity_simple. So entirely disabling extended
stats for the "simple" functions would also mean disabling extended
stats for a large number of OR clauses. Which is clearly wrong.

So 0003 addresses that, by adding a flag to the two "simple" functions.
Ultimately, this should probably do the same thing as 0002 and add thin
wrappers, because the existing functions are part of the public API.

Dean, does this address the issue you had in mind? Can you come up with
an example of that issue in the form of a regression test or something?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

0001-Improve-estimation-of-OR-clauses-using-exte-20201112.patchtext/x-patch; charset=UTF-8; name=0001-Improve-estimation-of-OR-clauses-using-exte-20201112.patchDownload

From a6b4ddc2777e29170f6caef41ec25a75899d14d3 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Thu, 12 Nov 2020 00:39:17 +0100
Subject: [PATCH 1/4] Improve estimation of OR clauses using extended
 statistics

Until now, OR clauses were estimated using extended statistics only when
the whole clause (all the arguments) are compatible. If even just one
argument was found to be incompatible, the whole clause was estimated
ignoring extended statistics. Estimation errors for OR clauses tend to
be fairly mild, so this was considered acceptable, but it may become an
issue for OR clauses with more complex arguments, etc.

This commit relaxes the restriction, using mostly the same logic as AND
clauses. We first apply extended statistics to as many arguments as
possible, and then use the (s1 + s2 - s1 * s2) formula to factor in the
remaining clauses.

The OR clause is still considered incompatible, though. If any argument
is unsupported or references variable not covered by the statistics, the
whole OR clause is incompatible. The consequence is that e.g. clauses

    (a = 1) AND (b = 1 OR c = 1 OR d = 1)

can't be estimated by statistics on (a,b,c) because the OR clause also
references "d". So we'll estimate each of the AND arguments separately,
and the extended statistics will be used only to estimate the OR clause.
This may be solved by creating statistics including the "d" column, but
the issue applies to cases where the clause type is unsupported, e.g.

    (a = 1) AND (b = 1 OR c = 1 OR mod(d,10) = 0)

which can't be solved by adding "d" to the statistics, at least for now.

Author: Tomas Vondra
Reviewed-by: Dean Rasheed, Thomas Munro
Discussion: Discussion: https://www.postgresql.org/message-id/flat/20200113230008.g67iyk4cs3xbnjju@development
---
 src/backend/optimizer/path/clausesel.c        | 141 +++++++++++++--
 src/backend/statistics/extended_stats.c       |  34 +++-
 src/backend/statistics/mcv.c                  |   5 +-
 src/include/optimizer/optimizer.h             |   7 +
 .../statistics/extended_stats_internal.h      |   3 +-
 src/include/statistics/statistics.h           |   3 +-
 src/test/regress/expected/stats_ext.out       | 165 +++++++++++++++++-
 src/test/regress/sql/stats_ext.sql            |  77 +++++++-
 8 files changed, 394 insertions(+), 41 deletions(-)

diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c
index 37a735b06b..bbe9591aff 100644
--- a/src/backend/optimizer/path/clausesel.c
+++ b/src/backend/optimizer/path/clausesel.c
@@ -92,7 +92,7 @@ clauselist_selectivity(PlannerInfo *root,
 		 */
 		s1 *= statext_clauselist_selectivity(root, clauses, varRelid,
 											 jointype, sjinfo, rel,
-											 &estimatedclauses);
+											 &estimatedclauses, false);
 	}
 
 	/*
@@ -104,6 +104,62 @@ clauselist_selectivity(PlannerInfo *root,
 											  estimatedclauses);
 }
 
+/*
+ * clauselist_selectivity_or -
+ *	  Compute the selectivity of an implicitly-ORed list of boolean
+ *	  expression clauses.  The list can be empty, in which case 0.0
+ *	  must be returned.  List elements may be either RestrictInfos
+ *	  or bare expression clauses --- the former is preferred since
+ *	  it allows caching of results.
+ *
+ * See clause_selectivity() for the meaning of the additional parameters.
+ *
+ * The basic approach is to apply extended statistics first, on as many
+ * clauses as possible, in order to capture cross-column dependencies etc.
+ * The remaining clauses are then estimated using regular statistics tracked
+ * for individual columns.  This is done by simply passing the clauses to
+ * clauselist_selectivity and then combining the selectivities using the
+ * regular formula (s1+s2 - s1*s2).
+ */
+static Selectivity
+clauselist_selectivity_or(PlannerInfo *root,
+						  List *clauses,
+						  int varRelid,
+						  JoinType jointype,
+						  SpecialJoinInfo *sjinfo)
+{
+	Selectivity	s = 0.0;
+	RelOptInfo *rel;
+	Bitmapset  *estimatedclauses = NULL;
+
+	/*
+	 * Determine if these clauses reference a single relation.  If so, and if
+	 * it has extended statistics, try to apply those.
+	 */
+	rel = find_single_rel_for_clauses(root, clauses);
+	if (rel && rel->rtekind == RTE_RELATION && rel->statlist != NIL)
+	{
+		/*
+		 * Estimate as many clauses as possible using extended statistics.
+		 *
+		 * 'estimatedclauses' tracks the 0-based list position index of
+		 * clauses that we've estimated using extended statistics, and that
+		 * should be ignored.
+		 *
+		 * XXX We can't multiply with current value, because for OR clauses
+		 * we start with 0.0, so we simply assign to 's' directly.
+		 */
+		s = statext_clauselist_selectivity(root, clauses, varRelid,
+										   jointype, sjinfo, rel,
+										   &estimatedclauses, true);
+	}
+
+	/* Estimate the rest of the clauses as if they were independent. */
+	return clauselist_selectivity_simple_or(root, clauses, varRelid,
+											jointype, sjinfo,
+											estimatedclauses, s);
+}
+
 /*
  * clauselist_selectivity_simple -
  *	  Compute the selectivity of an implicitly-ANDed list of boolean
@@ -350,6 +406,65 @@ clauselist_selectivity_simple(PlannerInfo *root,
 	return s1;
 }
 
+/*
+ * clauselist_selectivity_simple_or -
+ *	  Compute the selectivity of an implicitly-ORed list of boolean
+ *	  expression clauses.  The list can be empty, in which case the
+ *	  value passed in the last parameter must be returned.
+ *	  The estimatedclauses bitmap tracks clauses that have already
+ *	  been estimated by other means.
+ *
+ * See clause_selectivity() for the meaning of the additional parameters.
+ *
+ * Our basic approach is to combine selectivities of the subclauses
+ * using the formula (s1 + s2 - s1 * s2).  It's possible some of the
+ * clauses are already estimated by other means, in which case we treat
+ * them as a single logical clause, with the selecitivity passed in the
+ * last parameter.
+ */
+Selectivity
+clauselist_selectivity_simple_or(PlannerInfo *root,
+								 List *clauses,
+								 int varRelid,
+								 JoinType jointype,
+								 SpecialJoinInfo *sjinfo,
+								 Bitmapset *estimatedclauses,
+								 Selectivity s)
+{
+	ListCell   *lc;
+	int			listidx;
+
+	/*
+	 * Selectivities of the remaining clauses for an OR clause are computed
+	 * as s1+s2 - s1*s2 to account for the probable overlap of selected tuple
+	 * sets. The clauses estimated using extended statistics are effectively
+	 * treated as a single clause.
+	 *
+	 * XXX is this too conservative?
+	 */
+	listidx = -1;
+	foreach(lc, clauses)
+	{
+		Selectivity s2;
+
+		listidx++;
+
+		/* skip already estimated clauses */
+		if (bms_is_member(listidx, estimatedclauses))
+			continue;
+
+		s2 = clause_selectivity(root,
+								(Node *) lfirst(lc),
+								varRelid,
+								jointype,
+								sjinfo);
+
+		s = s + s2 - s * s2;
+	}
+
+	return s;
+}
+
 /*
  * addRangeClause --- add a new range clause for clauselist_selectivity
  *
@@ -734,24 +849,14 @@ clause_selectivity(PlannerInfo *root,
 	else if (is_orclause(clause))
 	{
 		/*
-		 * Selectivities for an OR clause are computed as s1+s2 - s1*s2 to
-		 * account for the probable overlap of selected tuple sets.
-		 *
-		 * XXX is this too conservative?
+		 * Almost the same thing as clauselist_selectivity, but with the
+		 * clauses connected by OR.
 		 */
-		ListCell   *arg;
-
-		s1 = 0.0;
-		foreach(arg, ((BoolExpr *) clause)->args)
-		{
-			Selectivity s2 = clause_selectivity(root,
-												(Node *) lfirst(arg),
-												varRelid,
-												jointype,
-												sjinfo);
-
-			s1 = s1 + s2 - s1 * s2;
-		}
+		s1 = clauselist_selectivity_or(root,
+									   ((BoolExpr *) clause)->args,
+									   varRelid,
+									   jointype,
+									   sjinfo);
 	}
 	else if (is_opclause(clause) || IsA(clause, DistinctExpr))
 	{
diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c
index 36326927c6..f5493bcad6 100644
--- a/src/backend/statistics/extended_stats.c
+++ b/src/backend/statistics/extended_stats.c
@@ -1282,16 +1282,17 @@ statext_is_compatible_clause(PlannerInfo *root, Node *clause, Index relid,
 static Selectivity
 statext_mcv_clauselist_selectivity(PlannerInfo *root, List *clauses, int varRelid,
 								   JoinType jointype, SpecialJoinInfo *sjinfo,
-								   RelOptInfo *rel, Bitmapset **estimatedclauses)
+								   RelOptInfo *rel, Bitmapset **estimatedclauses,
+								   bool is_or)
 {
 	ListCell   *l;
 	Bitmapset **list_attnums;
 	int			listidx;
-	Selectivity sel = 1.0;
+	Selectivity sel = (is_or) ? 0.0 : 1.0;
 
 	/* check if there's any stats that might be useful for us. */
 	if (!has_stats_of_kind(rel->statlist, STATS_EXT_MCV))
-		return 1.0;
+		return sel;
 
 	list_attnums = (Bitmapset **) palloc(sizeof(Bitmapset *) *
 										 list_length(clauses));
@@ -1377,8 +1378,12 @@ statext_mcv_clauselist_selectivity(PlannerInfo *root, List *clauses, int varReli
 		 * columns/clauses. We'll then use the various selectivities computed
 		 * from MCV list to improve it.
 		 */
-		simple_sel = clauselist_selectivity_simple(root, stat_clauses, varRelid,
-												   jointype, sjinfo, NULL);
+		if (is_or)
+			simple_sel = clauselist_selectivity_simple_or(root, stat_clauses, varRelid,
+														  jointype, sjinfo, NULL, 0.0);
+		else
+			simple_sel = clauselist_selectivity_simple(root, stat_clauses, varRelid,
+													   jointype, sjinfo, NULL);
 
 		/*
 		 * Now compute the multi-column estimate from the MCV list, along with
@@ -1386,7 +1391,7 @@ statext_mcv_clauselist_selectivity(PlannerInfo *root, List *clauses, int varReli
 		 */
 		mcv_sel = mcv_clauselist_selectivity(root, stat, stat_clauses, varRelid,
 											 jointype, sjinfo, rel,
-											 &mcv_basesel, &mcv_totalsel);
+											 &mcv_basesel, &mcv_totalsel, is_or);
 
 		/* Estimated selectivity of values not covered by MCV matches */
 		other_sel = simple_sel - mcv_basesel;
@@ -1404,7 +1409,10 @@ statext_mcv_clauselist_selectivity(PlannerInfo *root, List *clauses, int varReli
 		CLAMP_PROBABILITY(stat_sel);
 
 		/* Factor the estimate from this MCV to the overall estimate. */
-		sel *= stat_sel;
+		if (is_or)
+			sel = sel + stat_sel - sel * stat_sel;
+		else
+			sel *= stat_sel;
 	}
 
 	return sel;
@@ -1417,13 +1425,21 @@ statext_mcv_clauselist_selectivity(PlannerInfo *root, List *clauses, int varReli
 Selectivity
 statext_clauselist_selectivity(PlannerInfo *root, List *clauses, int varRelid,
 							   JoinType jointype, SpecialJoinInfo *sjinfo,
-							   RelOptInfo *rel, Bitmapset **estimatedclauses)
+							   RelOptInfo *rel, Bitmapset **estimatedclauses,
+							   bool is_or)
 {
 	Selectivity sel;
 
 	/* First, try estimating clauses using a multivariate MCV list. */
 	sel = statext_mcv_clauselist_selectivity(root, clauses, varRelid, jointype,
-											 sjinfo, rel, estimatedclauses);
+											 sjinfo, rel, estimatedclauses, is_or);
+
+	/*
+	 * Functional dependencies only work for clauses connected by AND, so for
+	 * OR clauses we're done.
+	 */
+	if (is_or)
+		return sel;
 
 	/*
 	 * Then, apply functional dependencies on the remaining clauses by calling
diff --git a/src/backend/statistics/mcv.c b/src/backend/statistics/mcv.c
index 6a262f1543..7c5841ed37 100644
--- a/src/backend/statistics/mcv.c
+++ b/src/backend/statistics/mcv.c
@@ -1904,7 +1904,8 @@ mcv_clauselist_selectivity(PlannerInfo *root, StatisticExtInfo *stat,
 						   List *clauses, int varRelid,
 						   JoinType jointype, SpecialJoinInfo *sjinfo,
 						   RelOptInfo *rel,
-						   Selectivity *basesel, Selectivity *totalsel)
+						   Selectivity *basesel, Selectivity *totalsel,
+						   bool is_or)
 {
 	int			i;
 	MCVList    *mcv;
@@ -1917,7 +1918,7 @@ mcv_clauselist_selectivity(PlannerInfo *root, StatisticExtInfo *stat,
 	mcv = statext_mcv_load(stat->statOid);
 
 	/* build a match bitmap for the clauses */
-	matches = mcv_get_match_bitmap(root, clauses, stat->keys, mcv, false);
+	matches = mcv_get_match_bitmap(root, clauses, stat->keys, mcv, is_or);
 
 	/* sum frequencies for all the matching MCV items */
 	*basesel = 0.0;
diff --git a/src/include/optimizer/optimizer.h b/src/include/optimizer/optimizer.h
index 3e4171056e..de24d7e1d4 100644
--- a/src/include/optimizer/optimizer.h
+++ b/src/include/optimizer/optimizer.h
@@ -64,6 +64,13 @@ extern Selectivity clauselist_selectivity_simple(PlannerInfo *root,
 												 JoinType jointype,
 												 SpecialJoinInfo *sjinfo,
 												 Bitmapset *estimatedclauses);
+extern Selectivity clauselist_selectivity_simple_or(PlannerInfo *root,
+													List *clauses,
+													int varRelid,
+													JoinType jointype,
+													SpecialJoinInfo *sjinfo,
+													Bitmapset *estimatedclauses,
+													Selectivity s);
 extern Selectivity clauselist_selectivity(PlannerInfo *root,
 										  List *clauses,
 										  int varRelid,
diff --git a/src/include/statistics/extended_stats_internal.h b/src/include/statistics/extended_stats_internal.h
index 61e69696cf..08d639e66a 100644
--- a/src/include/statistics/extended_stats_internal.h
+++ b/src/include/statistics/extended_stats_internal.h
@@ -107,6 +107,7 @@ extern Selectivity mcv_clauselist_selectivity(PlannerInfo *root,
 											  SpecialJoinInfo *sjinfo,
 											  RelOptInfo *rel,
 											  Selectivity *basesel,
-											  Selectivity *totalsel);
+											  Selectivity *totalsel,
+											  bool is_or);
 
 #endif							/* EXTENDED_STATS_INTERNAL_H */
diff --git a/src/include/statistics/statistics.h b/src/include/statistics/statistics.h
index 50fce4935f..c9ed21155c 100644
--- a/src/include/statistics/statistics.h
+++ b/src/include/statistics/statistics.h
@@ -116,7 +116,8 @@ extern Selectivity statext_clauselist_selectivity(PlannerInfo *root,
 												  JoinType jointype,
 												  SpecialJoinInfo *sjinfo,
 												  RelOptInfo *rel,
-												  Bitmapset **estimatedclauses);
+												  Bitmapset **estimatedclauses,
+												  bool is_or);
 extern bool has_stats_of_kind(List *stats, char requiredkind);
 extern StatisticExtInfo *choose_best_statistics(List *stats, char requiredkind,
 												Bitmapset **clause_attnums,
diff --git a/src/test/regress/expected/stats_ext.out b/src/test/regress/expected/stats_ext.out
index 4c3edd213f..0577631ac4 100644
--- a/src/test/regress/expected/stats_ext.out
+++ b/src/test/regress/expected/stats_ext.out
@@ -1113,6 +1113,12 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = '
        200 |    200
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = ''1'' OR c = 1 OR d IS NOT NULL');
+ estimated | actual 
+-----------+--------
+       200 |    200
+(1 row)
+
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IN (1, 2, 51, 52) AND b IN ( ''1'', ''2'')');
  estimated | actual 
 -----------+--------
@@ -1173,13 +1179,6 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a < ALL (ARRAY
        100 |    100
 (1 row)
 
--- we can't use the statistic for OR clauses that are not fully covered (missing 'd' attribute)
-SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = ''1'' OR c = 1 OR d IS NOT NULL');
- estimated | actual 
------------+--------
-       343 |    200
-(1 row)
-
 -- check change of unrelated column type does not reset the MCV statistics
 ALTER TABLE mcv_lists ALTER COLUMN d TYPE VARCHAR(64);
 SELECT d.stxdmcv IS NOT NULL
@@ -1477,6 +1476,110 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a AND
          1 |      0
 (1 row)
 
+-- mcv covering just a small fraction of data
+CREATE TABLE mcv_lists_partial (
+    a INT,
+    b INT,
+    c INT
+);
+-- 10 frequent groups, each with 100 elements
+INSERT INTO mcv_lists_partial (a, b, c)
+     SELECT
+         mod(i,10),
+         mod(i,10),
+         mod(i,10)
+     FROM generate_series(0,999) s(i);
+-- 100 groups that will make it to the MCV list (includes the 10 frequent ones)
+INSERT INTO mcv_lists_partial (a, b, c)
+     SELECT
+         i,
+         i,
+         i
+     FROM generate_series(0,99) s(i);
+-- 4000 groups in total, most of which won't make it (just a single item)
+INSERT INTO mcv_lists_partial (a, b, c)
+     SELECT
+         i,
+         i,
+         i
+     FROM generate_series(0,3999) s(i);
+ANALYZE mcv_lists_partial;
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 AND b = 0 AND c = 0');
+ estimated | actual 
+-----------+--------
+         1 |    102
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 OR b = 0 OR c = 0');
+ estimated | actual 
+-----------+--------
+       300 |    102
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 10 AND b = 10 AND c = 10');
+ estimated | actual 
+-----------+--------
+         1 |      2
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 10 OR b = 10 OR c = 10');
+ estimated | actual 
+-----------+--------
+         6 |      2
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 AND b = 0 AND c = 10');
+ estimated | actual 
+-----------+--------
+         1 |      0
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 OR b = 0 OR c = 10');
+ estimated | actual 
+-----------+--------
+       204 |    104
+(1 row)
+
+CREATE STATISTICS mcv_lists_partial_stats (mcv) ON a, b, c
+  FROM mcv_lists_partial;
+ANALYZE mcv_lists_partial;
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 AND b = 0 AND c = 0');
+ estimated | actual 
+-----------+--------
+       102 |    102
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 OR b = 0 OR c = 0');
+ estimated | actual 
+-----------+--------
+       504 |    102
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 10 AND b = 10 AND c = 10');
+ estimated | actual 
+-----------+--------
+         2 |      2
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 10 OR b = 10 OR c = 10');
+ estimated | actual 
+-----------+--------
+        10 |      2
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 AND b = 0 AND c = 10');
+ estimated | actual 
+-----------+--------
+         1 |      0
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 OR b = 0 OR c = 10');
+ estimated | actual 
+-----------+--------
+       412 |    104
+(1 row)
+
+DROP TABLE mcv_lists_partial;
 -- check the ability to use multiple MCV lists
 CREATE TABLE mcv_lists_multi (
 	a INTEGER,
@@ -1506,12 +1609,36 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE c = 0 AN
        102 |    714
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 AND c = 0');
+ estimated | actual 
+-----------+--------
+       143 |    142
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 OR c = 0');
+ estimated | actual 
+-----------+--------
+      1571 |   1572
+(1 row)
+
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 AND b = 0 AND c = 0 AND d = 0');
  estimated | actual 
 -----------+--------
          4 |    142
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE (a = 0 AND b = 0) OR (c = 0 AND d = 0)');
+ estimated | actual 
+-----------+--------
+       298 |   1572
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 OR b = 0 OR c = 0 OR d = 0');
+ estimated | actual 
+-----------+--------
+      2649 |   1572
+(1 row)
+
 -- create separate MCV statistics
 CREATE STATISTICS mcv_lists_multi_1 (mcv) ON a, b FROM mcv_lists_multi;
 CREATE STATISTICS mcv_lists_multi_2 (mcv) ON c, d FROM mcv_lists_multi;
@@ -1528,12 +1655,36 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE c = 0 AN
        714 |    714
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 AND c = 0');
+ estimated | actual 
+-----------+--------
+       143 |    142
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 OR c = 0');
+ estimated | actual 
+-----------+--------
+      1571 |   1572
+(1 row)
+
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 AND b = 0 AND c = 0 AND d = 0');
  estimated | actual 
 -----------+--------
        143 |    142
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE (a = 0 AND b = 0) OR (c = 0 AND d = 0)');
+ estimated | actual 
+-----------+--------
+      1571 |   1572
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 OR b = 0 OR c = 0 OR d = 0');
+ estimated | actual 
+-----------+--------
+      1571 |   1572
+(1 row)
+
 DROP TABLE mcv_lists_multi;
 -- Permission tests. Users should not be able to see specific data values in
 -- the extended statistics, if they lack permission to see those values in
diff --git a/src/test/regress/sql/stats_ext.sql b/src/test/regress/sql/stats_ext.sql
index 9781e590a3..3ec6dda094 100644
--- a/src/test/regress/sql/stats_ext.sql
+++ b/src/test/regress/sql/stats_ext.sql
@@ -561,6 +561,8 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE 4 >= a AND ''0
 
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = ''1'' OR c = 1');
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = ''1'' OR c = 1 OR d IS NOT NULL');
+
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IN (1, 2, 51, 52) AND b IN ( ''1'', ''2'')');
 
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IN (1, 2, 51, 52, NULL) AND b IN ( ''1'', ''2'', NULL)');
@@ -581,9 +583,6 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a < ALL (ARRAY
 
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a < ALL (ARRAY[4, 5]) AND b IN (''1'', ''2'', NULL, ''3'') AND c > ANY (ARRAY[1, 2, NULL, 3])');
 
--- we can't use the statistic for OR clauses that are not fully covered (missing 'd' attribute)
-SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = ''1'' OR c = 1 OR d IS NOT NULL');
-
 -- check change of unrelated column type does not reset the MCV statistics
 ALTER TABLE mcv_lists ALTER COLUMN d TYPE VARCHAR(64);
 
@@ -777,6 +776,70 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a AND
 
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a AND b AND NOT c');
 
+-- mcv covering just a small fraction of data
+CREATE TABLE mcv_lists_partial (
+    a INT,
+    b INT,
+    c INT
+);
+
+-- 10 frequent groups, each with 100 elements
+INSERT INTO mcv_lists_partial (a, b, c)
+     SELECT
+         mod(i,10),
+         mod(i,10),
+         mod(i,10)
+     FROM generate_series(0,999) s(i);
+
+-- 100 groups that will make it to the MCV list (includes the 10 frequent ones)
+INSERT INTO mcv_lists_partial (a, b, c)
+     SELECT
+         i,
+         i,
+         i
+     FROM generate_series(0,99) s(i);
+
+-- 4000 groups in total, most of which won't make it (just a single item)
+INSERT INTO mcv_lists_partial (a, b, c)
+     SELECT
+         i,
+         i,
+         i
+     FROM generate_series(0,3999) s(i);
+
+ANALYZE mcv_lists_partial;
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 AND b = 0 AND c = 0');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 OR b = 0 OR c = 0');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 10 AND b = 10 AND c = 10');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 10 OR b = 10 OR c = 10');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 AND b = 0 AND c = 10');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 OR b = 0 OR c = 10');
+
+CREATE STATISTICS mcv_lists_partial_stats (mcv) ON a, b, c
+  FROM mcv_lists_partial;
+
+ANALYZE mcv_lists_partial;
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 AND b = 0 AND c = 0');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 OR b = 0 OR c = 0');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 10 AND b = 10 AND c = 10');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 10 OR b = 10 OR c = 10');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 AND b = 0 AND c = 10');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 OR b = 0 OR c = 10');
+
+DROP TABLE mcv_lists_partial;
+
 -- check the ability to use multiple MCV lists
 CREATE TABLE mcv_lists_multi (
 	a INTEGER,
@@ -799,7 +862,11 @@ ANALYZE mcv_lists_multi;
 -- estimates without any mcv statistics
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 AND b = 0');
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE c = 0 AND d = 0');
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 AND c = 0');
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 OR c = 0');
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 AND b = 0 AND c = 0 AND d = 0');
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE (a = 0 AND b = 0) OR (c = 0 AND d = 0)');
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 OR b = 0 OR c = 0 OR d = 0');
 
 -- create separate MCV statistics
 CREATE STATISTICS mcv_lists_multi_1 (mcv) ON a, b FROM mcv_lists_multi;
@@ -809,7 +876,11 @@ ANALYZE mcv_lists_multi;
 
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 AND b = 0');
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE c = 0 AND d = 0');
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 AND c = 0');
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 OR c = 0');
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 AND b = 0 AND c = 0 AND d = 0');
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE (a = 0 AND b = 0) OR (c = 0 AND d = 0)');
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 OR b = 0 OR c = 0 OR d = 0');
 
 DROP TABLE mcv_lists_multi;
 
-- 
2.26.2

0002-WIP-don-t-use-extended-statistics-twice-20201112.patchtext/x-patch; charset=UTF-8; name=0002-WIP-don-t-use-extended-statistics-twice-20201112.patchDownload

From 26ea36f62cc30ffb44524e8bfa703ae28d2fef76 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Thu, 12 Nov 2020 02:53:45 +0100
Subject: [PATCH 2/4] WIP: don't use extended statistics twice

---
 src/backend/optimizer/path/clausesel.c | 112 ++++++++++++++++++-------
 1 file changed, 80 insertions(+), 32 deletions(-)

diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c
index bbe9591aff..215c5db9fd 100644
--- a/src/backend/optimizer/path/clausesel.c
+++ b/src/backend/optimizer/path/clausesel.c
@@ -45,6 +45,22 @@ static void addRangeClause(RangeQueryClause **rqlist, Node *clause,
 static RelOptInfo *find_single_rel_for_clauses(PlannerInfo *root,
 											   List *clauses);
 
+static Selectivity
+clause_selectivity_internal(PlannerInfo *root,
+							Node *clause,
+							int varRelid,
+							JoinType jointype,
+							SpecialJoinInfo *sjinfo,
+							bool use_extended_stats);
+
+static Selectivity
+clauselist_selectivity_internal(PlannerInfo *root,
+								List *clauses,
+								int varRelid,
+								JoinType jointype,
+								SpecialJoinInfo *sjinfo,
+								bool use_extended_stats);
+
 /****************************************************************************
  *		ROUTINES TO COMPUTE SELECTIVITIES
  ****************************************************************************/
@@ -71,6 +87,18 @@ clauselist_selectivity(PlannerInfo *root,
 					   int varRelid,
 					   JoinType jointype,
 					   SpecialJoinInfo *sjinfo)
+{
+	return clauselist_selectivity_internal(root, clauses, varRelid,
+										   jointype, sjinfo, true);
+}
+
+static Selectivity
+clauselist_selectivity_internal(PlannerInfo *root,
+								List *clauses,
+								int varRelid,
+								JoinType jointype,
+								SpecialJoinInfo *sjinfo,
+								bool use_extended_stats)
 {
 	Selectivity s1 = 1.0;
 	RelOptInfo *rel;
@@ -81,7 +109,7 @@ clauselist_selectivity(PlannerInfo *root,
 	 * it has extended statistics, try to apply those.
 	 */
 	rel = find_single_rel_for_clauses(root, clauses);
-	if (rel && rel->rtekind == RTE_RELATION && rel->statlist != NIL)
+	if (use_extended_stats && rel && rel->rtekind == RTE_RELATION && rel->statlist != NIL)
 	{
 		/*
 		 * Estimate as many clauses as possible using extended statistics.
@@ -126,7 +154,8 @@ clauselist_selectivity_or(PlannerInfo *root,
 						  List *clauses,
 						  int varRelid,
 						  JoinType jointype,
-						  SpecialJoinInfo *sjinfo)
+						  SpecialJoinInfo *sjinfo,
+						  bool use_extended_stats)
 {
 	Selectivity	s = 0.0;
 	RelOptInfo *rel;
@@ -137,7 +166,7 @@ clauselist_selectivity_or(PlannerInfo *root,
 	 * it has extended statistics, try to apply those.
 	 */
 	rel = find_single_rel_for_clauses(root, clauses);
-	if (rel && rel->rtekind == RTE_RELATION && rel->statlist != NIL)
+	if (use_extended_stats && rel && rel->rtekind == RTE_RELATION && rel->statlist != NIL)
 	{
 		/*
 		 * Estimate as many clauses as possible using extended statistics.
@@ -221,8 +250,8 @@ clauselist_selectivity_simple(PlannerInfo *root,
 	 * relevant.
 	 */
 	if (list_length(clauses) == 1 && bms_is_empty(estimatedclauses))
-		return clause_selectivity(root, (Node *) linitial(clauses),
-								  varRelid, jointype, sjinfo);
+		return clause_selectivity_internal(root, (Node *) linitial(clauses),
+										   varRelid, jointype, sjinfo, false);
 
 	/*
 	 * Anything that doesn't look like a potential rangequery clause gets
@@ -246,7 +275,8 @@ clauselist_selectivity_simple(PlannerInfo *root,
 			continue;
 
 		/* Always compute the selectivity using clause_selectivity */
-		s2 = clause_selectivity(root, clause, varRelid, jointype, sjinfo);
+		s2 = clause_selectivity_internal(root, clause, varRelid,
+										 jointype, sjinfo, false);
 
 		/*
 		 * Check for being passed a RestrictInfo.
@@ -453,11 +483,12 @@ clauselist_selectivity_simple_or(PlannerInfo *root,
 		if (bms_is_member(listidx, estimatedclauses))
 			continue;
 
-		s2 = clause_selectivity(root,
-								(Node *) lfirst(lc),
-								varRelid,
-								jointype,
-								sjinfo);
+		s2 = clause_selectivity_internal(root,
+										 (Node *) lfirst(lc),
+										 varRelid,
+										 jointype,
+										 sjinfo,
+										 false);
 
 		s = s + s2 - s * s2;
 	}
@@ -716,6 +747,18 @@ clause_selectivity(PlannerInfo *root,
 				   int varRelid,
 				   JoinType jointype,
 				   SpecialJoinInfo *sjinfo)
+{
+	return clause_selectivity_internal(root, clause, varRelid,
+									   jointype, sjinfo, true);
+}
+
+static Selectivity
+clause_selectivity_internal(PlannerInfo *root,
+							Node *clause,
+							int varRelid,
+							JoinType jointype,
+							SpecialJoinInfo *sjinfo,
+							bool use_extended_stats)
 {
 	Selectivity s1 = 0.5;		/* default for any unhandled clause type */
 	RestrictInfo *rinfo = NULL;
@@ -831,20 +874,22 @@ clause_selectivity(PlannerInfo *root,
 	else if (is_notclause(clause))
 	{
 		/* inverse of the selectivity of the underlying clause */
-		s1 = 1.0 - clause_selectivity(root,
-									  (Node *) get_notclausearg((Expr *) clause),
-									  varRelid,
-									  jointype,
-									  sjinfo);
+		s1 = 1.0 - clause_selectivity_internal(root,
+											   (Node *) get_notclausearg((Expr *) clause),
+											   varRelid,
+											   jointype,
+											   sjinfo,
+											   use_extended_stats);
 	}
 	else if (is_andclause(clause))
 	{
 		/* share code with clauselist_selectivity() */
-		s1 = clauselist_selectivity(root,
-									((BoolExpr *) clause)->args,
-									varRelid,
-									jointype,
-									sjinfo);
+		s1 = clauselist_selectivity_internal(root,
+											 ((BoolExpr *) clause)->args,
+											 varRelid,
+											 jointype,
+											 sjinfo,
+											 use_extended_stats);
 	}
 	else if (is_orclause(clause))
 	{
@@ -856,7 +901,8 @@ clause_selectivity(PlannerInfo *root,
 									   ((BoolExpr *) clause)->args,
 									   varRelid,
 									   jointype,
-									   sjinfo);
+									   sjinfo,
+									   use_extended_stats);
 	}
 	else if (is_opclause(clause) || IsA(clause, DistinctExpr))
 	{
@@ -957,20 +1003,22 @@ clause_selectivity(PlannerInfo *root,
 	else if (IsA(clause, RelabelType))
 	{
 		/* Not sure this case is needed, but it can't hurt */
-		s1 = clause_selectivity(root,
-								(Node *) ((RelabelType *) clause)->arg,
-								varRelid,
-								jointype,
-								sjinfo);
+		s1 = clause_selectivity_internal(root,
+										 (Node *) ((RelabelType *) clause)->arg,
+										 varRelid,
+										 jointype,
+										 sjinfo,
+										 use_extended_stats);
 	}
 	else if (IsA(clause, CoerceToDomain))
 	{
 		/* Not sure this case is needed, but it can't hurt */
-		s1 = clause_selectivity(root,
-								(Node *) ((CoerceToDomain *) clause)->arg,
-								varRelid,
-								jointype,
-								sjinfo);
+		s1 = clause_selectivity_internal(root,
+										 (Node *) ((CoerceToDomain *) clause)->arg,
+										 varRelid,
+										 jointype,
+										 sjinfo,
+										 use_extended_stats);
 	}
 	else
 	{
-- 
2.26.2

0003-WIP-pass-flag-20201112.patchtext/x-patch; charset=UTF-8; name=0003-WIP-pass-flag-20201112.patchDownload

From f3bfc5fed5c7aec084ca3394a073ebbfdce69459 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Thu, 12 Nov 2020 03:09:21 +0100
Subject: [PATCH 3/4] WIP: pass flag

---
 src/backend/optimizer/path/clausesel.c  | 20 +++++++++++++-------
 src/backend/statistics/extended_stats.c |  6 ++++--
 src/include/optimizer/optimizer.h       |  6 ++++--
 3 files changed, 21 insertions(+), 11 deletions(-)

diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c
index 215c5db9fd..360f5c6512 100644
--- a/src/backend/optimizer/path/clausesel.c
+++ b/src/backend/optimizer/path/clausesel.c
@@ -129,7 +129,8 @@ clauselist_selectivity_internal(PlannerInfo *root,
 	 */
 	return s1 * clauselist_selectivity_simple(root, clauses, varRelid,
 											  jointype, sjinfo,
-											  estimatedclauses);
+											  estimatedclauses,
+											  use_extended_stats);
 }
 
 /*
@@ -186,7 +187,8 @@ clauselist_selectivity_or(PlannerInfo *root,
 	/* Estimate the rest of the clauses as if they were independent. */
 	return clauselist_selectivity_simple_or(root, clauses, varRelid,
 											jointype, sjinfo,
-											estimatedclauses, s);
+											estimatedclauses, s,
+											use_extended_stats);
 }
 
 /*
@@ -237,7 +239,8 @@ clauselist_selectivity_simple(PlannerInfo *root,
 							  int varRelid,
 							  JoinType jointype,
 							  SpecialJoinInfo *sjinfo,
-							  Bitmapset *estimatedclauses)
+							  Bitmapset *estimatedclauses,
+							  bool use_extended_stats)
 {
 	Selectivity s1 = 1.0;
 	RangeQueryClause *rqlist = NULL;
@@ -251,7 +254,8 @@ clauselist_selectivity_simple(PlannerInfo *root,
 	 */
 	if (list_length(clauses) == 1 && bms_is_empty(estimatedclauses))
 		return clause_selectivity_internal(root, (Node *) linitial(clauses),
-										   varRelid, jointype, sjinfo, false);
+										   varRelid, jointype, sjinfo,
+										   use_extended_stats);
 
 	/*
 	 * Anything that doesn't look like a potential rangequery clause gets
@@ -276,7 +280,8 @@ clauselist_selectivity_simple(PlannerInfo *root,
 
 		/* Always compute the selectivity using clause_selectivity */
 		s2 = clause_selectivity_internal(root, clause, varRelid,
-										 jointype, sjinfo, false);
+										 jointype, sjinfo,
+										 use_extended_stats);
 
 		/*
 		 * Check for being passed a RestrictInfo.
@@ -459,7 +464,8 @@ clauselist_selectivity_simple_or(PlannerInfo *root,
 								 JoinType jointype,
 								 SpecialJoinInfo *sjinfo,
 								 Bitmapset *estimatedclauses,
-								 Selectivity s)
+								 Selectivity s,
+								 bool use_extended_stats)
 {
 	ListCell   *lc;
 	int			listidx;
@@ -488,7 +494,7 @@ clauselist_selectivity_simple_or(PlannerInfo *root,
 										 varRelid,
 										 jointype,
 										 sjinfo,
-										 false);
+										 use_extended_stats);
 
 		s = s + s2 - s * s2;
 	}
diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c
index f5493bcad6..ebc4887ee6 100644
--- a/src/backend/statistics/extended_stats.c
+++ b/src/backend/statistics/extended_stats.c
@@ -1380,10 +1380,12 @@ statext_mcv_clauselist_selectivity(PlannerInfo *root, List *clauses, int varReli
 		 */
 		if (is_or)
 			simple_sel = clauselist_selectivity_simple_or(root, stat_clauses, varRelid,
-														  jointype, sjinfo, NULL, 0.0);
+														  jointype, sjinfo, NULL, 0.0,
+														  false);
 		else
 			simple_sel = clauselist_selectivity_simple(root, stat_clauses, varRelid,
-													   jointype, sjinfo, NULL);
+													   jointype, sjinfo, NULL,
+													   false);
 
 		/*
 		 * Now compute the multi-column estimate from the MCV list, along with
diff --git a/src/include/optimizer/optimizer.h b/src/include/optimizer/optimizer.h
index de24d7e1d4..bae13ad1c5 100644
--- a/src/include/optimizer/optimizer.h
+++ b/src/include/optimizer/optimizer.h
@@ -63,14 +63,16 @@ extern Selectivity clauselist_selectivity_simple(PlannerInfo *root,
 												 int varRelid,
 												 JoinType jointype,
 												 SpecialJoinInfo *sjinfo,
-												 Bitmapset *estimatedclauses);
+												 Bitmapset *estimatedclauses,
+												 bool use_extended_stats);
 extern Selectivity clauselist_selectivity_simple_or(PlannerInfo *root,
 													List *clauses,
 													int varRelid,
 													JoinType jointype,
 													SpecialJoinInfo *sjinfo,
 													Bitmapset *estimatedclauses,
-													Selectivity s);
+													Selectivity s,
+													bool use_extended_stats);
 extern Selectivity clauselist_selectivity(PlannerInfo *root,
 										  List *clauses,
 										  int varRelid,
-- 
2.26.2

#28

Dean Rasheed

dean.a.rasheed@gmail.com

about 5 years ago

In reply to: Tomas Vondra (#27)

Re: Additional improvements to extended statistics

On Thu, 12 Nov 2020 at 14:18, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

Here is an improved WIP version of the patch series, modified to address
the issue with repeatedly applying the extended statistics, as discussed
with Dean in this thread. It's a bit rough and not committable, but I
need some feedback so I'm posting it in this state.

Cool. I haven't forgotten that I promised to look at this.

Dean, does this address the issue you had in mind? Can you come up with
an example of that issue in the form of a regression test or something?

I'm quite busy with my day job at the moment, but I expect to have
time to review this next week.

Regards,
Dean

#29

Dean Rasheed

dean.a.rasheed@gmail.com

about 5 years ago

In reply to: Tomas Vondra (#27)

1 attachment(s)

Re: Additional improvements to extended statistics

On Thu, 12 Nov 2020 at 14:18, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

Here is an improved WIP version of the patch series, modified to address
the issue with repeatedly applying the extended statistics, as discussed
with Dean in this thread. It's a bit rough and not committable, but I
need some feedback so I'm posting it in this state.

As it stands, it doesn't compile if 0003 is applied, because it missed
one of the callers of clauselist_selectivity_simple(), but that's
easily fixed.

0001 is the original patch improving estimates of OR clauses

0002 adds thin wrappers for clause[list]_selectivity, with "internal"
functions allowing to specify whether to keep considering extended stats

0003 does the same for the "simple" functions

I've kept it like this to demonstrate that 0002 is not sufficient. In my
response from March 24 I wrote this:

Isn't it the case that clauselist_selectivity_simple (and the OR
variant) should ignore extended stats entirely? That is, we'd need
to add a flag (or _simple variant) to clause_selectivity, so that it
calls causelist_selectivity_simple_or.

But that's actually wrong, as 0002 shows (as it breaks a couple of
regression tests), because of the way we handle OR clauses. At the top
level, an OR-clause is actually just a single clause and it may get
passed to clauselist_selectivity_simple. So entirely disabling extended
stats for the "simple" functions would also mean disabling extended
stats for a large number of OR clauses. Which is clearly wrong.

So 0003 addresses that, by adding a flag to the two "simple" functions.
Ultimately, this should probably do the same thing as 0002 and add thin
wrappers, because the existing functions are part of the public API.

I agree that, taken together, these patches fix the
multiple-extended-stats-evaluation issue. However:

I think this has ended up with too many variants of these functions,
since we now have "_internal" and "_simple" variants, and you're
proposing adding more. The original purpose of the "_simple" variants
was to compute selectivities without looking at extended stats, and
now the "_internal" variants compute selectivities with an additional
"use_extended_stats" flag to control whether or not to look at
extended stats. Thus they're basically the same, and could be rolled
together.

Additionally, it's worth noting that the "_simple" variants expose the
"estimatedclauses" bitmap as an argument, which IMO is a bit messy as
an API. All callers of the "_simple" functions outside of clausesel.c
actually pass in estimatedclauses=NULL, so it's possible to refactor
and get rid of that, turning estimatedclauses into a purely internal
variable.

Also, it's quite messy that clauselist_selectivity_simple_or() needs
to be passed a Selectivity input (the final argument) that is the
selectivity of any already-estimated clauses, or the value to return
if no not-already-estimated clauses are found, and must be 0.0 when
called from the extended stats code.

Attached is the kind of thing I had in mind (as a single patch, since
I don't think it's worth splitting up). This replaces the "_simple"
and "_internal" variants of these functions with "_opt_ext_stats"
variants whose signatures match the originals except for having the
single extra "use_extended_stats" boolean parameter. Additionally, the
"_simple" functions are merged into the originals (making them more
like they were in PG11) so that the "estimatedclauses" bitmap and
partial-OR-list Selectivity become internal details, no longer exposed
in the API.

Regards,
Dean

Attachments:

improve-estimation-of-OR-clauses-20201117.patchtext/x-patch; charset=US-ASCII; name=improve-estimation-of-OR-clauses-20201117.patchDownload

diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c
new file mode 100644
index 37a735b..ab16a4c
--- a/src/backend/optimizer/path/clausesel.c
+++ b/src/backend/optimizer/path/clausesel.c
@@ -44,6 +44,12 @@ static void addRangeClause(RangeQueryCla
 						   bool varonleft, bool isLTsel, Selectivity s2);
 static RelOptInfo *find_single_rel_for_clauses(PlannerInfo *root,
 											   List *clauses);
+static Selectivity clause_selectivity_opt_ext_stats(PlannerInfo *root,
+													Node *clause,
+													int varRelid,
+													JoinType jointype,
+													SpecialJoinInfo *sjinfo,
+													bool use_extended_stats);
 
 /****************************************************************************
  *		ROUTINES TO COMPUTE SELECTIVITIES
@@ -61,64 +67,8 @@ static RelOptInfo *find_single_rel_for_c
  *
  * The basic approach is to apply extended statistics first, on as many
  * clauses as possible, in order to capture cross-column dependencies etc.
- * The remaining clauses are then estimated using regular statistics tracked
- * for individual columns.  This is done by simply passing the clauses to
- * clauselist_selectivity_simple.
- */
-Selectivity
-clauselist_selectivity(PlannerInfo *root,
-					   List *clauses,
-					   int varRelid,
-					   JoinType jointype,
-					   SpecialJoinInfo *sjinfo)
-{
-	Selectivity s1 = 1.0;
-	RelOptInfo *rel;
-	Bitmapset  *estimatedclauses = NULL;
-
-	/*
-	 * Determine if these clauses reference a single relation.  If so, and if
-	 * it has extended statistics, try to apply those.
-	 */
-	rel = find_single_rel_for_clauses(root, clauses);
-	if (rel && rel->rtekind == RTE_RELATION && rel->statlist != NIL)
-	{
-		/*
-		 * Estimate as many clauses as possible using extended statistics.
-		 *
-		 * 'estimatedclauses' tracks the 0-based list position index of
-		 * clauses that we've estimated using extended statistics, and that
-		 * should be ignored.
-		 */
-		s1 *= statext_clauselist_selectivity(root, clauses, varRelid,
-											 jointype, sjinfo, rel,
-											 &estimatedclauses);
-	}
-
-	/*
-	 * Apply normal selectivity estimates for the remaining clauses, passing
-	 * 'estimatedclauses' so that it skips already estimated ones.
-	 */
-	return s1 * clauselist_selectivity_simple(root, clauses, varRelid,
-											  jointype, sjinfo,
-											  estimatedclauses);
-}
-
-/*
- * clauselist_selectivity_simple -
- *	  Compute the selectivity of an implicitly-ANDed list of boolean
- *	  expression clauses.  The list can be empty, in which case 1.0
- *	  must be returned.  List elements may be either RestrictInfos
- *	  or bare expression clauses --- the former is preferred since
- *	  it allows caching of results.  The estimatedclauses bitmap tracks
- *	  clauses that have already been estimated by other means.
- *
- * See clause_selectivity() for the meaning of the additional parameters.
- *
- * Our basic approach is to take the product of the selectivities of the
- * subclauses.  However, that's only right if the subclauses have independent
- * probabilities, and in reality they are often NOT independent.  So,
- * we want to be smarter where we can.
+ * The remaining clauses are then estimated by taking the product of their
+ * selectivities.
  *
  * We also recognize "range queries", such as "x > 34 AND x < 42".  Clauses
  * are recognized as possible range query components if they are restriction
@@ -147,28 +97,64 @@ clauselist_selectivity(PlannerInfo *root
  * selectivity functions; perhaps some day we can generalize the approach.
  */
 Selectivity
-clauselist_selectivity_simple(PlannerInfo *root,
-							  List *clauses,
-							  int varRelid,
-							  JoinType jointype,
-							  SpecialJoinInfo *sjinfo,
-							  Bitmapset *estimatedclauses)
+clauselist_selectivity(PlannerInfo *root,
+					   List *clauses,
+					   int varRelid,
+					   JoinType jointype,
+					   SpecialJoinInfo *sjinfo)
+{
+	return clauselist_selectivity_opt_ext_stats(root, clauses, varRelid,
+												jointype, sjinfo, true);
+}
+
+Selectivity
+clauselist_selectivity_opt_ext_stats(PlannerInfo *root,
+									 List *clauses,
+									 int varRelid,
+									 JoinType jointype,
+									 SpecialJoinInfo *sjinfo,
+									 bool use_extended_stats)
 {
 	Selectivity s1 = 1.0;
+	RelOptInfo *rel;
+	Bitmapset  *estimatedclauses = NULL;
 	RangeQueryClause *rqlist = NULL;
 	ListCell   *l;
 	int			listidx;
 
 	/*
-	 * If there's exactly one clause (and it was not estimated yet), just go
-	 * directly to clause_selectivity(). None of what we might do below is
-	 * relevant.
+	 * If there's exactly one clause, just go directly to
+	 * clause_selectivity(). None of what we might do below is relevant.
 	 */
-	if (list_length(clauses) == 1 && bms_is_empty(estimatedclauses))
-		return clause_selectivity(root, (Node *) linitial(clauses),
-								  varRelid, jointype, sjinfo);
+	if (list_length(clauses) == 1)
+		return clause_selectivity_opt_ext_stats(root,
+												(Node *) linitial(clauses),
+												varRelid, jointype, sjinfo,
+												use_extended_stats);
+
+	/*
+	 * Determine if these clauses reference a single relation.  If so, and if
+	 * it has extended statistics, try to apply those.
+	 */
+	rel = find_single_rel_for_clauses(root, clauses);
+	if (use_extended_stats && rel && rel->rtekind == RTE_RELATION && rel->statlist != NIL)
+	{
+		/*
+		 * Estimate as many clauses as possible using extended statistics.
+		 *
+		 * 'estimatedclauses' tracks the 0-based list position index of
+		 * clauses that we've estimated using extended statistics, and that
+		 * should be ignored.
+		 */
+		s1 *= statext_clauselist_selectivity(root, clauses, varRelid,
+											 jointype, sjinfo, rel,
+											 &estimatedclauses, false);
+	}
 
 	/*
+	 * Apply normal selectivity estimates for remaining clauses. We'll be
+	 * careful to skip any clauses which were already estimated above.
+	 *
 	 * Anything that doesn't look like a potential rangequery clause gets
 	 * multiplied into s1 and forgotten. Anything that does gets inserted into
 	 * an rqlist entry.
@@ -190,7 +176,9 @@ clauselist_selectivity_simple(PlannerInf
 			continue;
 
 		/* Always compute the selectivity using clause_selectivity */
-		s2 = clause_selectivity(root, clause, varRelid, jointype, sjinfo);
+		s2 = clause_selectivity_opt_ext_stats(root, clause, varRelid,
+											  jointype, sjinfo,
+											  use_extended_stats);
 
 		/*
 		 * Check for being passed a RestrictInfo.
@@ -351,6 +339,91 @@ clauselist_selectivity_simple(PlannerInf
 }
 
 /*
+ * clauselist_selectivity_or -
+ *	  Compute the selectivity of an implicitly-ORed list of boolean
+ *	  expression clauses.  The list can be empty, in which case 0.0
+ *	  must be returned.  List elements may be either RestrictInfos
+ *	  or bare expression clauses --- the former is preferred since
+ *	  it allows caching of results.
+ *
+ * See clause_selectivity() for the meaning of the additional parameters.
+ *
+ * The basic approach is to apply extended statistics first, on as many
+ * clauses as possible, in order to capture cross-column dependencies etc.
+ * The remaining clauses are then estimated as if they were independent.
+ */
+Selectivity
+clauselist_selectivity_or(PlannerInfo *root,
+						  List *clauses,
+						  int varRelid,
+						  JoinType jointype,
+						  SpecialJoinInfo *sjinfo,
+						  bool use_extended_stats)
+{
+	Selectivity s1 = 0.0;
+	RelOptInfo *rel;
+	Bitmapset  *estimatedclauses = NULL;
+	ListCell   *lc;
+	int			listidx;
+
+	/*
+	 * Determine if these clauses reference a single relation.  If so, and if
+	 * it has extended statistics, try to apply those.
+	 */
+	rel = find_single_rel_for_clauses(root, clauses);
+	if (use_extended_stats && rel && rel->rtekind == RTE_RELATION && rel->statlist != NIL)
+	{
+		/*
+		 * Estimate as many clauses as possible using extended statistics.
+		 *
+		 * 'estimatedclauses' tracks the 0-based list position index of
+		 * clauses that we've estimated using extended statistics, and that
+		 * should be ignored.
+		 *
+		 * XXX We can't multiply with current value, because for OR clauses we
+		 * start with 0.0, so we simply assign to 's' directly.
+		 */
+		s1 = statext_clauselist_selectivity(root, clauses, varRelid,
+											jointype, sjinfo, rel,
+											&estimatedclauses, true);
+	}
+
+	/*
+	 * Estimate the remaining clauses.
+	 *
+	 * Selectivities for an OR clause are computed as s1+s2 - s1*s2 to account
+	 * for the probable overlap of selected tuple sets.
+	 *
+	 * XXX is this too conservative?
+	 */
+	listidx = -1;
+	foreach(lc, clauses)
+	{
+		Selectivity s2;
+
+		listidx++;
+
+		/*
+		 * Skip this clause if it's already been estimated by some other
+		 * statistics above.
+		 */
+		if (bms_is_member(listidx, estimatedclauses))
+			continue;
+
+		s2 = clause_selectivity_opt_ext_stats(root,
+											  (Node *) lfirst(lc),
+											  varRelid,
+											  jointype,
+											  sjinfo,
+											  use_extended_stats);
+
+		s1 = s1 + s2 - s1 * s2;
+	}
+
+	return s1;
+}
+
+/*
  * addRangeClause --- add a new range clause for clauselist_selectivity
  *
  * Here is where we try to match up pairs of range-query clauses
@@ -602,6 +675,18 @@ clause_selectivity(PlannerInfo *root,
 				   JoinType jointype,
 				   SpecialJoinInfo *sjinfo)
 {
+	return clause_selectivity_opt_ext_stats(root, clause, varRelid,
+											jointype, sjinfo, true);
+}
+
+static Selectivity
+clause_selectivity_opt_ext_stats(PlannerInfo *root,
+								 Node *clause,
+								 int varRelid,
+								 JoinType jointype,
+								 SpecialJoinInfo *sjinfo,
+								 bool use_extended_stats)
+{
 	Selectivity s1 = 0.5;		/* default for any unhandled clause type */
 	RestrictInfo *rinfo = NULL;
 	bool		cacheable = false;
@@ -716,42 +801,35 @@ clause_selectivity(PlannerInfo *root,
 	else if (is_notclause(clause))
 	{
 		/* inverse of the selectivity of the underlying clause */
-		s1 = 1.0 - clause_selectivity(root,
-									  (Node *) get_notclausearg((Expr *) clause),
-									  varRelid,
-									  jointype,
-									  sjinfo);
+		s1 = 1.0 - clause_selectivity_opt_ext_stats(root,
+													(Node *) get_notclausearg((Expr *) clause),
+													varRelid,
+													jointype,
+													sjinfo,
+													use_extended_stats);
 	}
 	else if (is_andclause(clause))
 	{
 		/* share code with clauselist_selectivity() */
-		s1 = clauselist_selectivity(root,
-									((BoolExpr *) clause)->args,
-									varRelid,
-									jointype,
-									sjinfo);
+		s1 = clauselist_selectivity_opt_ext_stats(root,
+												  ((BoolExpr *) clause)->args,
+												  varRelid,
+												  jointype,
+												  sjinfo,
+												  use_extended_stats);
 	}
 	else if (is_orclause(clause))
 	{
 		/*
-		 * Selectivities for an OR clause are computed as s1+s2 - s1*s2 to
-		 * account for the probable overlap of selected tuple sets.
-		 *
-		 * XXX is this too conservative?
+		 * Almost the same thing as clauselist_selectivity, but with the
+		 * clauses connected by OR.
 		 */
-		ListCell   *arg;
-
-		s1 = 0.0;
-		foreach(arg, ((BoolExpr *) clause)->args)
-		{
-			Selectivity s2 = clause_selectivity(root,
-												(Node *) lfirst(arg),
-												varRelid,
-												jointype,
-												sjinfo);
-
-			s1 = s1 + s2 - s1 * s2;
-		}
+		s1 = clauselist_selectivity_or(root,
+									   ((BoolExpr *) clause)->args,
+									   varRelid,
+									   jointype,
+									   sjinfo,
+									   use_extended_stats);
 	}
 	else if (is_opclause(clause) || IsA(clause, DistinctExpr))
 	{
@@ -852,20 +930,22 @@ clause_selectivity(PlannerInfo *root,
 	else if (IsA(clause, RelabelType))
 	{
 		/* Not sure this case is needed, but it can't hurt */
-		s1 = clause_selectivity(root,
-								(Node *) ((RelabelType *) clause)->arg,
-								varRelid,
-								jointype,
-								sjinfo);
+		s1 = clause_selectivity_opt_ext_stats(root,
+											  (Node *) ((RelabelType *) clause)->arg,
+											  varRelid,
+											  jointype,
+											  sjinfo,
+											  use_extended_stats);
 	}
 	else if (IsA(clause, CoerceToDomain))
 	{
 		/* Not sure this case is needed, but it can't hurt */
-		s1 = clause_selectivity(root,
-								(Node *) ((CoerceToDomain *) clause)->arg,
-								varRelid,
-								jointype,
-								sjinfo);
+		s1 = clause_selectivity_opt_ext_stats(root,
+											  (Node *) ((CoerceToDomain *) clause)->arg,
+											  varRelid,
+											  jointype,
+											  sjinfo,
+											  use_extended_stats);
 	}
 	else
 	{
diff --git a/src/backend/statistics/dependencies.c b/src/backend/statistics/dependencies.c
new file mode 100644
index d950b4e..d05afa0
--- a/src/backend/statistics/dependencies.c
+++ b/src/backend/statistics/dependencies.c
@@ -1073,8 +1073,9 @@ clauselist_apply_dependencies(PlannerInf
 			}
 		}
 
-		simple_sel = clauselist_selectivity_simple(root, attr_clauses, varRelid,
-												   jointype, sjinfo, NULL);
+		simple_sel = clauselist_selectivity_opt_ext_stats(root, attr_clauses,
+														  varRelid, jointype,
+														  sjinfo, false);
 		attr_sel[attidx++] = simple_sel;
 	}
 
diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c
new file mode 100644
index 3632692..06213b6
--- a/src/backend/statistics/extended_stats.c
+++ b/src/backend/statistics/extended_stats.c
@@ -1282,16 +1282,17 @@ statext_is_compatible_clause(PlannerInfo
 static Selectivity
 statext_mcv_clauselist_selectivity(PlannerInfo *root, List *clauses, int varRelid,
 								   JoinType jointype, SpecialJoinInfo *sjinfo,
-								   RelOptInfo *rel, Bitmapset **estimatedclauses)
+								   RelOptInfo *rel, Bitmapset **estimatedclauses,
+								   bool is_or)
 {
 	ListCell   *l;
 	Bitmapset **list_attnums;
 	int			listidx;
-	Selectivity sel = 1.0;
+	Selectivity sel = (is_or) ? 0.0 : 1.0;
 
 	/* check if there's any stats that might be useful for us. */
 	if (!has_stats_of_kind(rel->statlist, STATS_EXT_MCV))
-		return 1.0;
+		return sel;
 
 	list_attnums = (Bitmapset **) palloc(sizeof(Bitmapset *) *
 										 list_length(clauses));
@@ -1377,8 +1378,17 @@ statext_mcv_clauselist_selectivity(Plann
 		 * columns/clauses. We'll then use the various selectivities computed
 		 * from MCV list to improve it.
 		 */
-		simple_sel = clauselist_selectivity_simple(root, stat_clauses, varRelid,
-												   jointype, sjinfo, NULL);
+		if (is_or)
+			simple_sel = clauselist_selectivity_or(root, stat_clauses,
+												   varRelid, jointype, sjinfo,
+												   false);
+		else
+			simple_sel = clauselist_selectivity_opt_ext_stats(root,
+															  stat_clauses,
+															  varRelid,
+															  jointype,
+															  sjinfo,
+															  false);
 
 		/*
 		 * Now compute the multi-column estimate from the MCV list, along with
@@ -1386,7 +1396,7 @@ statext_mcv_clauselist_selectivity(Plann
 		 */
 		mcv_sel = mcv_clauselist_selectivity(root, stat, stat_clauses, varRelid,
 											 jointype, sjinfo, rel,
-											 &mcv_basesel, &mcv_totalsel);
+											 &mcv_basesel, &mcv_totalsel, is_or);
 
 		/* Estimated selectivity of values not covered by MCV matches */
 		other_sel = simple_sel - mcv_basesel;
@@ -1404,7 +1414,10 @@ statext_mcv_clauselist_selectivity(Plann
 		CLAMP_PROBABILITY(stat_sel);
 
 		/* Factor the estimate from this MCV to the overall estimate. */
-		sel *= stat_sel;
+		if (is_or)
+			sel = sel + stat_sel - sel * stat_sel;
+		else
+			sel *= stat_sel;
 	}
 
 	return sel;
@@ -1417,13 +1430,21 @@ statext_mcv_clauselist_selectivity(Plann
 Selectivity
 statext_clauselist_selectivity(PlannerInfo *root, List *clauses, int varRelid,
 							   JoinType jointype, SpecialJoinInfo *sjinfo,
-							   RelOptInfo *rel, Bitmapset **estimatedclauses)
+							   RelOptInfo *rel, Bitmapset **estimatedclauses,
+							   bool is_or)
 {
 	Selectivity sel;
 
 	/* First, try estimating clauses using a multivariate MCV list. */
 	sel = statext_mcv_clauselist_selectivity(root, clauses, varRelid, jointype,
-											 sjinfo, rel, estimatedclauses);
+											 sjinfo, rel, estimatedclauses, is_or);
+
+	/*
+	 * Functional dependencies only work for clauses connected by AND, so for
+	 * OR clauses we're done.
+	 */
+	if (is_or)
+		return sel;
 
 	/*
 	 * Then, apply functional dependencies on the remaining clauses by calling
diff --git a/src/backend/statistics/mcv.c b/src/backend/statistics/mcv.c
new file mode 100644
index 6a262f1..7c5841e
--- a/src/backend/statistics/mcv.c
+++ b/src/backend/statistics/mcv.c
@@ -1904,7 +1904,8 @@ mcv_clauselist_selectivity(PlannerInfo *
 						   List *clauses, int varRelid,
 						   JoinType jointype, SpecialJoinInfo *sjinfo,
 						   RelOptInfo *rel,
-						   Selectivity *basesel, Selectivity *totalsel)
+						   Selectivity *basesel, Selectivity *totalsel,
+						   bool is_or)
 {
 	int			i;
 	MCVList    *mcv;
@@ -1917,7 +1918,7 @@ mcv_clauselist_selectivity(PlannerInfo *
 	mcv = statext_mcv_load(stat->statOid);
 
 	/* build a match bitmap for the clauses */
-	matches = mcv_get_match_bitmap(root, clauses, stat->keys, mcv, false);
+	matches = mcv_get_match_bitmap(root, clauses, stat->keys, mcv, is_or);
 
 	/* sum frequencies for all the matching MCV items */
 	*basesel = 0.0;
diff --git a/src/include/optimizer/optimizer.h b/src/include/optimizer/optimizer.h
new file mode 100644
index 3e41710..27f7985
--- a/src/include/optimizer/optimizer.h
+++ b/src/include/optimizer/optimizer.h
@@ -58,17 +58,23 @@ extern Selectivity clause_selectivity(Pl
 									  int varRelid,
 									  JoinType jointype,
 									  SpecialJoinInfo *sjinfo);
-extern Selectivity clauselist_selectivity_simple(PlannerInfo *root,
-												 List *clauses,
-												 int varRelid,
-												 JoinType jointype,
-												 SpecialJoinInfo *sjinfo,
-												 Bitmapset *estimatedclauses);
 extern Selectivity clauselist_selectivity(PlannerInfo *root,
 										  List *clauses,
 										  int varRelid,
 										  JoinType jointype,
 										  SpecialJoinInfo *sjinfo);
+extern Selectivity clauselist_selectivity_opt_ext_stats(PlannerInfo *root,
+														List *clauses,
+														int varRelid,
+														JoinType jointype,
+														SpecialJoinInfo *sjinfo,
+														bool use_extended_stats);
+Selectivity clauselist_selectivity_or(PlannerInfo *root,
+									  List *clauses,
+									  int varRelid,
+									  JoinType jointype,
+									  SpecialJoinInfo *sjinfo,
+									  bool use_extended_stats);
 
 /* in path/costsize.c: */
 
diff --git a/src/include/statistics/extended_stats_internal.h b/src/include/statistics/extended_stats_internal.h
new file mode 100644
index 61e6969..08d639e
--- a/src/include/statistics/extended_stats_internal.h
+++ b/src/include/statistics/extended_stats_internal.h
@@ -107,6 +107,7 @@ extern Selectivity mcv_clauselist_select
 											  SpecialJoinInfo *sjinfo,
 											  RelOptInfo *rel,
 											  Selectivity *basesel,
-											  Selectivity *totalsel);
+											  Selectivity *totalsel,
+											  bool is_or);
 
 #endif							/* EXTENDED_STATS_INTERNAL_H */
diff --git a/src/include/statistics/statistics.h b/src/include/statistics/statistics.h
new file mode 100644
index 50fce49..c9ed211
--- a/src/include/statistics/statistics.h
+++ b/src/include/statistics/statistics.h
@@ -116,7 +116,8 @@ extern Selectivity statext_clauselist_se
 												  JoinType jointype,
 												  SpecialJoinInfo *sjinfo,
 												  RelOptInfo *rel,
-												  Bitmapset **estimatedclauses);
+												  Bitmapset **estimatedclauses,
+												  bool is_or);
 extern bool has_stats_of_kind(List *stats, char requiredkind);
 extern StatisticExtInfo *choose_best_statistics(List *stats, char requiredkind,
 												Bitmapset **clause_attnums,
diff --git a/src/test/regress/expected/stats_ext.out b/src/test/regress/expected/stats_ext.out
new file mode 100644
index 4c3edd2..abab5d6
--- a/src/test/regress/expected/stats_ext.out
+++ b/src/test/regress/expected/stats_ext.out
@@ -1113,6 +1113,12 @@ SELECT * FROM check_estimated_rows('SELE
        200 |    200
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = ''1'' OR c = 1 OR d IS NOT NULL');
+ estimated | actual 
+-----------+--------
+       200 |    200
+(1 row)
+
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IN (1, 2, 51, 52) AND b IN ( ''1'', ''2'')');
  estimated | actual 
 -----------+--------
@@ -1173,13 +1179,6 @@ SELECT * FROM check_estimated_rows('SELE
        100 |    100
 (1 row)
 
--- we can't use the statistic for OR clauses that are not fully covered (missing 'd' attribute)
-SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = ''1'' OR c = 1 OR d IS NOT NULL');
- estimated | actual 
------------+--------
-       343 |    200
-(1 row)
-
 -- check change of unrelated column type does not reset the MCV statistics
 ALTER TABLE mcv_lists ALTER COLUMN d TYPE VARCHAR(64);
 SELECT d.stxdmcv IS NOT NULL
@@ -1477,6 +1476,110 @@ SELECT * FROM check_estimated_rows('SELE
          1 |      0
 (1 row)
 
+-- mcv covering just a small fraction of data
+CREATE TABLE mcv_lists_partial (
+    a INT,
+    b INT,
+    c INT
+);
+-- 10 frequent groups, each with 100 elements
+INSERT INTO mcv_lists_partial (a, b, c)
+     SELECT
+         mod(i,10),
+         mod(i,10),
+         mod(i,10)
+     FROM generate_series(0,999) s(i);
+-- 100 groups that will make it to the MCV list (includes the 10 frequent ones)
+INSERT INTO mcv_lists_partial (a, b, c)
+     SELECT
+         i,
+         i,
+         i
+     FROM generate_series(0,99) s(i);
+-- 4000 groups in total, most of which won't make it (just a single item)
+INSERT INTO mcv_lists_partial (a, b, c)
+     SELECT
+         i,
+         i,
+         i
+     FROM generate_series(0,3999) s(i);
+ANALYZE mcv_lists_partial;
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 AND b = 0 AND c = 0');
+ estimated | actual 
+-----------+--------
+         1 |    102
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 OR b = 0 OR c = 0');
+ estimated | actual 
+-----------+--------
+       300 |    102
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 10 AND b = 10 AND c = 10');
+ estimated | actual 
+-----------+--------
+         1 |      2
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 10 OR b = 10 OR c = 10');
+ estimated | actual 
+-----------+--------
+         6 |      2
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 AND b = 0 AND c = 10');
+ estimated | actual 
+-----------+--------
+         1 |      0
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 OR b = 0 OR c = 10');
+ estimated | actual 
+-----------+--------
+       204 |    104
+(1 row)
+
+CREATE STATISTICS mcv_lists_partial_stats (mcv) ON a, b, c
+  FROM mcv_lists_partial;
+ANALYZE mcv_lists_partial;
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 AND b = 0 AND c = 0');
+ estimated | actual 
+-----------+--------
+       102 |    102
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 OR b = 0 OR c = 0');
+ estimated | actual 
+-----------+--------
+       402 |    102
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 10 AND b = 10 AND c = 10');
+ estimated | actual 
+-----------+--------
+         2 |      2
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 10 OR b = 10 OR c = 10');
+ estimated | actual 
+-----------+--------
+         8 |      2
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 AND b = 0 AND c = 10');
+ estimated | actual 
+-----------+--------
+         1 |      0
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 OR b = 0 OR c = 10');
+ estimated | actual 
+-----------+--------
+       308 |    104
+(1 row)
+
+DROP TABLE mcv_lists_partial;
 -- check the ability to use multiple MCV lists
 CREATE TABLE mcv_lists_multi (
 	a INTEGER,
@@ -1506,12 +1609,36 @@ SELECT * FROM check_estimated_rows('SELE
        102 |    714
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 AND c = 0');
+ estimated | actual 
+-----------+--------
+       143 |    142
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 OR c = 0');
+ estimated | actual 
+-----------+--------
+      1571 |   1572
+(1 row)
+
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 AND b = 0 AND c = 0 AND d = 0');
  estimated | actual 
 -----------+--------
          4 |    142
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE (a = 0 AND b = 0) OR (c = 0 AND d = 0)');
+ estimated | actual 
+-----------+--------
+       298 |   1572
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 OR b = 0 OR c = 0 OR d = 0');
+ estimated | actual 
+-----------+--------
+      2649 |   1572
+(1 row)
+
 -- create separate MCV statistics
 CREATE STATISTICS mcv_lists_multi_1 (mcv) ON a, b FROM mcv_lists_multi;
 CREATE STATISTICS mcv_lists_multi_2 (mcv) ON c, d FROM mcv_lists_multi;
@@ -1528,12 +1655,36 @@ SELECT * FROM check_estimated_rows('SELE
        714 |    714
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 AND c = 0');
+ estimated | actual 
+-----------+--------
+       143 |    142
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 OR c = 0');
+ estimated | actual 
+-----------+--------
+      1571 |   1572
+(1 row)
+
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 AND b = 0 AND c = 0 AND d = 0');
  estimated | actual 
 -----------+--------
        143 |    142
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE (a = 0 AND b = 0) OR (c = 0 AND d = 0)');
+ estimated | actual 
+-----------+--------
+      1571 |   1572
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 OR b = 0 OR c = 0 OR d = 0');
+ estimated | actual 
+-----------+--------
+      1571 |   1572
+(1 row)
+
 DROP TABLE mcv_lists_multi;
 -- Permission tests. Users should not be able to see specific data values in
 -- the extended statistics, if they lack permission to see those values in
diff --git a/src/test/regress/sql/stats_ext.sql b/src/test/regress/sql/stats_ext.sql
new file mode 100644
index 9781e59..3ec6dda
--- a/src/test/regress/sql/stats_ext.sql
+++ b/src/test/regress/sql/stats_ext.sql
@@ -561,6 +561,8 @@ SELECT * FROM check_estimated_rows('SELE
 
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = ''1'' OR c = 1');
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = ''1'' OR c = 1 OR d IS NOT NULL');
+
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IN (1, 2, 51, 52) AND b IN ( ''1'', ''2'')');
 
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IN (1, 2, 51, 52, NULL) AND b IN ( ''1'', ''2'', NULL)');
@@ -581,9 +583,6 @@ SELECT * FROM check_estimated_rows('SELE
 
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a < ALL (ARRAY[4, 5]) AND b IN (''1'', ''2'', NULL, ''3'') AND c > ANY (ARRAY[1, 2, NULL, 3])');
 
--- we can't use the statistic for OR clauses that are not fully covered (missing 'd' attribute)
-SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = ''1'' OR c = 1 OR d IS NOT NULL');
-
 -- check change of unrelated column type does not reset the MCV statistics
 ALTER TABLE mcv_lists ALTER COLUMN d TYPE VARCHAR(64);
 
@@ -777,6 +776,70 @@ SELECT * FROM check_estimated_rows('SELE
 
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a AND b AND NOT c');
 
+-- mcv covering just a small fraction of data
+CREATE TABLE mcv_lists_partial (
+    a INT,
+    b INT,
+    c INT
+);
+
+-- 10 frequent groups, each with 100 elements
+INSERT INTO mcv_lists_partial (a, b, c)
+     SELECT
+         mod(i,10),
+         mod(i,10),
+         mod(i,10)
+     FROM generate_series(0,999) s(i);
+
+-- 100 groups that will make it to the MCV list (includes the 10 frequent ones)
+INSERT INTO mcv_lists_partial (a, b, c)
+     SELECT
+         i,
+         i,
+         i
+     FROM generate_series(0,99) s(i);
+
+-- 4000 groups in total, most of which won't make it (just a single item)
+INSERT INTO mcv_lists_partial (a, b, c)
+     SELECT
+         i,
+         i,
+         i
+     FROM generate_series(0,3999) s(i);
+
+ANALYZE mcv_lists_partial;
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 AND b = 0 AND c = 0');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 OR b = 0 OR c = 0');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 10 AND b = 10 AND c = 10');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 10 OR b = 10 OR c = 10');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 AND b = 0 AND c = 10');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 OR b = 0 OR c = 10');
+
+CREATE STATISTICS mcv_lists_partial_stats (mcv) ON a, b, c
+  FROM mcv_lists_partial;
+
+ANALYZE mcv_lists_partial;
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 AND b = 0 AND c = 0');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 OR b = 0 OR c = 0');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 10 AND b = 10 AND c = 10');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 10 OR b = 10 OR c = 10');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 AND b = 0 AND c = 10');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 OR b = 0 OR c = 10');
+
+DROP TABLE mcv_lists_partial;
+
 -- check the ability to use multiple MCV lists
 CREATE TABLE mcv_lists_multi (
 	a INTEGER,
@@ -799,7 +862,11 @@ ANALYZE mcv_lists_multi;
 -- estimates without any mcv statistics
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 AND b = 0');
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE c = 0 AND d = 0');
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 AND c = 0');
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 OR c = 0');
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 AND b = 0 AND c = 0 AND d = 0');
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE (a = 0 AND b = 0) OR (c = 0 AND d = 0)');
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 OR b = 0 OR c = 0 OR d = 0');
 
 -- create separate MCV statistics
 CREATE STATISTICS mcv_lists_multi_1 (mcv) ON a, b FROM mcv_lists_multi;
@@ -809,7 +876,11 @@ ANALYZE mcv_lists_multi;
 
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 AND b = 0');
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE c = 0 AND d = 0');
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 AND c = 0');
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 OR c = 0');
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 AND b = 0 AND c = 0 AND d = 0');
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE (a = 0 AND b = 0) OR (c = 0 AND d = 0)');
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 OR b = 0 OR c = 0 OR d = 0');
 
 DROP TABLE mcv_lists_multi;

#30

Tomas Vondra

tomas.vondra@enterprisedb.com

about 5 years ago

In reply to: Dean Rasheed (#29)

Re: Additional improvements to extended statistics

On 11/17/20 4:35 PM, Dean Rasheed wrote:

On Thu, 12 Nov 2020 at 14:18, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

Here is an improved WIP version of the patch series, modified to address
the issue with repeatedly applying the extended statistics, as discussed
with Dean in this thread. It's a bit rough and not committable, but I
need some feedback so I'm posting it in this state.

As it stands, it doesn't compile if 0003 is applied, because it missed
one of the callers of clauselist_selectivity_simple(), but that's
easily fixed.

0001 is the original patch improving estimates of OR clauses

0002 adds thin wrappers for clause[list]_selectivity, with "internal"
functions allowing to specify whether to keep considering extended stats

0003 does the same for the "simple" functions

I've kept it like this to demonstrate that 0002 is not sufficient. In my
response from March 24 I wrote this:

Isn't it the case that clauselist_selectivity_simple (and the OR
variant) should ignore extended stats entirely? That is, we'd need
to add a flag (or _simple variant) to clause_selectivity, so that it
calls causelist_selectivity_simple_or.

But that's actually wrong, as 0002 shows (as it breaks a couple of
regression tests), because of the way we handle OR clauses. At the top
level, an OR-clause is actually just a single clause and it may get
passed to clauselist_selectivity_simple. So entirely disabling extended
stats for the "simple" functions would also mean disabling extended
stats for a large number of OR clauses. Which is clearly wrong.

So 0003 addresses that, by adding a flag to the two "simple" functions.
Ultimately, this should probably do the same thing as 0002 and add thin
wrappers, because the existing functions are part of the public API.

I agree that, taken together, these patches fix the
multiple-extended-stats-evaluation issue. However:

I think this has ended up with too many variants of these functions,
since we now have "_internal" and "_simple" variants, and you're
proposing adding more. The original purpose of the "_simple" variants
was to compute selectivities without looking at extended stats, and
now the "_internal" variants compute selectivities with an additional
"use_extended_stats" flag to control whether or not to look at
extended stats. Thus they're basically the same, and could be rolled
together.

Yeah, I agree there were far too many functions. Your patch looks much
cleaner / saner than the one I shared last week.

Additionally, it's worth noting that the "_simple" variants expose the
"estimatedclauses" bitmap as an argument, which IMO is a bit messy as
an API. All callers of the "_simple" functions outside of clausesel.c
actually pass in estimatedclauses=NULL, so it's possible to refactor
and get rid of that, turning estimatedclauses into a purely internal
variable.

Hmmm. I think there were two reasons for exposing the estimatedclauses
bitmap like that: (a) we used the function internally and (b) we wanted
to allow cases where the user code might do something with the bitmap.
The first item is not an issue - we can hide that. As for the second
item, my guess is it was unnecessary future-proofing - we don't know
about any use case that might need this, so +1 to get rid of it.

Also, it's quite messy that clauselist_selectivity_simple_or() needs
to be passed a Selectivity input (the final argument) that is the
selectivity of any already-estimated clauses, or the value to return
if no not-already-estimated clauses are found, and must be 0.0 when
called from the extended stats code.

True.

Attached is the kind of thing I had in mind (as a single patch, since
I don't think it's worth splitting up). This replaces the "_simple"
and "_internal" variants of these functions with "_opt_ext_stats"
variants whose signatures match the originals except for having the
single extra "use_extended_stats" boolean parameter. Additionally, the
"_simple" functions are merged into the originals (making them more
like they were in PG11) so that the "estimatedclauses" bitmap and
partial-OR-list Selectivity become internal details, no longer exposed
in the API.

Seems fine to me, although the "_opt_ext_stats" is rather cryptic.
AFAICS we use "_internal" for similar functions.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#31

Dean Rasheed

dean.a.rasheed@gmail.com

about 5 years ago

In reply to: Tomas Vondra (#30)

Re: Additional improvements to extended statistics

On Wed, 18 Nov 2020 at 22:37, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

Seems fine to me, although the "_opt_ext_stats" is rather cryptic.
AFAICS we use "_internal" for similar functions.

There's precedent for using "_opt_xxx" for function variants that add
an option to existing functions, but I agree that in this case it's a
bit of a mouthful. I don't think "_internal" is appropriate though,
since the clauselist function isn't internal. Perhaps using just
"_ext" would be OK.

Regards,
Dean

#32

Dean Rasheed

dean.a.rasheed@gmail.com

about 5 years ago

In reply to: Dean Rasheed (#31)

1 attachment(s)

Re: Additional improvements to extended statistics

On Wed, 18 Nov 2020 at 22:37, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

Seems fine to me, although the "_opt_ext_stats" is rather cryptic.
AFAICS we use "_internal" for similar functions.

I have been thinking about this some more. The one part of this that I
still wasn't happy with was the way that base frequencies were used to
compute the selectivity correction to apply. As noted in [1]/messages/by-id/CAEZATCX8u9bZzcWEzqA_t7f_OQHu2oxeTUGnFHNEOXnJo35AQg@mail.gmail.com, using
base frequencies in this way isn't really appropriate for clauses
combined using "OR". The reason is that an item's base frequency is
computed as the product of the per-column selectivities, so that (freq
- base_freq) is the right correction to apply for a set of clauses
combined with "AND", but it doesn't really work properly for clauses
combined with "OR". This is why a number of the estimates in the
regression tests end up being significant over-estimates.

I speculated in [1]/messages/by-id/CAEZATCX8u9bZzcWEzqA_t7f_OQHu2oxeTUGnFHNEOXnJo35AQg@mail.gmail.com that we might fix that by tracking which columns
of the match bitmap actually matched the clauses being estimated, and
then only use those base frequencies. Unfortunately that would also
mean changing the format of the stats that we store, and so would be a
rather invasive change.

It occurred to me though, that there is another, much more
straightforward way to do it. We can rewrite the "OR" clauses, and
turn them into "AND" clauses using the fact that

P(A OR B) = P(A) + P(B) - P(A AND B)

and then use the multivariate stats to estimate the P(A AND B) part in
the usual way.

Attached is the resulting patch doing it that way. The main change is
in the way that statext_mcv_clauselist_selectivity() works, combined
with a new function mcv_clause_selectivity_or() that does the
necessary MCV bitmap manipulations.

Doing it this way also means that clausesel.c doesn't need to export
clauselist_selectivity_or(), and the new set of exported functions
seem a bit neater now.

A handful of regression test results change, and in all cases except
one the new estimates are much better. One estimate is made worse, but
in that case we only have 2 sets of partial stats:

SELECT * FROM mcv_lists_multi WHERE a = 0 OR b = 0 OR c = 0 OR d = 0

with stats on (a,b) and (c,d) so it's not surprising that combining (a
= 0 OR b = 0) with (c = 0 OR d = 0) mis-estimates a bit. I suspect the
old MV stats estimate was more down to chance in this case.

Regards,
Dean

[1]: /messages/by-id/CAEZATCX8u9bZzcWEzqA_t7f_OQHu2oxeTUGnFHNEOXnJo35AQg@mail.gmail.com

Attachments:

improve-estimation-of-OR-clauses-20201129.patchtext/x-patch; charset=US-ASCII; name=improve-estimation-of-OR-clauses-20201129.patchDownload

diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c
new file mode 100644
index 37a735b..7d6e678
--- a/src/backend/optimizer/path/clausesel.c
+++ b/src/backend/optimizer/path/clausesel.c
@@ -44,6 +44,12 @@ static void addRangeClause(RangeQueryCla
 						   bool varonleft, bool isLTsel, Selectivity s2);
 static RelOptInfo *find_single_rel_for_clauses(PlannerInfo *root,
 											   List *clauses);
+static Selectivity clauselist_selectivity_or(PlannerInfo *root,
+											 List *clauses,
+											 int varRelid,
+											 JoinType jointype,
+											 SpecialJoinInfo *sjinfo,
+											 bool use_extended_stats);
 
 /****************************************************************************
  *		ROUTINES TO COMPUTE SELECTIVITIES
@@ -61,64 +67,8 @@ static RelOptInfo *find_single_rel_for_c
  *
  * The basic approach is to apply extended statistics first, on as many
  * clauses as possible, in order to capture cross-column dependencies etc.
- * The remaining clauses are then estimated using regular statistics tracked
- * for individual columns.  This is done by simply passing the clauses to
- * clauselist_selectivity_simple.
- */
-Selectivity
-clauselist_selectivity(PlannerInfo *root,
-					   List *clauses,
-					   int varRelid,
-					   JoinType jointype,
-					   SpecialJoinInfo *sjinfo)
-{
-	Selectivity s1 = 1.0;
-	RelOptInfo *rel;
-	Bitmapset  *estimatedclauses = NULL;
-
-	/*
-	 * Determine if these clauses reference a single relation.  If so, and if
-	 * it has extended statistics, try to apply those.
-	 */
-	rel = find_single_rel_for_clauses(root, clauses);
-	if (rel && rel->rtekind == RTE_RELATION && rel->statlist != NIL)
-	{
-		/*
-		 * Estimate as many clauses as possible using extended statistics.
-		 *
-		 * 'estimatedclauses' tracks the 0-based list position index of
-		 * clauses that we've estimated using extended statistics, and that
-		 * should be ignored.
-		 */
-		s1 *= statext_clauselist_selectivity(root, clauses, varRelid,
-											 jointype, sjinfo, rel,
-											 &estimatedclauses);
-	}
-
-	/*
-	 * Apply normal selectivity estimates for the remaining clauses, passing
-	 * 'estimatedclauses' so that it skips already estimated ones.
-	 */
-	return s1 * clauselist_selectivity_simple(root, clauses, varRelid,
-											  jointype, sjinfo,
-											  estimatedclauses);
-}
-
-/*
- * clauselist_selectivity_simple -
- *	  Compute the selectivity of an implicitly-ANDed list of boolean
- *	  expression clauses.  The list can be empty, in which case 1.0
- *	  must be returned.  List elements may be either RestrictInfos
- *	  or bare expression clauses --- the former is preferred since
- *	  it allows caching of results.  The estimatedclauses bitmap tracks
- *	  clauses that have already been estimated by other means.
- *
- * See clause_selectivity() for the meaning of the additional parameters.
- *
- * Our basic approach is to take the product of the selectivities of the
- * subclauses.  However, that's only right if the subclauses have independent
- * probabilities, and in reality they are often NOT independent.  So,
- * we want to be smarter where we can.
+ * The remaining clauses are then estimated by taking the product of their
+ * selectivities.
  *
  * We also recognize "range queries", such as "x > 34 AND x < 42".  Clauses
  * are recognized as possible range query components if they are restriction
@@ -147,28 +97,63 @@ clauselist_selectivity(PlannerInfo *root
  * selectivity functions; perhaps some day we can generalize the approach.
  */
 Selectivity
-clauselist_selectivity_simple(PlannerInfo *root,
-							  List *clauses,
-							  int varRelid,
-							  JoinType jointype,
-							  SpecialJoinInfo *sjinfo,
-							  Bitmapset *estimatedclauses)
+clauselist_selectivity(PlannerInfo *root,
+					   List *clauses,
+					   int varRelid,
+					   JoinType jointype,
+					   SpecialJoinInfo *sjinfo)
+{
+	return clauselist_selectivity_ext(root, clauses, varRelid,
+									  jointype, sjinfo, true);
+}
+
+Selectivity
+clauselist_selectivity_ext(PlannerInfo *root,
+						   List *clauses,
+						   int varRelid,
+						   JoinType jointype,
+						   SpecialJoinInfo *sjinfo,
+						   bool use_extended_stats)
 {
 	Selectivity s1 = 1.0;
+	RelOptInfo *rel;
+	Bitmapset  *estimatedclauses = NULL;
 	RangeQueryClause *rqlist = NULL;
 	ListCell   *l;
 	int			listidx;
 
 	/*
-	 * If there's exactly one clause (and it was not estimated yet), just go
-	 * directly to clause_selectivity(). None of what we might do below is
-	 * relevant.
+	 * If there's exactly one clause, just go directly to
+	 * clause_selectivity(). None of what we might do below is relevant.
 	 */
-	if (list_length(clauses) == 1 && bms_is_empty(estimatedclauses))
-		return clause_selectivity(root, (Node *) linitial(clauses),
-								  varRelid, jointype, sjinfo);
+	if (list_length(clauses) == 1)
+		return clause_selectivity_ext(root, (Node *) linitial(clauses),
+									  varRelid, jointype, sjinfo,
+									  use_extended_stats);
+
+	/*
+	 * Determine if these clauses reference a single relation.  If so, and if
+	 * it has extended statistics, try to apply those.
+	 */
+	rel = find_single_rel_for_clauses(root, clauses);
+	if (use_extended_stats && rel && rel->rtekind == RTE_RELATION && rel->statlist != NIL)
+	{
+		/*
+		 * Estimate as many clauses as possible using extended statistics.
+		 *
+		 * 'estimatedclauses' tracks the 0-based list position index of
+		 * clauses that we've estimated using extended statistics, and that
+		 * should be ignored.
+		 */
+		s1 *= statext_clauselist_selectivity(root, clauses, varRelid,
+											 jointype, sjinfo, rel,
+											 &estimatedclauses, false);
+	}
 
 	/*
+	 * Apply normal selectivity estimates for remaining clauses. We'll be
+	 * careful to skip any clauses which were already estimated above.
+	 *
 	 * Anything that doesn't look like a potential rangequery clause gets
 	 * multiplied into s1 and forgotten. Anything that does gets inserted into
 	 * an rqlist entry.
@@ -190,7 +175,8 @@ clauselist_selectivity_simple(PlannerInf
 			continue;
 
 		/* Always compute the selectivity using clause_selectivity */
-		s2 = clause_selectivity(root, clause, varRelid, jointype, sjinfo);
+		s2 = clause_selectivity_ext(root, clause, varRelid, jointype, sjinfo,
+									use_extended_stats);
 
 		/*
 		 * Check for being passed a RestrictInfo.
@@ -351,6 +337,87 @@ clauselist_selectivity_simple(PlannerInf
 }
 
 /*
+ * clauselist_selectivity_or -
+ *	  Compute the selectivity of an implicitly-ORed list of boolean
+ *	  expression clauses.  The list can be empty, in which case 0.0
+ *	  must be returned.  List elements may be either RestrictInfos
+ *	  or bare expression clauses --- the former is preferred since
+ *	  it allows caching of results.
+ *
+ * See clause_selectivity() for the meaning of the additional parameters.
+ *
+ * The basic approach is to apply extended statistics first, on as many
+ * clauses as possible, in order to capture cross-column dependencies etc.
+ * The remaining clauses are then estimated as if they were independent.
+ */
+static Selectivity
+clauselist_selectivity_or(PlannerInfo *root,
+						  List *clauses,
+						  int varRelid,
+						  JoinType jointype,
+						  SpecialJoinInfo *sjinfo,
+						  bool use_extended_stats)
+{
+	Selectivity s1 = 0.0;
+	RelOptInfo *rel;
+	Bitmapset  *estimatedclauses = NULL;
+	ListCell   *lc;
+	int			listidx;
+
+	/*
+	 * Determine if these clauses reference a single relation.  If so, and if
+	 * it has extended statistics, try to apply those.
+	 */
+	rel = find_single_rel_for_clauses(root, clauses);
+	if (use_extended_stats && rel && rel->rtekind == RTE_RELATION && rel->statlist != NIL)
+	{
+		/*
+		 * Estimate as many clauses as possible using extended statistics.
+		 *
+		 * 'estimatedclauses' tracks the 0-based list position index of
+		 * clauses that we've estimated using extended statistics, and that
+		 * should be ignored.
+		 *
+		 * XXX We can't multiply with current value, because for OR clauses we
+		 * start with 0.0, so we simply assign to 's' directly.
+		 */
+		s1 = statext_clauselist_selectivity(root, clauses, varRelid,
+											jointype, sjinfo, rel,
+											&estimatedclauses, true);
+	}
+
+	/*
+	 * Estimate the remaining clauses.
+	 *
+	 * Selectivities for an OR clause are computed as s1+s2 - s1*s2 to account
+	 * for the probable overlap of selected tuple sets.
+	 *
+	 * XXX is this too conservative?
+	 */
+	listidx = -1;
+	foreach(lc, clauses)
+	{
+		Selectivity s2;
+
+		listidx++;
+
+		/*
+		 * Skip this clause if it's already been estimated by some other
+		 * statistics above.
+		 */
+		if (bms_is_member(listidx, estimatedclauses))
+			continue;
+
+		s2 = clause_selectivity_ext(root, (Node *) lfirst(lc), varRelid,
+									jointype, sjinfo, use_extended_stats);
+
+		s1 = s1 + s2 - s1 * s2;
+	}
+
+	return s1;
+}
+
+/*
  * addRangeClause --- add a new range clause for clauselist_selectivity
  *
  * Here is where we try to match up pairs of range-query clauses
@@ -602,6 +669,18 @@ clause_selectivity(PlannerInfo *root,
 				   JoinType jointype,
 				   SpecialJoinInfo *sjinfo)
 {
+	return clause_selectivity_ext(root, clause, varRelid,
+								  jointype, sjinfo, true);
+}
+
+Selectivity
+clause_selectivity_ext(PlannerInfo *root,
+					   Node *clause,
+					   int varRelid,
+					   JoinType jointype,
+					   SpecialJoinInfo *sjinfo,
+					   bool use_extended_stats)
+{
 	Selectivity s1 = 0.5;		/* default for any unhandled clause type */
 	RestrictInfo *rinfo = NULL;
 	bool		cacheable = false;
@@ -716,42 +795,35 @@ clause_selectivity(PlannerInfo *root,
 	else if (is_notclause(clause))
 	{
 		/* inverse of the selectivity of the underlying clause */
-		s1 = 1.0 - clause_selectivity(root,
-									  (Node *) get_notclausearg((Expr *) clause),
-									  varRelid,
-									  jointype,
-									  sjinfo);
+		s1 = 1.0 - clause_selectivity_ext(root,
+										  (Node *) get_notclausearg((Expr *) clause),
+										  varRelid,
+										  jointype,
+										  sjinfo,
+										  use_extended_stats);
 	}
 	else if (is_andclause(clause))
 	{
 		/* share code with clauselist_selectivity() */
-		s1 = clauselist_selectivity(root,
-									((BoolExpr *) clause)->args,
-									varRelid,
-									jointype,
-									sjinfo);
+		s1 = clauselist_selectivity_ext(root,
+										((BoolExpr *) clause)->args,
+										varRelid,
+										jointype,
+										sjinfo,
+										use_extended_stats);
 	}
 	else if (is_orclause(clause))
 	{
 		/*
-		 * Selectivities for an OR clause are computed as s1+s2 - s1*s2 to
-		 * account for the probable overlap of selected tuple sets.
-		 *
-		 * XXX is this too conservative?
+		 * Almost the same thing as clauselist_selectivity, but with the
+		 * clauses connected by OR.
 		 */
-		ListCell   *arg;
-
-		s1 = 0.0;
-		foreach(arg, ((BoolExpr *) clause)->args)
-		{
-			Selectivity s2 = clause_selectivity(root,
-												(Node *) lfirst(arg),
-												varRelid,
-												jointype,
-												sjinfo);
-
-			s1 = s1 + s2 - s1 * s2;
-		}
+		s1 = clauselist_selectivity_or(root,
+									   ((BoolExpr *) clause)->args,
+									   varRelid,
+									   jointype,
+									   sjinfo,
+									   use_extended_stats);
 	}
 	else if (is_opclause(clause) || IsA(clause, DistinctExpr))
 	{
@@ -852,20 +924,22 @@ clause_selectivity(PlannerInfo *root,
 	else if (IsA(clause, RelabelType))
 	{
 		/* Not sure this case is needed, but it can't hurt */
-		s1 = clause_selectivity(root,
-								(Node *) ((RelabelType *) clause)->arg,
-								varRelid,
-								jointype,
-								sjinfo);
+		s1 = clause_selectivity_ext(root,
+									(Node *) ((RelabelType *) clause)->arg,
+									varRelid,
+									jointype,
+									sjinfo,
+									use_extended_stats);
 	}
 	else if (IsA(clause, CoerceToDomain))
 	{
 		/* Not sure this case is needed, but it can't hurt */
-		s1 = clause_selectivity(root,
-								(Node *) ((CoerceToDomain *) clause)->arg,
-								varRelid,
-								jointype,
-								sjinfo);
+		s1 = clause_selectivity_ext(root,
+									(Node *) ((CoerceToDomain *) clause)->arg,
+									varRelid,
+									jointype,
+									sjinfo,
+									use_extended_stats);
 	}
 	else
 	{
diff --git a/src/backend/statistics/dependencies.c b/src/backend/statistics/dependencies.c
new file mode 100644
index d950b4e..b1abcde
--- a/src/backend/statistics/dependencies.c
+++ b/src/backend/statistics/dependencies.c
@@ -1073,8 +1073,8 @@ clauselist_apply_dependencies(PlannerInf
 			}
 		}
 
-		simple_sel = clauselist_selectivity_simple(root, attr_clauses, varRelid,
-												   jointype, sjinfo, NULL);
+		simple_sel = clauselist_selectivity_ext(root, attr_clauses, varRelid,
+												jointype, sjinfo, false);
 		attr_sel[attidx++] = simple_sel;
 	}
 
diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c
new file mode 100644
index 3632692..3941657
--- a/src/backend/statistics/extended_stats.c
+++ b/src/backend/statistics/extended_stats.c
@@ -1282,16 +1282,17 @@ statext_is_compatible_clause(PlannerInfo
 static Selectivity
 statext_mcv_clauselist_selectivity(PlannerInfo *root, List *clauses, int varRelid,
 								   JoinType jointype, SpecialJoinInfo *sjinfo,
-								   RelOptInfo *rel, Bitmapset **estimatedclauses)
+								   RelOptInfo *rel, Bitmapset **estimatedclauses,
+								   bool is_or)
 {
 	ListCell   *l;
 	Bitmapset **list_attnums;
 	int			listidx;
-	Selectivity sel = 1.0;
+	Selectivity sel = (is_or) ? 0.0 : 1.0;
 
 	/* check if there's any stats that might be useful for us. */
 	if (!has_stats_of_kind(rel->statlist, STATS_EXT_MCV))
-		return 1.0;
+		return sel;
 
 	list_attnums = (Bitmapset **) palloc(sizeof(Bitmapset *) *
 										 list_length(clauses));
@@ -1371,40 +1372,111 @@ statext_mcv_clauselist_selectivity(Plann
 			listidx++;
 		}
 
-		/*
-		 * First compute "simple" selectivity, i.e. without the extended
-		 * statistics, and essentially assuming independence of the
-		 * columns/clauses. We'll then use the various selectivities computed
-		 * from MCV list to improve it.
-		 */
-		simple_sel = clauselist_selectivity_simple(root, stat_clauses, varRelid,
-												   jointype, sjinfo, NULL);
+		if (is_or)
+		{
+			bool	   *or_matches = NULL;
+			MCVList    *mcv_list;
 
-		/*
-		 * Now compute the multi-column estimate from the MCV list, along with
-		 * the other selectivities (base & total selectivity).
-		 */
-		mcv_sel = mcv_clauselist_selectivity(root, stat, stat_clauses, varRelid,
-											 jointype, sjinfo, rel,
-											 &mcv_basesel, &mcv_totalsel);
+			/* Load the MCV list stored in the statistics object */
+			mcv_list = statext_mcv_load(stat->statOid);
 
-		/* Estimated selectivity of values not covered by MCV matches */
-		other_sel = simple_sel - mcv_basesel;
-		CLAMP_PROBABILITY(other_sel);
+			/*
+			 * Compute the selectivity of the OR'ed list of clauses by
+			 * estimating each in turn and combining them using the formula
+			 * P(A OR B) = P(A) + P(B) - P(A AND B).  This allows us to use
+			 * the multivariate MCV stats to better estimate P(A AND B).
+			 *
+			 * Each time we iterate this formula, the clause "A" above is
+			 * equal to all the clauses processed so far, combined with "OR".
+			 * This is then combined using "AND" with the next clause, as
+			 * described in mcv_clause_selectivity_or().
+			 */
+			foreach(l, stat_clauses)
+			{
+				Node	   *clause = (Node *) lfirst(l);
+				Selectivity clause_sel;
 
-		/* The non-MCV selectivity can't exceed the 1 - mcv_totalsel. */
-		if (other_sel > 1.0 - mcv_totalsel)
-			other_sel = 1.0 - mcv_totalsel;
+				/* Compute P(B) --- the selectivity of the next clause */
+				clause_sel = clause_selectivity_ext(root, clause, varRelid,
+													jointype, sjinfo, false);
 
-		/*
-		 * Overall selectivity is the combination of MCV and non-MCV
-		 * estimates.
-		 */
-		stat_sel = mcv_sel + other_sel;
-		CLAMP_PROBABILITY(stat_sel);
+				/* Initial estimate for P(A AND B), assuming independence */
+				simple_sel = sel * clause_sel;
 
-		/* Factor the estimate from this MCV to the overall estimate. */
-		sel *= stat_sel;
+				/*
+				 * Now compute the multi-column estimate of this from the MCV
+				 * list, along with the base and total selectivities.
+				 */
+				mcv_sel = mcv_clause_selectivity_or(root, stat, mcv_list,
+													clause, &or_matches,
+													&mcv_basesel,
+													&mcv_totalsel);
+
+				/*
+				 * Estimated selectivity of (A AND B) values not covered by
+				 * MCV matches.
+				 */
+				other_sel = simple_sel - mcv_basesel;
+				CLAMP_PROBABILITY(other_sel);
+
+				/*
+				 * The non-MCV selectivity can't exceed the 1 - mcv_totalsel.
+				 */
+				if (other_sel > 1.0 - mcv_totalsel)
+					other_sel = 1.0 - mcv_totalsel;
+
+				/*
+				 * Overall selectivity P(A AND B) is the combination of MCV
+				 * and non-MCV estimates.
+				 */
+				stat_sel = mcv_sel + other_sel;
+				CLAMP_PROBABILITY(stat_sel);
+
+				/* Factor this into the overall result */
+				sel = sel + clause_sel - stat_sel;
+				CLAMP_PROBABILITY(sel);
+			}
+		}
+		else
+		{
+			/*
+			 * We have an implicitly AND'ed list of clauses.
+			 *
+			 * First compute the "simple" selectivity, i.e. without any
+			 * extended statistics, and essentially assuming independence of
+			 * the columns/clauses.
+			 */
+			simple_sel = clauselist_selectivity_ext(root, stat_clauses,
+													varRelid, jointype,
+													sjinfo, false);
+
+			/*
+			 * Now compute the multi-column estimate from the MCV list, along
+			 * with the other selectivities (base & total selectivity).
+			 */
+			mcv_sel = mcv_clauselist_selectivity(root, stat, stat_clauses,
+												 varRelid, jointype, sjinfo,
+												 rel, &mcv_basesel,
+												 &mcv_totalsel);
+
+			/* Estimated selectivity of values not covered by MCV matches */
+			other_sel = simple_sel - mcv_basesel;
+			CLAMP_PROBABILITY(other_sel);
+
+			/* The non-MCV selectivity can't exceed the 1 - mcv_totalsel. */
+			if (other_sel > 1.0 - mcv_totalsel)
+				other_sel = 1.0 - mcv_totalsel;
+
+			/*
+			 * Overall selectivity is the combination of MCV and non-MCV
+			 * estimates.
+			 */
+			stat_sel = mcv_sel + other_sel;
+			CLAMP_PROBABILITY(stat_sel);
+
+			/* Factor this into the overall result */
+			sel *= stat_sel;
+		}
 	}
 
 	return sel;
@@ -1417,13 +1489,21 @@ statext_mcv_clauselist_selectivity(Plann
 Selectivity
 statext_clauselist_selectivity(PlannerInfo *root, List *clauses, int varRelid,
 							   JoinType jointype, SpecialJoinInfo *sjinfo,
-							   RelOptInfo *rel, Bitmapset **estimatedclauses)
+							   RelOptInfo *rel, Bitmapset **estimatedclauses,
+							   bool is_or)
 {
 	Selectivity sel;
 
 	/* First, try estimating clauses using a multivariate MCV list. */
 	sel = statext_mcv_clauselist_selectivity(root, clauses, varRelid, jointype,
-											 sjinfo, rel, estimatedclauses);
+											 sjinfo, rel, estimatedclauses, is_or);
+
+	/*
+	 * Functional dependencies only work for clauses connected by AND, so for
+	 * OR clauses we're done.
+	 */
+	if (is_or)
+		return sel;
 
 	/*
 	 * Then, apply functional dependencies on the remaining clauses by calling
diff --git a/src/backend/statistics/mcv.c b/src/backend/statistics/mcv.c
new file mode 100644
index 6a262f1..3913538
--- a/src/backend/statistics/mcv.c
+++ b/src/backend/statistics/mcv.c
@@ -1936,3 +1936,76 @@ mcv_clauselist_selectivity(PlannerInfo *
 
 	return s;
 }
+
+
+/*
+ * mcv_clause_selectivity_or
+ *		Return a correction to the selectivity estimate for a clause that is
+ *		included in an OR'ed list of clauses.
+ *
+ * This is called for each clause in the OR'ed list of clauses, using the
+ * following algorithm:
+ *
+ * Suppose P[n] = P(C[1] OR C[2] OR ... OR C[n]) is the combined selectivity
+ * of the first n clauses in the list.  Then the combined selectivity taking
+ * into account the next clause C[n+1] can be written as
+ *
+ *		P[n+1] = P[n] + P(C[n+1]) - P((C[1] OR ... OR C[n]) AND C[n+1])
+ *
+ * To estimate the final "correction" term above, we track the match bitmap
+ * for the OR'ed list of clauses examined so far and examine its intersection
+ * with the match bitmap for the (n+1)'th clause.
+ *
+ * The return value is the sum of MCV item frequencies for the match bitmap
+ * intersection corresponding to the correction term above.  We also return
+ * the total selectivity of all the MCV items (not just the matching ones),
+ * and the sum of base frequencies computed on the assumption of independence,
+ * in the same way as mcv_clauselist_selectivity().  This allows the
+ * correction term above to be estimated using both per-column statistics and
+ * multivariate MCV statistics, in the same way as we do for an implicitly
+ * AND'ed list of clauses.
+ *
+ * The parameter "or_matches" is an in/out parameter tracking the match bitmap
+ * for the clauses examined so far.  The caller is expected to set it to NULL
+ * the first time it calls this function.
+ */
+Selectivity
+mcv_clause_selectivity_or(PlannerInfo *root, StatisticExtInfo *stat,
+						  MCVList *mcv, Node *clause, bool **or_matches,
+						  Selectivity *basesel, Selectivity *totalsel)
+{
+	bool	   *new_matches = NULL;
+	int			i;
+	Selectivity s = 0.0;
+
+	/* build the OR-matches bitmap, if not built already */
+	if (*or_matches == NULL)
+		*or_matches = palloc0(sizeof(bool) * mcv->nitems);
+
+	/* build the match bitmap for the new clause */
+	new_matches = mcv_get_match_bitmap(root, list_make1(clause), stat->keys,
+									   mcv, false);
+
+	/*
+	 * Sum the frequencies for all the matching MCV items in the intersection
+	 * of the two match bitmaps and update or_matches as described above.
+	 */
+	*basesel = 0.0;
+	*totalsel = 0.0;
+	for (i = 0; i < mcv->nitems; i++)
+	{
+		*totalsel += mcv->items[i].frequency;
+
+		if ((*or_matches)[i] && new_matches[i])
+		{
+			*basesel += mcv->items[i].base_frequency;
+			s += mcv->items[i].frequency;
+		}
+
+		(*or_matches)[i] = (*or_matches)[i] || new_matches[i];
+	}
+
+	pfree(new_matches);
+
+	return s;
+}
diff --git a/src/include/optimizer/optimizer.h b/src/include/optimizer/optimizer.h
new file mode 100644
index 3e41710..dea0e73
--- a/src/include/optimizer/optimizer.h
+++ b/src/include/optimizer/optimizer.h
@@ -58,17 +58,23 @@ extern Selectivity clause_selectivity(Pl
 									  int varRelid,
 									  JoinType jointype,
 									  SpecialJoinInfo *sjinfo);
-extern Selectivity clauselist_selectivity_simple(PlannerInfo *root,
-												 List *clauses,
-												 int varRelid,
-												 JoinType jointype,
-												 SpecialJoinInfo *sjinfo,
-												 Bitmapset *estimatedclauses);
+extern Selectivity clause_selectivity_ext(PlannerInfo *root,
+										  Node *clause,
+										  int varRelid,
+										  JoinType jointype,
+										  SpecialJoinInfo *sjinfo,
+										  bool use_extended_stats);
 extern Selectivity clauselist_selectivity(PlannerInfo *root,
 										  List *clauses,
 										  int varRelid,
 										  JoinType jointype,
 										  SpecialJoinInfo *sjinfo);
+extern Selectivity clauselist_selectivity_ext(PlannerInfo *root,
+											  List *clauses,
+											  int varRelid,
+											  JoinType jointype,
+											  SpecialJoinInfo *sjinfo,
+											  bool use_extended_stats);
 
 /* in path/costsize.c: */
 
diff --git a/src/include/statistics/extended_stats_internal.h b/src/include/statistics/extended_stats_internal.h
new file mode 100644
index 61e6969..242e470
--- a/src/include/statistics/extended_stats_internal.h
+++ b/src/include/statistics/extended_stats_internal.h
@@ -109,4 +109,12 @@ extern Selectivity mcv_clauselist_select
 											  Selectivity *basesel,
 											  Selectivity *totalsel);
 
+extern Selectivity mcv_clause_selectivity_or(PlannerInfo *root,
+											 StatisticExtInfo *stat,
+											 MCVList *mcv,
+											 Node *clause,
+											 bool **or_matches,
+											 Selectivity *basesel,
+											 Selectivity *totalsel);
+
 #endif							/* EXTENDED_STATS_INTERNAL_H */
diff --git a/src/include/statistics/statistics.h b/src/include/statistics/statistics.h
new file mode 100644
index 50fce49..c9ed211
--- a/src/include/statistics/statistics.h
+++ b/src/include/statistics/statistics.h
@@ -116,7 +116,8 @@ extern Selectivity statext_clauselist_se
 												  JoinType jointype,
 												  SpecialJoinInfo *sjinfo,
 												  RelOptInfo *rel,
-												  Bitmapset **estimatedclauses);
+												  Bitmapset **estimatedclauses,
+												  bool is_or);
 extern bool has_stats_of_kind(List *stats, char requiredkind);
 extern StatisticExtInfo *choose_best_statistics(List *stats, char requiredkind,
 												Bitmapset **clause_attnums,
diff --git a/src/test/regress/expected/stats_ext.out b/src/test/regress/expected/stats_ext.out
new file mode 100644
index 4c3edd2..07b6ac5
--- a/src/test/regress/expected/stats_ext.out
+++ b/src/test/regress/expected/stats_ext.out
@@ -1113,6 +1113,12 @@ SELECT * FROM check_estimated_rows('SELE
        200 |    200
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = ''1'' OR c = 1 OR d IS NOT NULL');
+ estimated | actual 
+-----------+--------
+       200 |    200
+(1 row)
+
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IN (1, 2, 51, 52) AND b IN ( ''1'', ''2'')');
  estimated | actual 
 -----------+--------
@@ -1173,13 +1179,6 @@ SELECT * FROM check_estimated_rows('SELE
        100 |    100
 (1 row)
 
--- we can't use the statistic for OR clauses that are not fully covered (missing 'd' attribute)
-SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = ''1'' OR c = 1 OR d IS NOT NULL');
- estimated | actual 
------------+--------
-       343 |    200
-(1 row)
-
 -- check change of unrelated column type does not reset the MCV statistics
 ALTER TABLE mcv_lists ALTER COLUMN d TYPE VARCHAR(64);
 SELECT d.stxdmcv IS NOT NULL
@@ -1477,6 +1476,110 @@ SELECT * FROM check_estimated_rows('SELE
          1 |      0
 (1 row)
 
+-- mcv covering just a small fraction of data
+CREATE TABLE mcv_lists_partial (
+    a INT,
+    b INT,
+    c INT
+);
+-- 10 frequent groups, each with 100 elements
+INSERT INTO mcv_lists_partial (a, b, c)
+     SELECT
+         mod(i,10),
+         mod(i,10),
+         mod(i,10)
+     FROM generate_series(0,999) s(i);
+-- 100 groups that will make it to the MCV list (includes the 10 frequent ones)
+INSERT INTO mcv_lists_partial (a, b, c)
+     SELECT
+         i,
+         i,
+         i
+     FROM generate_series(0,99) s(i);
+-- 4000 groups in total, most of which won't make it (just a single item)
+INSERT INTO mcv_lists_partial (a, b, c)
+     SELECT
+         i,
+         i,
+         i
+     FROM generate_series(0,3999) s(i);
+ANALYZE mcv_lists_partial;
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 AND b = 0 AND c = 0');
+ estimated | actual 
+-----------+--------
+         1 |    102
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 OR b = 0 OR c = 0');
+ estimated | actual 
+-----------+--------
+       300 |    102
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 10 AND b = 10 AND c = 10');
+ estimated | actual 
+-----------+--------
+         1 |      2
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 10 OR b = 10 OR c = 10');
+ estimated | actual 
+-----------+--------
+         6 |      2
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 AND b = 0 AND c = 10');
+ estimated | actual 
+-----------+--------
+         1 |      0
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 OR b = 0 OR c = 10');
+ estimated | actual 
+-----------+--------
+       204 |    104
+(1 row)
+
+CREATE STATISTICS mcv_lists_partial_stats (mcv) ON a, b, c
+  FROM mcv_lists_partial;
+ANALYZE mcv_lists_partial;
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 AND b = 0 AND c = 0');
+ estimated | actual 
+-----------+--------
+       102 |    102
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 OR b = 0 OR c = 0');
+ estimated | actual 
+-----------+--------
+        98 |    102
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 10 AND b = 10 AND c = 10');
+ estimated | actual 
+-----------+--------
+         2 |      2
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 10 OR b = 10 OR c = 10');
+ estimated | actual 
+-----------+--------
+         2 |      2
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 AND b = 0 AND c = 10');
+ estimated | actual 
+-----------+--------
+         1 |      0
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 OR b = 0 OR c = 10');
+ estimated | actual 
+-----------+--------
+       102 |    104
+(1 row)
+
+DROP TABLE mcv_lists_partial;
 -- check the ability to use multiple MCV lists
 CREATE TABLE mcv_lists_multi (
 	a INTEGER,
@@ -1506,12 +1609,36 @@ SELECT * FROM check_estimated_rows('SELE
        102 |    714
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 AND c = 0');
+ estimated | actual 
+-----------+--------
+       143 |    142
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 OR c = 0');
+ estimated | actual 
+-----------+--------
+      1571 |   1572
+(1 row)
+
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 AND b = 0 AND c = 0 AND d = 0');
  estimated | actual 
 -----------+--------
          4 |    142
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE (a = 0 AND b = 0) OR (c = 0 AND d = 0)');
+ estimated | actual 
+-----------+--------
+       298 |   1572
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 OR b = 0 OR c = 0 OR d = 0');
+ estimated | actual 
+-----------+--------
+      2649 |   1572
+(1 row)
+
 -- create separate MCV statistics
 CREATE STATISTICS mcv_lists_multi_1 (mcv) ON a, b FROM mcv_lists_multi;
 CREATE STATISTICS mcv_lists_multi_2 (mcv) ON c, d FROM mcv_lists_multi;
@@ -1528,12 +1655,36 @@ SELECT * FROM check_estimated_rows('SELE
        714 |    714
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 AND c = 0');
+ estimated | actual 
+-----------+--------
+       143 |    142
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 OR c = 0');
+ estimated | actual 
+-----------+--------
+      1571 |   1572
+(1 row)
+
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 AND b = 0 AND c = 0 AND d = 0');
  estimated | actual 
 -----------+--------
        143 |    142
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE (a = 0 AND b = 0) OR (c = 0 AND d = 0)');
+ estimated | actual 
+-----------+--------
+      1571 |   1572
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 OR b = 0 OR c = 0 OR d = 0');
+ estimated | actual 
+-----------+--------
+      1714 |   1572
+(1 row)
+
 DROP TABLE mcv_lists_multi;
 -- Permission tests. Users should not be able to see specific data values in
 -- the extended statistics, if they lack permission to see those values in
diff --git a/src/test/regress/sql/stats_ext.sql b/src/test/regress/sql/stats_ext.sql
new file mode 100644
index 9781e59..3ec6dda
--- a/src/test/regress/sql/stats_ext.sql
+++ b/src/test/regress/sql/stats_ext.sql
@@ -561,6 +561,8 @@ SELECT * FROM check_estimated_rows('SELE
 
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = ''1'' OR c = 1');
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = ''1'' OR c = 1 OR d IS NOT NULL');
+
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IN (1, 2, 51, 52) AND b IN ( ''1'', ''2'')');
 
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IN (1, 2, 51, 52, NULL) AND b IN ( ''1'', ''2'', NULL)');
@@ -581,9 +583,6 @@ SELECT * FROM check_estimated_rows('SELE
 
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a < ALL (ARRAY[4, 5]) AND b IN (''1'', ''2'', NULL, ''3'') AND c > ANY (ARRAY[1, 2, NULL, 3])');
 
--- we can't use the statistic for OR clauses that are not fully covered (missing 'd' attribute)
-SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = ''1'' OR c = 1 OR d IS NOT NULL');
-
 -- check change of unrelated column type does not reset the MCV statistics
 ALTER TABLE mcv_lists ALTER COLUMN d TYPE VARCHAR(64);
 
@@ -777,6 +776,70 @@ SELECT * FROM check_estimated_rows('SELE
 
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a AND b AND NOT c');
 
+-- mcv covering just a small fraction of data
+CREATE TABLE mcv_lists_partial (
+    a INT,
+    b INT,
+    c INT
+);
+
+-- 10 frequent groups, each with 100 elements
+INSERT INTO mcv_lists_partial (a, b, c)
+     SELECT
+         mod(i,10),
+         mod(i,10),
+         mod(i,10)
+     FROM generate_series(0,999) s(i);
+
+-- 100 groups that will make it to the MCV list (includes the 10 frequent ones)
+INSERT INTO mcv_lists_partial (a, b, c)
+     SELECT
+         i,
+         i,
+         i
+     FROM generate_series(0,99) s(i);
+
+-- 4000 groups in total, most of which won't make it (just a single item)
+INSERT INTO mcv_lists_partial (a, b, c)
+     SELECT
+         i,
+         i,
+         i
+     FROM generate_series(0,3999) s(i);
+
+ANALYZE mcv_lists_partial;
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 AND b = 0 AND c = 0');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 OR b = 0 OR c = 0');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 10 AND b = 10 AND c = 10');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 10 OR b = 10 OR c = 10');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 AND b = 0 AND c = 10');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 OR b = 0 OR c = 10');
+
+CREATE STATISTICS mcv_lists_partial_stats (mcv) ON a, b, c
+  FROM mcv_lists_partial;
+
+ANALYZE mcv_lists_partial;
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 AND b = 0 AND c = 0');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 OR b = 0 OR c = 0');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 10 AND b = 10 AND c = 10');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 10 OR b = 10 OR c = 10');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 AND b = 0 AND c = 10');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 OR b = 0 OR c = 10');
+
+DROP TABLE mcv_lists_partial;
+
 -- check the ability to use multiple MCV lists
 CREATE TABLE mcv_lists_multi (
 	a INTEGER,
@@ -799,7 +862,11 @@ ANALYZE mcv_lists_multi;
 -- estimates without any mcv statistics
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 AND b = 0');
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE c = 0 AND d = 0');
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 AND c = 0');
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 OR c = 0');
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 AND b = 0 AND c = 0 AND d = 0');
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE (a = 0 AND b = 0) OR (c = 0 AND d = 0)');
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 OR b = 0 OR c = 0 OR d = 0');
 
 -- create separate MCV statistics
 CREATE STATISTICS mcv_lists_multi_1 (mcv) ON a, b FROM mcv_lists_multi;
@@ -809,7 +876,11 @@ ANALYZE mcv_lists_multi;
 
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 AND b = 0');
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE c = 0 AND d = 0');
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 AND c = 0');
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 OR c = 0');
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 AND b = 0 AND c = 0 AND d = 0');
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE (a = 0 AND b = 0) OR (c = 0 AND d = 0)');
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 OR b = 0 OR c = 0 OR d = 0');
 
 DROP TABLE mcv_lists_multi;

#33

Tomas Vondra

tomas.vondra@enterprisedb.com

about 5 years ago

In reply to: Dean Rasheed (#32)

Re: Additional improvements to extended statistics

On 11/29/20 3:57 PM, Dean Rasheed wrote:

On Wed, 18 Nov 2020 at 22:37, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

Seems fine to me, although the "_opt_ext_stats" is rather cryptic.
AFAICS we use "_internal" for similar functions.

I have been thinking about this some more. The one part of this that I
still wasn't happy with was the way that base frequencies were used to
compute the selectivity correction to apply. As noted in [1], using
base frequencies in this way isn't really appropriate for clauses
combined using "OR". The reason is that an item's base frequency is
computed as the product of the per-column selectivities, so that (freq
- base_freq) is the right correction to apply for a set of clauses
combined with "AND", but it doesn't really work properly for clauses
combined with "OR". This is why a number of the estimates in the
regression tests end up being significant over-estimates.

I speculated in [1] that we might fix that by tracking which columns
of the match bitmap actually matched the clauses being estimated, and
then only use those base frequencies. Unfortunately that would also
mean changing the format of the stats that we store, and so would be a
rather invasive change.

It occurred to me though, that there is another, much more
straightforward way to do it. We can rewrite the "OR" clauses, and
turn them into "AND" clauses using the fact that

P(A OR B) = P(A) + P(B) - P(A AND B)

and then use the multivariate stats to estimate the P(A AND B) part in
the usual way.

OK, that seems quite reasonable.

Attached is the resulting patch doing it that way. The main change is
in the way that statext_mcv_clauselist_selectivity() works, combined
with a new function mcv_clause_selectivity_or() that does the
necessary MCV bitmap manipulations.

Doing it this way also means that clausesel.c doesn't need to export
clauselist_selectivity_or(), and the new set of exported functions
seem a bit neater now.

Nice. I agree this looks way better than the version I hacked together.

I wonder how much of the comment before clauselist_selectivity should
move to clauselist_selectivity_ext - it does talk about range clauses
and so on, but clauselist_selectivity does not really deal with that.
But maybe that's just an implementation detail and it's better to keep
the comment the way it is.

I noticed this outdated comment:

/* Always compute the selectivity using clause_selectivity */
s2 = clause_selectivity_ext(root, clause, varRelid, jointype, sjinfo,

Also, the comment at clauselist_selectivity_or seems to not follow the
usual pattern, which I think is

/*
* function name
* short one-sentence description
*
* ... longer description ...
*/

Those are fairly minor issues. I don't have any deeper objections, and
it seems committable. Do you plan to do that sometime soon?

A handful of regression test results change, and in all cases except
one the new estimates are much better. One estimate is made worse, but
in that case we only have 2 sets of partial stats:

SELECT * FROM mcv_lists_multi WHERE a = 0 OR b = 0 OR c = 0 OR d = 0

with stats on (a,b) and (c,d) so it's not surprising that combining (a
= 0 OR b = 0) with (c = 0 OR d = 0) mis-estimates a bit. I suspect the
old MV stats estimate was more down to chance in this case.

Yeah, that's quite possible - we're multiplying two estimates, but
there's a clear correlation. So it was mostly luck we had over-estimated
the clauses before, which gave us higher product and thus accidentally
better overall estimate.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#34

Dean Rasheed

dean.a.rasheed@gmail.com

about 5 years ago

In reply to: Tomas Vondra (#33)

Re: Additional improvements to extended statistics

On Sun, 29 Nov 2020 at 21:02, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

Those are fairly minor issues. I don't have any deeper objections, and
it seems committable. Do you plan to do that sometime soon?

OK, I've updated the patch status in the CF app, and I should be able
to push it in the next day or so.

Regards,
Dean

#35

Tomas Vondra

tomas.vondra@enterprisedb.com

about 5 years ago

In reply to: Dean Rasheed (#34)

Re: Additional improvements to extended statistics

On 12/1/20 9:15 AM, Dean Rasheed wrote:

On Sun, 29 Nov 2020 at 21:02, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

Those are fairly minor issues. I don't have any deeper objections, and
it seems committable. Do you plan to do that sometime soon?

OK, I've updated the patch status in the CF app, and I should be able
to push it in the next day or so.

Cool, thanks.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#36

Dean Rasheed

dean.a.rasheed@gmail.com

about 5 years ago

In reply to: Tomas Vondra (#33)

1 attachment(s)

Re: Additional improvements to extended statistics

On Sun, 29 Nov 2020 at 21:02, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

I wonder how much of the comment before clauselist_selectivity should
move to clauselist_selectivity_ext - it does talk about range clauses
and so on, but clauselist_selectivity does not really deal with that.
But maybe that's just an implementation detail and it's better to keep
the comment the way it is.

I think it's better to keep it the way it is, so that the entirety of
what clauselist_selectivity() does (via clauselist_selectivity_ext())
can be read in one place, but I have added separate comments for the
new "_ext" functions to explain how they differ. That matches similar
examples elsewhere.

I noticed this outdated comment:

/* Always compute the selectivity using clause_selectivity */
s2 = clause_selectivity_ext(root, clause, varRelid, jointype, sjinfo,

Updated.

Also, the comment at clauselist_selectivity_or seems to not follow the
usual pattern, which I think is

/*
* function name
* short one-sentence description
*
* ... longer description ...
*/

Hmm, it seems OK to me. The first part is basically copied from
clauselist_selectivity(). The "longer description" doesn't really have
much more to say because it's much simpler than
clauselist_selectivity(), but it seems neater to keep the two roughly
in sync.

I've been hacking on this a bit more and attached is an updated
(hopefully final) version with some other comment improvements and
also a couple of other tweaks:

The previous version had duplicated code blocks that implemented the
same MCV-correction algorithm using simple_sel, mcv_sel, base_sel,
other_sel and total_sel, which was quite messy. So I refactored that
into a new function mcv_combine_selectivities(). About half the
comments from statext_mcv_clauselist_selectivity() then move over to
mcv_combine_selectivities(). I also updated the comments for
mcv_clauselist_selectivity() and mcv_clause_selectivity_or() to
explain how their outputs are expected to be used by
mcv_combine_selectivities(). That hopefully makes for a clean
separation of concerns, and makes it easier to tweak the way MCV stats
are applied on top of simple stats, if someone thinks of a better
approach in the future.

In the previous version, for an ORed list of clauses, the MCV
correction was only applied to the overlaps between clauses. That's OK
as long as each clause only refers to a single column, since the
per-column statistics ought to be the best way to estimate each
individual clause in that case. However, if the individual clauses
refer to more than one column, I think the MCV correction should be
applied to each individual clause as well as to the overlaps. That
turns out to be pretty straightforward, since we're already doing all
the hard work of computing the match bitmap for each clause. The sort
of queries I had in mind were things like this:

WHERE (a = 1 AND b = 1) OR (a = 2 AND b = 2)

I added a new test case along those lines and the new estimates are
much better than they are without this patch, but not for the reason I
thought --- the old code consistently over-estimated queries like that
because it actually applied the MCV correction twice (once while
processing each AND list, via clause_selectivity(), called from
clauselist_selectivity_simple(), and once for the top-level OR clause,
contained in a single-element implicitly-ANDed list). The way the new
code is structured avoids any kind of double application of extended
stats, producing a much better estimate, which is good.

However, the new code doesn't apply the extended stats directly using
clauselist_selectivity_or() for this kind of query because there are
no RestrictInfos for the nested AND clauses, so
find_single_rel_for_clauses() (and similarly
statext_is_compatible_clause()) regards those clauses as not
compatible with extended stats. So what ends up happening is that
extended stats are used only when we descend down to the two AND
clauses, and their results are combined using the original "s1 + s2 -
s1 * s2" formula. That actually works OK in this case, because there
is no overlap between the two AND clauses, but it wouldn't work so
well if there was.

I'm pretty sure that can be fixed by teaching
find_single_rel_for_clauses() and statext_is_compatible_clause() to
handle BoolExpr clauses, looking for RestrictInfos underneath them,
but I think that should be left for a follow-in patch. I have left a
regression test in place, whose estimates ought to be improved by such
a fix.

The upshot of all that is that the new code that applies the MCV
correction to the individual clauses in an ORed list doesn't help with
queries like the one above at the moment, and it's not obvious whether
it is currently reachable, but I think it's worth leaving in because
it seems more principled, and makes that code more future-proof. I
also think it's neater because now the signature of
mcv_clause_selectivity_or() is more natural --- it's primary return
value is now the clause's MCV selectivity, as suggested by the
function's name, rather than the overlap selectivity that the previous
version was returning. Also, after your "Var Op Var" patch is applied,
I think it would be possible to construct queries that would benefit
from this, so it would be good to get that committed too.

Barring any further comments, I'll push this sometime soon.

Regards,
Dean

Attachments:

improve-estimation-of-OR-clauses-20201202.patchtext/x-patch; charset=US-ASCII; name=improve-estimation-of-OR-clauses-20201202.patchDownload

diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c
new file mode 100644
index 37a735b..b88b29e
--- a/src/backend/optimizer/path/clausesel.c
+++ b/src/backend/optimizer/path/clausesel.c
@@ -44,6 +44,12 @@ static void addRangeClause(RangeQueryCla
 						   bool varonleft, bool isLTsel, Selectivity s2);
 static RelOptInfo *find_single_rel_for_clauses(PlannerInfo *root,
 											   List *clauses);
+static Selectivity clauselist_selectivity_or(PlannerInfo *root,
+											 List *clauses,
+											 int varRelid,
+											 JoinType jointype,
+											 SpecialJoinInfo *sjinfo,
+											 bool use_extended_stats);
 
 /****************************************************************************
  *		ROUTINES TO COMPUTE SELECTIVITIES
@@ -61,64 +67,10 @@ static RelOptInfo *find_single_rel_for_c
  *
  * The basic approach is to apply extended statistics first, on as many
  * clauses as possible, in order to capture cross-column dependencies etc.
- * The remaining clauses are then estimated using regular statistics tracked
- * for individual columns.  This is done by simply passing the clauses to
- * clauselist_selectivity_simple.
- */
-Selectivity
-clauselist_selectivity(PlannerInfo *root,
-					   List *clauses,
-					   int varRelid,
-					   JoinType jointype,
-					   SpecialJoinInfo *sjinfo)
-{
-	Selectivity s1 = 1.0;
-	RelOptInfo *rel;
-	Bitmapset  *estimatedclauses = NULL;
-
-	/*
-	 * Determine if these clauses reference a single relation.  If so, and if
-	 * it has extended statistics, try to apply those.
-	 */
-	rel = find_single_rel_for_clauses(root, clauses);
-	if (rel && rel->rtekind == RTE_RELATION && rel->statlist != NIL)
-	{
-		/*
-		 * Estimate as many clauses as possible using extended statistics.
-		 *
-		 * 'estimatedclauses' tracks the 0-based list position index of
-		 * clauses that we've estimated using extended statistics, and that
-		 * should be ignored.
-		 */
-		s1 *= statext_clauselist_selectivity(root, clauses, varRelid,
-											 jointype, sjinfo, rel,
-											 &estimatedclauses);
-	}
-
-	/*
-	 * Apply normal selectivity estimates for the remaining clauses, passing
-	 * 'estimatedclauses' so that it skips already estimated ones.
-	 */
-	return s1 * clauselist_selectivity_simple(root, clauses, varRelid,
-											  jointype, sjinfo,
-											  estimatedclauses);
-}
-
-/*
- * clauselist_selectivity_simple -
- *	  Compute the selectivity of an implicitly-ANDed list of boolean
- *	  expression clauses.  The list can be empty, in which case 1.0
- *	  must be returned.  List elements may be either RestrictInfos
- *	  or bare expression clauses --- the former is preferred since
- *	  it allows caching of results.  The estimatedclauses bitmap tracks
- *	  clauses that have already been estimated by other means.
- *
- * See clause_selectivity() for the meaning of the additional parameters.
- *
- * Our basic approach is to take the product of the selectivities of the
- * subclauses.  However, that's only right if the subclauses have independent
- * probabilities, and in reality they are often NOT independent.  So,
- * we want to be smarter where we can.
+ * The remaining clauses are then estimated by taking the product of their
+ * selectivities, but that's only right if they have independent
+ * probabilities, and in reality they are often NOT independent even if they
+ * only refer to a single column.  So, we want to be smarter where we can.
  *
  * We also recognize "range queries", such as "x > 34 AND x < 42".  Clauses
  * are recognized as possible range query components if they are restriction
@@ -147,28 +99,68 @@ clauselist_selectivity(PlannerInfo *root
  * selectivity functions; perhaps some day we can generalize the approach.
  */
 Selectivity
-clauselist_selectivity_simple(PlannerInfo *root,
-							  List *clauses,
-							  int varRelid,
-							  JoinType jointype,
-							  SpecialJoinInfo *sjinfo,
-							  Bitmapset *estimatedclauses)
+clauselist_selectivity(PlannerInfo *root,
+					   List *clauses,
+					   int varRelid,
+					   JoinType jointype,
+					   SpecialJoinInfo *sjinfo)
+{
+	return clauselist_selectivity_ext(root, clauses, varRelid,
+									  jointype, sjinfo, true);
+}
+
+/*
+ * clauselist_selectivity_ext -
+ *	  Extended version of clauselist_selectivity().  If "use_extended_stats"
+ *	  is false, all extended statistics will be ignored, and only per-column
+ *	  statistics will be used.
+ */
+Selectivity
+clauselist_selectivity_ext(PlannerInfo *root,
+						   List *clauses,
+						   int varRelid,
+						   JoinType jointype,
+						   SpecialJoinInfo *sjinfo,
+						   bool use_extended_stats)
 {
 	Selectivity s1 = 1.0;
+	RelOptInfo *rel;
+	Bitmapset  *estimatedclauses = NULL;
 	RangeQueryClause *rqlist = NULL;
 	ListCell   *l;
 	int			listidx;
 
 	/*
-	 * If there's exactly one clause (and it was not estimated yet), just go
-	 * directly to clause_selectivity(). None of what we might do below is
-	 * relevant.
+	 * If there's exactly one clause, just go directly to
+	 * clause_selectivity_ext(). None of what we might do below is relevant.
 	 */
-	if (list_length(clauses) == 1 && bms_is_empty(estimatedclauses))
-		return clause_selectivity(root, (Node *) linitial(clauses),
-								  varRelid, jointype, sjinfo);
+	if (list_length(clauses) == 1)
+		return clause_selectivity_ext(root, (Node *) linitial(clauses),
+									  varRelid, jointype, sjinfo,
+									  use_extended_stats);
 
 	/*
+	 * Determine if these clauses reference a single relation.  If so, and if
+	 * it has extended statistics, try to apply those.
+	 */
+	rel = find_single_rel_for_clauses(root, clauses);
+	if (use_extended_stats && rel && rel->rtekind == RTE_RELATION && rel->statlist != NIL)
+	{
+		/*
+		 * Estimate as many clauses as possible using extended statistics.
+		 *
+		 * 'estimatedclauses' is populated with the 0-based list position
+		 * index of clauses estimated here, and that should be ignored below.
+		 */
+		s1 = statext_clauselist_selectivity(root, clauses, varRelid,
+											jointype, sjinfo, rel,
+											&estimatedclauses, false);
+	}
+
+	/*
+	 * Apply normal selectivity estimates for remaining clauses. We'll be
+	 * careful to skip any clauses which were already estimated above.
+	 *
 	 * Anything that doesn't look like a potential rangequery clause gets
 	 * multiplied into s1 and forgotten. Anything that does gets inserted into
 	 * an rqlist entry.
@@ -189,8 +181,9 @@ clauselist_selectivity_simple(PlannerInf
 		if (bms_is_member(listidx, estimatedclauses))
 			continue;
 
-		/* Always compute the selectivity using clause_selectivity */
-		s2 = clause_selectivity(root, clause, varRelid, jointype, sjinfo);
+		/* Compute the selectivity of this clause in isolation */
+		s2 = clause_selectivity_ext(root, clause, varRelid, jointype, sjinfo,
+									use_extended_stats);
 
 		/*
 		 * Check for being passed a RestrictInfo.
@@ -351,6 +344,83 @@ clauselist_selectivity_simple(PlannerInf
 }
 
 /*
+ * clauselist_selectivity_or -
+ *	  Compute the selectivity of an implicitly-ORed list of boolean
+ *	  expression clauses.  The list can be empty, in which case 0.0
+ *	  must be returned.  List elements may be either RestrictInfos
+ *	  or bare expression clauses --- the former is preferred since
+ *	  it allows caching of results.
+ *
+ * See clause_selectivity() for the meaning of the additional parameters.
+ *
+ * The basic approach is to apply extended statistics first, on as many
+ * clauses as possible, in order to capture cross-column dependencies etc.
+ * The remaining clauses are then estimated as if they were independent.
+ */
+static Selectivity
+clauselist_selectivity_or(PlannerInfo *root,
+						  List *clauses,
+						  int varRelid,
+						  JoinType jointype,
+						  SpecialJoinInfo *sjinfo,
+						  bool use_extended_stats)
+{
+	Selectivity s1 = 0.0;
+	RelOptInfo *rel;
+	Bitmapset  *estimatedclauses = NULL;
+	ListCell   *lc;
+	int			listidx;
+
+	/*
+	 * Determine if these clauses reference a single relation.  If so, and if
+	 * it has extended statistics, try to apply those.
+	 */
+	rel = find_single_rel_for_clauses(root, clauses);
+	if (use_extended_stats && rel && rel->rtekind == RTE_RELATION && rel->statlist != NIL)
+	{
+		/*
+		 * Estimate as many clauses as possible using extended statistics.
+		 *
+		 * 'estimatedclauses' is populated with the 0-based list position
+		 * index of clauses estimated here, and that should be ignored below.
+		 */
+		s1 = statext_clauselist_selectivity(root, clauses, varRelid,
+											jointype, sjinfo, rel,
+											&estimatedclauses, true);
+	}
+
+	/*
+	 * Estimate the remaining clauses as if they were independent.
+	 *
+	 * Selectivities for an OR clause are computed as s1+s2 - s1*s2 to account
+	 * for the probable overlap of selected tuple sets.
+	 *
+	 * XXX is this too conservative?
+	 */
+	listidx = -1;
+	foreach(lc, clauses)
+	{
+		Selectivity s2;
+
+		listidx++;
+
+		/*
+		 * Skip this clause if it's already been estimated by some other
+		 * statistics above.
+		 */
+		if (bms_is_member(listidx, estimatedclauses))
+			continue;
+
+		s2 = clause_selectivity_ext(root, (Node *) lfirst(lc), varRelid,
+									jointype, sjinfo, use_extended_stats);
+
+		s1 = s1 + s2 - s1 * s2;
+	}
+
+	return s1;
+}
+
+/*
  * addRangeClause --- add a new range clause for clauselist_selectivity
  *
  * Here is where we try to match up pairs of range-query clauses
@@ -602,6 +672,24 @@ clause_selectivity(PlannerInfo *root,
 				   JoinType jointype,
 				   SpecialJoinInfo *sjinfo)
 {
+	return clause_selectivity_ext(root, clause, varRelid,
+								  jointype, sjinfo, true);
+}
+
+/*
+ * clause_selectivity_ext -
+ *	  Extended version of clause_selectivity().  If "use_extended_stats" is
+ *	  false, all extended statistics will be ignored, and only per-column
+ *	  statistics will be used.
+ */
+Selectivity
+clause_selectivity_ext(PlannerInfo *root,
+					   Node *clause,
+					   int varRelid,
+					   JoinType jointype,
+					   SpecialJoinInfo *sjinfo,
+					   bool use_extended_stats)
+{
 	Selectivity s1 = 0.5;		/* default for any unhandled clause type */
 	RestrictInfo *rinfo = NULL;
 	bool		cacheable = false;
@@ -716,42 +804,35 @@ clause_selectivity(PlannerInfo *root,
 	else if (is_notclause(clause))
 	{
 		/* inverse of the selectivity of the underlying clause */
-		s1 = 1.0 - clause_selectivity(root,
-									  (Node *) get_notclausearg((Expr *) clause),
-									  varRelid,
-									  jointype,
-									  sjinfo);
+		s1 = 1.0 - clause_selectivity_ext(root,
+										  (Node *) get_notclausearg((Expr *) clause),
+										  varRelid,
+										  jointype,
+										  sjinfo,
+										  use_extended_stats);
 	}
 	else if (is_andclause(clause))
 	{
 		/* share code with clauselist_selectivity() */
-		s1 = clauselist_selectivity(root,
-									((BoolExpr *) clause)->args,
-									varRelid,
-									jointype,
-									sjinfo);
+		s1 = clauselist_selectivity_ext(root,
+										((BoolExpr *) clause)->args,
+										varRelid,
+										jointype,
+										sjinfo,
+										use_extended_stats);
 	}
 	else if (is_orclause(clause))
 	{
 		/*
-		 * Selectivities for an OR clause are computed as s1+s2 - s1*s2 to
-		 * account for the probable overlap of selected tuple sets.
-		 *
-		 * XXX is this too conservative?
+		 * Almost the same thing as clauselist_selectivity, but with the
+		 * clauses connected by OR.
 		 */
-		ListCell   *arg;
-
-		s1 = 0.0;
-		foreach(arg, ((BoolExpr *) clause)->args)
-		{
-			Selectivity s2 = clause_selectivity(root,
-												(Node *) lfirst(arg),
-												varRelid,
-												jointype,
-												sjinfo);
-
-			s1 = s1 + s2 - s1 * s2;
-		}
+		s1 = clauselist_selectivity_or(root,
+									   ((BoolExpr *) clause)->args,
+									   varRelid,
+									   jointype,
+									   sjinfo,
+									   use_extended_stats);
 	}
 	else if (is_opclause(clause) || IsA(clause, DistinctExpr))
 	{
@@ -852,20 +933,22 @@ clause_selectivity(PlannerInfo *root,
 	else if (IsA(clause, RelabelType))
 	{
 		/* Not sure this case is needed, but it can't hurt */
-		s1 = clause_selectivity(root,
-								(Node *) ((RelabelType *) clause)->arg,
-								varRelid,
-								jointype,
-								sjinfo);
+		s1 = clause_selectivity_ext(root,
+									(Node *) ((RelabelType *) clause)->arg,
+									varRelid,
+									jointype,
+									sjinfo,
+									use_extended_stats);
 	}
 	else if (IsA(clause, CoerceToDomain))
 	{
 		/* Not sure this case is needed, but it can't hurt */
-		s1 = clause_selectivity(root,
-								(Node *) ((CoerceToDomain *) clause)->arg,
-								varRelid,
-								jointype,
-								sjinfo);
+		s1 = clause_selectivity_ext(root,
+									(Node *) ((CoerceToDomain *) clause)->arg,
+									varRelid,
+									jointype,
+									sjinfo,
+									use_extended_stats);
 	}
 	else
 	{
diff --git a/src/backend/statistics/dependencies.c b/src/backend/statistics/dependencies.c
new file mode 100644
index d950b4e..b1abcde
--- a/src/backend/statistics/dependencies.c
+++ b/src/backend/statistics/dependencies.c
@@ -1073,8 +1073,8 @@ clauselist_apply_dependencies(PlannerInf
 			}
 		}
 
-		simple_sel = clauselist_selectivity_simple(root, attr_clauses, varRelid,
-												   jointype, sjinfo, NULL);
+		simple_sel = clauselist_selectivity_ext(root, attr_clauses, varRelid,
+												jointype, sjinfo, false);
 		attr_sel[attidx++] = simple_sel;
 	}
 
diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c
new file mode 100644
index 3632692..8d3cd09
--- a/src/backend/statistics/extended_stats.c
+++ b/src/backend/statistics/extended_stats.c
@@ -1239,10 +1239,10 @@ statext_is_compatible_clause(PlannerInfo
  * One of the main challenges with using MCV lists is how to extrapolate the
  * estimate to the data not covered by the MCV list. To do that, we compute
  * not only the "MCV selectivity" (selectivities for MCV items matching the
- * supplied clauses), but also a couple of derived selectivities:
+ * supplied clauses), but also the following related selectivities:
  *
- * - simple selectivity:  Computed without extended statistic, i.e. as if the
- * columns/clauses were independent
+ * - simple selectivity:  Computed without extended statistics, i.e. as if the
+ * columns/clauses were independent.
  *
  * - base selectivity:  Similar to simple selectivity, but is computed using
  * the extended statistic by adding up the base frequencies (that we compute
@@ -1250,30 +1250,9 @@ statext_is_compatible_clause(PlannerInfo
  *
  * - total selectivity: Selectivity covered by the whole MCV list.
  *
- * - other selectivity: A selectivity estimate for data not covered by the MCV
- * list (i.e. satisfying the clauses, but not common enough to make it into
- * the MCV list)
- *
- * Note: While simple and base selectivities are defined in a quite similar
- * way, the values are computed differently and are not therefore equal. The
- * simple selectivity is computed as a product of per-clause estimates, while
- * the base selectivity is computed by adding up base frequencies of matching
- * items of the multi-column MCV list. So the values may differ for two main
- * reasons - (a) the MCV list may not cover 100% of the data and (b) some of
- * the MCV items did not match the estimated clauses.
- *
- * As both (a) and (b) reduce the base selectivity value, it generally holds
- * that (simple_selectivity >= base_selectivity). If the MCV list covers all
- * the data, the values may be equal.
- *
- * So, (simple_selectivity - base_selectivity) is an estimate for the part
- * not covered by the MCV list, and (mcv_selectivity - base_selectivity) may
- * be seen as a correction for the part covered by the MCV list. Those two
- * statements are actually equivalent.
- *
- * Note: Due to rounding errors and minor differences in how the estimates
- * are computed, the inequality may not always hold. Which is why we clamp
- * the selectivities to prevent strange estimate (negative etc.).
+ * These are passed to mcv_combine_selectivities() which combines them to
+ * produce a selectivity estimate that makes use of both per-column statistics
+ * and the multi-column MCV statistics.
  *
  * 'estimatedclauses' is an input/output parameter.  We set bits for the
  * 0-based 'clauses' indexes we estimate for and also skip clause items that
@@ -1282,16 +1261,17 @@ statext_is_compatible_clause(PlannerInfo
 static Selectivity
 statext_mcv_clauselist_selectivity(PlannerInfo *root, List *clauses, int varRelid,
 								   JoinType jointype, SpecialJoinInfo *sjinfo,
-								   RelOptInfo *rel, Bitmapset **estimatedclauses)
+								   RelOptInfo *rel, Bitmapset **estimatedclauses,
+								   bool is_or)
 {
 	ListCell   *l;
 	Bitmapset **list_attnums;
 	int			listidx;
-	Selectivity sel = 1.0;
+	Selectivity sel = (is_or) ? 0.0 : 1.0;
 
 	/* check if there's any stats that might be useful for us. */
 	if (!has_stats_of_kind(rel->statlist, STATS_EXT_MCV))
-		return 1.0;
+		return sel;
 
 	list_attnums = (Bitmapset **) palloc(sizeof(Bitmapset *) *
 										 list_length(clauses));
@@ -1327,12 +1307,7 @@ statext_mcv_clauselist_selectivity(Plann
 	{
 		StatisticExtInfo *stat;
 		List	   *stat_clauses;
-		Selectivity simple_sel,
-					mcv_sel,
-					mcv_basesel,
-					mcv_totalsel,
-					other_sel,
-					stat_sel;
+		Bitmapset  *simple_clauses;
 
 		/* find the best suited statistics object for these attnums */
 		stat = choose_best_statistics(rel->statlist, STATS_EXT_MCV,
@@ -1351,6 +1326,9 @@ statext_mcv_clauselist_selectivity(Plann
 		/* now filter the clauses to be estimated using the selected MCV */
 		stat_clauses = NIL;
 
+		/* record which clauses are simple (single column) */
+		simple_clauses = NULL;
+
 		listidx = 0;
 		foreach(l, clauses)
 		{
@@ -1361,6 +1339,10 @@ statext_mcv_clauselist_selectivity(Plann
 			if (list_attnums[listidx] != NULL &&
 				bms_is_subset(list_attnums[listidx], stat->keys))
 			{
+				if (bms_membership(list_attnums[listidx]) == BMS_SINGLETON)
+					simple_clauses = bms_add_member(simple_clauses,
+													list_length(stat_clauses));
+
 				stat_clauses = lappend(stat_clauses, (Node *) lfirst(l));
 				*estimatedclauses = bms_add_member(*estimatedclauses, listidx);
 
@@ -1371,40 +1353,131 @@ statext_mcv_clauselist_selectivity(Plann
 			listidx++;
 		}
 
-		/*
-		 * First compute "simple" selectivity, i.e. without the extended
-		 * statistics, and essentially assuming independence of the
-		 * columns/clauses. We'll then use the various selectivities computed
-		 * from MCV list to improve it.
-		 */
-		simple_sel = clauselist_selectivity_simple(root, stat_clauses, varRelid,
-												   jointype, sjinfo, NULL);
+		if (is_or)
+		{
+			bool	   *or_matches = NULL;
+			Selectivity simple_or_sel = 0.0;
+			MCVList    *mcv_list;
 
-		/*
-		 * Now compute the multi-column estimate from the MCV list, along with
-		 * the other selectivities (base & total selectivity).
-		 */
-		mcv_sel = mcv_clauselist_selectivity(root, stat, stat_clauses, varRelid,
-											 jointype, sjinfo, rel,
-											 &mcv_basesel, &mcv_totalsel);
+			/* Load the MCV list stored in the statistics object */
+			mcv_list = statext_mcv_load(stat->statOid);
 
-		/* Estimated selectivity of values not covered by MCV matches */
-		other_sel = simple_sel - mcv_basesel;
-		CLAMP_PROBABILITY(other_sel);
+			/*
+			 * Compute the selectivity of the ORed list of clauses by
+			 * estimating each in turn and combining them using the formula
+			 * P(A OR B) = P(A) + P(B) - P(A AND B).  This allows us to use
+			 * the multivariate MCV stats to better estimate each term.
+			 *
+			 * Each time we iterate this formula, the clause "A" above is
+			 * equal to all the clauses processed so far, combined with "OR".
+			 */
+			listidx = 0;
+			foreach(l, stat_clauses)
+			{
+				Node	   *clause = (Node *) lfirst(l);
+				Selectivity simple_sel,
+							overlap_simple_sel,
+							mcv_sel,
+							mcv_basesel,
+							overlap_mcvsel,
+							overlap_basesel,
+							mcv_totalsel,
+							clause_sel,
+							overlap_sel;
 
-		/* The non-MCV selectivity can't exceed the 1 - mcv_totalsel. */
-		if (other_sel > 1.0 - mcv_totalsel)
-			other_sel = 1.0 - mcv_totalsel;
+				/*
+				 * "Simple" selectivity of the next clause and its overlap
+				 * with any of the previous clauses.  These are our initial
+				 * estimates of P(B) and P(A AND B), assuming independence of
+				 * columns/clauses.
+				 */
+				simple_sel = clause_selectivity_ext(root, clause, varRelid,
+													jointype, sjinfo, false);
 
-		/*
-		 * Overall selectivity is the combination of MCV and non-MCV
-		 * estimates.
-		 */
-		stat_sel = mcv_sel + other_sel;
-		CLAMP_PROBABILITY(stat_sel);
+				overlap_simple_sel = simple_or_sel * simple_sel;
 
-		/* Factor the estimate from this MCV to the overall estimate. */
-		sel *= stat_sel;
+				/*
+				 * New "simple" selectivity of all clauses seen so far,
+				 * assuming independence.
+				 */
+				simple_or_sel += simple_sel - overlap_simple_sel;
+				CLAMP_PROBABILITY(simple_or_sel);
+
+				/*
+				 * Multi-column estimate of this clause using MCV statistics,
+				 * along with base and total selectivities, and corresponding
+				 * selectivities for the overlap term P(A AND B).
+				 */
+				mcv_sel = mcv_clause_selectivity_or(root, stat, mcv_list,
+													clause, &or_matches,
+													&mcv_basesel,
+													&overlap_mcvsel,
+													&overlap_basesel,
+													&mcv_totalsel);
+
+				/*
+				 * Combine the simple and multi-column estimates.
+				 *
+				 * If this clause is a simple single-column clause, then we
+				 * just use the simple selectivity estimate for it, since the
+				 * multi-column statistics are unlikely to improve on that
+				 * (and in fact could make it worse).  For the overlap, we
+				 * always make use of the multi-column statistics.
+				 */
+				if (bms_is_member(listidx, simple_clauses))
+					clause_sel = simple_sel;
+				else
+					clause_sel = mcv_combine_selectivities(simple_sel,
+														   mcv_sel,
+														   mcv_basesel,
+														   mcv_totalsel);
+
+				overlap_sel = mcv_combine_selectivities(overlap_simple_sel,
+														overlap_mcvsel,
+														overlap_basesel,
+														mcv_totalsel);
+
+				/* Factor these into the overall result */
+				sel += clause_sel - overlap_sel;
+				CLAMP_PROBABILITY(sel);
+
+				listidx++;
+			}
+		}
+		else					/* Implicitly-ANDed list of clauses */
+		{
+			Selectivity simple_sel,
+						mcv_sel,
+						mcv_basesel,
+						mcv_totalsel,
+						stat_sel;
+
+			/*
+			 * "Simple" selectivity, i.e. without any extended statistics,
+			 * essentially assuming independence of the columns/clauses.
+			 */
+			simple_sel = clauselist_selectivity_ext(root, stat_clauses,
+													varRelid, jointype,
+													sjinfo, false);
+
+			/*
+			 * Multi-column estimate using MCV statistics, along with base and
+			 * total selectivities.
+			 */
+			mcv_sel = mcv_clauselist_selectivity(root, stat, stat_clauses,
+												 varRelid, jointype, sjinfo,
+												 rel, &mcv_basesel,
+												 &mcv_totalsel);
+
+			/* Combine the simple and multi-column estimates. */
+			stat_sel = mcv_combine_selectivities(simple_sel,
+												 mcv_sel,
+												 mcv_basesel,
+												 mcv_totalsel);
+
+			/* Factor this into the overall result */
+			sel *= stat_sel;
+		}
 	}
 
 	return sel;
@@ -1417,13 +1490,21 @@ statext_mcv_clauselist_selectivity(Plann
 Selectivity
 statext_clauselist_selectivity(PlannerInfo *root, List *clauses, int varRelid,
 							   JoinType jointype, SpecialJoinInfo *sjinfo,
-							   RelOptInfo *rel, Bitmapset **estimatedclauses)
+							   RelOptInfo *rel, Bitmapset **estimatedclauses,
+							   bool is_or)
 {
 	Selectivity sel;
 
 	/* First, try estimating clauses using a multivariate MCV list. */
 	sel = statext_mcv_clauselist_selectivity(root, clauses, varRelid, jointype,
-											 sjinfo, rel, estimatedclauses);
+											 sjinfo, rel, estimatedclauses, is_or);
+
+	/*
+	 * Functional dependencies only work for clauses connected by AND, so for
+	 * OR clauses we're done.
+	 */
+	if (is_or)
+		return sel;
 
 	/*
 	 * Then, apply functional dependencies on the remaining clauses by calling
diff --git a/src/backend/statistics/mcv.c b/src/backend/statistics/mcv.c
new file mode 100644
index 6a262f1..fae792a
--- a/src/backend/statistics/mcv.c
+++ b/src/backend/statistics/mcv.c
@@ -32,6 +32,7 @@
 #include "utils/fmgroids.h"
 #include "utils/fmgrprotos.h"
 #include "utils/lsyscache.h"
+#include "utils/selfuncs.h"
 #include "utils/syscache.h"
 #include "utils/typcache.h"
 
@@ -1889,15 +1890,79 @@ mcv_get_match_bitmap(PlannerInfo *root,
 
 
 /*
+ * mcv_combine_selectivities
+ * 		Combine per-column and multi-column MCV selectivity estimates.
+ *
+ * simple_sel is a "simple" selectivity estimate (produced without using any
+ * extended statistics, essentially assuming independence of columns/clauses).
+ *
+ * mcv_sel and mcv_basesel are sums of the frequencies and base frequencies of
+ * all matching MCV items.  The difference (mcv_sel - mcv_basesel) is then
+ * essentially interpreted as a correction to be added to simple_sel, as
+ * described below.
+ *
+ * mcv_totalsel is the sum of the frequencies of all MCV items (not just the
+ * matching ones).  This is used as an upper bound on the portion of the
+ * selectivity estimates not covered by the MCV statistics.
+ *
+ * Note: While simple and base selectivities are defined in a quite similar
+ * way, the values are computed differently and are not therefore equal. The
+ * simple selectivity is computed as a product of per-clause estimates, while
+ * the base selectivity is computed by adding up base frequencies of matching
+ * items of the multi-column MCV list. So the values may differ for two main
+ * reasons - (a) the MCV list may not cover 100% of the data and (b) some of
+ * the MCV items did not match the estimated clauses.
+ *
+ * As both (a) and (b) reduce the base selectivity value, it generally holds
+ * that (simple_sel >= mcv_basesel). If the MCV list covers all the data, the
+ * values may be equal.
+ *
+ * So, other_sel = (simple_sel - mcv_basesel) is an estimate for the part not
+ * covered by the MCV list, and (mcv_sel - mcv_basesel) may be seen as a
+ * correction for the part covered by the MCV list. Those two statements are
+ * actually equivalent.
+ */
+Selectivity
+mcv_combine_selectivities(Selectivity simple_sel,
+						  Selectivity mcv_sel,
+						  Selectivity mcv_basesel,
+						  Selectivity mcv_totalsel)
+{
+	Selectivity other_sel;
+	Selectivity sel;
+
+	/* estimated selectivity of values not covered by MCV matches */
+	other_sel = simple_sel - mcv_basesel;
+	CLAMP_PROBABILITY(other_sel);
+
+	/* this non-MCV selectivity cannot exceed 1 - mcv_totalsel */
+	if (other_sel > 1.0 - mcv_totalsel)
+		other_sel = 1.0 - mcv_totalsel;
+
+	/* overall selectivity is the sum of the MCV and non-MCV parts */
+	sel = mcv_sel + other_sel;
+	CLAMP_PROBABILITY(sel);
+
+	return sel;
+}
+
+
+/*
  * mcv_clauselist_selectivity
- *		Return the selectivity estimate computed using an MCV list.
+ *		Use MCV statistics to estimate the selectivity of an implicitly-ANDed
+ *		list of clauses.
  *
- * First builds a bitmap of MCV items matching the clauses, and then sums
- * the frequencies of matching items.
+ * This determines which MCV items match every clause in the list and returns
+ * the sum of the frequencies of those items.
  *
- * It also produces two additional interesting selectivities - total
- * selectivity of all the MCV items (not just the matching ones), and the
- * base frequency computed on the assumption of independence.
+ * In addition, it returns the sum of the base frequencies of each of those
+ * items (that is the sum of the selectivities that each item would have if
+ * the columns were independent of one another), and the total selectivity of
+ * all the MCV items (not just the matching ones).  These are expected to be
+ * used together with a "simple" selectivity estimate (one based only on
+ * per-column statistics) to produce an overall selectivity estimate that
+ * makes use of both per-column and multi-column statistics --- see
+ * mcv_combine_selectivities().
  */
 Selectivity
 mcv_clauselist_selectivity(PlannerInfo *root, StatisticExtInfo *stat,
@@ -1928,7 +1993,6 @@ mcv_clauselist_selectivity(PlannerInfo *
 
 		if (matches[i] != false)
 		{
-			/* XXX Shouldn't the basesel be outside the if condition? */
 			*basesel += mcv->items[i].base_frequency;
 			s += mcv->items[i].frequency;
 		}
@@ -1936,3 +2000,94 @@ mcv_clauselist_selectivity(PlannerInfo *
 
 	return s;
 }
+
+
+/*
+ * mcv_clause_selectivity_or
+ *		Use MCV statistics to estimate the selectivity of a clause that
+ *		appears in an ORed list of clauses.
+ *
+ * As with mcv_clauselist_selectivity() this determines which MCV items match
+ * the clause and returns both the sum of the frequencies and the sum of the
+ * base frequencies of those items, as well as the sum of the frequencies of
+ * all MCV items (not just the matching ones) so that this information can be
+ * used by mcv_combine_selectivities() to produce a selectivity estimate that
+ * makes use of both per-column and multi-column statistics.
+ *
+ * Additionally, we return information to help compute the overall selectivity
+ * of the ORed list of clauses assumed to contain this clause.  This function
+ * is intended to be called for each clause in the ORed list of clauses,
+ * allowing the overall selectivity to be computed using the following
+ * algorithm:
+ *
+ * Suppose P[n] = P(C[1] OR C[2] OR ... OR C[n]) is the combined selectivity
+ * of the first n clauses in the list.  Then the combined selectivity taking
+ * into account the next clause C[n+1] can be written as
+ *
+ *		P[n+1] = P[n] + P(C[n+1]) - P((C[1] OR ... OR C[n]) AND C[n+1])
+ *
+ * The final term above represents the overlap between the clauses examined so
+ * far and the (n+1)'th clause.  To estimate its selectivity, we track the
+ * match bitmap for the ORed list of clauses examined so far and examine its
+ * intersection with the match bitmap for the (n+1)'th clause.
+ *
+ * We then also return the sums of the MCV item frequencies and base
+ * frequencies for the match bitmap intersection corresponding to the overlap
+ * term above, so that they can be combined with a simple selectivity estimate
+ * for that term.
+ *
+ * The parameter "or_matches" is an in/out parameter tracking the match bitmap
+ * for the clauses examined so far.  The caller is expected to set it to NULL
+ * the first time it calls this function.
+ */
+Selectivity
+mcv_clause_selectivity_or(PlannerInfo *root, StatisticExtInfo *stat,
+						  MCVList *mcv, Node *clause, bool **or_matches,
+						  Selectivity *basesel, Selectivity *overlap_mcvsel,
+						  Selectivity *overlap_basesel, Selectivity *totalsel)
+{
+	Selectivity s = 0.0;
+	bool	   *new_matches;
+	int			i;
+
+	/* build the OR-matches bitmap, if not built already */
+	if (*or_matches == NULL)
+		*or_matches = palloc0(sizeof(bool) * mcv->nitems);
+
+	/* build the match bitmap for the new clause */
+	new_matches = mcv_get_match_bitmap(root, list_make1(clause), stat->keys,
+									   mcv, false);
+
+	/*
+	 * Sum the frequencies for all the MCV items matching this clause and also
+	 * those matching the overlap between this clause and any of the preceding
+	 * clauses as described above.
+	 */
+	*basesel = 0.0;
+	*overlap_mcvsel = 0.0;
+	*overlap_basesel = 0.0;
+	*totalsel = 0.0;
+	for (i = 0; i < mcv->nitems; i++)
+	{
+		*totalsel += mcv->items[i].frequency;
+
+		if (new_matches[i])
+		{
+			s += mcv->items[i].frequency;
+			*basesel += mcv->items[i].base_frequency;
+
+			if ((*or_matches)[i])
+			{
+				*overlap_mcvsel += mcv->items[i].frequency;
+				*overlap_basesel += mcv->items[i].base_frequency;
+			}
+		}
+
+		/* update the OR-matches bitmap for the next clause */
+		(*or_matches)[i] = (*or_matches)[i] || new_matches[i];
+	}
+
+	pfree(new_matches);
+
+	return s;
+}
diff --git a/src/include/optimizer/optimizer.h b/src/include/optimizer/optimizer.h
new file mode 100644
index 3e41710..dea0e73
--- a/src/include/optimizer/optimizer.h
+++ b/src/include/optimizer/optimizer.h
@@ -58,17 +58,23 @@ extern Selectivity clause_selectivity(Pl
 									  int varRelid,
 									  JoinType jointype,
 									  SpecialJoinInfo *sjinfo);
-extern Selectivity clauselist_selectivity_simple(PlannerInfo *root,
-												 List *clauses,
-												 int varRelid,
-												 JoinType jointype,
-												 SpecialJoinInfo *sjinfo,
-												 Bitmapset *estimatedclauses);
+extern Selectivity clause_selectivity_ext(PlannerInfo *root,
+										  Node *clause,
+										  int varRelid,
+										  JoinType jointype,
+										  SpecialJoinInfo *sjinfo,
+										  bool use_extended_stats);
 extern Selectivity clauselist_selectivity(PlannerInfo *root,
 										  List *clauses,
 										  int varRelid,
 										  JoinType jointype,
 										  SpecialJoinInfo *sjinfo);
+extern Selectivity clauselist_selectivity_ext(PlannerInfo *root,
+											  List *clauses,
+											  int varRelid,
+											  JoinType jointype,
+											  SpecialJoinInfo *sjinfo,
+											  bool use_extended_stats);
 
 /* in path/costsize.c: */
 
diff --git a/src/include/statistics/extended_stats_internal.h b/src/include/statistics/extended_stats_internal.h
new file mode 100644
index 61e6969..02bf6a0
--- a/src/include/statistics/extended_stats_internal.h
+++ b/src/include/statistics/extended_stats_internal.h
@@ -99,6 +99,11 @@ extern SortItem *build_sorted_items(int
 extern bool examine_clause_args(List *args, Var **varp,
 								Const **cstp, bool *varonleftp);
 
+extern Selectivity mcv_combine_selectivities(Selectivity simple_sel,
+											 Selectivity mcv_sel,
+											 Selectivity mcv_basesel,
+											 Selectivity mcv_totalsel);
+
 extern Selectivity mcv_clauselist_selectivity(PlannerInfo *root,
 											  StatisticExtInfo *stat,
 											  List *clauses,
@@ -109,4 +114,14 @@ extern Selectivity mcv_clauselist_select
 											  Selectivity *basesel,
 											  Selectivity *totalsel);
 
+extern Selectivity mcv_clause_selectivity_or(PlannerInfo *root,
+											 StatisticExtInfo *stat,
+											 MCVList *mcv,
+											 Node *clause,
+											 bool **or_matches,
+											 Selectivity *basesel,
+											 Selectivity *overlap_mcvsel,
+											 Selectivity *overlap_basesel,
+											 Selectivity *totalsel);
+
 #endif							/* EXTENDED_STATS_INTERNAL_H */
diff --git a/src/include/statistics/statistics.h b/src/include/statistics/statistics.h
new file mode 100644
index 50fce49..c9ed211
--- a/src/include/statistics/statistics.h
+++ b/src/include/statistics/statistics.h
@@ -116,7 +116,8 @@ extern Selectivity statext_clauselist_se
 												  JoinType jointype,
 												  SpecialJoinInfo *sjinfo,
 												  RelOptInfo *rel,
-												  Bitmapset **estimatedclauses);
+												  Bitmapset **estimatedclauses,
+												  bool is_or);
 extern bool has_stats_of_kind(List *stats, char requiredkind);
 extern StatisticExtInfo *choose_best_statistics(List *stats, char requiredkind,
 												Bitmapset **clause_attnums,
diff --git a/src/test/regress/expected/stats_ext.out b/src/test/regress/expected/stats_ext.out
new file mode 100644
index 4c3edd2..dbbe984
--- a/src/test/regress/expected/stats_ext.out
+++ b/src/test/regress/expected/stats_ext.out
@@ -1113,6 +1113,12 @@ SELECT * FROM check_estimated_rows('SELE
        200 |    200
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = ''1'' OR c = 1 OR d IS NOT NULL');
+ estimated | actual 
+-----------+--------
+       200 |    200
+(1 row)
+
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IN (1, 2, 51, 52) AND b IN ( ''1'', ''2'')');
  estimated | actual 
 -----------+--------
@@ -1173,13 +1179,6 @@ SELECT * FROM check_estimated_rows('SELE
        100 |    100
 (1 row)
 
--- we can't use the statistic for OR clauses that are not fully covered (missing 'd' attribute)
-SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = ''1'' OR c = 1 OR d IS NOT NULL');
- estimated | actual 
------------+--------
-       343 |    200
-(1 row)
-
 -- check change of unrelated column type does not reset the MCV statistics
 ALTER TABLE mcv_lists ALTER COLUMN d TYPE VARCHAR(64);
 SELECT d.stxdmcv IS NOT NULL
@@ -1477,6 +1476,134 @@ SELECT * FROM check_estimated_rows('SELE
          1 |      0
 (1 row)
 
+-- mcv covering just a small fraction of data
+CREATE TABLE mcv_lists_partial (
+    a INT,
+    b INT,
+    c INT
+);
+-- 10 frequent groups, each with 100 elements
+INSERT INTO mcv_lists_partial (a, b, c)
+     SELECT
+         mod(i,10),
+         mod(i,10),
+         mod(i,10)
+     FROM generate_series(0,999) s(i);
+-- 100 groups that will make it to the MCV list (includes the 10 frequent ones)
+INSERT INTO mcv_lists_partial (a, b, c)
+     SELECT
+         i,
+         i,
+         i
+     FROM generate_series(0,99) s(i);
+-- 4000 groups in total, most of which won't make it (just a single item)
+INSERT INTO mcv_lists_partial (a, b, c)
+     SELECT
+         i,
+         i,
+         i
+     FROM generate_series(0,3999) s(i);
+ANALYZE mcv_lists_partial;
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 AND b = 0 AND c = 0');
+ estimated | actual 
+-----------+--------
+         1 |    102
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 OR b = 0 OR c = 0');
+ estimated | actual 
+-----------+--------
+       300 |    102
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 10 AND b = 10 AND c = 10');
+ estimated | actual 
+-----------+--------
+         1 |      2
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 10 OR b = 10 OR c = 10');
+ estimated | actual 
+-----------+--------
+         6 |      2
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 AND b = 0 AND c = 10');
+ estimated | actual 
+-----------+--------
+         1 |      0
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 OR b = 0 OR c = 10');
+ estimated | actual 
+-----------+--------
+       204 |    104
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE (a = 0 AND b = 0 AND c = 0) OR (a = 1 AND b = 1 AND c = 1) OR (a = 2 AND b = 2 AND c = 2)');
+ estimated | actual 
+-----------+--------
+         1 |    306
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE (a = 0 AND b = 0) OR (a = 0 AND c = 0) OR (b = 0 AND c = 0)');
+ estimated | actual 
+-----------+--------
+         6 |    102
+(1 row)
+
+CREATE STATISTICS mcv_lists_partial_stats (mcv) ON a, b, c
+  FROM mcv_lists_partial;
+ANALYZE mcv_lists_partial;
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 AND b = 0 AND c = 0');
+ estimated | actual 
+-----------+--------
+       102 |    102
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 OR b = 0 OR c = 0');
+ estimated | actual 
+-----------+--------
+        96 |    102
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 10 AND b = 10 AND c = 10');
+ estimated | actual 
+-----------+--------
+         2 |      2
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 10 OR b = 10 OR c = 10');
+ estimated | actual 
+-----------+--------
+         2 |      2
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 AND b = 0 AND c = 10');
+ estimated | actual 
+-----------+--------
+         1 |      0
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 OR b = 0 OR c = 10');
+ estimated | actual 
+-----------+--------
+       102 |    104
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE (a = 0 AND b = 0 AND c = 0) OR (a = 1 AND b = 1 AND c = 1) OR (a = 2 AND b = 2 AND c = 2)');
+ estimated | actual 
+-----------+--------
+       300 |    306
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE (a = 0 AND b = 0) OR (a = 0 AND c = 0) OR (b = 0 AND c = 0)');
+ estimated | actual 
+-----------+--------
+       306 |    102
+(1 row)
+
+DROP TABLE mcv_lists_partial;
 -- check the ability to use multiple MCV lists
 CREATE TABLE mcv_lists_multi (
 	a INTEGER,
@@ -1506,12 +1633,36 @@ SELECT * FROM check_estimated_rows('SELE
        102 |    714
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 AND c = 0');
+ estimated | actual 
+-----------+--------
+       143 |    142
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 OR c = 0');
+ estimated | actual 
+-----------+--------
+      1571 |   1572
+(1 row)
+
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 AND b = 0 AND c = 0 AND d = 0');
  estimated | actual 
 -----------+--------
          4 |    142
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE (a = 0 AND b = 0) OR (c = 0 AND d = 0)');
+ estimated | actual 
+-----------+--------
+       298 |   1572
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 OR b = 0 OR c = 0 OR d = 0');
+ estimated | actual 
+-----------+--------
+      2649 |   1572
+(1 row)
+
 -- create separate MCV statistics
 CREATE STATISTICS mcv_lists_multi_1 (mcv) ON a, b FROM mcv_lists_multi;
 CREATE STATISTICS mcv_lists_multi_2 (mcv) ON c, d FROM mcv_lists_multi;
@@ -1528,12 +1679,36 @@ SELECT * FROM check_estimated_rows('SELE
        714 |    714
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 AND c = 0');
+ estimated | actual 
+-----------+--------
+       143 |    142
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 OR c = 0');
+ estimated | actual 
+-----------+--------
+      1571 |   1572
+(1 row)
+
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 AND b = 0 AND c = 0 AND d = 0');
  estimated | actual 
 -----------+--------
        143 |    142
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE (a = 0 AND b = 0) OR (c = 0 AND d = 0)');
+ estimated | actual 
+-----------+--------
+      1571 |   1572
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 OR b = 0 OR c = 0 OR d = 0');
+ estimated | actual 
+-----------+--------
+      1714 |   1572
+(1 row)
+
 DROP TABLE mcv_lists_multi;
 -- Permission tests. Users should not be able to see specific data values in
 -- the extended statistics, if they lack permission to see those values in
diff --git a/src/test/regress/sql/stats_ext.sql b/src/test/regress/sql/stats_ext.sql
new file mode 100644
index 9781e59..7912e73
--- a/src/test/regress/sql/stats_ext.sql
+++ b/src/test/regress/sql/stats_ext.sql
@@ -561,6 +561,8 @@ SELECT * FROM check_estimated_rows('SELE
 
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = ''1'' OR c = 1');
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = ''1'' OR c = 1 OR d IS NOT NULL');
+
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IN (1, 2, 51, 52) AND b IN ( ''1'', ''2'')');
 
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IN (1, 2, 51, 52, NULL) AND b IN ( ''1'', ''2'', NULL)');
@@ -581,9 +583,6 @@ SELECT * FROM check_estimated_rows('SELE
 
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a < ALL (ARRAY[4, 5]) AND b IN (''1'', ''2'', NULL, ''3'') AND c > ANY (ARRAY[1, 2, NULL, 3])');
 
--- we can't use the statistic for OR clauses that are not fully covered (missing 'd' attribute)
-SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = ''1'' OR c = 1 OR d IS NOT NULL');
-
 -- check change of unrelated column type does not reset the MCV statistics
 ALTER TABLE mcv_lists ALTER COLUMN d TYPE VARCHAR(64);
 
@@ -777,6 +776,78 @@ SELECT * FROM check_estimated_rows('SELE
 
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a AND b AND NOT c');
 
+-- mcv covering just a small fraction of data
+CREATE TABLE mcv_lists_partial (
+    a INT,
+    b INT,
+    c INT
+);
+
+-- 10 frequent groups, each with 100 elements
+INSERT INTO mcv_lists_partial (a, b, c)
+     SELECT
+         mod(i,10),
+         mod(i,10),
+         mod(i,10)
+     FROM generate_series(0,999) s(i);
+
+-- 100 groups that will make it to the MCV list (includes the 10 frequent ones)
+INSERT INTO mcv_lists_partial (a, b, c)
+     SELECT
+         i,
+         i,
+         i
+     FROM generate_series(0,99) s(i);
+
+-- 4000 groups in total, most of which won't make it (just a single item)
+INSERT INTO mcv_lists_partial (a, b, c)
+     SELECT
+         i,
+         i,
+         i
+     FROM generate_series(0,3999) s(i);
+
+ANALYZE mcv_lists_partial;
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 AND b = 0 AND c = 0');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 OR b = 0 OR c = 0');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 10 AND b = 10 AND c = 10');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 10 OR b = 10 OR c = 10');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 AND b = 0 AND c = 10');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 OR b = 0 OR c = 10');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE (a = 0 AND b = 0 AND c = 0) OR (a = 1 AND b = 1 AND c = 1) OR (a = 2 AND b = 2 AND c = 2)');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE (a = 0 AND b = 0) OR (a = 0 AND c = 0) OR (b = 0 AND c = 0)');
+
+CREATE STATISTICS mcv_lists_partial_stats (mcv) ON a, b, c
+  FROM mcv_lists_partial;
+
+ANALYZE mcv_lists_partial;
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 AND b = 0 AND c = 0');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 OR b = 0 OR c = 0');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 10 AND b = 10 AND c = 10');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 10 OR b = 10 OR c = 10');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 AND b = 0 AND c = 10');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 OR b = 0 OR c = 10');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE (a = 0 AND b = 0 AND c = 0) OR (a = 1 AND b = 1 AND c = 1) OR (a = 2 AND b = 2 AND c = 2)');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE (a = 0 AND b = 0) OR (a = 0 AND c = 0) OR (b = 0 AND c = 0)');
+
+DROP TABLE mcv_lists_partial;
+
 -- check the ability to use multiple MCV lists
 CREATE TABLE mcv_lists_multi (
 	a INTEGER,
@@ -799,7 +870,11 @@ ANALYZE mcv_lists_multi;
 -- estimates without any mcv statistics
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 AND b = 0');
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE c = 0 AND d = 0');
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 AND c = 0');
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 OR c = 0');
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 AND b = 0 AND c = 0 AND d = 0');
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE (a = 0 AND b = 0) OR (c = 0 AND d = 0)');
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 OR b = 0 OR c = 0 OR d = 0');
 
 -- create separate MCV statistics
 CREATE STATISTICS mcv_lists_multi_1 (mcv) ON a, b FROM mcv_lists_multi;
@@ -809,7 +884,11 @@ ANALYZE mcv_lists_multi;
 
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 AND b = 0');
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE c = 0 AND d = 0');
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 AND c = 0');
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 OR c = 0');
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 AND b = 0 AND c = 0 AND d = 0');
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE (a = 0 AND b = 0) OR (c = 0 AND d = 0)');
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 OR b = 0 OR c = 0 OR d = 0');
 
 DROP TABLE mcv_lists_multi;

#37

Tomas Vondra

tomas.vondra@enterprisedb.com

about 5 years ago

In reply to: Dean Rasheed (#36)

Re: Additional improvements to extended statistics

On 12/2/20 4:51 PM, Dean Rasheed wrote:

On Sun, 29 Nov 2020 at 21:02, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

I wonder how much of the comment before clauselist_selectivity should
move to clauselist_selectivity_ext - it does talk about range clauses
and so on, but clauselist_selectivity does not really deal with that.
But maybe that's just an implementation detail and it's better to keep
the comment the way it is.

I think it's better to keep it the way it is, so that the entirety of
what clauselist_selectivity() does (via clauselist_selectivity_ext())
can be read in one place, but I have added separate comments for the
new "_ext" functions to explain how they differ. That matches similar
examples elsewhere.

I noticed this outdated comment:

/* Always compute the selectivity using clause_selectivity */
s2 = clause_selectivity_ext(root, clause, varRelid, jointype, sjinfo,

Updated.

Also, the comment at clauselist_selectivity_or seems to not follow the
usual pattern, which I think is

/*
* function name
* short one-sentence description
*
* ... longer description ...
*/

Hmm, it seems OK to me. The first part is basically copied from
clauselist_selectivity(). The "longer description" doesn't really have
much more to say because it's much simpler than
clauselist_selectivity(), but it seems neater to keep the two roughly
in sync.

I see. In that case it's OK, I guess.

I've been hacking on this a bit more and attached is an updated
(hopefully final) version with some other comment improvements and
also a couple of other tweaks:

The previous version had duplicated code blocks that implemented the
same MCV-correction algorithm using simple_sel, mcv_sel, base_sel,
other_sel and total_sel, which was quite messy. So I refactored that
into a new function mcv_combine_selectivities(). About half the
comments from statext_mcv_clauselist_selectivity() then move over to
mcv_combine_selectivities(). I also updated the comments for
mcv_clauselist_selectivity() and mcv_clause_selectivity_or() to
explain how their outputs are expected to be used by
mcv_combine_selectivities(). That hopefully makes for a clean
separation of concerns, and makes it easier to tweak the way MCV stats
are applied on top of simple stats, if someone thinks of a better
approach in the future.

In the previous version, for an ORed list of clauses, the MCV
correction was only applied to the overlaps between clauses. That's OK
as long as each clause only refers to a single column, since the
per-column statistics ought to be the best way to estimate each
individual clause in that case. However, if the individual clauses
refer to more than one column, I think the MCV correction should be
applied to each individual clause as well as to the overlaps. That
turns out to be pretty straightforward, since we're already doing all
the hard work of computing the match bitmap for each clause. The sort
of queries I had in mind were things like this:

WHERE (a = 1 AND b = 1) OR (a = 2 AND b = 2)

I added a new test case along those lines and the new estimates are
much better than they are without this patch, but not for the reason I
thought --- the old code consistently over-estimated queries like that
because it actually applied the MCV correction twice (once while
processing each AND list, via clause_selectivity(), called from
clauselist_selectivity_simple(), and once for the top-level OR clause,
contained in a single-element implicitly-ANDed list). The way the new
code is structured avoids any kind of double application of extended
stats, producing a much better estimate, which is good.

Nice.

However, the new code doesn't apply the extended stats directly using
clauselist_selectivity_or() for this kind of query because there are
no RestrictInfos for the nested AND clauses, so
find_single_rel_for_clauses() (and similarly
statext_is_compatible_clause()) regards those clauses as not
compatible with extended stats. So what ends up happening is that
extended stats are used only when we descend down to the two AND
clauses, and their results are combined using the original "s1 + s2 -
s1 * s2" formula. That actually works OK in this case, because there
is no overlap between the two AND clauses, but it wouldn't work so
well if there was.

I'm pretty sure that can be fixed by teaching
find_single_rel_for_clauses() and statext_is_compatible_clause() to
handle BoolExpr clauses, looking for RestrictInfos underneath them,
but I think that should be left for a follow-in patch. I have left a
regression test in place, whose estimates ought to be improved by such
a fix.

Yeah, I agree with leaving this for a separate patch. We can't do
everything at the same time.

The upshot of all that is that the new code that applies the MCV
correction to the individual clauses in an ORed list doesn't help with
queries like the one above at the moment, and it's not obvious whether
it is currently reachable, but I think it's worth leaving in because
it seems more principled, and makes that code more future-proof. I
also think it's neater because now the signature of
mcv_clause_selectivity_or() is more natural --- it's primary return
value is now the clause's MCV selectivity, as suggested by the
function's name, rather than the overlap selectivity that the previous
version was returning. Also, after your "Var Op Var" patch is applied,
I think it would be possible to construct queries that would benefit
from this, so it would be good to get that committed too.

Barring any further comments, I'll push this sometime soon.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#38

Dean Rasheed

dean.a.rasheed@gmail.com

about 5 years ago

In reply to: Tomas Vondra (#37)

Re: Additional improvements to extended statistics

On Wed, 2 Dec 2020 at 16:34, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

On 12/2/20 4:51 PM, Dean Rasheed wrote:

Barring any further comments, I'll push this sometime soon.

+1

Pushed.

Regards,
Dean

#39

Dean Rasheed

dean.a.rasheed@gmail.com

about 5 years ago

In reply to: Dean Rasheed (#36)

2 attachment(s)

Re: Additional improvements to extended statistics

On Wed, 2 Dec 2020 at 15:51, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:

The sort of queries I had in mind were things like this:

WHERE (a = 1 AND b = 1) OR (a = 2 AND b = 2)

However, the new code doesn't apply the extended stats directly using
clauselist_selectivity_or() for this kind of query because there are
no RestrictInfos for the nested AND clauses, so
find_single_rel_for_clauses() (and similarly
statext_is_compatible_clause()) regards those clauses as not
compatible with extended stats. So what ends up happening is that
extended stats are used only when we descend down to the two AND
clauses, and their results are combined using the original "s1 + s2 -
s1 * s2" formula. That actually works OK in this case, because there
is no overlap between the two AND clauses, but it wouldn't work so
well if there was.

I'm pretty sure that can be fixed by teaching
find_single_rel_for_clauses() and statext_is_compatible_clause() to
handle BoolExpr clauses, looking for RestrictInfos underneath them,
but I think that should be left for a follow-in patch.

Attached is a patch doing that, which improves a couple of the
estimates for queries with AND clauses underneath OR clauses, as
expected.

This also revealed a minor bug in the way that the estimates for
multiple statistics objects were combined while processing an OR
clause -- the estimates for the overlaps between clauses only apply
for the current statistics object, so we really have to combine the
estimates for each set of clauses for each statistics object as if
they were independent of one another.

0001 fixes the multiple-extended-stats issue for OR clauses, and 0002
improves the estimates for sub-AND clauses underneath OR clauses.

These are both quite small patches, that hopefully won't interfere
with any of the other extended stats patches.

Regards,
Dean

Attachments:

0001-Improve-estimation-of-OR-clauses-using-multiple-exte.patchapplication/octet-stream; name=0001-Improve-estimation-of-OR-clauses-using-multiple-exte.patchDownload

From aa87ba523683cca702129fb68eb2519c89ca152a Mon Sep 17 00:00:00 2001
From: Dean Rasheed <dean.a.rasheed@gmail.com>
Date: Mon, 7 Dec 2020 14:42:33 +0000
Subject: [PATCH 1/2] Improve estimation of OR clauses using multiple extended
 statistics.

When estimating an OR clause using multiple extended statistics
objects, treat the estimates for each set of clauses for each
statistics object as independent of one another. The overlap estimates
produced for each statistics object do not apply to clauses covered by
other statistics objects.
---
 src/backend/statistics/extended_stats.c | 25 +++++++++++++++++--------
 src/test/regress/expected/stats_ext.out |  2 +-
 2 files changed, 18 insertions(+), 9 deletions(-)

diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c
index 8d3cd091ad..b6bd12c229 100644
--- a/src/backend/statistics/extended_stats.c
+++ b/src/backend/statistics/extended_stats.c
@@ -1356,17 +1356,19 @@ statext_mcv_clauselist_selectivity(PlannerInfo *root, List *clauses, int varReli
 		if (is_or)
 		{
 			bool	   *or_matches = NULL;
-			Selectivity simple_or_sel = 0.0;
+			Selectivity simple_or_sel = 0.0,
+						stat_sel = 0.0;
 			MCVList    *mcv_list;
 
 			/* Load the MCV list stored in the statistics object */
 			mcv_list = statext_mcv_load(stat->statOid);
 
 			/*
-			 * Compute the selectivity of the ORed list of clauses by
-			 * estimating each in turn and combining them using the formula
-			 * P(A OR B) = P(A) + P(B) - P(A AND B).  This allows us to use
-			 * the multivariate MCV stats to better estimate each term.
+			 * Compute the selectivity of the ORed list of clauses covered by
+			 * this statistics object by estimating each in turn and combining
+			 * them using the formula P(A OR B) = P(A) + P(B) - P(A AND B).
+			 * This allows us to use the multivariate MCV stats to better
+			 * estimate the individual terms and their overlap.
 			 *
 			 * Each time we iterate this formula, the clause "A" above is
 			 * equal to all the clauses processed so far, combined with "OR".
@@ -1437,12 +1439,19 @@ statext_mcv_clauselist_selectivity(PlannerInfo *root, List *clauses, int varReli
 														overlap_basesel,
 														mcv_totalsel);
 
-				/* Factor these into the overall result */
-				sel += clause_sel - overlap_sel;
-				CLAMP_PROBABILITY(sel);
+				/* Factor these into the result for this statistics object */
+				stat_sel += clause_sel - overlap_sel;
+				CLAMP_PROBABILITY(stat_sel);
 
 				listidx++;
 			}
+
+			/*
+			 * Factor the result for this statistics object into the overall
+			 * result. We just assume that the results from each separate
+			 * statistics object are independent of one another.
+			 */
+			sel = sel + stat_sel - sel * stat_sel;
 		}
 		else					/* Implicitly-ANDed list of clauses */
 		{
diff --git a/src/test/regress/expected/stats_ext.out b/src/test/regress/expected/stats_ext.out
index dbbe9844b2..6e1c4f3edd 100644
--- a/src/test/regress/expected/stats_ext.out
+++ b/src/test/regress/expected/stats_ext.out
@@ -1706,7 +1706,7 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE (a = 0 A
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 OR b = 0 OR c = 0 OR d = 0');
  estimated | actual 
 -----------+--------
-      1714 |   1572
+      1571 |   1572
 (1 row)
 
 DROP TABLE mcv_lists_multi;
-- 
2.26.2

0002-Improve-estimation-of-ANDs-under-ORs-using-extended-.patchapplication/octet-stream; name=0002-Improve-estimation-of-ANDs-under-ORs-using-extended-.patchDownload

From 21634e391305a6e5f029068af62ebf6baf450037 Mon Sep 17 00:00:00 2001
From: Dean Rasheed <dean.a.rasheed@gmail.com>
Date: Mon, 7 Dec 2020 14:59:34 +0000
Subject: [PATCH 2/2] Improve estimation of ANDs under ORs using extended
 statistics.

Formerly, extended statistics only handled clauses that were
RestrictInfos. However, the restrictinfo machinery doesn't create
sub-AND RestrictInfos for AND clauses underneath OR clauses.
Therefore teach extended statistics to handle bare AND clauses,
looking for compatible RestrictInfo clauses underneath them.
---
 src/backend/optimizer/path/clausesel.c  | 21 +++++++++++++++++++++
 src/backend/statistics/extended_stats.c | 25 +++++++++++++++++++++++++
 src/test/regress/expected/stats_ext.out |  4 ++--
 3 files changed, 48 insertions(+), 2 deletions(-)

diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c
index b88b29ec23..a7e535c27f 100644
--- a/src/backend/optimizer/path/clausesel.c
+++ b/src/backend/optimizer/path/clausesel.c
@@ -538,7 +538,28 @@ find_single_rel_for_clauses(PlannerInfo *root, List *clauses)
 		 * However, currently the extended-stats machinery won't do anything
 		 * with non-RestrictInfo clauses anyway, so there's no point in
 		 * spending extra cycles; just fail if that's what we have.
+		 *
+		 * An exception to that rule is if we have a bare BoolExpr AND clause.
+		 * We treat this as a special case because the restrictinfo machinery
+		 * doesn't build RestrictInfos on top of AND clauses.
 		 */
+		if (is_andclause(rinfo))
+		{
+			RelOptInfo *rel;
+
+			rel = find_single_rel_for_clauses(root,
+											  ((BoolExpr *) rinfo)->args);
+
+			if (rel == NULL)
+				return NULL;
+			if (lastrelid == 0)
+				lastrelid = rel->relid;
+			else if (rel->relid != lastrelid)
+				return NULL;
+
+			continue;
+		}
+
 		if (!IsA(rinfo, RestrictInfo))
 			return NULL;
 
diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c
index b6bd12c229..9745a2f248 100644
--- a/src/backend/statistics/extended_stats.c
+++ b/src/backend/statistics/extended_stats.c
@@ -1174,6 +1174,31 @@ statext_is_compatible_clause(PlannerInfo *root, Node *clause, Index relid,
 	RestrictInfo *rinfo = (RestrictInfo *) clause;
 	Oid			userid;
 
+	/*
+	 * Special-case handling for bare BoolExpr AND clauses, because the
+	 * restrictinfo machinery doesn't build RestrictInfos on top of AND
+	 * clauses.
+	 */
+	if (is_andclause(clause))
+	{
+		BoolExpr   *expr = (BoolExpr *) clause;
+		ListCell   *lc;
+
+		/*
+		 * Check that each sub-clause is compatible.  We expect these to be
+		 * RestrictInfos.
+		 */
+		foreach(lc, expr->args)
+		{
+			if (!statext_is_compatible_clause(root, (Node *) lfirst(lc),
+											  relid, attnums))
+				return false;
+		}
+
+		return true;
+	}
+
+	/* Otherwise it must be a RestrictInfo. */
 	if (!IsA(rinfo, RestrictInfo))
 		return false;
 
diff --git a/src/test/regress/expected/stats_ext.out b/src/test/regress/expected/stats_ext.out
index 6e1c4f3edd..7bfeaf85f0 100644
--- a/src/test/regress/expected/stats_ext.out
+++ b/src/test/regress/expected/stats_ext.out
@@ -1594,13 +1594,13 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE (a = 0 AND b = 0 AND c = 0) OR (a = 1 AND b = 1 AND c = 1) OR (a = 2 AND b = 2 AND c = 2)');
  estimated | actual 
 -----------+--------
-       300 |    306
+       306 |    306
 (1 row)
 
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE (a = 0 AND b = 0) OR (a = 0 AND c = 0) OR (b = 0 AND c = 0)');
  estimated | actual 
 -----------+--------
-       306 |    102
+       108 |    102
 (1 row)
 
 DROP TABLE mcv_lists_partial;
-- 
2.26.2

#40

Tomas Vondra

tomas.vondra@enterprisedb.com

about 5 years ago

In reply to: Dean Rasheed (#39)

Re: Additional improvements to extended statistics

On 12/7/20 5:15 PM, Dean Rasheed wrote:

On Wed, 2 Dec 2020 at 15:51, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:

The sort of queries I had in mind were things like this:

WHERE (a = 1 AND b = 1) OR (a = 2 AND b = 2)

However, the new code doesn't apply the extended stats directly using
clauselist_selectivity_or() for this kind of query because there are
no RestrictInfos for the nested AND clauses, so
find_single_rel_for_clauses() (and similarly
statext_is_compatible_clause()) regards those clauses as not
compatible with extended stats. So what ends up happening is that
extended stats are used only when we descend down to the two AND
clauses, and their results are combined using the original "s1 + s2 -
s1 * s2" formula. That actually works OK in this case, because there
is no overlap between the two AND clauses, but it wouldn't work so
well if there was.

I'm pretty sure that can be fixed by teaching
find_single_rel_for_clauses() and statext_is_compatible_clause() to
handle BoolExpr clauses, looking for RestrictInfos underneath them,
but I think that should be left for a follow-in patch.

Attached is a patch doing that, which improves a couple of the
estimates for queries with AND clauses underneath OR clauses, as
expected.

This also revealed a minor bug in the way that the estimates for
multiple statistics objects were combined while processing an OR
clause -- the estimates for the overlaps between clauses only apply
for the current statistics object, so we really have to combine the
estimates for each set of clauses for each statistics object as if
they were independent of one another.

0001 fixes the multiple-extended-stats issue for OR clauses, and 0002
improves the estimates for sub-AND clauses underneath OR clauses.

Cool! Thanks for taking time to investigate and fixing those. Both
patches seem fine to me.

These are both quite small patches, that hopefully won't interfere
with any of the other extended stats patches.

I haven't tried, but it should not interfere with it too much.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company