using extended statistics to improve join estimates

Started by Tomas Vondraalmost 5 years ago33 messages

tomas.vondra@enterprisedb.com

almost 5 years ago

1 attachment(s)

Hi,

So far the extended statistics are applied only at scan level, i.e. when
estimating selectivity for individual tables. Which is great, but joins
are a known challenge, so let's try doing something about it ...

Konstantin Knizhnik posted a patch [1]/messages/by-id/71d67391-16a9-3e5e-b5e4-8f7fd32cc1b2@postgrespro.ru using functional dependencies to
improve join estimates in January. It's an interesting approach, but as
I explained in that thread I think we should try a different approach,
similar to how we use MCV lists without extended stats. We'll probably
end up considering functional dependencies too, but probably only as a
fallback (similar to what we do for single-relation estimates).

This is a PoC demonstrating the approach I envisioned. It's incomplete
and has various limitations:

- no expression support, just plain attribute references
- only equality conditions
- requires MCV lists on both sides
- inner joins only

All of this can / should be relaxed later, of course. But for a PoC this
seems sufficient.

The basic principle is fairly simple, and mimics what eqjoinsel_inner
does. Assume we have a query:

SELECT * FROM t1 JOIN t2 ON (t1.a = t2.a AND t1.b = t2.b)

If we have MCV lists on (t1.a,t1.b) and (t2.a,t2.b) then we can use the
same logic as eqjoinsel_inner and "match" them together. If the MCV list
is "larger" - e.g. on (a,b,c) - then it's a bit more complicated, but
the general idea is the same.

To demonstrate this, consider a very simple example with a table that
has a lot of dependency between the columns:

==================================================================

CREATE TABLE t (a INT, b INT, c INT, d INT);
INSERT INTO t SELECT mod(i,100), mod(i,100), mod(i,100), mod(i,100)
FROM generate_series(1,100000) s(i);
ANALYZE t;

SELECT * FROM t t1 JOIN t t2 ON (t1.a = t2.a AND t1.b = t2.b);

CREATE STATISTICS s (mcv, ndistinct) ON a,b,c,d FROM t;
ANALYZE t;

SELECT * FROM t t1 JOIN t t2 ON (t1.a = t2.a AND t1.b = t2.b);

ALTER STATISTICS s SET STATISTICS 10000;
ANALYZE t;

SELECT * FROM t t1 JOIN t t2 ON (t1.a = t2.a AND t1.b = t2.b);

==================================================================

The results look like this:

- actual rows: 100000000
- estimated (no stats): 1003638
- estimated (stats, 100): 100247844
- estimated (stats, 10k): 100000000

So, in this case the extended stats help quite a bit, even with the
default statistics target.

However, there are other things we can do. For example, we can use
restrictions (at relation level) as "conditions" to filter the MCV lits,
and calculate conditional probability. This is useful even if there's
just a single join condition (on one column), but there are dependencies
between that column and the other filters. Or maybe when there are
filters between conditions on the two sides.

Consider for example these two queries:

SELECT * FROM t t1 JOIN t t2 ON (t1.a = t2.a AND t1.b = t2.b)
WHERE t1.c < 25 AND t2.d < 25;

SELECT * FROM t t1 JOIN t t2 ON (t1.a = t2.a AND t1.b = t2.b)
WHERE t1.c < 25 AND t2.d > 75;

In this particular case we know that (a = b = c = d) so the two filters
are somewhat redundant. The regular estimates will ignore that, but with
MCV we can actually detect that - when we combine the two MCV lists, we
essentially calculate MCV (a,b,t1.c,t2.d), and use that.

Q1 Q2
- actual rows: 25000000 0
- estimated (no stats): 62158 60241
- estimated (stats, 100): 25047900 1
- estimated (stats, 10k): 25000000 1

Obviously, the accuracy depends on how representative the MCV list is
(what fraction of the data it represents), and in this case it works
fairly nicely. A lot of the future work will be about handling cases
when it represents only part of the data.

The attached PoC patch has a number of FIXME and XXX, describing stuff I
ignored to keep it simple, possible future improvement. And so on.

regards

[1]: /messages/by-id/71d67391-16a9-3e5e-b5e4-8f7fd32cc1b2@postgrespro.ru
/messages/by-id/71d67391-16a9-3e5e-b5e4-8f7fd32cc1b2@postgrespro.ru

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

extended-stats-joins-poc.patchtext/x-patch; charset=UTF-8; name=extended-stats-joins-poc.patchDownload

diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c
index d263ecf082..dca1e7d34e 100644
--- a/src/backend/optimizer/path/clausesel.c
+++ b/src/backend/optimizer/path/clausesel.c
@@ -157,6 +157,23 @@ clauselist_selectivity_ext(PlannerInfo *root,
 											&estimatedclauses, false);
 	}
 
+	/*
+	 * Try applying extended statistics to joins. There's not much we can
+	 * do to detect when this makes sense, but we can check that there are
+	 * join clauses, and that at least some of the rels have stats.
+	 *
+	 * XXX Isn't this mutualy exclusive with the preceding block which
+	 * calculates estimates for a single relation?
+	 */
+	if (use_extended_stats &&
+		statext_try_join_estimates(root, clauses, varRelid, jointype, sjinfo,
+						 estimatedclauses))
+	{
+		s1 *= statext_clauselist_join_selectivity(root, clauses, varRelid,
+												  jointype, sjinfo,
+												  &estimatedclauses);
+	}
+
 	/*
 	 * Apply normal selectivity estimates for remaining clauses. We'll be
 	 * careful to skip any clauses which were already estimated above.
diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c
index 8c75690fce..fec11ad9b5 100644
--- a/src/backend/statistics/extended_stats.c
+++ b/src/backend/statistics/extended_stats.c
@@ -30,6 +30,7 @@
 #include "nodes/nodeFuncs.h"
 #include "optimizer/clauses.h"
 #include "optimizer/optimizer.h"
+#include "optimizer/pathnode.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
 #include "statistics/extended_stats_internal.h"
@@ -1154,6 +1155,36 @@ has_stats_of_kind(List *stats, char requiredkind)
 	return false;
 }
 
+/*
+ * has_matching_mcv
+ *		Check whether the list contains statistic of a given kind
+ *
+ * XXX Should consider both attnums and expressions. Also should consider
+ * restrictinfos as conditions.
+ */
+StatisticExtInfo *
+find_matching_mcv(List *stats, Bitmapset *attnums)
+{
+	ListCell   *l;
+	StatisticExtInfo *found = NULL;
+elog(WARNING, "find_matching_mcv %d", bms_num_members(attnums));
+	foreach(l, stats)
+	{
+		StatisticExtInfo *stat = (StatisticExtInfo *) lfirst(l);
+
+		if (stat->kind != STATS_EXT_MCV)
+			continue;
+
+		if (!bms_is_subset(attnums, stat->keys))
+			continue;
+
+		if (!found || (bms_num_members(found->keys) > bms_num_members(stat->keys)))
+			found = stat;
+	}
+
+	return found;
+}
+
 /*
  * stat_find_expression
  *		Search for an expression in statistics object's list of expressions.
@@ -2571,3 +2602,642 @@ make_build_data(Relation rel, StatExtEntry *stat, int numrows, HeapTuple *rows,
 
 	return result;
 }
+
+static bool *
+statext_mcv_eval_conditions(PlannerInfo *root, RelOptInfo *rel,
+							StatisticExtInfo *stat, MCVList *mcv,
+							Selectivity *sel)
+{
+	ListCell   *lc;
+	List	   *conditions = NIL;
+
+	/* extract conditions that may be applied to the MCV list */
+	foreach (lc, rel->baserestrictinfo)
+	{
+		RestrictInfo *rinfo = (RestrictInfo *) lfirst(lc);
+		Bitmapset *attnums = NULL;
+		List *exprs = NIL;
+
+		/* clause has to be supported by MCV in general */
+		if (!statext_is_compatible_clause(root, (Node *) rinfo, rel->relid,
+										  &attnums, &exprs))
+			continue;
+
+		/*
+		 * clause has to be covered by the statistics object
+		 *
+		 * FIXME handle expressions properly
+		 */
+		if (!bms_is_subset(attnums, stat->keys))
+			continue;
+
+		conditions = lappend(conditions, rinfo->clause);
+	}
+
+	/* everything matches by default */
+	*sel = 1.0;
+
+	if (!conditions)
+		return NULL;
+
+	/* what's the selectivity of the conditions alone? */
+	*sel = clauselist_selectivity(root, conditions, rel->relid, 0, NULL);
+
+	return mcv_get_match_bitmap(root, conditions, stat->keys, stat->exprs,
+								mcv, false);
+}
+
+static double
+statext_ndistinct_estimate(PlannerInfo *root, RelOptInfo *rel, List *clauses)
+{
+	ListCell *lc;
+
+	List *exprs = NIL;
+
+	foreach (lc, clauses)
+	{
+		ListCell *lc2;
+		Node *clause = (Node *) lfirst(lc);
+		OpExpr *opexpr;
+
+		if (!is_opclause(clause))
+			continue;
+
+		opexpr = (OpExpr *) clause;
+
+		if (list_length(opexpr->args) != 2)
+			continue;
+
+		foreach (lc2, opexpr->args)
+		{
+			Node *expr = (Node *) lfirst(lc2);
+			Bitmapset *varnos = pull_varnos(root, expr);
+
+			if (bms_singleton_member(varnos) == rel->relid)
+				exprs = lappend(exprs, expr);
+		}
+	}
+
+	return estimate_num_groups(root, exprs, rel->rows, NULL, NULL);
+}
+
+/*
+ * statext_compare_mcvs
+ *		Calculte join selectivity using extended statistics, similarly to
+ *		eqjoinsel_inner.
+ *
+ * Considers restrictions on base relations too, essentially computing
+ * a conditional probability
+ *
+ *	P(join clauses | baserestrictinfos on either side)
+ */
+Selectivity
+statext_compare_mcvs(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
+					 StatisticExtInfo *stat1, StatisticExtInfo *stat2,
+					 List *clauses)
+{
+	MCVList *mcv1;
+	MCVList *mcv2;
+	int		i, j;
+	Selectivity s = 0;
+
+	/* items eliminated by conditions (if any) */
+	bool   *conditions1 = NULL,
+		   *conditions2 = NULL;
+
+	double	conditions1_sel = 1.0,
+			conditions2_sel = 1.0;
+
+	bool   *matches1 = NULL,
+		   *matches2 = NULL;
+
+	double	matchfreq1,
+			unmatchfreq1,
+			matchfreq2,
+			unmatchfreq2,
+			otherfreq1,
+			mcvfreq1,
+			otherfreq2,
+			mcvfreq2;
+
+	double	nd1,
+			nd2;
+
+	double	totalsel1,
+			totalsel2;
+
+	mcv1 = statext_mcv_load(stat1->statOid);
+	mcv2 = statext_mcv_load(stat2->statOid);
+
+	/* should only get here with MCV on both sides */
+	Assert(mcv1 && mcv2);
+
+	matches1 = (bool *) palloc0(sizeof(bool) * mcv1->nitems);
+	matches2 = (bool *) palloc0(sizeof(bool) * mcv2->nitems);
+
+	/* apply baserestrictinfo conditions on the MCV lists */
+
+	conditions1 = statext_mcv_eval_conditions(root, rel1, stat1, mcv1,
+											  &conditions1_sel);
+
+	conditions2 = statext_mcv_eval_conditions(root, rel2, stat2, mcv2,
+											  &conditions2_sel);
+
+	/*
+	 * Match items from the two MCV lits.
+	 *
+	 * We don't know if the matches are 1:1 - we may have overlap on only
+	 * a subset of attributes, e.g. (a,b,c) vs. (b,c,d), so there may be
+	 * multiple matches.
+	 */
+	for (i = 0; i < mcv1->nitems; i++)
+	{
+		/* skip items eliminated by restrictions on rel1 */
+		if (conditions1 && !conditions1[i])
+			continue;
+
+		/* find matches in the second MCV list */
+		for (j = 0; j < mcv2->nitems; j++)
+		{
+			ListCell   *lc;
+			bool		items_match = true;
+
+			/* skip items eliminated by restrictions on rel2 */
+			if (conditions2 && !conditions2[j])
+				continue;
+
+			foreach (lc, clauses)
+			{
+				Node *clause = (Node *) lfirst(lc);
+				Bitmapset  *atts1 = NULL;
+				Bitmapset  *atts2 = NULL;
+				Datum		value1, value2;
+				int			index1, index2;
+				AttrNumber	attnum1;
+				AttrNumber	attnum2;
+				bool		match;
+
+				FmgrInfo	opproc;
+				OpExpr	   *expr = (OpExpr *) clause;
+
+				Assert(is_opclause(clause));
+
+				fmgr_info(get_opcode(expr->opno), &opproc);
+
+				/* determine the columns in each statistics object */
+
+				pull_varattnos(clause, rel1->relid, &atts1);
+				attnum1 = bms_singleton_member(atts1) + FirstLowInvalidHeapAttributeNumber;
+				index1 = bms_member_index(stat1->keys, attnum1);
+
+				pull_varattnos(clause, rel2->relid, &atts2);
+				attnum2 = bms_singleton_member(atts2) + FirstLowInvalidHeapAttributeNumber;
+				index2 = bms_member_index(stat2->keys, attnum2);
+
+				/* if either value is null, we're done */
+				if (mcv1->items[i].isnull[index1] || mcv2->items[j].isnull[index2])
+					match = false;
+				else
+				{
+					value1 = mcv1->items[i].values[index1];
+					value2 = mcv2->items[j].values[index2];
+
+					/*
+					 * FIXME Might have issues with order of parameters, but for
+					 * same-type equality that should not matter.
+					 * */
+					match = DatumGetBool(FunctionCall2Coll(&opproc,
+														   InvalidOid,
+														   value1, value2));
+				}
+
+				items_match &= match;
+
+				if (!items_match)
+					break;
+			}
+
+			if (items_match)
+			{
+				matches1[i] = matches2[j] = true;
+				s += mcv1->items[i].frequency * mcv2->items[j].frequency;
+
+				/* XXX Do we need to do something about base frequency? */
+			}
+		}
+	}
+
+	matchfreq1 = unmatchfreq1 = mcvfreq1 = 0.0;
+	for (i = 0; i < mcv1->nitems; i++)
+	{
+		mcvfreq1 += mcv1->items[i].frequency;
+
+		if (conditions1 && !conditions1[i])
+			continue;
+
+		if (matches1[i])
+			matchfreq1 += mcv1->items[i].frequency;
+		else
+			unmatchfreq1 += mcv1->items[i].frequency;
+	}
+
+	/* not represented by the MCV */
+	otherfreq1 = 1 - mcvfreq1;
+
+	matchfreq2 = unmatchfreq2 = mcvfreq2 = 0.0;
+	for (i = 0; i < mcv2->nitems; i++)
+	{
+		mcvfreq2 += mcv2->items[i].frequency;
+
+		if (conditions2 && !conditions2[i])
+			continue;
+
+		if (matches2[i])
+			matchfreq2 += mcv2->items[i].frequency;
+		else
+			unmatchfreq2 += mcv2->items[i].frequency;
+	}
+
+	/* not represented by the MCV */
+	otherfreq2 = 1 - mcvfreq2;
+
+	/* correction for MCV parts eliminated by the conditions */
+	s = s * mcvfreq1 * mcvfreq2 / (matchfreq1 + unmatchfreq1) / (matchfreq2 + unmatchfreq2);
+
+	nd1 = statext_ndistinct_estimate(root, rel1, clauses);
+	nd2 = statext_ndistinct_estimate(root, rel2, clauses);
+
+	/*
+	 * XXX this is a bit bogus, because we don't know what fraction of
+	 * distinct combinations is covered by the MCV list (we're only
+	 * dealing with some of the columns), so we can't use the same
+	 * formular as eqjoinsel_inner exactly. Moreover, we need to look
+	 * at the conditions. So instead we simply assume the conditions
+	 * affect the distinct groups, and use that.
+	 */
+	nd1 *= conditions1_sel;
+	nd2 *= conditions2_sel;
+
+	totalsel1 = s;
+	totalsel1 += unmatchfreq1 * otherfreq2 / nd2;
+	totalsel1 += otherfreq1 * (otherfreq2 + unmatchfreq2) / nd2;
+
+//	if (nd2 > mcvb->nitems)
+//		totalsel1 += unmatchfreq1 * otherfreq2 / (nd2 - mcvb->nitems);
+//	if (nd2 > nmatches)
+//		totalsel1 += otherfreq1 * (otherfreq2 + unmatchfreq2) /
+//			(nd2 - nmatches);
+
+	totalsel2 = s;
+	totalsel2 += unmatchfreq2 * otherfreq1 / nd1;
+	totalsel2 += otherfreq2 * (otherfreq1 + unmatchfreq1) / nd1;
+
+//	if (nd1 > mcva->nitems)
+//		totalsel2 += unmatchfreq2 * otherfreq1 / (nd1 - mcva->nitems);
+//	if (nd1 > nmatches)
+//		totalsel2 += otherfreq2 * (otherfreq1 + unmatchfreq1) /
+//			(nd1 - nmatches);
+
+	s = Min(totalsel1, totalsel2);
+
+	return s;
+}
+
+static bool
+is_supported_join_clause(Node *clause)
+{
+	Oid	oprsel;
+	RestrictInfo   *rinfo;
+
+	/* XXX Not sure we can rely on only getting RestrictInfo here? */
+	if (!IsA(clause, RestrictInfo))
+		return false;
+
+	/* strip the RestrictInfo */
+	rinfo = (RestrictInfo *) clause;
+	clause = (Node *) rinfo->clause;
+
+	/* skip clauses that don't link two base relations */
+	if (bms_num_members(rinfo->clause_relids) != 2)
+		return false;
+
+	/* we only support simple operator clauses for now */
+	if (!is_opclause(clause))
+		return false;
+
+	oprsel = get_oprjoin(((OpExpr *) clause)->opno);
+
+	if (oprsel != F_EQJOINSEL)
+		return false;
+
+	/*
+	 * FIXME More thorought check that it's Var = Var or something like
+	 * that with expressions. Maybe also check that both relations have
+	 * extended statistics (no point in matching without it).
+	 *
+	 * XXX Also check it's not expression on system attributes, which we
+	 * don't allow in extended statistics.
+	 *
+	 * XXX Although maybe we could allow cases that combine expressions
+	 * from both relations on either side? Like (t1.a + t2.b = t1.c - t2.d)
+	 * or something like that. We could do "cartesian product" of the MCV
+	 * stats and restrict it using this condition.
+	 */
+
+	return true;
+}
+
+/*
+ * statext_try_join_estimates
+ *		Checks if it's worth considering extended stats on join estimates.
+ */
+bool
+statext_try_join_estimates(PlannerInfo *root, List *clauses, int varRelid,
+						   JoinType jointype, SpecialJoinInfo *sjinfo,
+						   Bitmapset *estimatedclauses)
+{
+	int			listidx;
+	int			k;
+	ListCell   *lc;
+	Bitmapset  *relids = NULL;
+
+	/* XXX see treat_as_join_clause */
+	if ((varRelid != 0) || (sjinfo == NULL))
+		return false;
+
+	listidx = -1;
+	foreach (lc, clauses)
+	{
+		Node *clause = (Node *) lfirst(lc);
+		RestrictInfo *rinfo;
+		listidx++;
+
+		/* skip estimated clauses */
+		if (bms_is_member(listidx, estimatedclauses))
+			continue;
+
+		/* is_supported_join_clause ensures we have RestrictInfo */
+		if (!is_supported_join_clause(clause))
+			continue;
+
+		rinfo = (RestrictInfo *) clause;
+
+		relids = bms_union(relids, rinfo->clause_relids);
+	}
+
+	/* no join clauses found, don't try applying extended stats */
+	if (bms_num_members(relids) == 0)
+		return false;
+
+	/*
+	 * Check that at least some of the rels have extended stats.
+	 *
+	 * XXX Maybe we should check how many rels have stats, and cross-check
+	 * how compatible they are (e.g. that both have MCVs, etc.). Also,
+	 * maybe this should cross-check the exact pairs of rels with a join
+	 * clause between them?
+	 *
+	 * XXX We could also check if there are enough parameters in each rel
+	 * to consider extended stats. If there's just a single attribute, it's
+	 * probably better to use just regular statistics. OTOH we can also
+	 * consider restriction clauses from baserestrictinfo and use them
+	 * to calculate conditional probabilities.
+	 */
+	k = -1;
+	while ((k = bms_next_member(relids, k)) >= 0)
+	{
+		RelOptInfo *rel = find_base_rel(root, k);
+		if (rel->statlist)
+			return true;
+	}
+
+	return false;
+}
+
+/* Information about a join between two relations. */
+typedef struct JoinPairInfo
+{
+	Bitmapset  *rels;
+	List	   *clauses;
+} JoinPairInfo;
+
+/*
+ * statext_build_join_pairs
+ *		Extract pairs of joined rels with join clauses for each pair.
+ */
+static JoinPairInfo *
+statext_build_join_pairs(List *clauses, Bitmapset *estimatedclauses, int *npairs)
+{
+	int				cnt;
+	int				listidx;
+	JoinPairInfo   *info;
+	ListCell	   *lc;
+
+	/*
+	 * Assume each clause is for a different pair of relations (some of them
+	 * might be already estimated, but meh - there shouldn't be too many of
+	 * them and it's cheaper than repalloc.
+	 */
+	info = (JoinPairInfo *) palloc0(sizeof(JoinPairInfo) * list_length(clauses));
+	cnt = 0;
+
+	listidx = -1;
+	foreach(lc, clauses)
+	{
+		int				i;
+		bool			found;
+		Node		   *clause = (Node *) lfirst(lc);
+		RestrictInfo   *rinfo;
+
+		listidx++;
+
+		/* skip already estimated clauses */
+		if (bms_is_member(listidx, estimatedclauses))
+			continue;
+
+		/* is_supported_join_clause ensures it's a restrict info */
+		if (!is_supported_join_clause(clause))
+			continue;
+
+		rinfo = (RestrictInfo *) clause;
+		clause = (Node *) rinfo->clause;
+
+		/* search for a matching join pair */
+		found = false;
+		for (i = 0; i < cnt; i++)
+		{
+			if (bms_is_subset(rinfo->clause_relids, info[i].rels))
+			{
+				info[i].clauses = lappend(info[i].clauses, clause);
+				found = true;
+				break;
+			}
+		}
+
+		if (!found)
+		{
+			info[cnt].rels = rinfo->clause_relids;
+			info[cnt].clauses = lappend(info[cnt].clauses, clause);
+			cnt++;
+		}
+	}
+
+	if (cnt == 0)
+		return NULL;
+
+	*npairs = cnt;
+	return info;
+}
+
+static RelOptInfo *
+extract_relation_info(PlannerInfo *root, JoinPairInfo *info, int index,
+					  Bitmapset **attnums, StatisticExtInfo **stat);
+
+/*
+ * statext_clauselist_join_selectivity
+ *		Use extended stats to estimate join clauses.
+ *
+ * XXX In principle, we should not restrict this to cases with multiple
+ * join clauses - we should consider dependencies with conditions at the
+ * base relations, i.e. calculate P(join clause | base restrictions).
+ * But currently that does not happen, because clauselist_selectivity_ext
+ * treats a single clause as a special case (and we don't apply extended
+ * statistics in that case yet).
+ */
+Selectivity
+statext_clauselist_join_selectivity(PlannerInfo *root, List *clauses, int varRelid,
+									JoinType jointype, SpecialJoinInfo *sjinfo,
+									Bitmapset **estimatedclauses)
+{
+	int			i;
+	int			listidx;
+	Selectivity	s = 1.0;
+
+	JoinPairInfo *info;
+	int				ninfo;
+
+	if (!clauses)
+		return 1.0;
+
+	/* extract pairs of joined relations from the list of clauses */
+	info = statext_build_join_pairs(clauses, *estimatedclauses, &ninfo);
+elog(WARNING, "statext_build_join_pairs = %p", info);
+	/* no useful join pairs */
+	if (!info)
+		return 1.0;
+
+	/*
+	 * Process the join pairs, try to find a matching MCV on each side.
+	 *
+	 * XXX The basic principle is quite similar to eqjoinsel_inner, i.e.
+	 * we try to find a MCV on both sides of the join, and use it to get
+	 * better join estimate. It's a bit more complicated, because there
+	 * might be multiple MCV lists, we also need ndistinct estimate, and
+	 * there may be interesting baserestrictions too.
+	 *
+	 * XXX At the moment we only handle the case with matching MCVs on
+	 * both sides, but it'd be good to also handle case with just ndistinct
+	 * statistics improving ndistinct estimates.
+	 *
+	 * XXX Perhaps it'd be good to also handle case with one side only
+	 * having "regular" statistics (e.g. MCV), especially in cases with
+	 * no conditions on that side of the join (where we can't use the
+	 * extended MCV to calculate conditional probability).
+	 */
+	for (i = 0; i < ninfo; i++)
+	{
+		RelOptInfo *rel1;
+		RelOptInfo *rel2;
+		Bitmapset  *attnos1 = NULL;
+		Bitmapset  *attnos2 = NULL;
+		StatisticExtInfo *stat1;
+		StatisticExtInfo *stat2;
+
+		ListCell *lc;
+
+		/* extract info about the first relation */
+		rel1 = extract_relation_info(root, &info[i], 0, &attnos1, &stat1);
+
+		/* extract info about the second relation */
+		rel2 = extract_relation_info(root, &info[i], 1, &attnos2, &stat2);
+
+		/* XXX only handling case with MCV on both sides for now */
+		if (!stat1 || !stat2)
+			continue;
+
+		s *= statext_compare_mcvs(root, rel1, rel2, stat1, stat2, info[i].clauses);
+
+		/*
+		 * Now mark all the clauses for this pair as estimated.
+		 *
+		 * XXX Maybe track the indexes in JoinPairInfo, so that we can
+		 * simply union the two bitmaps, without the extra matching.
+		 */
+		foreach (lc, info->clauses)
+		{
+			Node *clause = (Node *) lfirst(lc);
+			ListCell *lc2;
+
+			listidx = -1;
+			foreach (lc2, clauses)
+			{
+				Node *clause2 = (Node *) lfirst(lc2);
+				listidx++;
+
+				Assert(IsA(clause2, RestrictInfo));
+
+				clause2 = (Node *) ((RestrictInfo *) clause2)->clause;
+
+				if (equal(clause, clause2))
+				{
+					*estimatedclauses = bms_add_member(*estimatedclauses, listidx);
+					break;
+				}
+			}
+		}
+	}
+
+	return s;
+}
+
+static RelOptInfo *
+extract_relation_info(PlannerInfo *root, JoinPairInfo *info, int index,
+					  Bitmapset **attnums, StatisticExtInfo **stat)
+{
+	int	k;
+	int	relid;
+	RelOptInfo *rel;
+
+	k = -1;
+	while (index >= 0)
+	{
+		k = bms_next_member(info->rels, k);
+		if (k < 0)
+			elog(ERROR, "failed to extract relid");
+
+		relid = k;
+		index--;
+	}
+
+	rel = find_base_rel(root, relid);
+
+	/*
+	 * extract attnums from the clauses, and remove the offset (we don't
+	 * bother with system attributes)
+	 *
+	 * FIXME This is wrong, we need to match the clauses to both attnums
+	 * and expressions to extended statistics objects.
+	 */
+	pull_varattnos((Node *) info->clauses, relid, attnums);
+
+	k = -1;
+	while ((k = bms_next_member(*attnums, k)) >= 0)
+	{
+		*attnums = bms_del_member(*attnums, k);
+		*attnums = bms_add_member(*attnums, k + FirstLowInvalidHeapAttributeNumber);
+	}
+
+	*stat = find_matching_mcv(rel->statlist, *attnums);
+
+	return rel;
+}
diff --git a/src/backend/statistics/mcv.c b/src/backend/statistics/mcv.c
index 2a00fb4848..5410360653 100644
--- a/src/backend/statistics/mcv.c
+++ b/src/backend/statistics/mcv.c
@@ -1597,7 +1597,7 @@ mcv_match_expression(Node *expr, Bitmapset *keys, List *exprs, Oid *collid)
  * & and |, which should be faster than min/max. The bitmaps are fairly
  * small, though (thanks to the cap on the MCV list size).
  */
-static bool *
+bool *
 mcv_get_match_bitmap(PlannerInfo *root, List *clauses,
 					 Bitmapset *keys, List *exprs,
 					 MCVList *mcvlist, bool is_or)
diff --git a/src/include/statistics/extended_stats_internal.h b/src/include/statistics/extended_stats_internal.h
index 55cd9252a5..072085365c 100644
--- a/src/include/statistics/extended_stats_internal.h
+++ b/src/include/statistics/extended_stats_internal.h
@@ -127,4 +127,8 @@ extern Selectivity mcv_clause_selectivity_or(PlannerInfo *root,
 											 Selectivity *overlap_basesel,
 											 Selectivity *totalsel);
 
+extern bool *mcv_get_match_bitmap(PlannerInfo *root, List *clauses,
+								  Bitmapset *keys, List *exprs,
+								  MCVList *mcvlist, bool is_or);
+
 #endif							/* EXTENDED_STATS_INTERNAL_H */
diff --git a/src/include/statistics/statistics.h b/src/include/statistics/statistics.h
index 326cf26fea..967b2ff0db 100644
--- a/src/include/statistics/statistics.h
+++ b/src/include/statistics/statistics.h
@@ -120,10 +120,25 @@ extern Selectivity statext_clauselist_selectivity(PlannerInfo *root,
 												  Bitmapset **estimatedclauses,
 												  bool is_or);
 extern bool has_stats_of_kind(List *stats, char requiredkind);
+extern StatisticExtInfo *find_matching_mcv(List *stats, Bitmapset *attnums);
 extern StatisticExtInfo *choose_best_statistics(List *stats, char requiredkind,
 												Bitmapset **clause_attnums,
 												List **clause_exprs,
 												int nclauses);
 extern HeapTuple statext_expressions_load(Oid stxoid, int idx);
 
+extern bool statext_try_join_estimates(PlannerInfo *root, List *clauses, int varRelid,
+									   JoinType jointype, SpecialJoinInfo *sjinfo,
+									   Bitmapset *estimatedclauses);
+
+extern Selectivity statext_clauselist_join_selectivity(PlannerInfo *root, List *clauses,
+													   int varRelid,
+													   JoinType jointype, SpecialJoinInfo *sjinfo,
+													   Bitmapset **estimatedclauses);
+
+extern Selectivity statext_compare_mcvs(PlannerInfo *root,
+										RelOptInfo *rela, RelOptInfo *relb,
+										StatisticExtInfo *sa, StatisticExtInfo *sb,
+										List *clauses);
+
 #endif							/* STATISTICS_H */

Zhihong Yu

zyu@yugabyte.com

almost 5 years ago

In reply to: Tomas Vondra (#1)

Re: using extended statistics to improve join estimates

Hi,

+ * has_matching_mcv
+ *     Check whether the list contains statistic of a given kind

The method name is find_matching_mcv(). It seems the method initially
returned bool but later the return type was changed.

+ StatisticExtInfo *found = NULL;

found normally is associated with bool return value. Maybe call the
variable matching_mcv or something similar.

+           /* skip items eliminated by restrictions on rel2 */
+           if (conditions2 && !conditions2[j])
+               continue;

Maybe you can add a counter recording the number of non-skipped items for
the inner loop. If this counter is 0 after the completion of one iteration,
we come out of the outer loop directly.

Cheers

On Wed, Mar 31, 2021 at 10:36 AM Tomas Vondra <tomas.vondra@enterprisedb.com>
wrote:

Show quoted text

Hi,

So far the extended statistics are applied only at scan level, i.e. when
estimating selectivity for individual tables. Which is great, but joins
are a known challenge, so let's try doing something about it ...

Konstantin Knizhnik posted a patch [1] using functional dependencies to
improve join estimates in January. It's an interesting approach, but as
I explained in that thread I think we should try a different approach,
similar to how we use MCV lists without extended stats. We'll probably
end up considering functional dependencies too, but probably only as a
fallback (similar to what we do for single-relation estimates).

This is a PoC demonstrating the approach I envisioned. It's incomplete
and has various limitations:

- no expression support, just plain attribute references
- only equality conditions
- requires MCV lists on both sides
- inner joins only

All of this can / should be relaxed later, of course. But for a PoC this
seems sufficient.

The basic principle is fairly simple, and mimics what eqjoinsel_inner
does. Assume we have a query:

SELECT * FROM t1 JOIN t2 ON (t1.a = t2.a AND t1.b = t2.b)

If we have MCV lists on (t1.a,t1.b) and (t2.a,t2.b) then we can use the
same logic as eqjoinsel_inner and "match" them together. If the MCV list
is "larger" - e.g. on (a,b,c) - then it's a bit more complicated, but
the general idea is the same.

To demonstrate this, consider a very simple example with a table that
has a lot of dependency between the columns:

==================================================================

CREATE TABLE t (a INT, b INT, c INT, d INT);
INSERT INTO t SELECT mod(i,100), mod(i,100), mod(i,100), mod(i,100)
FROM generate_series(1,100000) s(i);
ANALYZE t;

SELECT * FROM t t1 JOIN t t2 ON (t1.a = t2.a AND t1.b = t2.b);

CREATE STATISTICS s (mcv, ndistinct) ON a,b,c,d FROM t;
ANALYZE t;

SELECT * FROM t t1 JOIN t t2 ON (t1.a = t2.a AND t1.b = t2.b);

ALTER STATISTICS s SET STATISTICS 10000;
ANALYZE t;

SELECT * FROM t t1 JOIN t t2 ON (t1.a = t2.a AND t1.b = t2.b);

==================================================================

The results look like this:

- actual rows: 100000000
- estimated (no stats): 1003638
- estimated (stats, 100): 100247844
- estimated (stats, 10k): 100000000

So, in this case the extended stats help quite a bit, even with the
default statistics target.

However, there are other things we can do. For example, we can use
restrictions (at relation level) as "conditions" to filter the MCV lits,
and calculate conditional probability. This is useful even if there's
just a single join condition (on one column), but there are dependencies
between that column and the other filters. Or maybe when there are
filters between conditions on the two sides.

Consider for example these two queries:

SELECT * FROM t t1 JOIN t t2 ON (t1.a = t2.a AND t1.b = t2.b)
WHERE t1.c < 25 AND t2.d < 25;

SELECT * FROM t t1 JOIN t t2 ON (t1.a = t2.a AND t1.b = t2.b)
WHERE t1.c < 25 AND t2.d > 75;

In this particular case we know that (a = b = c = d) so the two filters
are somewhat redundant. The regular estimates will ignore that, but with
MCV we can actually detect that - when we combine the two MCV lists, we
essentially calculate MCV (a,b,t1.c,t2.d), and use that.

Q1 Q2
- actual rows: 25000000 0
- estimated (no stats): 62158 60241
- estimated (stats, 100): 25047900 1
- estimated (stats, 10k): 25000000 1

Obviously, the accuracy depends on how representative the MCV list is
(what fraction of the data it represents), and in this case it works
fairly nicely. A lot of the future work will be about handling cases
when it represents only part of the data.

The attached PoC patch has a number of FIXME and XXX, describing stuff I
ignored to keep it simple, possible future improvement. And so on.

regards

[1]

/messages/by-id/71d67391-16a9-3e5e-b5e4-8f7fd32cc1b2@postgrespro.ru

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Tomas Vondra

tomas.vondra@enterprisedb.com

over 4 years ago

In reply to: Tomas Vondra (#1)

1 attachment(s)

Re: using extended statistics to improve join estimates

Hi,

Here's a slightly improved / cleaned up version of the PoC patch,
removing a bunch of XXX and FIXMEs, adding comments, etc.

The approach is sound in principle, I think, although there's still a
bunch of things to address:

1) statext_compare_mcvs only really deals with equijoins / inner joins
at the moment, as it's based on eqjoinsel_inner. It's probably desirable
to add support for additional join types (inequality and outer joins).

2) Some of the steps are performed multiple times - e.g. matching base
restrictions to statistics, etc. Those probably can be cached somehow,
to reduce the overhead.

3) The logic of picking the statistics to apply is somewhat simplistic,
and maybe could be improved in some way. OTOH the number of candidate
statistics is likely low, so this is not a big issue.

4) statext_compare_mcvs is based on eqjoinsel_inner and makes a bunch of
assumptions similar to the original, but some of those assumptions may
be wrong in multi-column case, particularly when working with a subset
of columns. For example (ndistinct - size(MCV)) may not be the number of
distinct combinations outside the MCV, when ignoring some columns. Same
for nullfract, and so on. I'm not sure we can do much more than pick
some reasonable approximation.

5) There are no regression tests at the moment. Clearly a gap.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

extended-stats-joins-20210614.patchtext/x-patch; charset=UTF-8; name=extended-stats-joins-20210614.patchDownload

diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c
index d263ecf082..dca1e7d34e 100644
--- a/src/backend/optimizer/path/clausesel.c
+++ b/src/backend/optimizer/path/clausesel.c
@@ -157,6 +157,23 @@ clauselist_selectivity_ext(PlannerInfo *root,
 											&estimatedclauses, false);
 	}
 
+	/*
+	 * Try applying extended statistics to joins. There's not much we can
+	 * do to detect when this makes sense, but we can check that there are
+	 * join clauses, and that at least some of the rels have stats.
+	 *
+	 * XXX Isn't this mutualy exclusive with the preceding block which
+	 * calculates estimates for a single relation?
+	 */
+	if (use_extended_stats &&
+		statext_try_join_estimates(root, clauses, varRelid, jointype, sjinfo,
+						 estimatedclauses))
+	{
+		s1 *= statext_clauselist_join_selectivity(root, clauses, varRelid,
+												  jointype, sjinfo,
+												  &estimatedclauses);
+	}
+
 	/*
 	 * Apply normal selectivity estimates for remaining clauses. We'll be
 	 * careful to skip any clauses which were already estimated above.
diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c
index b05e818ba9..d4cbbee785 100644
--- a/src/backend/statistics/extended_stats.c
+++ b/src/backend/statistics/extended_stats.c
@@ -30,6 +30,7 @@
 #include "nodes/nodeFuncs.h"
 #include "optimizer/clauses.h"
 #include "optimizer/optimizer.h"
+#include "optimizer/pathnode.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
 #include "statistics/extended_stats_internal.h"
@@ -101,6 +102,16 @@ static StatsBuildData *make_build_data(Relation onerel, StatExtEntry *stat,
 									   int numrows, HeapTuple *rows,
 									   VacAttrStats **stats, int stattarget);
 
+static bool stat_covers_expressions(StatisticExtInfo *stat, List *exprs,
+									Bitmapset **expr_idxs);
+
+static List *statext_mcv_get_conditions(PlannerInfo *root,
+										RelOptInfo *rel,
+										StatisticExtInfo *info);
+
+static bool *statext_mcv_eval_conditions(PlannerInfo *root, RelOptInfo *rel,
+										 StatisticExtInfo *stat, MCVList *mcv,
+										 Selectivity *sel);
 
 /*
  * Compute requested extended stats, using the rows sampled for the plain
@@ -1154,6 +1165,89 @@ has_stats_of_kind(List *stats, char requiredkind)
 	return false;
 }
 
+/*
+ * find_matching_mcv
+ *		Search for a MCV covering all the attributes.
+ *
+ * XXX Should consider both attnums and expressions. Also should consider
+ * additional restrictinfos as conditions (but only as optional).
+ *
+ * XXX The requirement that all the attributes need to be covered might be
+ * too strong. Maybe we could relax it a bit, and search for MCVs (on both
+ * sides of the join) with the largest overlap. But we don't really expect
+ * many candidate MCVs, so this simple approach seems sufficient.
+ */
+StatisticExtInfo *
+find_matching_mcv(PlannerInfo *root, RelOptInfo *rel, Bitmapset *attnums, List *exprs)
+{
+	ListCell   *l;
+	StatisticExtInfo *mcv = NULL;
+	List *stats = rel->statlist;
+
+	foreach(l, stats)
+	{
+		StatisticExtInfo *stat = (StatisticExtInfo *) lfirst(l);
+		List *conditions1 = NIL,
+			 *conditions2 = NIL;
+
+		/* We only care about MCV statistics here. */
+		if (stat->kind != STATS_EXT_MCV)
+			continue;
+
+		/*
+		 * Ignore MCVs not covering all the attributes/expressions.
+		 *
+		 * XXX Maybe we shouldn't be so strict and consider only partial
+		 * matches for join clauses too?
+		 */
+		if (!bms_is_subset(attnums, stat->keys) ||
+			!stat_covers_expressions(stat, exprs, NULL))
+			continue;
+
+		/* If there's no matching MCV yet, keep it. */
+		if (!mcv)
+		{
+			mcv = stat;
+			continue;
+		}
+
+		/*
+		 * OK, we have two candidate statistics and we need to pick. We'll
+		 * use two simple heuristics: We prefer smaller statistics (fewer
+		 * columns), on the assumption that a smaller statistics probably
+		 * represents a larger fraction of the data (fewer combinations
+		 * with higher counts). But we also like if the statistics covers
+		 * some additional conditions at the baserel level, because this
+		 * may affect the data distribition. Of course, those two metrics
+		 * are contradictory - smaller stats are less likely to cover as
+		 * many conditions as a larger one.
+		 *
+		 * XXX For now we simply prefer smaller statistics, but maybe it
+		 * should be the other way around.
+		 */
+		if (bms_num_members(mcv->keys) + list_length(mcv->exprs) >
+			bms_num_members(stat->keys) + list_length(stat->exprs))
+		{
+			mcv = stat;
+			continue;
+		}
+
+		/*
+		 * Now inspect the base restrictinfo conditions too. We need to be
+		 * more careful because we didn't check which of those clauses are
+		 * compatible, so we need to run statext_is_compatible_clause.
+		 */
+		conditions1 = statext_mcv_get_conditions(root, rel, stat);
+		conditions2 = statext_mcv_get_conditions(root, rel, mcv);
+
+		/* if the new statistics covers more conditions, use it */
+		if (list_length(conditions2) > list_length(conditions1))
+			mcv = stat;
+	}
+
+	return mcv;
+}
+
 /*
  * stat_find_expression
  *		Search for an expression in statistics object's list of expressions.
@@ -2603,3 +2697,846 @@ make_build_data(Relation rel, StatExtEntry *stat, int numrows, HeapTuple *rows,
 
 	return result;
 }
+
+/*
+ * statext_mcv_get_conditions
+ *		Get conditions on base relations, to be used as conditions for joins.
+ *
+ * When estimating joins using extended statistics, we can apply conditions
+ * from base relations as conditions. This peeks at the baserestrictinfo
+ * list for a relation and extracts those that are compatible with extended
+ * statistics.
+ */
+static List *
+statext_mcv_get_conditions(PlannerInfo *root, RelOptInfo *rel,
+						   StatisticExtInfo *info)
+{
+	ListCell   *lc;
+	List	   *conditions = NIL;
+
+	/* extract conditions that may be applied to the MCV list */
+	foreach (lc, rel->baserestrictinfo)
+	{
+		RestrictInfo *rinfo = (RestrictInfo *) lfirst(lc);
+		Bitmapset *indexes = NULL;
+		Bitmapset *attnums = NULL;
+		List *exprs = NIL;
+
+		/* clause has to be supported by MCV in general */
+		if (!statext_is_compatible_clause(root, (Node *) rinfo, rel->relid,
+										  &attnums, &exprs))
+			continue;
+
+		/*
+		 * clause is compatible in general, but is it actually covered
+		 * by this partiular statistics object?
+		 */
+		if (!bms_is_subset(attnums, info->keys) ||
+			!stat_covers_expressions(info, exprs, &indexes))
+			continue;
+
+		conditions = lappend(conditions, rinfo->clause);
+	}
+
+	return conditions;
+}
+
+/*
+ * statext_mcv_eval_conditions
+ *		Evaluate a list of conditions on the MCV lists.
+ *
+ * This returns a match bitmap for the conditions, which can be used later
+ * to restrict just the "interesting" part of the MCV lists. Also returns
+ * the selectivity of the conditions, or 1.0 if there are no conditions.
+ */
+static bool *
+statext_mcv_eval_conditions(PlannerInfo *root, RelOptInfo *rel,
+							StatisticExtInfo *stat, MCVList *mcv,
+							Selectivity *sel)
+{
+	List   *conditions;
+
+	/* everything matches by default */
+	*sel = 1.0;
+
+	/*
+	 * XXX We've already evaluated this before, when picking the statistics
+	 * object. Maybe we should stash it somewhere, so that we don't have to
+	 * evaluate it again.
+	 */
+	conditions = statext_mcv_get_conditions(root, rel, stat);
+
+	/* If no conditions, we're done. */
+	if (!conditions)
+		return NULL;
+
+	/* what's the selectivity of the conditions alone? */
+	*sel = clauselist_selectivity(root, conditions, rel->relid, 0, NULL);
+
+	return mcv_get_match_bitmap(root, conditions, stat->keys, stat->exprs,
+								mcv, false);
+}
+
+/*
+ * statext_ndistinct_estimate
+ *		Estimate number of distinct values in a list of clauses.
+ *
+ * This is used to extract expressions for a given relation from join clauses,
+ * so that we can estimate the number of distinct values in those expressions.
+ * That is needed for join cardinality estimation, similarly to what eqjoinsel
+ * does for regular coumns.
+ */
+static double
+statext_ndistinct_estimate(PlannerInfo *root, RelOptInfo *rel, List *clauses)
+{
+	ListCell *lc;
+
+	List *exprs = NIL;
+
+	foreach (lc, clauses)
+	{
+		ListCell *lc2;
+		Node *clause = (Node *) lfirst(lc);
+		OpExpr *opexpr;
+
+		if (!is_opclause(clause))
+			continue;
+
+		opexpr = (OpExpr *) clause;
+
+		if (list_length(opexpr->args) != 2)
+			continue;
+
+		foreach (lc2, opexpr->args)
+		{
+			Node *expr = (Node *) lfirst(lc2);
+			Bitmapset *varnos = pull_varnos(root, expr);
+
+			if (bms_singleton_member(varnos) == rel->relid)
+				exprs = lappend(exprs, expr);
+		}
+	}
+
+	return estimate_num_groups(root, exprs, rel->rows, NULL, NULL);
+}
+
+/*
+ * statext_compare_mcvs
+ *		Calculte join selectivity using extended statistics, similarly to
+ *		eqjoinsel_inner.
+ *
+ * Considers restrictions on base relations too, essentially computing
+ * a conditional probability
+ *
+ *	P(join clauses | baserestrictinfos on either side)
+ *
+ * Compared to eqjoinsel_inner there's a couple problems. With per-column
+ * MCV lists it's obvious that the number of distinct values not covered
+ * by the MCV is (ndistinct - size(MCV)). With multi-column MCVs it's not
+ * that simple, particularly when the conditions are on a subset of the
+ * MCV and NULLs are involved. E.g. with MCV (a,b,c) and conditions on
+ * (a,b), it's not clear if the number of (a,b) combinations not covered
+ * by the MCV is
+ *
+ * (ndistinct(a,b) - ndistinct_mcv(a,b))
+ *
+ * where ndistinct_mcv(a,b) is the number of distinct (a,b) combinations
+ * included in the MCV list. These combinations may be present in the rest
+ * of the data (outside MCV), just with some extra values in "c". So in
+ * principle there may be between
+ *
+ * (ndistinct(a,b) - ndistinct_mcv(a,b)) and ndistinct(a,b)
+ *
+ * distinct values in the rest of the data. So we need to pick something
+ * in between, there's no way to calculate this accurately.
+ */
+static Selectivity
+statext_compare_mcvs(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
+					 StatisticExtInfo *stat1, StatisticExtInfo *stat2,
+					 List *clauses)
+{
+	MCVList *mcv1;
+	MCVList *mcv2;
+	int		i, j;
+	Selectivity s = 0;
+
+	/* items eliminated by conditions (if any) */
+	bool   *conditions1 = NULL,
+		   *conditions2 = NULL;
+
+	double	conditions1_sel = 1.0,
+			conditions2_sel = 1.0;
+
+	bool   *matches1 = NULL,
+		   *matches2 = NULL;
+
+	double	matchfreq1,
+			unmatchfreq1,
+			matchfreq2,
+			unmatchfreq2,
+			otherfreq1,
+			mcvfreq1,
+			otherfreq2,
+			mcvfreq2;
+
+	double	nd1,
+			nd2;
+
+	double	totalsel1,
+			totalsel2;
+
+	mcv1 = statext_mcv_load(stat1->statOid);
+	mcv2 = statext_mcv_load(stat2->statOid);
+
+	/* should only get here with MCV on both sides */
+	Assert(mcv1 && mcv2);
+
+	matches1 = (bool *) palloc0(sizeof(bool) * mcv1->nitems);
+	matches2 = (bool *) palloc0(sizeof(bool) * mcv2->nitems);
+
+	/* apply baserestrictinfo conditions on the MCV lists */
+
+	conditions1 = statext_mcv_eval_conditions(root, rel1, stat1, mcv1,
+											  &conditions1_sel);
+
+	conditions2 = statext_mcv_eval_conditions(root, rel2, stat2, mcv2,
+											  &conditions2_sel);
+
+	/*
+	 * Match items from the two MCV lits.
+	 *
+	 * We don't know if the matches are 1:1 - we may have overlap on only
+	 * a subset of attributes, e.g. (a,b,c) vs. (b,c,d), so there may be
+	 * multiple matches.
+	 */
+	for (i = 0; i < mcv1->nitems; i++)
+	{
+		/* skip items eliminated by restrictions on rel1 */
+		if (conditions1 && !conditions1[i])
+			continue;
+
+		/* find matches in the second MCV list */
+		for (j = 0; j < mcv2->nitems; j++)
+		{
+			ListCell   *lc;
+			bool		items_match = true;
+
+			/* skip items eliminated by restrictions on rel2 */
+			if (conditions2 && !conditions2[j])
+				continue;
+
+			foreach (lc, clauses)
+			{
+				Node *clause = (Node *) lfirst(lc);
+				Bitmapset  *atts1 = NULL;
+				Bitmapset  *atts2 = NULL;
+				Datum		value1, value2;
+				int			index1, index2;
+				AttrNumber	attnum1;
+				AttrNumber	attnum2;
+				bool		match;
+
+				FmgrInfo	opproc;
+				OpExpr	   *expr = (OpExpr *) clause;
+
+				Assert(is_opclause(clause));
+
+				fmgr_info(get_opcode(expr->opno), &opproc);
+
+				/* determine the columns in each statistics object */
+
+				pull_varattnos(clause, rel1->relid, &atts1);
+				attnum1 = bms_singleton_member(atts1) + FirstLowInvalidHeapAttributeNumber;
+				index1 = bms_member_index(stat1->keys, attnum1);
+
+				pull_varattnos(clause, rel2->relid, &atts2);
+				attnum2 = bms_singleton_member(atts2) + FirstLowInvalidHeapAttributeNumber;
+				index2 = bms_member_index(stat2->keys, attnum2);
+
+				/* if either value is null, we're done */
+				if (mcv1->items[i].isnull[index1] || mcv2->items[j].isnull[index2])
+					match = false;
+				else
+				{
+					value1 = mcv1->items[i].values[index1];
+					value2 = mcv2->items[j].values[index2];
+
+					/*
+					 * FIXME Might have issues with order of parameters, but for
+					 * same-type equality that should not matter.
+					 * */
+					match = DatumGetBool(FunctionCall2Coll(&opproc,
+														   InvalidOid,
+														   value1, value2));
+				}
+
+				items_match &= match;
+
+				if (!items_match)
+					break;
+			}
+
+			if (items_match)
+			{
+				matches1[i] = matches2[j] = true;
+				s += mcv1->items[i].frequency * mcv2->items[j].frequency;
+
+				/* XXX Do we need to do something about base frequency? */
+			}
+		}
+	}
+
+	matchfreq1 = unmatchfreq1 = mcvfreq1 = 0.0;
+	for (i = 0; i < mcv1->nitems; i++)
+	{
+		mcvfreq1 += mcv1->items[i].frequency;
+
+		if (conditions1 && !conditions1[i])
+			continue;
+
+		if (matches1[i])
+			matchfreq1 += mcv1->items[i].frequency;
+		else
+			unmatchfreq1 += mcv1->items[i].frequency;
+	}
+
+	/* not represented by the MCV */
+	otherfreq1 = 1 - mcvfreq1;
+
+	matchfreq2 = unmatchfreq2 = mcvfreq2 = 0.0;
+	for (i = 0; i < mcv2->nitems; i++)
+	{
+		mcvfreq2 += mcv2->items[i].frequency;
+
+		if (conditions2 && !conditions2[i])
+			continue;
+
+		if (matches2[i])
+			matchfreq2 += mcv2->items[i].frequency;
+		else
+			unmatchfreq2 += mcv2->items[i].frequency;
+	}
+
+	/* not represented by the MCV */
+	otherfreq2 = 1 - mcvfreq2;
+
+	/* correction for MCV parts eliminated by the conditions */
+	s = s * mcvfreq1 * mcvfreq2 / (matchfreq1 + unmatchfreq1) / (matchfreq2 + unmatchfreq2);
+
+	nd1 = statext_ndistinct_estimate(root, rel1, clauses);
+	nd2 = statext_ndistinct_estimate(root, rel2, clauses);
+
+	/*
+	 * XXX this is a bit bogus, because we don't know what fraction of
+	 * distinct combinations is covered by the MCV list (we're only
+	 * dealing with some of the columns), so we can't use the same
+	 * formular as eqjoinsel_inner exactly. Moreover, we need to look
+	 * at the conditions. So instead we simply assume the conditions
+	 * affect the distinct groups, and use that.
+	 */
+	nd1 *= conditions1_sel;
+	nd2 *= conditions2_sel;
+
+	totalsel1 = s;
+	totalsel1 += unmatchfreq1 * otherfreq2 / nd2;
+	totalsel1 += otherfreq1 * (otherfreq2 + unmatchfreq2) / nd2;
+
+//	if (nd2 > mcvb->nitems)
+//		totalsel1 += unmatchfreq1 * otherfreq2 / (nd2 - mcvb->nitems);
+//	if (nd2 > nmatches)
+//		totalsel1 += otherfreq1 * (otherfreq2 + unmatchfreq2) /
+//			(nd2 - nmatches);
+
+	totalsel2 = s;
+	totalsel2 += unmatchfreq2 * otherfreq1 / nd1;
+	totalsel2 += otherfreq2 * (otherfreq1 + unmatchfreq1) / nd1;
+
+//	if (nd1 > mcva->nitems)
+//		totalsel2 += unmatchfreq2 * otherfreq1 / (nd1 - mcva->nitems);
+//	if (nd1 > nmatches)
+//		totalsel2 += otherfreq2 * (otherfreq1 + unmatchfreq1) /
+//			(nd1 - nmatches);
+
+	s = Min(totalsel1, totalsel2);
+
+	return s;
+}
+
+/*
+ * statext_is_supported_join_clause
+ *		Check if a join clause may be estimated using extended stats.
+ *
+ * Determines if this is a join clause of the form (Expr op Expr) which
+ * may be estimated using extended statistics. Each side must reference
+ * just one relation for now.
+ */
+static bool
+statext_is_supported_join_clause(PlannerInfo *root, Node *clause,
+								 int varRelid, SpecialJoinInfo *sjinfo)
+{
+	Oid	oprsel;
+	RestrictInfo   *rinfo;
+	OpExpr		   *opclause;
+	ListCell	   *lc;
+
+	/*
+	 * evaluation as a restriction clause, either at scan node or forced
+	 *
+	 * XXX See treat_as_join_clause.
+	 */
+	if ((varRelid != 0) || (sjinfo == NULL))
+		return false;
+
+	/* XXX Can we rely on always getting RestrictInfo here? */
+	if (!IsA(clause, RestrictInfo))
+		return false;
+
+	/* strip the RestrictInfo */
+	rinfo = (RestrictInfo *) clause;
+	clause = (Node *) rinfo->clause;
+
+	/* is it referencing multiple relations? */
+	if (bms_membership(rinfo->clause_relids) != BMS_MULTIPLE)
+		return false;
+
+	/* we only support simple operator clauses for now */
+	if (!is_opclause(clause))
+		return false;
+
+	opclause = (OpExpr *) clause;
+
+	/* for now we only support estimating equijoins */
+	oprsel = get_oprjoin(opclause->opno);
+
+	if (oprsel != F_EQJOINSEL)
+		return false;
+
+	/*
+	 * Make sure we're not mixing vars from multiple relations on the same
+	 * side, like
+	 *
+	 *   (t1.a + t2.a) = (t1.b + t2.b)
+	 *
+	 * which is still technically an opclause, but we can't match it to
+	 * extended statistics in a simple way.
+	 *
+	 * XXX This also means we require rinfo->clause_relids to have 2 rels.
+	 *
+	 * XXX Also check it's not expression on system attributes, which we
+	 * don't allow in extended statistics.
+	 *
+	 * XXX Although maybe we could allow cases that combine expressions
+	 * from both relations on either side? Like (t1.a + t2.b = t1.c - t2.d)
+	 * or something like that. We could do "cartesian product" of the MCV
+	 * stats and restrict it using this condition.
+	 */
+	foreach (lc, opclause->args)
+	{
+		Bitmapset *varnos = NULL;
+		Node *expr = (Node *) lfirst(lc);
+
+		varnos = pull_varnos(root, expr);
+
+		/*
+		 * No argument should reference more than just one relation.
+		 *
+		 * This effectively means each side references just two relations.
+		 * If there's no relation on one side, it's a Const, and the other
+		 * side has to be either Const or Expr with a single rel. In which
+		 * case it can't be a join clause.
+		 */
+		if (bms_num_members(varnos) > 1)
+			return false;
+
+		/*
+		 * XXX Maybe check that both relations have extended statistics
+		 * (no point in considering the clause as useful without it). But
+		 * we'll do that check later anyway, so keep this cheap.
+		 */
+	}
+
+	return true;
+}
+
+/*
+ * statext_try_join_estimates
+ *		Checks if it's worth considering extended stats on join estimates.
+ *
+ * This is supposed to be a quick/cheap check to decide whether to expend
+ * more effort on applying extended statistics to join clauses.
+ */
+bool
+statext_try_join_estimates(PlannerInfo *root, List *clauses, int varRelid,
+						   JoinType jointype, SpecialJoinInfo *sjinfo,
+						   Bitmapset *estimatedclauses)
+{
+	int			listidx;
+	int			k;
+	ListCell   *lc;
+	Bitmapset  *relids = NULL;
+
+	/*
+	 * XXX Not having these values means treat_as_join_clause returns false,
+	 * so we're not supposed to handle join clauses here. So just bail out.
+	 */
+	if ((varRelid != 0) || (sjinfo == NULL))
+		return false;
+
+	listidx = -1;
+	foreach (lc, clauses)
+	{
+		Node *clause = (Node *) lfirst(lc);
+		RestrictInfo *rinfo;
+		listidx++;
+
+		/* skip estimated clauses */
+		if (bms_is_member(listidx, estimatedclauses))
+			continue;
+
+		/*
+		 * Skip clauses that are not join clauses or that we don't know
+		 * how to handle estimate using extended statistics.
+		 */
+		if (!statext_is_supported_join_clause(root, clause, varRelid, sjinfo))
+			continue;
+
+		/*
+		 * Collect relids from all usable clauses.
+		 *
+		 * XXX We're guaranteed to have RestrictInfo thanks to the checks
+		 * in statext_is_supported_join_clause.
+		 */
+		rinfo = (RestrictInfo *) clause;
+		relids = bms_union(relids, rinfo->clause_relids);
+	}
+
+	/* no join clauses found, don't try applying extended stats */
+	if (bms_num_members(relids) == 0)
+		return false;
+
+	/*
+	 * We expect either 0 or >= 2 relids, a case with 1 relid in join clauses
+	 * should be impossible. And we just ruled out 0, so there are at least 2.
+	 */
+	Assert(bms_num_members(relids) >= 2);
+
+	/*
+	 * Check that at least some of the rels referenced by the clauses have
+	 * extended stats.
+	 *
+	 * XXX Maybe we should check how many rels have stats, and cross-check
+	 * how compatible they are (e.g. that both have MCVs, etc.). Also,
+	 * maybe this should cross-check the exact pairs of rels with a join
+	 * clause between them? OTOH this is supposed to be a cheap check, so
+	 * maybe better leave that for later.
+	 *
+	 * XXX We could also check if there are enough parameters in each rel
+	 * to consider extended stats. If there's just a single attribute, it's
+	 * probably better to use just regular statistics. OTOH we can also
+	 * consider restriction clauses from baserestrictinfo and use them
+	 * to calculate conditional probabilities.
+	 */
+	k = -1;
+	while ((k = bms_next_member(relids, k)) >= 0)
+	{
+		RelOptInfo *rel = find_base_rel(root, k);
+		if (rel->statlist)
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * Information about a join between two relations. It tracks relations being
+ * joined and the join clauses.
+ */
+typedef struct JoinPairInfo
+{
+	Bitmapset  *rels;
+	List	   *clauses;
+} JoinPairInfo;
+
+/*
+ * statext_build_join_pairs
+ *		Extract pairs of joined rels with join clauses for each pair.
+ *
+ * Walks the remaining (not yet estimated) clauses, and splits them into
+ * lists for each pair of joined relations. Returns NULL if there are no
+ * suitable join pairs that might be estimated using extended stats.
+ *
+ * XXX It's possible there are join clauses, but the clauses are not
+ * supported by the extended stats machinery (we only support opclauses
+ * with F_EQJOINSEL selectivity function at the moment).
+ */
+static JoinPairInfo *
+statext_build_join_pairs(PlannerInfo *root, List *clauses, int varRelid,
+						 JoinType jointype, SpecialJoinInfo *sjinfo,
+						 Bitmapset *estimatedclauses, int *npairs)
+{
+	int				cnt;
+	int				listidx;
+	JoinPairInfo   *info;
+	ListCell	   *lc;
+
+	/*
+	 * Assume each clause is for a different pair of relations (some of them
+	 * might be already estimated, but meh - there shouldn't be too many of
+	 * them and it's cheaper than repalloc.
+	 */
+	info = (JoinPairInfo *) palloc0(sizeof(JoinPairInfo) * list_length(clauses));
+	cnt = 0;
+
+	listidx = -1;
+	foreach(lc, clauses)
+	{
+		int				i;
+		bool			found;
+		Node		   *clause = (Node *) lfirst(lc);
+		RestrictInfo   *rinfo;
+
+		listidx++;
+
+		/* skip already estimated clauses */
+		if (bms_is_member(listidx, estimatedclauses))
+			continue;
+
+		/*
+		 * Make sure the clause is a join clause of a supported shape (at
+		 * the moment we support just (Expr op Expr) clauses with each
+		 * side referencing just a single relation.
+		 */
+		if (!statext_is_supported_join_clause(root, clause, varRelid, sjinfo))
+			continue;
+
+		/* statext_is_supported_join_clause guarantees RestrictInfo */
+		rinfo = (RestrictInfo *) clause;
+		clause = (Node *) rinfo->clause;
+
+		/* search for a matching join pair */
+		found = false;
+		for (i = 0; i < cnt; i++)
+		{
+			if (bms_is_subset(rinfo->clause_relids, info[i].rels))
+			{
+				info[i].clauses = lappend(info[i].clauses, clause);
+				found = true;
+				break;
+			}
+		}
+
+		if (!found)
+		{
+			info[cnt].rels = rinfo->clause_relids;
+			info[cnt].clauses = lappend(info[cnt].clauses, clause);
+			cnt++;
+		}
+	}
+
+	if (cnt == 0)
+		return NULL;
+
+	*npairs = cnt;
+	return info;
+}
+
+/*
+ * extract_relation_info
+ *		Extract information about a relation in a join pair.
+ *
+ * The relation is identified by index (generally 0 or 1), and picks extended
+ * statistics covering matching the join clauses and baserel restrictions.
+ *
+ * XXX Can we have cases with indexes above 1? Probably for clauses mixing
+ * vars from 3 relations, but we're rejecting those.
+ */
+static RelOptInfo *
+extract_relation_info(PlannerInfo *root, JoinPairInfo *info, int index,
+					  StatisticExtInfo **stat)
+{
+	int	k;
+	int	relid;
+	RelOptInfo *rel;
+	ListCell *lc;
+
+	Bitmapset  *attnums = NULL;
+	List	   *exprs = NIL;
+
+	k = -1;
+	while (index >= 0)
+	{
+		k = bms_next_member(info->rels, k);
+		if (k < 0)
+			elog(ERROR, "failed to extract relid");
+
+		relid = k;
+		index--;
+	}
+
+	rel = find_base_rel(root, relid);
+
+	/*
+	 * Walk the clauses for this join pair, and extract expressions about
+	 * the relation identified by index / relid. For simple Vars we extract
+	 * the attnum. Otherwise we keep the whole expression.
+	 */
+	foreach (lc, info->clauses)
+	{
+		ListCell *lc2;
+		Node *clause = (Node *) lfirst(lc);
+		OpExpr *opclause = (OpExpr *) clause;
+
+		/* only opclauses supported for now */
+		Assert(is_opclause(clause));
+
+		foreach (lc2, opclause->args)
+		{
+			Node *arg = (Node *) lfirst(lc2);
+			Bitmapset *varnos = NULL;
+
+			/* plain Var references (boolean Vars or recursive checks) */
+			if (IsA(arg, Var))
+			{
+				Var		   *var = (Var *) arg;
+
+				/* Ignore vars from other relations. */
+				if (var->varno != relid)
+					continue;
+
+				/* we also better ensure the Var is from the current level */
+				if (var->varlevelsup > 0)
+					continue;
+
+				/* Also skip system attributes (we don't allow stats on those). */
+				if (!AttrNumberIsForUserDefinedAttr(var->varattno))
+					elog(ERROR, "unexpected system attribute");
+
+				attnums = bms_add_member(attnums, var->varattno);
+
+				/* Done, process the next argument. */
+				continue;
+			}
+
+			/*
+			 * OK, it's a more complex expression, so check if it matches
+			 * the relid and maybe keep it as a whole. It should be
+			 * compatible because we already checked it when building the
+			 * join pairs.
+			 */
+			varnos = pull_varnos(root, arg);
+
+			if (relid == bms_singleton_member(varnos))
+				exprs = lappend(exprs, arg);
+		}
+	}
+
+	*stat = find_matching_mcv(root, rel, attnums, exprs);
+
+	return rel;
+}
+
+/*
+ * statext_clauselist_join_selectivity
+ *		Use extended stats to estimate join clauses.
+ *
+ * XXX In principle, we should not restrict this to cases with multiple
+ * join clauses - we should consider dependencies with conditions at the
+ * base relations, i.e. calculate P(join clause | base restrictions).
+ * But currently that does not happen, because clauselist_selectivity_ext
+ * treats a single clause as a special case (and we don't apply extended
+ * statistics in that case yet).
+ */
+Selectivity
+statext_clauselist_join_selectivity(PlannerInfo *root, List *clauses, int varRelid,
+									JoinType jointype, SpecialJoinInfo *sjinfo,
+									Bitmapset **estimatedclauses)
+{
+	int			i;
+	int			listidx;
+	Selectivity	s = 1.0;
+
+	JoinPairInfo *info;
+	int				ninfo;
+
+	if (!clauses)
+		return 1.0;
+
+	/* extract pairs of joined relations from the list of clauses */
+	info = statext_build_join_pairs(root, clauses, varRelid, jointype, sjinfo,
+									*estimatedclauses, &ninfo);
+
+	/* no useful join pairs */
+	if (!info)
+		return 1.0;
+
+	/*
+	 * Process the join pairs, try to find a matching MCV on each side.
+	 *
+	 * XXX The basic principle is quite similar to eqjoinsel_inner, i.e.
+	 * we try to find a MCV on both sides of the join, and use it to get
+	 * better join estimate. It's a bit more complicated, because there
+	 * might be multiple MCV lists, we also need ndistinct estimate, and
+	 * there may be interesting baserestrictions too.
+	 *
+	 * XXX At the moment we only handle the case with matching MCVs on
+	 * both sides, but it'd be good to also handle case with just ndistinct
+	 * statistics improving ndistinct estimates.
+	 *
+	 * XXX Perhaps it'd be good to also handle case with one side only
+	 * having "regular" statistics (e.g. MCV), especially in cases with
+	 * no conditions on that side of the join (where we can't use the
+	 * extended MCV to calculate conditional probability).
+	 */
+	for (i = 0; i < ninfo; i++)
+	{
+		RelOptInfo *rel1;
+		RelOptInfo *rel2;
+		StatisticExtInfo *stat1;
+		StatisticExtInfo *stat2;
+
+		ListCell *lc;
+
+		/* extract info about the first relation */
+		rel1 = extract_relation_info(root, &info[i], 0, &stat1);
+
+		/* extract info about the second relation */
+		rel2 = extract_relation_info(root, &info[i], 1, &stat2);
+
+		/* XXX only handling case with MCV on both sides for now */
+		if (!stat1 || !stat2)
+			continue;
+
+		s *= statext_compare_mcvs(root, rel1, rel2, stat1, stat2, info[i].clauses);
+
+		/*
+		 * Now mark all the clauses for this join pair as estimated.
+		 *
+		 * XXX Maybe track the indexes in JoinPairInfo, so that we can
+		 * simply union the two bitmaps, without the extra matching.
+		 */
+		foreach (lc, info->clauses)
+		{
+			Node *clause = (Node *) lfirst(lc);
+			ListCell *lc2;
+
+			listidx = -1;
+			foreach (lc2, clauses)
+			{
+				Node *clause2 = (Node *) lfirst(lc2);
+				listidx++;
+
+				Assert(IsA(clause2, RestrictInfo));
+
+				clause2 = (Node *) ((RestrictInfo *) clause2)->clause;
+
+				if (equal(clause, clause2))
+				{
+					*estimatedclauses = bms_add_member(*estimatedclauses, listidx);
+					break;
+				}
+			}
+		}
+	}
+
+	return s;
+}
diff --git a/src/backend/statistics/mcv.c b/src/backend/statistics/mcv.c
index ef118952c7..7a7d2c8834 100644
--- a/src/backend/statistics/mcv.c
+++ b/src/backend/statistics/mcv.c
@@ -1602,7 +1602,7 @@ mcv_match_expression(Node *expr, Bitmapset *keys, List *exprs, Oid *collid)
  * & and |, which should be faster than min/max. The bitmaps are fairly
  * small, though (thanks to the cap on the MCV list size).
  */
-static bool *
+bool *
 mcv_get_match_bitmap(PlannerInfo *root, List *clauses,
 					 Bitmapset *keys, List *exprs,
 					 MCVList *mcvlist, bool is_or)
diff --git a/src/include/statistics/extended_stats_internal.h b/src/include/statistics/extended_stats_internal.h
index 55cd9252a5..072085365c 100644
--- a/src/include/statistics/extended_stats_internal.h
+++ b/src/include/statistics/extended_stats_internal.h
@@ -127,4 +127,8 @@ extern Selectivity mcv_clause_selectivity_or(PlannerInfo *root,
 											 Selectivity *overlap_basesel,
 											 Selectivity *totalsel);
 
+extern bool *mcv_get_match_bitmap(PlannerInfo *root, List *clauses,
+								  Bitmapset *keys, List *exprs,
+								  MCVList *mcvlist, bool is_or);
+
 #endif							/* EXTENDED_STATS_INTERNAL_H */
diff --git a/src/include/statistics/statistics.h b/src/include/statistics/statistics.h
index 326cf26fea..8d890e6ce7 100644
--- a/src/include/statistics/statistics.h
+++ b/src/include/statistics/statistics.h
@@ -120,10 +120,21 @@ extern Selectivity statext_clauselist_selectivity(PlannerInfo *root,
 												  Bitmapset **estimatedclauses,
 												  bool is_or);
 extern bool has_stats_of_kind(List *stats, char requiredkind);
+extern StatisticExtInfo *find_matching_mcv(PlannerInfo *root, RelOptInfo *rel,
+										   Bitmapset *attnums, List *exprs);
 extern StatisticExtInfo *choose_best_statistics(List *stats, char requiredkind,
 												Bitmapset **clause_attnums,
 												List **clause_exprs,
 												int nclauses);
 extern HeapTuple statext_expressions_load(Oid stxoid, int idx);
 
+extern bool statext_try_join_estimates(PlannerInfo *root, List *clauses, int varRelid,
+									   JoinType jointype, SpecialJoinInfo *sjinfo,
+									   Bitmapset *estimatedclauses);
+
+extern Selectivity statext_clauselist_join_selectivity(PlannerInfo *root, List *clauses,
+													   int varRelid,
+													   JoinType jointype, SpecialJoinInfo *sjinfo,
+													   Bitmapset **estimatedclauses);
+
 #endif							/* STATISTICS_H */

Tomas Vondra

tomas.vondra@enterprisedb.com

over 4 years ago

In reply to: Tomas Vondra (#3)

1 attachment(s)

Re: using extended statistics to improve join estimates

Hi,

attached is an improved version of this patch, addressing some of the
points mentioned in my last message:

1) Adds a couple regression tests, testing various join cases with
expressions, additional conditions, etc.

2) Adds support for expressions, so the join clauses don't need to
reference just simple columns. So e.g. this can benefit from extended
statistics, when defined on the expressions:

-- CREATE STATISTICS s1 ON (a+1), b FROM t1;
-- CREATE STATISTICS s2 ON (a+1), b FROM t2;

SELECT * FROM t1 JOIN t2 ON ((t1.a + 1) = (t2.a + 1) AND t1.b = t2.b);

3) Can combine extended statistics and regular (per-column) statistics.
The previous version required extended statistics MCV on both sides of
the join, but adding extended statistics on both sides may impractical
(e.g. if one side does not have correlated columns it's silly to have to
add it just to make this patch happy).

For example you may have extended stats on the dimension table but not
the fact table, and the patch still can combine those two. Of course, if
there's no MCV on either side, we can't do much.

So this patch works when both sides have extended statistics MCV, or
when one side has extended MCV and the other side regular MCV. It might
seem silly, but the extended MCV allows considering additional baserel
conditions (if there are any).

examples
========

The table / data is very simple, but hopefully good enough for some
simple examples.

create table t1 (a int, b int, c int);
create table t2 (a int, b int, c int);

insert into t1 select mod(i,50), mod(i,50), mod(i,50)
from generate_series(1,1000) s(i);

insert into t2 select mod(i,50), mod(i,50), mod(i,50)
from generate_series(1,1000) s(i);

analyze t1, t2;

First, without extended stats (just the first line of explain analyze,
to keep the message short):

explain analyze select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b);

QUERY PLAN
------------------------------------------------------------------------
Hash Join (cost=31.00..106.00 rows=400 width=24)
(actual time=5.426..22.678 rows=20000 loops=1)

explain analyze select * from t1 join t2 on (t1.a = t2.a) where t1.c < 25;

QUERY PLAN
------------------------------------------------------------------------
Hash Join (cost=28.50..160.75 rows=10000 width=24)
(actual time=5.325..19.760 rows=10000 loops=1)

explain analyze select * from t1 join t2 on (t1.a = t2.a) where t1.c <
25 and t2.c > 25;

QUERY PLAN
------------------------------------------------------------------------
Hash Join (cost=24.50..104.75 rows=4800 width=24)
(actual time=5.618..5.639 rows=0 loops=1)

Now, let's create statistics:

create statistics s1 on a, b, c from t1 ;
create statistics s2 on a, b, c from t2 ;
analyze t1, t2;

and now the same queries again:

explain analyze select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b);

QUERY PLAN
------------------------------------------------------------------------
Hash Join (cost=31.00..106.00 rows=20000 width=24)
(actual time=5.448..22.713 rows=20000 loops=1)

explain analyze select * from t1 join t2 on (t1.a = t2.a) where t1.c < 25;

QUERY PLAN
------------------------------------------------------------------------
Hash Join (cost=28.50..160.75 rows=10000 width=24)
(actual time=5.317..16.680 rows=10000 loops=1)

explain analyze select * from t1 join t2 on (t1.a = t2.a) where t1.c <
25 and t2.c > 25;

QUERY PLAN
------------------------------------------------------------------------
Hash Join (cost=24.50..104.75 rows=1 width=24)
(actual time=5.647..5.667 rows=0 loops=1)

Those examples are a bit simplistic, but the improvements are fairly
clear I think.

limitations & open issues
=========================

Let's talk about the main general restrictions and open issues in the
current patch that I can think of at the moment.

1) statistics covering all join clauses

The patch requires the statistics to cover all the join clauses, mostly
because it simplifies the implementation. This means that to use the
per-column statistics, there has to be just a single join clause.

AFAICS this could be relaxed and we could use multiple statistics to
estimate the clauses. But it'd make selection of statistics much more
complicated, because we have to pick "matching" statistics on both sides
of the join. So it seems like an overkill, and most joins have very few
conditions anyway.

2) only equality join clauses

The other restriction is that at the moment this only supports simple
equality clauses, combined with AND. So for example this is supported

t1 JOIN t2 ON ((t1.a = t2.a) AND (t1.b + 2 = t2.b + 1))

while these are not:

t1 JOIN t2 ON ((t1.a = t2.a) OR (t1.b + 2 = t2.b + 1))

t1 JOIN t2 ON ((t1.a - t2.a = 0) AND (t1.b + 2 = t2.b + 1))

t1 JOIN t2 ON ((t1.a = t2.a) AND ((t1.b = t2.b) OR (t1.c = t2.c)))

I'm not entirely sure these restrictions can be relaxed. It's not that
difficult to evaluate these cases when matching items between the MCV
lists, similarly to how we evaluate bitmaps for baserel estimates.

But I'm not sure what to do about the part not covered by the MCV lists.
The eqjoinsel() approach uses ndistinct estimates for that, but that
only works for AND clauses, I think. How would that work for OR?

Similarly, I'm not sure we can do much for non-equality conditions, but
those are currently estimated as default selectivity in selfuncs.c.

3) estimation by join pairs

At the moment, the estimates are calculated for pairs of relations, so
for example given a query

explain analyze
select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b)
join t3 on (t1.b = t3.b and t2.c = t3.c);

we'll estimate the first join (t1,t2) just fine, but then the second
join actually combines (t1,t2,t3). What the patch currently does is it
splits it into (t1,t2) and (t2,t3) and estimates those. I wonder if this
should actually combine all three MCVs at once - we're pretty much
combining the MCVs into one large MCV representing the join result.

But I haven't done that yet, as it requires the MCVs to be combined
using the join clauses (overlap in a way), but I'm not sure how likely
that is in practice. In the example it could help, but that's a bit
artificial example.

4) still just inner equi-joins

I haven't done any work on extending this to outer joins etc. Adding
outer and semi joins should not be complicated, mostly copying and
tweaking what eqjoinsel() does.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

0001-Estimate-joins-using-extended-statistics-20211006.patchtext/x-patch; charset=UTF-8; name=0001-Estimate-joins-using-extended-statistics-20211006.patchDownload

From ed7c4612abf4aa209c6d6fcb14e68bedf3cff7e6 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas.vondra@postgresql.org>
Date: Tue, 5 Oct 2021 02:10:27 +0200
Subject: [PATCH] Estimate joins using extended statistics

Use extended statistics (MCV) to improve join estimates. In general this
is similar to how we use regular statistics - we search for extended
statistics (with MCV) covering all join clauses, and if we find such MCV
on both sides of the join, we combine those two MCVs.

Extended statistics allow a couple additional improvements - e.g. if
there are baserel conditions, we can use them to restrict the part of
the MCVs combined. This means we're building conditional probability
distribution and calculating conditional probability

    P(join clauses | baserel conditions)

instead of just P(join clauses).

The patch also allows combining regular and extended MCV - we don't need
extended MCVs on both sides. This helps when one of the tables does not
have extended statistics (e.g. because there are no correlations).
---
 src/backend/optimizer/path/clausesel.c        |  63 +-
 src/backend/statistics/extended_stats.c       | 805 ++++++++++++++++++
 src/backend/statistics/mcv.c                  | 754 ++++++++++++++++
 .../statistics/extended_stats_internal.h      |  20 +
 src/include/statistics/statistics.h           |  12 +
 src/test/regress/expected/stats_ext.out       | 168 ++++
 src/test/regress/sql/stats_ext.sql            |  64 ++
 7 files changed, 1885 insertions(+), 1 deletion(-)

diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c
index d263ecf082..709e92446b 100644
--- a/src/backend/optimizer/path/clausesel.c
+++ b/src/backend/optimizer/path/clausesel.c
@@ -50,6 +50,9 @@ static Selectivity clauselist_selectivity_or(PlannerInfo *root,
 											 JoinType jointype,
 											 SpecialJoinInfo *sjinfo,
 											 bool use_extended_stats);
+static inline bool treat_as_join_clause(PlannerInfo *root,
+										Node *clause, RestrictInfo *rinfo,
+										int varRelid, SpecialJoinInfo *sjinfo);
 
 /****************************************************************************
  *		ROUTINES TO COMPUTE SELECTIVITIES
@@ -129,12 +132,53 @@ clauselist_selectivity_ext(PlannerInfo *root,
 	RangeQueryClause *rqlist = NULL;
 	ListCell   *l;
 	int			listidx;
+	bool		single_clause_optimization = true;
+
+	/*
+	 * The optimization of skipping to clause_selectivity_ext for single
+	 * clauses means we can't improve join estimates with a single join
+	 * clause but additional baserel restrictions. So we disable it when
+	 * estimating joins.
+	 *
+	 * XXX Not sure if this is the right way to do it, but more elaborate
+	 * checks would mostly negate the whole point of the optimization.
+	 * The (Var op Var) patch has the same issue.
+	 *
+	 * XXX An alternative might be making clause_selectivity_ext smarter
+	 * and make it use the join extended stats there. But that seems kinda
+	 * against the whole point of the optimization (skipping expensive
+	 * stuff) and it's making other parts more complex.
+	 *
+	 * XXX Maybe this should check if there are at least some restrictions
+	 * on some base relations, which seems important. But then again, that
+	 * seems to go against the idea of this check to be cheap. Moreover, it
+	 * won't work for OR clauses, which may have multiple parts but we still
+	 * see them as a single BoolExpr clause (it doesn't work later, though).
+	 */
+	if (list_length(clauses) == 1)
+	{
+		Node *clause = linitial(clauses);
+		RestrictInfo *rinfo = NULL;
+
+		if (IsA(clause, RestrictInfo))
+		{
+			rinfo = (RestrictInfo *) clause;
+			clause = (Node *) rinfo->clause;
+		}
+
+		single_clause_optimization
+			= !treat_as_join_clause(root, clause, rinfo, varRelid, sjinfo);
+	}
 
 	/*
 	 * If there's exactly one clause, just go directly to
 	 * clause_selectivity_ext(). None of what we might do below is relevant.
+	 *
+	 * XXX This means we won't try using extended stats on OR-clauses (which
+	 * are a single BoolExpr clause at this point), although we'll do that
+	 * later (once we look at the arguments).
 	 */
-	if (list_length(clauses) == 1)
+	if ((list_length(clauses) == 1) && single_clause_optimization)
 		return clause_selectivity_ext(root, (Node *) linitial(clauses),
 									  varRelid, jointype, sjinfo,
 									  use_extended_stats);
@@ -157,6 +201,23 @@ clauselist_selectivity_ext(PlannerInfo *root,
 											&estimatedclauses, false);
 	}
 
+	/*
+	 * Try applying extended statistics to joins. There's not much we can
+	 * do to detect when this makes sense, but we can check that there are
+	 * join clauses, and that at least some of the rels have stats.
+	 *
+	 * XXX Isn't this mutualy exclusive with the preceding block which
+	 * calculates estimates for a single relation?
+	 */
+	if (use_extended_stats &&
+		statext_try_join_estimates(root, clauses, varRelid, jointype, sjinfo,
+						 estimatedclauses))
+	{
+		s1 *= statext_clauselist_join_selectivity(root, clauses, varRelid,
+												  jointype, sjinfo,
+												  &estimatedclauses);
+	}
+
 	/*
 	 * Apply normal selectivity estimates for remaining clauses. We'll be
 	 * careful to skip any clauses which were already estimated above.
diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c
index 69ca52094f..f8cc342b7e 100644
--- a/src/backend/statistics/extended_stats.c
+++ b/src/backend/statistics/extended_stats.c
@@ -30,6 +30,7 @@
 #include "nodes/nodeFuncs.h"
 #include "optimizer/clauses.h"
 #include "optimizer/optimizer.h"
+#include "optimizer/pathnode.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
 #include "statistics/extended_stats_internal.h"
@@ -101,6 +102,8 @@ static StatsBuildData *make_build_data(Relation onerel, StatExtEntry *stat,
 									   int numrows, HeapTuple *rows,
 									   VacAttrStats **stats, int stattarget);
 
+static bool stat_covers_expressions(StatisticExtInfo *stat, List *exprs,
+									Bitmapset **expr_idxs);
 
 /*
  * Compute requested extended stats, using the rows sampled for the plain
@@ -2608,3 +2611,805 @@ make_build_data(Relation rel, StatExtEntry *stat, int numrows, HeapTuple *rows,
 
 	return result;
 }
+
+/*
+ * statext_find_matching_mcv
+ *		Search for a MCV covering all the attributes and expressions.
+ *
+ * We pick the statistics to use for join estimation. The statistics has to
+ * be an MCV, and we require it to match all the join conditions, because it
+ * makes the estimation simpler.
+ *
+ * If there are multiple candidate statistics (matching all join clauses),
+ * we pick the smallest one, and we also consider additional conditions on
+ * the base relations to restrict the MCV items used for estimation (using
+ * conditional probability).
+ *
+ * XXX The requirement that all the attributes need to be covered might be
+ * too strong. We could relax this and and require fewer matches (at least two,
+ * if counting the additional conditions), and we might even apply multiple
+ * statistics etc. But that would require matching statistics on both sides of
+ * the join, while now we simply know the statistics match. We don't really
+ * expect many candidate MCVs, so this simple approach seems sufficient. And
+ * the joins usually use only one or two columns, so there's not much room
+ * for applying multiple statistics anyway.
+ */
+StatisticExtInfo *
+statext_find_matching_mcv(PlannerInfo *root, RelOptInfo *rel,
+						  Bitmapset *attnums, List *exprs)
+{
+	ListCell   *l;
+	StatisticExtInfo *mcv = NULL;
+	List *stats = rel->statlist;
+
+	foreach(l, stats)
+	{
+		StatisticExtInfo *stat = (StatisticExtInfo *) lfirst(l);
+		List *conditions1 = NIL,
+			 *conditions2 = NIL;
+
+		/* We only care about MCV statistics here. */
+		if (stat->kind != STATS_EXT_MCV)
+			continue;
+
+		/*
+		 * Ignore MCVs not covering all the attributes/expressions.
+		 *
+		 * XXX Maybe we shouldn't be so strict and consider only partial
+		 * matches for join clauses too?
+		 */
+		if (!bms_is_subset(attnums, stat->keys) ||
+			!stat_covers_expressions(stat, exprs, NULL))
+			continue;
+
+		/* If there's no matching MCV yet, keep this one. */
+		if (!mcv)
+		{
+			mcv = stat;
+			continue;
+		}
+
+		/*
+		 * OK, we have two candidate statistics and we need to decide which one
+		 * to keep. We'll use two simple heuristics:
+		 *
+		 * (a) We prefer smaller statistics (fewer columns), on the assumption
+		 * that it represents a larger fraction of the data (due to having fewer
+		 * combinations with higher counts).
+		 *
+		 * (b) If the statistics covers some additional conditions for the rels,
+		 * that may help with considering additional dependencies between the
+		 * tables.
+		 *
+		 * Of course, those two heuristict are somewhat contradictory - smaller
+		 * stats are less likely to cover as many conditions as a larger one. We
+		 * consider the additional conditions first - if someone created such
+		 * statistics, there probably is a dependency worth considering.
+		 *
+		 * When inspecting the restrictions, we need to be careful - we don't
+		 * know which of them are compatible with extended stats, so we have to
+		 * run them through statext_is_compatible_clause first and then match
+		 * them.to the statistics.
+		 *
+		 * XXX Maybe we shouldn't pick statistics that covers just a single join
+		 * clause, without any additional conditions. In such case we could just
+		 * as well pick regular statistics for the column/expression, but it's
+		 * not clear if that actually exists (so we might reject the stats here
+		 * and then fail to find something simpler/better).
+		 */
+		conditions1 = statext_determine_join_restrictions(root, rel, stat);
+		conditions2 = statext_determine_join_restrictions(root, rel, mcv);
+
+		/* if the new statistics covers more conditions, use it */
+		if (list_length(conditions2) > list_length(conditions1))
+		{
+			mcv = stat;
+			continue;
+		}
+
+		/* The statistics seem about equal, so just use the smaller one. */
+		if (bms_num_members(mcv->keys) + list_length(mcv->exprs) >
+			bms_num_members(stat->keys) + list_length(stat->exprs))
+		{
+			mcv = stat;
+		}
+	}
+
+	return mcv;
+}
+
+/*
+ * statext_determine_join_restrictions
+ *		Get restrictions on base relation, covered by the statistics.
+ *
+ * Returns a list of baserel restrictinfos, compatible with extended statistics
+ * and covered by the extended statistics.
+ *
+ * When using extended statistics to estimate joins, we can use conditions
+ * from base relations to calculate conditional probability
+ *
+ *    P(join clauses | baserel restrictions)
+ *
+ * which should be a better estimate than just P(join clauses). We want to pick
+ * the statistics covering the most such conditions.
+ */
+List *
+statext_determine_join_restrictions(PlannerInfo *root, RelOptInfo *rel,
+									StatisticExtInfo *info)
+{
+	ListCell   *lc;
+	List	   *conditions = NIL;
+
+	/* extract conditions that may be applied to the MCV list */
+	foreach (lc, rel->baserestrictinfo)
+	{
+		RestrictInfo *rinfo = (RestrictInfo *) lfirst(lc);
+		Bitmapset *indexes = NULL;
+		Bitmapset *attnums = NULL;
+		List *exprs = NIL;
+
+		/* clause has to be supported by MCV in general */
+		if (!statext_is_compatible_clause(root, (Node *) rinfo, rel->relid,
+										  &attnums, &exprs))
+			continue;
+
+		/*
+		 * clause is compatible in general, but is it actually covered
+		 * by this partiular statistics object?
+		 */
+		if (!bms_is_subset(attnums, info->keys) ||
+			!stat_covers_expressions(info, exprs, &indexes))
+			continue;
+
+		conditions = lappend(conditions, rinfo->clause);
+	}
+
+	return conditions;
+}
+
+/*
+ * statext_is_supported_join_clause
+ *		Check if a join clause may be estimated using extended stats.
+ *
+ * Determines if this is a join clause of the form (Expr op Expr) which may be
+ * estimated using extended statistics. Each side must reference just a single
+ * relation for now.
+ *
+ * Similar to treat_as_join_clause, but we place additional restrictions
+ * on the conditions.
+ */
+static bool
+statext_is_supported_join_clause(PlannerInfo *root, Node *clause,
+								 int varRelid, SpecialJoinInfo *sjinfo)
+{
+	Oid	oprsel;
+	RestrictInfo   *rinfo;
+	OpExpr		   *opclause;
+	ListCell	   *lc;
+
+	/*
+	 * evaluation as a restriction clause, either at scan node or forced
+	 *
+	 * XXX See treat_as_join_clause.
+	 */
+	if ((varRelid != 0) || (sjinfo == NULL))
+		return false;
+
+	/* XXX Can we rely on always getting RestrictInfo here? */
+	if (!IsA(clause, RestrictInfo))
+		return false;
+
+	/* strip the RestrictInfo */
+	rinfo = (RestrictInfo *) clause;
+	clause = (Node *) rinfo->clause;
+
+	/* is it referencing multiple relations? */
+	if (bms_membership(rinfo->clause_relids) != BMS_MULTIPLE)
+		return false;
+
+	/* we only support simple operator clauses for now */
+	if (!is_opclause(clause))
+		return false;
+
+	opclause = (OpExpr *) clause;
+
+	/* for now we only support estimating equijoins */
+	oprsel = get_oprjoin(opclause->opno);
+
+	/* has to be an equality condition */
+	if (oprsel != F_EQJOINSEL)
+		return false;
+
+	/*
+	 * Make sure we're not mixing vars from multiple relations on the same
+	 * side, like
+	 *
+	 *   (t1.a + t2.a) = (t1.b + t2.b)
+	 *
+	 * which is still technically an opclause, but we can't match it to
+	 * extended statistics in a simple way.
+	 *
+	 * XXX This also means we require rinfo->clause_relids to have 2 rels.
+	 *
+	 * XXX Also check it's not expression on system attributes, which we
+	 * don't allow in extended statistics.
+	 *
+	 * XXX Although maybe we could allow cases that combine expressions
+	 * from both relations on either side? Like (t1.a + t2.b = t1.c - t2.d)
+	 * or something like that. We could do "cartesian product" of the MCV
+	 * stats and restrict it using this condition.
+	 */
+	foreach (lc, opclause->args)
+	{
+		Bitmapset *varnos = NULL;
+		Node *expr = (Node *) lfirst(lc);
+
+		varnos = pull_varnos(root, expr);
+
+		/*
+		 * No argument should reference more than just one relation.
+		 *
+		 * This effectively means each side references just two relations.
+		 * If there's no relation on one side, it's a Const, and the other
+		 * side has to be either Const or Expr with a single rel. In which
+		 * case it can't be a join clause.
+		 */
+		if (bms_num_members(varnos) > 1)
+			return false;
+
+		/*
+		 * XXX Maybe check that both relations have extended statistics
+		 * (no point in considering the clause as useful without it). But
+		 * we'll do that check later anyway, so keep this cheap.
+		 */
+	}
+
+	return true;
+}
+
+/*
+ * statext_try_join_estimates
+ *		Checks if it's worth considering extended stats on join estimates.
+ *
+ * This is supposed to be a quick/cheap check to decide whether to expend
+ * more effort on applying extended statistics to join clauses.
+ */
+bool
+statext_try_join_estimates(PlannerInfo *root, List *clauses, int varRelid,
+						   JoinType jointype, SpecialJoinInfo *sjinfo,
+						   Bitmapset *estimatedclauses)
+{
+	int			listidx;
+	int			k;
+	ListCell   *lc;
+	Bitmapset  *relids = NULL;
+
+	/*
+	 * XXX Not having these values means treat_as_join_clause returns false,
+	 * so we're not supposed to handle join clauses here. So just bail out.
+	 */
+	if ((varRelid != 0) || (sjinfo == NULL))
+		return false;
+
+	/*
+	 * Check if there are any unestimated join clauses, collect relids.
+	 *
+	 * XXX Currently this only allows simple OpExpr equality clauses with each
+	 * argument refering to single relation, AND-ed together. Maybe we could
+	 * relax this in the future, e.g. to allow more complex (deeper) expressions
+	 * and to allow OR-ed join clauses too. And maybe supporting inequalities.
+	 *
+	 * Handling more complex expressions seems simple - we already do that for
+	 * baserel estimates by building the match bitmap recursively, and we could
+	 * do something similar for combinations of MCV items (a bit like building
+	 * a single bit in the match bitmap). The challenge is what to do about the
+	 * part not represented by MCV, which is now based on ndistinct estimates.
+	 */
+	listidx = -1;
+	foreach (lc, clauses)
+	{
+		Node *clause = (Node *) lfirst(lc);
+		RestrictInfo *rinfo;
+
+		/* needs to happen before skipping any clauses */
+		listidx++;
+
+		/* Skip clauses that were already estimated. */
+		if (bms_is_member(listidx, estimatedclauses))
+			continue;
+
+		/*
+		 * Skip clauses that are not join clauses or that we don't know
+		 * how to handle estimate using extended statistics.
+		 */
+		if (!statext_is_supported_join_clause(root, clause, varRelid, sjinfo))
+			continue;
+
+		/*
+		 * XXX We're guaranteed to have RestrictInfo thanks to the checks
+		 * in statext_is_supported_join_clause.
+		 */
+		rinfo = (RestrictInfo *) clause;
+
+		/* Collect relids from all usable clauses. */
+		relids = bms_union(relids, rinfo->clause_relids);
+	}
+
+	/* no join clauses found, don't try applying extended stats */
+	if (bms_num_members(relids) == 0)
+		return false;
+
+	/*
+	 * We expect either 0 or >= 2 relids, a case with 1 relid in join clauses
+	 * should be impossible. And we just ruled out 0, so there are at least 2.
+	 */
+	Assert(bms_num_members(relids) >= 2);
+
+	/*
+	 * Check that at least some of the rels referenced by the clauses have
+	 * extended stats.
+	 *
+	 * XXX Maybe we should check how many rels have stats, and cross-check how
+	 * compatible they are (e.g. that both have MCVs, etc.). We might also
+	 * cross-check the exact joined pairs of rels, but it's is supposed to be
+	 * a cheap check, so maybe better leave that for later.
+	 *
+	 * XXX We could also check the number of parameters in each rel to consider
+	 * extended stats. If there's just a single attribute, it's pointless to use
+	 * extended statistics. OTOH we can also consider restriction clauses from
+	 * baserestrictinfo and use them to calculate conditional probabilities.
+	 */
+	k = -1;
+	while ((k = bms_next_member(relids, k)) >= 0)
+	{
+		RelOptInfo *rel = find_base_rel(root, k);
+		if (rel->statlist)
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * Information about two joined relations, along with the join clauses between.
+ */
+typedef struct JoinPairInfo
+{
+	Bitmapset  *rels;
+	List	   *clauses;
+} JoinPairInfo;
+
+/*
+ * statext_build_join_pairs
+ *		Extract pairs of joined rels with join clauses for each pair.
+ *
+ * Walks the remaining (not yet estimated) clauses, and splits them into
+ * lists for each pair of joined relations. Returns NULL if there are no
+ * suitable join pairs that might be estimated using extended stats.
+ *
+ * XXX It's possible there are join clauses, but the clauses are not
+ * supported by the extended stats machinery (we only support opclauses
+ * with F_EQJOINSEL selectivity function at the moment).
+ */
+static JoinPairInfo *
+statext_build_join_pairs(PlannerInfo *root, List *clauses, int varRelid,
+						 JoinType jointype, SpecialJoinInfo *sjinfo,
+						 Bitmapset *estimatedclauses, int *npairs)
+{
+	int				cnt;
+	int				listidx;
+	JoinPairInfo   *info;
+	ListCell	   *lc;
+
+	/*
+	 * Assume each clause is for a different pair of relations (some of them
+	 * might be already estimated, but meh - there shouldn't be too many of
+	 * them and it's cheaper than repalloc.
+	 */
+	info = (JoinPairInfo *) palloc0(sizeof(JoinPairInfo) * list_length(clauses));
+	cnt = 0;
+
+	listidx = -1;
+	foreach(lc, clauses)
+	{
+		int				i;
+		bool			found;
+		Node		   *clause = (Node *) lfirst(lc);
+		RestrictInfo   *rinfo;
+
+		listidx++;
+
+		/* skip already estimated clauses */
+		if (bms_is_member(listidx, estimatedclauses))
+			continue;
+
+		/*
+		 * Make sure the clause is a join clause of a supported shape (at
+		 * the moment we support just (Expr op Expr) clauses with each
+		 * side referencing just a single relation.
+		 */
+		if (!statext_is_supported_join_clause(root, clause, varRelid, sjinfo))
+			continue;
+
+		/* statext_is_supported_join_clause guarantees RestrictInfo */
+		rinfo = (RestrictInfo *) clause;
+		clause = (Node *) rinfo->clause;
+
+		/* search for a matching join pair */
+		found = false;
+		for (i = 0; i < cnt; i++)
+		{
+			if (bms_is_subset(rinfo->clause_relids, info[i].rels))
+			{
+				info[i].clauses = lappend(info[i].clauses, clause);
+				found = true;
+				break;
+			}
+		}
+
+		if (!found)
+		{
+			info[cnt].rels = rinfo->clause_relids;
+			info[cnt].clauses = lappend(info[cnt].clauses, clause);
+			cnt++;
+		}
+	}
+
+	if (cnt == 0)
+		return NULL;
+
+	*npairs = cnt;
+	return info;
+}
+
+/*
+ * extract_relation_info
+ *		Extract information about a relation in a join pair.
+ *
+ * The relation is identified by index (generally 0 or 1), and picks extended
+ * statistics covering the join clauses and baserel restrictions.
+ *
+ * XXX Can we have cases with indexes above 1? Probably for clauses mixing
+ * vars from 3 relations, but statext_is_supported_join_clause rejects those.
+ */
+static RelOptInfo *
+extract_relation_info(PlannerInfo *root, JoinPairInfo *info, int index,
+					  StatisticExtInfo **stat)
+{
+	int			k;
+	int			relid;
+	RelOptInfo *rel;
+	ListCell   *lc;
+	List	   *exprs = NIL;
+
+	Bitmapset  *attnums = NULL;
+
+	Assert((index >= 0) && (index <= 1));
+
+	k = -1;
+	while (index >= 0)
+	{
+		k = bms_next_member(info->rels, k);
+		if (k < 0)
+			elog(ERROR, "failed to extract relid");
+
+		relid = k;
+		index--;
+	}
+
+	rel = find_base_rel(root, relid);
+
+	/*
+	 * Walk the clauses for this join pair, and extract expressions about
+	 * the relation identified by index / relid. For simple Vars we extract
+	 * the attnum. Otherwise we keep the whole expression.
+	 */
+	foreach (lc, info->clauses)
+	{
+		ListCell *lc2;
+		Node *clause = (Node *) lfirst(lc);
+		OpExpr *opclause = (OpExpr *) clause;
+
+		/* only opclauses supported for now */
+		Assert(is_opclause(clause));
+
+		foreach (lc2, opclause->args)
+		{
+			Node *arg = (Node *) lfirst(lc2);
+			Bitmapset *varnos = NULL;
+
+			/* plain Var references (boolean Vars or recursive checks) */
+			if (IsA(arg, Var))
+			{
+				Var		   *var = (Var *) arg;
+
+				/* Ignore vars from other relations. */
+				if (var->varno != relid)
+					continue;
+
+				/* we also better ensure the Var is from the current level */
+				if (var->varlevelsup > 0)
+					continue;
+
+				/* Also skip system attributes (we don't allow stats on those). */
+				if (!AttrNumberIsForUserDefinedAttr(var->varattno))
+					elog(ERROR, "unexpected system attribute");
+
+				attnums = bms_add_member(attnums, var->varattno);
+
+				/* Done, process the next argument. */
+				continue;
+			}
+
+			/*
+			 * OK, it's a more complex expression, so check if it matches
+			 * the relid and maybe keep it as a whole. It should be
+			 * compatible because we already checked it when building the
+			 * join pairs.
+			 */
+			varnos = pull_varnos(root, arg);
+
+			if (relid == bms_singleton_member(varnos))
+				exprs = lappend(exprs, arg);
+		}
+	}
+
+	*stat = statext_find_matching_mcv(root, rel, attnums, exprs);
+
+	return rel;
+}
+
+/*
+ * get_expression_for_rel
+ *		Extract expression for a given relation from the join clause.
+ *
+ * Given a join clause supported by the extended statistics (currently that
+ * means just OpExpr clauses with each argument referencing single rel),
+ * return either the left or right argument expression for the rel.
+ *
+ * XXX This should probably return a flag identifying whether it's the
+ * left or right argument.
+ */
+static Node *
+get_expression_for_rel(PlannerInfo *root, RelOptInfo *rel, Node *clause)
+{
+	OpExpr *opexpr;
+	Node   *expr;
+
+	/*
+	 * Strip the RestrictInfo node, get the actual clause.
+	 *
+	 * XXX Not sure if we need to care about removing other node types
+	 * too (e.g. RelabelType etc.). statext_is_supported_join_clause
+	 * matches this, but maybe we need to relax it?
+	 */
+	if (IsA(clause, RestrictInfo))
+		clause = (Node *) ((RestrictInfo *) clause)->clause;
+
+	opexpr = (OpExpr *) clause;
+
+	/* Make sure we have the expected node type. */
+	Assert(is_opclause(clause));
+	Assert(list_length(opexpr->args) == 2);
+
+	/* FIXME strip relabel etc. the way examine_opclause_args does */
+	expr = linitial(opexpr->args);
+	if (bms_singleton_member(pull_varnos(root, expr)) == rel->relid)
+		return expr;
+
+	expr = lsecond(opexpr->args);
+	if (bms_singleton_member(pull_varnos(root, expr)) == rel->relid)
+		return expr;
+
+	return NULL;
+}
+
+/*
+ * statext_clauselist_join_selectivity
+ *		Use extended stats to estimate join clauses.
+ *
+ * XXX In principle, we should not restrict this to cases with multiple
+ * join clauses - we should consider dependencies with conditions at the
+ * base relations, i.e. calculate P(join clause | base restrictions).
+ * But currently that does not happen, because clauselist_selectivity_ext
+ * treats a single clause as a special case (and we don't apply extended
+ * statistics in that case yet).
+ */
+Selectivity
+statext_clauselist_join_selectivity(PlannerInfo *root, List *clauses, int varRelid,
+									JoinType jointype, SpecialJoinInfo *sjinfo,
+									Bitmapset **estimatedclauses)
+{
+	int			i;
+	int			listidx;
+	Selectivity	s = 1.0;
+
+	JoinPairInfo *info;
+	int				ninfo;
+
+	if (!clauses)
+		return 1.0;
+
+	/* extract pairs of joined relations from the list of clauses */
+	info = statext_build_join_pairs(root, clauses, varRelid, jointype, sjinfo,
+									*estimatedclauses, &ninfo);
+
+	/* no useful join pairs */
+	if (!info)
+		return 1.0;
+
+	/*
+	 * Process the join pairs, try to find a matching MCV on each side.
+	 *
+	 * XXX The basic principle is quite similar to eqjoinsel_inner, i.e.
+	 * we try to find a MCV on both sides of the join, and use it to get
+	 * better join estimate. It's a bit more complicated, because there
+	 * might be multiple MCV lists, we also need ndistinct estimate, and
+	 * there may be interesting baserestrictions too.
+	 *
+	 * XXX At the moment we only handle the case with matching MCVs on
+	 * both sides, but it'd be good to also handle case with just ndistinct
+	 * statistics improving ndistinct estimates.
+	 *
+	 * XXX We might also handle cases with a regular MCV on one side and
+	 * an extended MCV on the other side.
+	 *
+	 * XXX Perhaps it'd be good to also handle case with one side only
+	 * having "regular" statistics (e.g. MCV), especially in cases with
+	 * no conditions on that side of the join (where we can't use the
+	 * extended MCV to calculate conditional probability).
+	 */
+	for (i = 0; i < ninfo; i++)
+	{
+		ListCell *lc;
+
+		RelOptInfo *rel1;
+		RelOptInfo *rel2;
+
+		StatisticExtInfo *stat1;
+		StatisticExtInfo *stat2;
+
+		/* extract info about the first relation */
+		rel1 = extract_relation_info(root, &info[i], 0, &stat1);
+
+		/* extract info about the second relation */
+		rel2 = extract_relation_info(root, &info[i], 1, &stat2);
+
+		/*
+		 * We can handle three basic cases:
+		 *
+		 * a) Extended stats (with MCV) on both sides is an ideal case, and we
+		 * can simply combine the two MCVs, possibly with additional conditions
+		 * from the relations.
+		 *
+		 * b) Extended stats on one side, regular MCV on the other side (this
+		 * means there's just one join clause / expression). It also means the
+		 * extended stats likely covers at least one extra condition, otherwise
+		 * we could just use regular statistics. We can combine the stats just
+		 * similarly to (a).
+		 *
+		 * c) No extended stats with MCV. If there are multiple join clauses,
+		 * we can try using ndistinct coefficients and do what esjoinsel does.
+		 *
+		 * If none of these applies, we fallback to the regular selectivity
+		 * estimation in eqjoinsel.
+		 */
+		if (stat1 && stat2)
+		{
+			s *= mcv_combine_extended(root, rel1, rel2, stat1, stat2, info[i].clauses);
+		}
+		else if (stat1 && (list_length(info[i].clauses) == 1))
+		{
+			/* try finding MCV on the other relation */
+			VariableStatData	vardata;
+			AttStatsSlot		sslot;
+			Form_pg_statistic	stats = NULL;
+			bool				have_mcvs = false;
+			Node			   *clause = linitial(info[i].clauses);
+			Node			   *expr = get_expression_for_rel(root, rel2, clause);
+			double				nd;
+			bool				isdefault;
+
+			examine_variable(root, expr, 0, &vardata);
+
+			nd = get_variable_numdistinct(&vardata, &isdefault);
+
+			memset(&sslot, 0, sizeof(sslot));
+
+			if (HeapTupleIsValid(vardata.statsTuple))
+			{
+				/* note we allow use of nullfrac regardless of security check */
+				stats = (Form_pg_statistic) GETSTRUCT(vardata.statsTuple);
+				/* FIXME should this call statistic_proc_security_check like eqjoinsel? */
+				have_mcvs = get_attstatsslot(&sslot, vardata.statsTuple,
+											 STATISTIC_KIND_MCV, InvalidOid,
+											 ATTSTATSSLOT_VALUES | ATTSTATSSLOT_NUMBERS);
+			}
+
+			if (have_mcvs)
+				s *= mcv_combine_simple(root, rel1, stat1, &sslot,
+										stats->stanullfrac, nd, isdefault, clause);
+
+			free_attstatsslot(&sslot);
+
+			ReleaseVariableStats(vardata);
+
+			/* no stats, don't mark the clauses as estimated */
+			if (!have_mcvs)
+				continue;
+		}
+		else if (stat2 && (list_length(info[i].clauses) == 1))
+		{
+			/* try finding MCV on the other relation */
+			VariableStatData	vardata;
+			AttStatsSlot		sslot;
+			Form_pg_statistic	stats = NULL;
+			bool				have_mcvs = false;
+			Node			   *clause = (Node *) linitial(info[i].clauses);
+			Node			   *expr = get_expression_for_rel(root, rel1, clause);
+			double				nd;
+			bool				isdefault;
+
+			examine_variable(root, expr, 0, &vardata);
+
+			nd = get_variable_numdistinct(&vardata, &isdefault);
+
+			memset(&sslot, 0, sizeof(sslot));
+
+			if (HeapTupleIsValid(vardata.statsTuple))
+			{
+				/* note we allow use of nullfrac regardless of security check */
+				stats = (Form_pg_statistic) GETSTRUCT(vardata.statsTuple);
+				/* FIXME should this call statistic_proc_security_check like eqjoinsel? */
+				have_mcvs = get_attstatsslot(&sslot, vardata.statsTuple,
+											 STATISTIC_KIND_MCV, InvalidOid,
+											 ATTSTATSSLOT_VALUES | ATTSTATSSLOT_NUMBERS);
+			}
+
+			if (have_mcvs)
+				s *= mcv_combine_simple(root, rel2, stat2, &sslot,
+										stats->stanullfrac, nd, isdefault, clause);
+
+			free_attstatsslot(&sslot);
+
+			ReleaseVariableStats(vardata);
+
+			/* no stats, don't mark the clauses as estimated */
+			if (!have_mcvs)
+				continue;
+		}
+		else
+			continue;
+
+		/*
+		 * Now mark all the clauses for this join pair as estimated.
+		 *
+		 * XXX Maybe track the indexes in JoinPairInfo, so that we can
+		 * simply union the two bitmaps, without the extra matching.
+		 */
+		foreach (lc, info->clauses)
+		{
+			Node *clause = (Node *) lfirst(lc);
+			ListCell *lc2;
+
+			listidx = -1;
+			foreach (lc2, clauses)
+			{
+				Node *clause2 = (Node *) lfirst(lc2);
+				listidx++;
+
+				Assert(IsA(clause2, RestrictInfo));
+
+				clause2 = (Node *) ((RestrictInfo *) clause2)->clause;
+
+				if (equal(clause, clause2))
+				{
+					*estimatedclauses = bms_add_member(*estimatedclauses, listidx);
+					break;
+				}
+			}
+		}
+	}
+
+	return s;
+}
diff --git a/src/backend/statistics/mcv.c b/src/backend/statistics/mcv.c
index 35b39ece07..ca5383d9bc 100644
--- a/src/backend/statistics/mcv.c
+++ b/src/backend/statistics/mcv.c
@@ -24,6 +24,7 @@
 #include "funcapi.h"
 #include "nodes/nodeFuncs.h"
 #include "optimizer/clauses.h"
+#include "optimizer/optimizer.h"
 #include "statistics/extended_stats_internal.h"
 #include "statistics/statistics.h"
 #include "utils/array.h"
@@ -2157,3 +2158,756 @@ mcv_clause_selectivity_or(PlannerInfo *root, StatisticExtInfo *stat,
 
 	return s;
 }
+
+/*
+ * statext_compare_mcvs
+ *		Calculte join selectivity using extended statistics, similarly to
+ *		eqjoinsel_inner.
+ *
+ * Considers restrictions on base relations too, essentially computing a
+ * conditional probability
+ *
+ *	P(join clauses | baserestrictinfos on either side)
+ *
+ * Compared to eqjoinsel_inner there's a couple problems. With per-column MCV
+ * lists it's obvious that the number of distinct values not covered by the MCV
+ * is (ndistinct - size(MCV)). With multi-column MCVs it's not that simple,
+ * particularly when the conditions are on a subset of the MCV attributes and/or
+ * NULLs are involved. E.g. with MCV (a,b,c) and conditions on (a,b), it's not
+ * clear if the number of (a,b) combinations not covered by the MCV is
+ *
+ * (ndistinct(a,b) - ndistinct_mcv(a,b))
+ *
+ * where ndistinct_mcv(a,b) is the number of distinct (a,b) combinations
+ * included in the MCV list. These combinations may be present in the rest
+ * of the data (outside MCV), just with some extra values in "c". So in
+ * principle there may be between
+ *
+ * (ndistinct(a,b) - ndistinct_mcv(a,b)) and ndistinct(a,b)
+ *
+ * distinct values in the part of the data not covered by the MCV. So we need
+ * to pick something in between, there's no way to calculate this accurately.
+ */
+Selectivity
+mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
+					 StatisticExtInfo *stat1, StatisticExtInfo *stat2,
+					 List *clauses)
+{
+	ListCell   *lc;
+
+	MCVList    *mcv1,
+			   *mcv2;
+	int			idx,
+				i,
+				j;
+	Selectivity s = 0;
+
+	/* match bitmaps and selectivity for baserel conditions (if any) */
+	List   *exprs1 = NIL,
+		   *exprs2 = NIL;
+	List   *conditions1 = NIL,
+		   *conditions2 = NIL;
+	bool   *cmatches1 = NULL,
+		   *cmatches2 = NULL;
+
+	double	csel1 = 1.0,
+			csel2 = 1.0;
+
+	bool   *matches1 = NULL,
+		   *matches2 = NULL;
+
+	/* estimates for the two relations */
+	double	matchfreq1,
+			unmatchfreq1,
+			otherfreq1,
+			mcvfreq1,
+			nd1,
+			totalsel1;
+
+	double 	matchfreq2,
+			unmatchfreq2,
+			otherfreq2,
+			mcvfreq2,
+			nd2,
+			totalsel2;
+
+	/* info about clauses and how they match to MCV stats */
+	FmgrInfo   *opprocs;
+	int		   *indexes1,
+			   *indexes2;
+	bool	   *reverse;
+
+	/* we picked the stats so that they have MCV enabled */
+	Assert((stat1->kind = STATS_EXT_MCV) && (stat2->kind = STATS_EXT_MCV));
+
+	mcv1 = statext_mcv_load(stat1->statOid);
+	mcv2 = statext_mcv_load(stat2->statOid);
+
+	/* should only get here with MCV on both sides */
+	Assert(mcv1 && mcv2);
+
+	/* Determine which baserel clauses to use for conditional probability. */
+	conditions1 = statext_determine_join_restrictions(root, rel1, stat1);
+	conditions2 = statext_determine_join_restrictions(root, rel2, stat2);
+
+	/*
+	 * Calculate match bitmaps for restrictions on either side of the join
+	 * (there may be none, in which case this will be NULL).
+	 */
+	if (conditions1)
+	{
+		cmatches1 = mcv_get_match_bitmap(root, conditions1,
+										 stat1->keys, stat1->exprs,
+										 mcv1, false);
+		csel1 = clauselist_selectivity(root, conditions1, rel1->relid, 0, NULL);
+	}
+
+	if (conditions2)
+	{
+		cmatches2 = mcv_get_match_bitmap(root, conditions2,
+										 stat2->keys, stat2->exprs,
+										 mcv2, false);
+		csel2 = clauselist_selectivity(root, conditions2, rel2->relid, 0, NULL);
+	}
+
+	/*
+	 * Match bitmaps for matches between MCV elements. By default there
+	 * are no matches, so we set all items to 0.
+	 */
+	matches1 = (bool *) palloc0(sizeof(bool) * mcv1->nitems);
+	matches2 = (bool *) palloc0(sizeof(bool) * mcv2->nitems);
+
+	/*
+	 * Initialize information about clauses and how they match to the MCV
+	 * stats we picked. We do this only once before processing the lists,
+	 * so that we don't have to do that for each MCV item or so.
+	 */
+	opprocs = (FmgrInfo *) palloc(sizeof(FmgrInfo) * list_length(clauses));
+	indexes1 = (int *) palloc(sizeof(int) * list_length(clauses));
+	indexes2 = (int *) palloc(sizeof(int) * list_length(clauses));
+	reverse = (bool *) palloc(sizeof(bool) * list_length(clauses));
+
+	idx = 0;
+	foreach (lc, clauses)
+	{
+		Node	   *clause = (Node *) lfirst(lc);
+		OpExpr	   *opexpr;
+		Node	   *expr1,
+				   *expr2;
+		Bitmapset  *relids1,
+				   *relids2;
+
+		/*
+		 * Strip the RestrictInfo node, get the actual clause.
+		 *
+		 * XXX Not sure if we need to care about removing other node types
+		 * too (e.g. RelabelType etc.). statext_is_supported_join_clause
+		 * matches this, but maybe we need to relax it?
+		 */
+		if (IsA(clause, RestrictInfo))
+			clause = (Node *) ((RestrictInfo *) clause)->clause;
+
+		opexpr = (OpExpr *) clause;
+
+		/* Make sure we have the expected node type. */
+		Assert(is_opclause(clause));
+		Assert(list_length(opexpr->args) == 2);
+
+		fmgr_info(get_opcode(opexpr->opno), &opprocs[idx]);
+
+		/* FIXME strip relabel etc. the way examine_opclause_args does */
+		expr1 = linitial(opexpr->args);
+		expr2 = lsecond(opexpr->args);
+
+		/* determine order of clauses (rel1 op rel2) or (rel2 op rel1) */
+		relids1 = pull_varnos(root, expr1);
+		relids2 = pull_varnos(root, expr2);
+
+		if ((bms_singleton_member(relids1) == rel1->relid) &&
+			(bms_singleton_member(relids2) == rel2->relid))
+		{
+			Oid		collid;
+
+			indexes1[idx] = mcv_match_expression(expr1,
+												 stat1->keys, stat1->exprs,
+												 &collid);
+			indexes2[idx] = mcv_match_expression(expr2,
+												 stat2->keys, stat2->exprs,
+												 &collid);
+			reverse[idx] = false;
+
+			exprs1 = lappend(exprs1, expr1);
+			exprs2 = lappend(exprs2, expr2);
+		}
+		else if ((bms_singleton_member(relids2) == rel1->relid) &&
+				 (bms_singleton_member(relids1) == rel2->relid))
+		{
+			Oid		collid;
+
+			indexes1[idx] = mcv_match_expression(expr2,
+												 stat2->keys, stat2->exprs,
+												 &collid);
+			indexes2[idx] = mcv_match_expression(expr1,
+												 stat1->keys, stat1->exprs,
+												 &collid);
+			reverse[idx] = true;
+
+			exprs1 = lappend(exprs1, expr2);
+			exprs2 = lappend(exprs2, expr1);
+		}
+		else
+			/* should never happen */
+			Assert(false);
+
+		Assert((indexes1[idx] >= 0) &&
+			   (indexes1[idx] < bms_num_members(stat1->keys) + list_length(stat1->exprs)));
+
+		Assert((indexes2[idx] >= 0) &&
+			   (indexes2[idx] < bms_num_members(stat2->keys) + list_length(stat2->exprs)));
+
+		idx++;
+	}
+
+	/*
+	 * Match items between the two MCV lists.
+	 *
+	 * We don't know if the join conditions match all attributes in the MCV, the
+	 * overlap may be just on a subset  of attributes, e.g. (a,b,c) vs. (b,c,d).
+	 * So there may be multiple matches on either side. So we can't optimize by
+	 * aborting the inner loop after the first match, etc.
+	 *
+	 * XXX We can skip the items eliminated by the base restrictions, of course.
+	 *
+	 * XXX We might optimize this in two ways. We might sort the MCV items on
+	 * both sides using the "join" attributes, and then perform somthing like
+	 * merge join. Or we might calculate hash from the join columns, and then
+	 * compare this (to eliminate most expensive equality functions).
+	 */
+	for (i = 0; i < mcv1->nitems; i++)
+	{
+		bool	has_nulls;
+
+		/* skip items eliminated by restrictions on rel1 */
+		if (cmatches1 && !cmatches1[i])
+			continue;
+
+		/*
+		 * Check if any value in the first MCV item is NULL, because it'll be
+		 * mismatch anyway.
+		 *
+		 * XXX This might not work for some join clauses, e.g. IS NOT DISTINCT
+		 * FROM, but those are currently not considered compatible (we only
+		 * allow OpExpr at the moment).
+		 */
+		has_nulls = false;
+		for (j = 0; j < list_length(clauses); j++)
+			has_nulls |= mcv1->items[i].isnull[indexes1[j]];
+
+		if (has_nulls)
+			continue;
+
+		/* find matches in the second MCV list */
+		for (j = 0; j < mcv2->nitems; j++)
+		{
+			int			idx;
+			bool		items_match = true;
+
+			/* skip items eliminated by restrictions on rel2 */
+			if (cmatches2 && !cmatches2[j])
+				continue;
+
+			/*
+			 * XXX We can't skip based on existing matches2 value, because there
+			 * may be duplicates in the first MCV.
+			 */
+
+			/*
+			 * Evaluate if all the join clauses match between the two MCV items.
+			 *
+			 * XXX We might optimize the order of evaluation, using the costs of
+			 * operator functions for individual columns. It does depend on the
+			 * number of distinct values, etc.
+			 */
+			idx = 0;
+			foreach (lc, clauses)
+			{
+				bool	match;
+				int		index1 = indexes1[idx],
+						index2 = indexes2[idx];
+				Datum	value1,
+						value2;
+				bool	reverse_args = reverse[idx];
+
+				/* If either value is null, it's a mismatch */
+				if (mcv2->items[j].isnull[index2])
+					match = false;
+				else
+				{
+					value1 = mcv1->items[i].values[index1];
+					value2 = mcv2->items[j].values[index2];
+
+					/*
+					 * Careful about order of parameters. For same-type equality
+					 * that should not matter, but easy enough.
+					 *
+					 * FIXME Use appropriate collation.
+					 */
+					if (reverse_args)
+						match = DatumGetBool(FunctionCall2Coll(&opprocs[idx],
+															   InvalidOid,
+															   value2, value1));
+					else
+						match = DatumGetBool(FunctionCall2Coll(&opprocs[idx],
+															   InvalidOid,
+															   value1, value2));
+				}
+
+				items_match &= match;
+
+				if (!items_match)
+					break;
+
+				idx++;
+			}
+
+			if (items_match)
+			{
+				/* XXX Do we need to do something about base frequency? */
+				matches1[i] = matches2[j] = true;
+				s += mcv1->items[i].frequency * mcv2->items[j].frequency;
+			}
+		}
+	}
+
+	matchfreq1 = unmatchfreq1 = mcvfreq1 = 0.0;
+	for (i = 0; i < mcv1->nitems; i++)
+	{
+		mcvfreq1 += mcv1->items[i].frequency;
+
+		/* ignore MCV items eliminated by baserel conditions */
+		if (cmatches1 && !cmatches1[i])
+			continue;
+
+		if (matches1[i])
+			matchfreq1 += mcv1->items[i].frequency;
+		else
+			unmatchfreq1 += mcv1->items[i].frequency;
+	}
+
+	/* not represented by the MCV */
+	otherfreq1 = 1.0 - mcvfreq1;
+
+	matchfreq2 = unmatchfreq2 = mcvfreq2 = 0.0;
+	for (i = 0; i < mcv2->nitems; i++)
+	{
+		mcvfreq2 += mcv2->items[i].frequency;
+
+		/* ignore MCV items eliminated by baserel conditions */
+		if (cmatches2 && !cmatches2[i])
+			continue;
+
+		if (matches2[i])
+			matchfreq2 += mcv2->items[i].frequency;
+		else
+			unmatchfreq2 += mcv2->items[i].frequency;
+	}
+
+	/* not represented by the MCV */
+	otherfreq2 = 1.0 - mcvfreq2;
+
+	/*
+	 * Correction for MCV parts eliminated by the conditions.
+	 *
+	 * We need to be careful about cases where conditions eliminated all
+	 * the MCV items. We must not divide by 0.0, because that would either
+	 * produce bogus value or trigger division by zero. Instead we simply
+	 * set the selectivity to 0.0, because there can't be any matches.
+	 */
+	if ((matchfreq1 + unmatchfreq1) > 0)
+		s = s * mcvfreq1 / (matchfreq1 + unmatchfreq1);
+	else
+		s = 0.0;
+
+	if ((matchfreq2 + unmatchfreq2) > 0)
+		s = s * mcvfreq2 / (matchfreq2 + unmatchfreq2);
+	else
+		s = 0.0;
+
+	/* calculate ndistinct for the expression in join clauses for each rel */
+	nd1 = estimate_num_groups(root, exprs1, rel1->rows, NULL, NULL);
+	nd2 = estimate_num_groups(root, exprs2, rel2->rows, NULL, NULL);
+
+	/*
+	 * Consider the part of the data not represented by the MCV lists.
+	 *
+	 * XXX this is a bit bogus, because we don't know what fraction of
+	 * distinct combinations is covered by the MCV list (we're only
+	 * dealing with some of the columns), so we can't use the same
+	 * formular as eqjoinsel_inner exactly. We just use the estimates
+	 * for the whole table - this is likely an overestimate, because
+	 * (a) items may repeat in the MCV list, if it has more columns,
+	 * and (b) some of the combinations may be present in non-MCV data.
+	 *
+	 * Moreover, we need to look at the conditions. For now we simply
+	 * assume the conditions affect the distinct groups, and use that.
+	 *
+	 * XXX We might calculate the number of distinct groups in the MCV,
+	 * and then use something between (nd1 - distinct(MCV)) and (nd1),
+	 * which are the possible extreme values, assuming the estimates
+	 * are accurate. Maybe mean or geometric mean would work?
+	 *
+	 * XXX Not sure multiplying ndistinct with probabilities is good.
+	 * Maybe we should do something more like estimate_num_groups?
+	 */
+	nd1 *= csel1;
+	nd2 *= csel2;
+
+	totalsel1 = s;
+	totalsel1 += unmatchfreq1 * otherfreq2 / nd2;
+	totalsel1 += otherfreq1 * (otherfreq2 + unmatchfreq2) / nd2;
+
+//	if (nd2 > mcvb->nitems)
+//		totalsel1 += unmatchfreq1 * otherfreq2 / (nd2 - mcvb->nitems);
+//	if (nd2 > nmatches)
+//		totalsel1 += otherfreq1 * (otherfreq2 + unmatchfreq2) /
+//			(nd2 - nmatches);
+
+	totalsel2 = s;
+	totalsel2 += unmatchfreq2 * otherfreq1 / nd1;
+	totalsel2 += otherfreq2 * (otherfreq1 + unmatchfreq1) / nd1;
+
+//	if (nd1 > mcva->nitems)
+//		totalsel2 += unmatchfreq2 * otherfreq1 / (nd1 - mcva->nitems);
+//	if (nd1 > nmatches)
+//		totalsel2 += otherfreq2 * (otherfreq1 + unmatchfreq1) /
+//			(nd1 - nmatches);
+
+	s = Min(totalsel1, totalsel2);
+
+	return s;
+}
+
+
+/*
+ * statext_compare_simple
+ *		Calculte join selectivity using a combination of extended statistics
+ * MCV on one side, and simple per-column MCV on the other.
+ *
+ * Most of the mcv_combine_extended comment applies here too, but we can make
+ * some simplifications because we know the second (per-column) MCV is simpler,
+ * contains no NULL or duplicate values, etc.
+ */
+Selectivity
+mcv_combine_simple(PlannerInfo *root, RelOptInfo *rel, StatisticExtInfo *stat,
+				   AttStatsSlot *sslot, double stanullfrac, double nd,
+				   bool isdefault, Node *clause)
+{
+	MCVList    *mcv;
+	int			i,
+				j;
+	Selectivity s = 0;
+
+	/* match bitmaps and selectivity for baserel conditions (if any) */
+	List   *conditions = NIL;
+	bool   *cmatches = NULL;
+
+	double	csel = 1.0;
+
+	bool   *matches1 = NULL,
+		   *matches2 = NULL;
+
+	/* estimates for the two sides */
+	double	matchfreq1,
+			unmatchfreq1,
+			otherfreq1,
+			mcvfreq1,
+			nd1,
+			totalsel1;
+
+	double 	matchfreq2,
+			unmatchfreq2,
+			otherfreq2,
+			mcvfreq2,
+			nd2,
+			totalsel2;
+
+	List   *exprs1 = NIL,
+		   *exprs2 = NIL;
+
+	/* info about clauses and how they match to MCV stats */
+	FmgrInfo	opproc;
+	int			index;
+	bool		reverse;
+
+	/* we picked the stats so that they have MCV enabled */
+	Assert(stat->kind = STATS_EXT_MCV);
+
+	mcv = statext_mcv_load(stat->statOid);
+
+	/* should only get here with MCV on both sides */
+	Assert(mcv);
+
+	/* Determine which baserel clauses to use for conditional probability. */
+	conditions = statext_determine_join_restrictions(root, rel, stat);
+
+	/*
+	 * Calculate match bitmaps for restrictions on either side of the join
+	 * (there may be none, in which case this will be NULL).
+	 */
+	if (conditions)
+	{
+		cmatches = mcv_get_match_bitmap(root, conditions,
+										 stat->keys, stat->exprs,
+										 mcv, false);
+		csel = clauselist_selectivity(root, conditions, rel->relid, 0, NULL);
+	}
+
+	/*
+	 * Match bitmaps for matches between MCV elements. By default there
+	 * are no matches, so we set all items to 0.
+	 */
+	matches1 = (bool *) palloc0(sizeof(bool) * mcv->nitems);
+
+	/* Matches for the side with just regular single-column MCV. */
+	matches2 = (bool *) palloc0(sizeof(bool) * sslot->nvalues);
+
+	/*
+	 * Initialize information about the clause and how it matches to the
+	 * extended stats we picked. We do this only once before processing
+	 * the lists, so that we don't have to do that for each item or so.
+	 */
+	{
+		OpExpr	   *opexpr;
+		Node	   *expr1,
+				   *expr2;
+		Bitmapset  *relids1,
+				   *relids2;
+
+		/*
+		 * Strip the RestrictInfo node, get the actual clause.
+		 *
+		 * XXX Not sure if we need to care about removing other node types
+		 * too (e.g. RelabelType etc.). statext_is_supported_join_clause
+		 * matches this, but maybe we need to relax it?
+		 */
+		if (IsA(clause, RestrictInfo))
+			clause = (Node *) ((RestrictInfo *) clause)->clause;
+
+		opexpr = (OpExpr *) clause;
+
+		/* Make sure we have the expected node type. */
+		Assert(is_opclause(clause));
+		Assert(list_length(opexpr->args) == 2);
+
+		fmgr_info(get_opcode(opexpr->opno), &opproc);
+
+		/* FIXME strip relabel etc. the way examine_opclause_args does */
+		expr1 = linitial(opexpr->args);
+		expr2 = lsecond(opexpr->args);
+
+		/* determine order of clauses (rel1 op rel2) or (rel2 op rel1) */
+		relids1 = pull_varnos(root, expr1);
+		relids2 = pull_varnos(root, expr2);
+
+		if (bms_singleton_member(relids1) == rel->relid)
+		{
+			Oid		collid;
+
+			index = mcv_match_expression(expr1, stat->keys, stat->exprs,
+										 &collid);
+			reverse = false;
+
+			exprs1 = lappend(exprs1, expr1);
+			exprs2 = lappend(exprs2, expr2);
+		}
+		else if (bms_singleton_member(relids2) == rel->relid)
+		{
+			Oid		collid;
+
+			index = mcv_match_expression(expr2, stat->keys, stat->exprs,
+										 &collid);
+			reverse = true;
+
+			exprs1 = lappend(exprs1, expr2);
+			exprs2 = lappend(exprs2, expr1);
+		}
+		else
+			/* should never happen */
+			Assert(false);
+
+		Assert((index >= 0) &&
+			   (index < bms_num_members(stat->keys) + list_length(stat->exprs)));
+	}
+
+	/*
+	 * Match items between the two MCV lists.
+	 *
+	 * We don't know if the join conditions match all attributes in the MCV, the
+	 * overlap may be just on a subset  of attributes, e.g. (a,b,c) vs. (b,c,d).
+	 * So there may be multiple matches on either side. So we can't optimize by
+	 * aborting the inner loop after the first match, etc.
+	 *
+	 * XXX We can skip the items eliminated by the base restrictions, of course.
+	 *
+	 * XXX We might optimize this in two ways. We might sort the MCV items on
+	 * both sides using the "join" attributes, and then perform somthing like
+	 * merge join. Or we might calculate hash from the join columns, and then
+	 * compare this (to eliminate most expensive equality functions).
+	 */
+	for (i = 0; i < mcv->nitems; i++)
+	{
+		/* skip items eliminated by restrictions on rel1 */
+		if (cmatches && !cmatches[i])
+			continue;
+
+		/*
+		 * We can check mcv1->items[i].isnull[index1] here, because it'll be a
+		 * mismatch anyway (the simple MCV does not contain NULLs).
+		 */
+		if (mcv->items[i].isnull[index])
+			continue;
+
+		/* find matches in the second MCV list */
+		for (j = 0; j < sslot->nvalues; j++)
+		{
+			bool	match;
+			Datum	value1 = mcv->items[i].values[index];
+			Datum	value2 = sslot->values[j];
+
+			/*
+			 * Evaluate the join clause between the two MCV lists. We don't
+			 * need to deal with NULL values here - we've already checked for
+			 * NULL in the extended statistics earlier, and the simple MCV
+			 * does not contain NULL values.
+			*
+			 * Careful about order of parameters. For same-type equality
+			 * that should not matter, but easy enough.
+			 *
+			 * FIXME Use appropriate collation.
+			 */
+			if (reverse)
+				match = DatumGetBool(FunctionCall2Coll(&opproc,
+													   InvalidOid,
+													   value2, value1));
+			else
+				match = DatumGetBool(FunctionCall2Coll(&opproc,
+													   InvalidOid,
+													   value1, value2));
+
+			if (match)
+			{
+				/* XXX Do we need to do something about base frequency? */
+				matches1[i] = matches2[j] = true;
+				s += mcv->items[i].frequency * sslot->numbers[j];
+
+				/*
+				 * We know there can be just a single match in the regular
+				 * MCV list, so we can abort the inner loop.
+				 */
+				break;
+			}
+		}
+	}
+
+	matchfreq1 = unmatchfreq1 = mcvfreq1 = 0.0;
+	for (i = 0; i < mcv->nitems; i++)
+	{
+		mcvfreq1 += mcv->items[i].frequency;
+
+		/* ignore MCV items eliminated by baserel conditions */
+		if (cmatches && !cmatches[i])
+			continue;
+
+		if (matches1[i])
+			matchfreq1 += mcv->items[i].frequency;
+		else
+			unmatchfreq1 += mcv->items[i].frequency;
+	}
+
+	/* not represented by the MCV */
+	otherfreq1 = 1.0 - mcvfreq1;
+
+	matchfreq2 = unmatchfreq2 = mcvfreq2 = 0.0;
+	for (i = 0; i < sslot->nvalues; i++)
+	{
+		mcvfreq2 += sslot->numbers[i];
+
+		if (matches2[i])
+			matchfreq2 += sslot->numbers[i];
+		else
+			unmatchfreq2 += sslot->numbers[i];
+	}
+
+	/* not represented by the MCV */
+	otherfreq2 = 1.0 - mcvfreq2;
+
+	/*
+	 * Correction for MCV parts eliminated by the conditions.
+	 *
+	 * We need to be careful about cases where conditions eliminated all
+	 * the MCV items. We must not divide by 0.0, because that would either
+	 * produce bogus value or trigger division by zero. Instead we simply
+	 * set the selectivity to 0.0, because there can't be any matches.
+	 */
+	if ((matchfreq1 + unmatchfreq1) > 0)
+		s = s * mcvfreq1 / (matchfreq1 + unmatchfreq1);
+	else
+		s = 0.0;
+
+	if ((matchfreq2 + unmatchfreq2) > 0)
+		s = s * mcvfreq2 / (matchfreq2 + unmatchfreq2);
+	else
+		s = 0.0;
+
+	/* calculate ndistinct for the expression in join clauses for each rel */
+	nd1 = estimate_num_groups(root, exprs1, rel->rows, NULL, NULL);
+	nd2 = nd;
+
+	/*
+	 * Consider the part of the data not represented by the MCV lists.
+	 *
+	 * XXX this is a bit bogus, because we don't know what fraction of
+	 * distinct combinations is covered by the MCV list (we're only
+	 * dealing with some of the columns), so we can't use the same
+	 * formular as eqjoinsel_inner exactly. We just use the estimates
+	 * for the whole table - this is likely an overestimate, because
+	 * (a) items may repeat in the MCV list, if it has more columns,
+	 * and (b) some of the combinations may be present in non-MCV data.
+	 *
+	 * Moreover, we need to look at the conditions. For now we simply
+	 * assume the conditions affect the distinct groups, and use that.
+	 *
+	 * XXX We might calculate the number of distinct groups in the MCV,
+	 * and then use something between (nd1 - distinct(MCV)) and (nd1),
+	 * which are the possible extreme values, assuming the estimates
+	 * are accurate. Maybe mean or geometric mean would work?
+	 *
+	 * XXX Not sure multiplying ndistinct with probabilities is good.
+	 * Maybe we should do something more like estimate_num_groups?
+	 */
+	nd1 *= csel;
+
+	totalsel1 = s;
+	totalsel1 += unmatchfreq1 * otherfreq2 / nd2;
+	totalsel1 += otherfreq1 * (otherfreq2 + unmatchfreq2) / nd2;
+
+//	if (nd2 > mcvb->nitems)
+//		totalsel1 += unmatchfreq1 * otherfreq2 / (nd2 - mcvb->nitems);
+//	if (nd2 > nmatches)
+//		totalsel1 += otherfreq1 * (otherfreq2 + unmatchfreq2) /
+//			(nd2 - nmatches);
+
+	totalsel2 = s;
+	totalsel2 += unmatchfreq2 * otherfreq1 / nd1;
+	totalsel2 += otherfreq2 * (otherfreq1 + unmatchfreq1) / nd1;
+
+//	if (nd1 > mcva->nitems)
+//		totalsel2 += unmatchfreq2 * otherfreq1 / (nd1 - mcva->nitems);
+//	if (nd1 > nmatches)
+//		totalsel2 += otherfreq2 * (otherfreq1 + unmatchfreq1) /
+//			(nd1 - nmatches);
+
+	s = Min(totalsel1, totalsel2);
+
+	return s;
+}
diff --git a/src/include/statistics/extended_stats_internal.h b/src/include/statistics/extended_stats_internal.h
index 55cd9252a5..1e51c54fef 100644
--- a/src/include/statistics/extended_stats_internal.h
+++ b/src/include/statistics/extended_stats_internal.h
@@ -15,6 +15,7 @@
 #define EXTENDED_STATS_INTERNAL_H
 
 #include "statistics/statistics.h"
+#include "utils/lsyscache.h"
 #include "utils/sortsupport.h"
 
 typedef struct
@@ -127,4 +128,23 @@ extern Selectivity mcv_clause_selectivity_or(PlannerInfo *root,
 											 Selectivity *overlap_basesel,
 											 Selectivity *totalsel);
 
+extern Selectivity mcv_combine_simple(PlannerInfo *root,
+									  RelOptInfo *rel,
+									  StatisticExtInfo *stat,
+									  AttStatsSlot *sslot,
+									  double stanullfrac,
+									  double nd, bool isdefault,
+									  Node *clause);
+
+extern Selectivity mcv_combine_extended(PlannerInfo *root,
+										RelOptInfo *rel1,
+										RelOptInfo *rel2,
+										StatisticExtInfo *stat1,
+										StatisticExtInfo *stat2,
+										List *clauses);
+
+extern List *statext_determine_join_restrictions(PlannerInfo *root,
+												 RelOptInfo *rel,
+												 StatisticExtInfo *info);
+
 #endif							/* EXTENDED_STATS_INTERNAL_H */
diff --git a/src/include/statistics/statistics.h b/src/include/statistics/statistics.h
index 326cf26fea..4bf27240f6 100644
--- a/src/include/statistics/statistics.h
+++ b/src/include/statistics/statistics.h
@@ -126,4 +126,16 @@ extern StatisticExtInfo *choose_best_statistics(List *stats, char requiredkind,
 												int nclauses);
 extern HeapTuple statext_expressions_load(Oid stxoid, int idx);
 
+extern StatisticExtInfo *statext_find_matching_mcv(PlannerInfo *root, RelOptInfo *rel,
+										   Bitmapset *attnums, List *exprs);
+
+extern bool statext_try_join_estimates(PlannerInfo *root, List *clauses, int varRelid,
+									   JoinType jointype, SpecialJoinInfo *sjinfo,
+									   Bitmapset *estimatedclauses);
+
+extern Selectivity statext_clauselist_join_selectivity(PlannerInfo *root, List *clauses,
+													   int varRelid,
+													   JoinType jointype, SpecialJoinInfo *sjinfo,
+													   Bitmapset **estimatedclauses);
+
 #endif							/* STATISTICS_H */
diff --git a/src/test/regress/expected/stats_ext.out b/src/test/regress/expected/stats_ext.out
index c60ba45aba..634f01cc56 100644
--- a/src/test/regress/expected/stats_ext.out
+++ b/src/test/regress/expected/stats_ext.out
@@ -2974,6 +2974,174 @@ SELECT c0 FROM ONLY expr_stats_incompatible_test WHERE
 (0 rows)
 
 DROP TABLE expr_stats_incompatible_test;
+-- Test join estimates.
+CREATE TABLE join_test_1 (a int, b int, c int);
+CREATE TABLE join_test_2 (a int, b int, c int);
+INSERT INTO join_test_1 SELECT mod(i,10), mod(i,10), mod(i,10) FROM generate_series(1,1000) s(i);
+INSERT INTO join_test_2 SELECT mod(i,10), mod(i,10), mod(i,10) FROM generate_series(1,1000) s(i);
+ANALYZE join_test_1;
+ANALYZE join_test_2;
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b))');
+ estimated | actual 
+-----------+--------
+       500 | 100000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5');
+ estimated | actual 
+-----------+--------
+       250 |  50000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 0');
+ estimated | actual 
+-----------+--------
+         1 |      0
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5 and j2.c < 3');
+ estimated | actual 
+-----------+--------
+        75 |  30000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5 and j2.c > 5');
+ estimated | actual 
+-----------+--------
+       100 |      0
+(1 row)
+
+-- can't be improved due to the optimization in clauselist_selectivity_ext,
+-- which skips cases with a single (join) clause
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5');
+ estimated | actual 
+-----------+--------
+      2500 |  50000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5 and j2.c < 5');
+ estimated | actual 
+-----------+--------
+      1250 |  50000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5 and j2.c > 5');
+ estimated | actual 
+-----------+--------
+      1000 |      0
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a) where j1.c < 5');
+ estimated | actual 
+-----------+--------
+      2500 |  50000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a + 1) where j1.c < 5');
+ estimated | actual 
+-----------+--------
+      2500 |  50000
+(1 row)
+
+-- create extended statistics on the join/where columns
+CREATE STATISTICS join_stats_1 ON a, b, c, (a+1), (b+1) FROM join_test_1;
+CREATE STATISTICS join_stats_2 ON a, b, c, (a+1), (b+1) FROM join_test_2;
+ANALYZE join_test_1;
+ANALYZE join_test_2;
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b))');
+ estimated | actual 
+-----------+--------
+    100000 | 100000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5');
+ estimated | actual 
+-----------+--------
+     50000 |  50000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 0');
+ estimated | actual 
+-----------+--------
+         1 |      0
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5 and j2.c < 3');
+ estimated | actual 
+-----------+--------
+     30000 |  30000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5 and j2.c > 5');
+ estimated | actual 
+-----------+--------
+         1 |      0
+(1 row)
+
+-- can't be improved due to the optimization in clauselist_selectivity_ext,
+-- which skips cases with a single (join) clause
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5');
+ estimated | actual 
+-----------+--------
+     50000 |  50000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5 and j2.c < 5');
+ estimated | actual 
+-----------+--------
+     50000 |  50000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5 and j2.c > 5');
+ estimated | actual 
+-----------+--------
+         1 |      0
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a) where j1.c < 5');
+ estimated | actual 
+-----------+--------
+     50000 |  50000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a + 1) where j1.c < 5');
+ estimated | actual 
+-----------+--------
+     50000 |  50000
+(1 row)
+
+-- try combining with single-column (and single-expression) statistics
+DROP STATISTICS join_test_2;
+ERROR:  statistics object "join_test_2" does not exist
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a) where j1.c < 5');
+ estimated | actual 
+-----------+--------
+     50000 |  50000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a + 1) where j1.c < 5');
+ estimated | actual 
+-----------+--------
+     50000 |  50000
+(1 row)
+
+CREATE STATISTICS join_stats_2 ON (a+1) FROM join_test_2;
+ERROR:  statistics object "join_stats_2" already exists
+ANALYZE join_test_2;
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a) where j1.c < 5');
+ estimated | actual 
+-----------+--------
+     50000 |  50000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a + 1) where j1.c < 5');
+ estimated | actual 
+-----------+--------
+     50000 |  50000
+(1 row)
+
+DROP TABLE join_test_1;
+DROP TABLE join_test_2;
 -- Permission tests. Users should not be able to see specific data values in
 -- the extended statistics, if they lack permission to see those values in
 -- the underlying table.
diff --git a/src/test/regress/sql/stats_ext.sql b/src/test/regress/sql/stats_ext.sql
index 6fb37962a7..42ae750b2d 100644
--- a/src/test/regress/sql/stats_ext.sql
+++ b/src/test/regress/sql/stats_ext.sql
@@ -1500,6 +1500,70 @@ SELECT c0 FROM ONLY expr_stats_incompatible_test WHERE
 
 DROP TABLE expr_stats_incompatible_test;
 
+
+-- Test join estimates.
+CREATE TABLE join_test_1 (a int, b int, c int);
+CREATE TABLE join_test_2 (a int, b int, c int);
+
+INSERT INTO join_test_1 SELECT mod(i,10), mod(i,10), mod(i,10) FROM generate_series(1,1000) s(i);
+INSERT INTO join_test_2 SELECT mod(i,10), mod(i,10), mod(i,10) FROM generate_series(1,1000) s(i);
+
+ANALYZE join_test_1;
+ANALYZE join_test_2;
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b))');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 0');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5 and j2.c < 3');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5 and j2.c > 5');
+
+-- can't be improved due to the optimization in clauselist_selectivity_ext,
+-- which skips cases with a single (join) clause
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5 and j2.c < 5');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5 and j2.c > 5');
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a) where j1.c < 5');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a + 1) where j1.c < 5');
+
+-- create extended statistics on the join/where columns
+CREATE STATISTICS join_stats_1 ON a, b, c, (a+1), (b+1) FROM join_test_1;
+CREATE STATISTICS join_stats_2 ON a, b, c, (a+1), (b+1) FROM join_test_2;
+
+ANALYZE join_test_1;
+ANALYZE join_test_2;
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b))');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 0');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5 and j2.c < 3');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5 and j2.c > 5');
+
+-- can't be improved due to the optimization in clauselist_selectivity_ext,
+-- which skips cases with a single (join) clause
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5 and j2.c < 5');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5 and j2.c > 5');
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a) where j1.c < 5');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a + 1) where j1.c < 5');
+
+-- try combining with single-column (and single-expression) statistics
+DROP STATISTICS join_test_2;
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a) where j1.c < 5');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a + 1) where j1.c < 5');
+
+CREATE STATISTICS join_stats_2 ON (a+1) FROM join_test_2;
+ANALYZE join_test_2;
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a) where j1.c < 5');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a + 1) where j1.c < 5');
+
+
+DROP TABLE join_test_1;
+DROP TABLE join_test_2;
+
 -- Permission tests. Users should not be able to see specific data values in
 -- the extended statistics, if they lack permission to see those values in
 -- the underlying table.
-- 
2.31.1

Zhihong Yu

zyu@yugabyte.com

over 4 years ago

In reply to: Tomas Vondra (#4)

Re: using extended statistics to improve join estimates

On Wed, Oct 6, 2021 at 12:33 PM Tomas Vondra <tomas.vondra@enterprisedb.com>
wrote:

Hi,

attached is an improved version of this patch, addressing some of the
points mentioned in my last message:

1) Adds a couple regression tests, testing various join cases with
expressions, additional conditions, etc.

2) Adds support for expressions, so the join clauses don't need to
reference just simple columns. So e.g. this can benefit from extended
statistics, when defined on the expressions:

-- CREATE STATISTICS s1 ON (a+1), b FROM t1;
-- CREATE STATISTICS s2 ON (a+1), b FROM t2;

SELECT * FROM t1 JOIN t2 ON ((t1.a + 1) = (t2.a + 1) AND t1.b = t2.b);

3) Can combine extended statistics and regular (per-column) statistics.
The previous version required extended statistics MCV on both sides of
the join, but adding extended statistics on both sides may impractical
(e.g. if one side does not have correlated columns it's silly to have to
add it just to make this patch happy).

For example you may have extended stats on the dimension table but not
the fact table, and the patch still can combine those two. Of course, if
there's no MCV on either side, we can't do much.

So this patch works when both sides have extended statistics MCV, or
when one side has extended MCV and the other side regular MCV. It might
seem silly, but the extended MCV allows considering additional baserel
conditions (if there are any).

examples
========

The table / data is very simple, but hopefully good enough for some
simple examples.

create table t1 (a int, b int, c int);
create table t2 (a int, b int, c int);

insert into t1 select mod(i,50), mod(i,50), mod(i,50)
from generate_series(1,1000) s(i);

insert into t2 select mod(i,50), mod(i,50), mod(i,50)
from generate_series(1,1000) s(i);

analyze t1, t2;

First, without extended stats (just the first line of explain analyze,
to keep the message short):

explain analyze select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b);

QUERY PLAN
------------------------------------------------------------------------
Hash Join (cost=31.00..106.00 rows=400 width=24)
(actual time=5.426..22.678 rows=20000 loops=1)

explain analyze select * from t1 join t2 on (t1.a = t2.a) where t1.c < 25;

QUERY PLAN
------------------------------------------------------------------------
Hash Join (cost=28.50..160.75 rows=10000 width=24)
(actual time=5.325..19.760 rows=10000 loops=1)

explain analyze select * from t1 join t2 on (t1.a = t2.a) where t1.c <
25 and t2.c > 25;

QUERY PLAN
------------------------------------------------------------------------
Hash Join (cost=24.50..104.75 rows=4800 width=24)
(actual time=5.618..5.639 rows=0 loops=1)

Now, let's create statistics:

create statistics s1 on a, b, c from t1 ;
create statistics s2 on a, b, c from t2 ;
analyze t1, t2;

and now the same queries again:

explain analyze select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b);

QUERY PLAN
------------------------------------------------------------------------
Hash Join (cost=31.00..106.00 rows=20000 width=24)
(actual time=5.448..22.713 rows=20000 loops=1)

explain analyze select * from t1 join t2 on (t1.a = t2.a) where t1.c < 25;

QUERY PLAN
------------------------------------------------------------------------
Hash Join (cost=28.50..160.75 rows=10000 width=24)
(actual time=5.317..16.680 rows=10000 loops=1)

explain analyze select * from t1 join t2 on (t1.a = t2.a) where t1.c <
25 and t2.c > 25;

QUERY PLAN
------------------------------------------------------------------------
Hash Join (cost=24.50..104.75 rows=1 width=24)
(actual time=5.647..5.667 rows=0 loops=1)

Those examples are a bit simplistic, but the improvements are fairly
clear I think.

limitations & open issues
=========================

Let's talk about the main general restrictions and open issues in the
current patch that I can think of at the moment.

1) statistics covering all join clauses

The patch requires the statistics to cover all the join clauses, mostly
because it simplifies the implementation. This means that to use the
per-column statistics, there has to be just a single join clause.

AFAICS this could be relaxed and we could use multiple statistics to
estimate the clauses. But it'd make selection of statistics much more
complicated, because we have to pick "matching" statistics on both sides
of the join. So it seems like an overkill, and most joins have very few
conditions anyway.

2) only equality join clauses

The other restriction is that at the moment this only supports simple
equality clauses, combined with AND. So for example this is supported

t1 JOIN t2 ON ((t1.a = t2.a) AND (t1.b + 2 = t2.b + 1))

while these are not:

t1 JOIN t2 ON ((t1.a = t2.a) OR (t1.b + 2 = t2.b + 1))

t1 JOIN t2 ON ((t1.a - t2.a = 0) AND (t1.b + 2 = t2.b + 1))

t1 JOIN t2 ON ((t1.a = t2.a) AND ((t1.b = t2.b) OR (t1.c = t2.c)))

I'm not entirely sure these restrictions can be relaxed. It's not that
difficult to evaluate these cases when matching items between the MCV
lists, similarly to how we evaluate bitmaps for baserel estimates.

But I'm not sure what to do about the part not covered by the MCV lists.
The eqjoinsel() approach uses ndistinct estimates for that, but that
only works for AND clauses, I think. How would that work for OR?

Similarly, I'm not sure we can do much for non-equality conditions, but
those are currently estimated as default selectivity in selfuncs.c.

3) estimation by join pairs

At the moment, the estimates are calculated for pairs of relations, so
for example given a query

explain analyze
select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b)
join t3 on (t1.b = t3.b and t2.c = t3.c);

we'll estimate the first join (t1,t2) just fine, but then the second
join actually combines (t1,t2,t3). What the patch currently does is it
splits it into (t1,t2) and (t2,t3) and estimates those. I wonder if this
should actually combine all three MCVs at once - we're pretty much
combining the MCVs into one large MCV representing the join result.

But I haven't done that yet, as it requires the MCVs to be combined
using the join clauses (overlap in a way), but I'm not sure how likely
that is in practice. In the example it could help, but that's a bit
artificial example.

4) still just inner equi-joins

I haven't done any work on extending this to outer joins etc. Adding
outer and semi joins should not be complicated, mostly copying and
tweaking what eqjoinsel() does.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Hi,

+       conditions2 = statext_determine_join_restrictions(root, rel, mcv);
+
+       /* if the new statistics covers more conditions, use it */
+       if (list_length(conditions2) > list_length(conditions1))
+       {
+           mcv = stat;

It seems conditions2 is calculated using mcv, I wonder why mcv is replaced
by stat (for conditions1 whose length is shorter) ?

Cheers

Tomas Vondra

tomas.vondra@enterprisedb.com

over 4 years ago

In reply to: Zhihong Yu (#5)

Re: using extended statistics to improve join estimates

On 10/6/21 23:03, Zhihong Yu wrote:

Hi,
+       conditions2 = statext_determine_join_restrictions(root, rel, mcv);
+
+       /* if the new statistics covers more conditions, use it */
+       if (list_length(conditions2) > list_length(conditions1))
+       {
+           mcv = stat;
It seems conditions2 is calculated using mcv, I wonder why mcv is
replaced by stat (for conditions1 whose length is shorter) ?

Yeah, that's wrong - it should be the other way around, i.e.

if (list_length(conditions1) > list_length(conditions2))

There's no test with multiple candidate statistics yet, so this went
unnoticed :-/

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Andy Fan

zhihui.fan1213@gmail.com

about 4 years ago

In reply to: Tomas Vondra (#4)

Re: using extended statistics to improve join estimates

Hi Tomas:

This is the exact patch I want, thanks for the patch!

On Thu, Oct 7, 2021 at 3:33 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

3) estimation by join pairs

At the moment, the estimates are calculated for pairs of relations, so
for example given a query

explain analyze
select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b)
join t3 on (t1.b = t3.b and t2.c = t3.c);

we'll estimate the first join (t1,t2) just fine, but then the second
join actually combines (t1,t2,t3). What the patch currently does is it
splits it into (t1,t2) and (t2,t3) and estimates those.

Actually I can't understand how this works even for a simpler example.
let's say we query like this (ONLY use t2's column to join t3).

select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b)
join t3 on (t2.c = t3.c and t2.d = t3.d);

Then it works well on JoinRel(t1, t2) AND JoinRel(t2, t3). But when comes
to JoinRel(t1, t2, t3), we didn't maintain the MCV on join rel, so it
is hard to
work. Here I see your solution is splitting it into (t1, t2) AND (t2,
t3) and estimate
those. But how does this help to estimate the size of JoinRel(t1, t2, t3)?

I wonder if this
should actually combine all three MCVs at once - we're pretty much
combining the MCVs into one large MCV representing the join result.

I guess we can keep the MCVs on joinrel for these matches. Take the above
query I provided for example, and suppose the MCV data as below:

t1(a, b)
(1, 2) -> 0.1
(1, 3) -> 0.2
(2, 3) -> 0.5
(2, 8) -> 0.1

t2(a, b)
(1, 2) -> 0.2
(1, 3) -> 0.1
(2, 4) -> 0.2
(2, 10) -> 0.1

After t1.a = t2.a AND t1.b = t2.b, we can build the MCV as below

(1, 2, 1, 2) -> 0.1 * 0.2
(1, 3, 1, 3) -> 0.2 * 0.1

And recording the total mcv frequence as (0.1 + 0.2 + 0.5 + 0.1) *
(0.2 + 0.1 + 0.2 + 0.1)

With this design, the nitems of MCV on joinrel would be less than
either of baserel.

and since we handle the eqjoin as well, we even can record the items as

(1, 2) -> 0.1 * 0.2
(1, 3) -> 0.2 * 0.1;

About when we should maintain the JoinRel's MCV data, rather than
maintain this just
after the JoinRel size is estimated, we can only estimate it when it
is needed. for example:

select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b)
join t3 on (t2.c = t3.c and t2.d = t3.d);

we don't need to maintain the MCV on (t1, t2, t3) since no others
need it at all. However
I don't check code too closely to see if it (Lazing computing MVC on
joinrel) is convenient
to do.

But I haven't done that yet, as it requires the MCVs to be combined
using the join clauses (overlap in a way), but I'm not sure how likely
that is in practice. In the example it could help, but that's a bit
artificial example.

4) still just inner equi-joins

I haven't done any work on extending this to outer joins etc. Adding
outer and semi joins should not be complicated, mostly copying and
tweaking what eqjoinsel() does.

Overall, thanks for the feature and I am expecting there are more cases
to handle during discussion. To make the review process more efficient,
I suggest that we split the patch into smaller ones and review/commit them
separately if we have finalized the design roughly . For example:

Patch 1 -- required both sides to have extended statistics.
Patch 2 -- required one side to have extended statistics and the other side had
per-column MCV.
Patch 3 -- handle the case like WHERE t1.a = t2.a and t1.b = Const;
Patch 3 -- handle the case for 3+ table joins.
Patch 4 -- supports the outer join.

I think we can do this if we are sure that each individual patch would work in
some cases and would not make any other case worse. If you agree with this,
I can do that splitting work during my review process.

--
Best Regards
Andy Fan (https://www.aliyun.com/)

Justin Pryzby

pryzby@telsasoft.com

about 4 years ago

In reply to: Tomas Vondra (#4)

Re: using extended statistics to improve join estimates

Your regression tests include two errors, which appear to be accidental, and
fixing the error shows that this case is being estimated poorly.

+-- try combining with single-column (and single-expression) statistics
+DROP STATISTICS join_test_2;
+ERROR:  statistics object "join_test_2" does not exist
...
+ERROR:  statistics object "join_stats_2" already exists

--
Justin

Tomas Vondra

tomas.vondra@enterprisedb.com

about 4 years ago

In reply to: Justin Pryzby (#8)

1 attachment(s)

Re: using extended statistics to improve join estimates

On 11/22/21 02:23, Justin Pryzby wrote:

Your regression tests include two errors, which appear to be accidental, and
fixing the error shows that this case is being estimated poorly.
+-- try combining with single-column (and single-expression) statistics
+DROP STATISTICS join_test_2;
+ERROR:  statistics object "join_test_2" does not exist
...
+ERROR:  statistics object "join_stats_2" already exists

D'oh, what a silly mistake ...

You're right fixing the DROP STATISTICS results in worse estimate, but
that's actually expected for a fairly simple reason. The join condition
has expressions on both sides, and dropping the statistics means we
don't have any MCV for the join_test_2 side. So the optimizer ends up
not using the regular estimates, as if there were no extended stats.

A couple lines later the script creates an extended statistics on that
expression alone, which fixes this. An expression index would do the
trick too.

Attached is a patch fixing the test and also the issue reported by
Zhihong Yu some time ago.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

0001-Estimate-joins-using-extended-statistics-20211213.patchtext/x-patch; charset=UTF-8; name=0001-Estimate-joins-using-extended-statistics-20211213.patchDownload

From 616f7f3faa818ea89c4c1cecb9aa50dd9e4fe8e7 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas.vondra@postgresql.org>
Date: Mon, 13 Dec 2021 14:05:17 +0100
Subject: [PATCH] Estimate joins using extended statistics

Use extended statistics (MCV) to improve join estimates. In general this
is similar to how we use regular statistics - we search for extended
statistics (with MCV) covering all join clauses, and if we find such MCV
on both sides of the join, we combine those two MCVs.

Extended statistics allow a couple additional improvements - e.g. if
there are baserel conditions, we can use them to restrict the part of
the MCVs combined. This means we're building conditional probability
distribution and calculating conditional probability

    P(join clauses | baserel conditions)

instead of just P(join clauses).

The patch also allows combining regular and extended MCV - we don't need
extended MCVs on both sides. This helps when one of the tables does not
have extended statistics (e.g. because there are no correlations).
---
 src/backend/optimizer/path/clausesel.c        |  63 +-
 src/backend/statistics/extended_stats.c       | 805 ++++++++++++++++++
 src/backend/statistics/mcv.c                  | 754 ++++++++++++++++
 .../statistics/extended_stats_internal.h      |  20 +
 src/include/statistics/statistics.h           |  12 +
 src/test/regress/expected/stats_ext.out       | 167 ++++
 src/test/regress/sql/stats_ext.sql            |  66 ++
 7 files changed, 1886 insertions(+), 1 deletion(-)

diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c
index d263ecf082..709e92446b 100644
--- a/src/backend/optimizer/path/clausesel.c
+++ b/src/backend/optimizer/path/clausesel.c
@@ -50,6 +50,9 @@ static Selectivity clauselist_selectivity_or(PlannerInfo *root,
 											 JoinType jointype,
 											 SpecialJoinInfo *sjinfo,
 											 bool use_extended_stats);
+static inline bool treat_as_join_clause(PlannerInfo *root,
+										Node *clause, RestrictInfo *rinfo,
+										int varRelid, SpecialJoinInfo *sjinfo);
 
 /****************************************************************************
  *		ROUTINES TO COMPUTE SELECTIVITIES
@@ -129,12 +132,53 @@ clauselist_selectivity_ext(PlannerInfo *root,
 	RangeQueryClause *rqlist = NULL;
 	ListCell   *l;
 	int			listidx;
+	bool		single_clause_optimization = true;
+
+	/*
+	 * The optimization of skipping to clause_selectivity_ext for single
+	 * clauses means we can't improve join estimates with a single join
+	 * clause but additional baserel restrictions. So we disable it when
+	 * estimating joins.
+	 *
+	 * XXX Not sure if this is the right way to do it, but more elaborate
+	 * checks would mostly negate the whole point of the optimization.
+	 * The (Var op Var) patch has the same issue.
+	 *
+	 * XXX An alternative might be making clause_selectivity_ext smarter
+	 * and make it use the join extended stats there. But that seems kinda
+	 * against the whole point of the optimization (skipping expensive
+	 * stuff) and it's making other parts more complex.
+	 *
+	 * XXX Maybe this should check if there are at least some restrictions
+	 * on some base relations, which seems important. But then again, that
+	 * seems to go against the idea of this check to be cheap. Moreover, it
+	 * won't work for OR clauses, which may have multiple parts but we still
+	 * see them as a single BoolExpr clause (it doesn't work later, though).
+	 */
+	if (list_length(clauses) == 1)
+	{
+		Node *clause = linitial(clauses);
+		RestrictInfo *rinfo = NULL;
+
+		if (IsA(clause, RestrictInfo))
+		{
+			rinfo = (RestrictInfo *) clause;
+			clause = (Node *) rinfo->clause;
+		}
+
+		single_clause_optimization
+			= !treat_as_join_clause(root, clause, rinfo, varRelid, sjinfo);
+	}
 
 	/*
 	 * If there's exactly one clause, just go directly to
 	 * clause_selectivity_ext(). None of what we might do below is relevant.
+	 *
+	 * XXX This means we won't try using extended stats on OR-clauses (which
+	 * are a single BoolExpr clause at this point), although we'll do that
+	 * later (once we look at the arguments).
 	 */
-	if (list_length(clauses) == 1)
+	if ((list_length(clauses) == 1) && single_clause_optimization)
 		return clause_selectivity_ext(root, (Node *) linitial(clauses),
 									  varRelid, jointype, sjinfo,
 									  use_extended_stats);
@@ -157,6 +201,23 @@ clauselist_selectivity_ext(PlannerInfo *root,
 											&estimatedclauses, false);
 	}
 
+	/*
+	 * Try applying extended statistics to joins. There's not much we can
+	 * do to detect when this makes sense, but we can check that there are
+	 * join clauses, and that at least some of the rels have stats.
+	 *
+	 * XXX Isn't this mutualy exclusive with the preceding block which
+	 * calculates estimates for a single relation?
+	 */
+	if (use_extended_stats &&
+		statext_try_join_estimates(root, clauses, varRelid, jointype, sjinfo,
+						 estimatedclauses))
+	{
+		s1 *= statext_clauselist_join_selectivity(root, clauses, varRelid,
+												  jointype, sjinfo,
+												  &estimatedclauses);
+	}
+
 	/*
 	 * Apply normal selectivity estimates for remaining clauses. We'll be
 	 * careful to skip any clauses which were already estimated above.
diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c
index 69ca52094f..ce6a62d944 100644
--- a/src/backend/statistics/extended_stats.c
+++ b/src/backend/statistics/extended_stats.c
@@ -30,6 +30,7 @@
 #include "nodes/nodeFuncs.h"
 #include "optimizer/clauses.h"
 #include "optimizer/optimizer.h"
+#include "optimizer/pathnode.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
 #include "statistics/extended_stats_internal.h"
@@ -101,6 +102,8 @@ static StatsBuildData *make_build_data(Relation onerel, StatExtEntry *stat,
 									   int numrows, HeapTuple *rows,
 									   VacAttrStats **stats, int stattarget);
 
+static bool stat_covers_expressions(StatisticExtInfo *stat, List *exprs,
+									Bitmapset **expr_idxs);
 
 /*
  * Compute requested extended stats, using the rows sampled for the plain
@@ -2608,3 +2611,805 @@ make_build_data(Relation rel, StatExtEntry *stat, int numrows, HeapTuple *rows,
 
 	return result;
 }
+
+/*
+ * statext_find_matching_mcv
+ *		Search for a MCV covering all the attributes and expressions.
+ *
+ * We pick the statistics to use for join estimation. The statistics has to
+ * be an MCV, and we require it to match all the join conditions, because it
+ * makes the estimation simpler.
+ *
+ * If there are multiple candidate statistics (matching all join clauses),
+ * we pick the smallest one, and we also consider additional conditions on
+ * the base relations to restrict the MCV items used for estimation (using
+ * conditional probability).
+ *
+ * XXX The requirement that all the attributes need to be covered might be
+ * too strong. We could relax this and and require fewer matches (at least two,
+ * if counting the additional conditions), and we might even apply multiple
+ * statistics etc. But that would require matching statistics on both sides of
+ * the join, while now we simply know the statistics match. We don't really
+ * expect many candidate MCVs, so this simple approach seems sufficient. And
+ * the joins usually use only one or two columns, so there's not much room
+ * for applying multiple statistics anyway.
+ */
+StatisticExtInfo *
+statext_find_matching_mcv(PlannerInfo *root, RelOptInfo *rel,
+						  Bitmapset *attnums, List *exprs)
+{
+	ListCell   *l;
+	StatisticExtInfo *mcv = NULL;
+	List *stats = rel->statlist;
+
+	foreach(l, stats)
+	{
+		StatisticExtInfo *stat = (StatisticExtInfo *) lfirst(l);
+		List *conditions1 = NIL,
+			 *conditions2 = NIL;
+
+		/* We only care about MCV statistics here. */
+		if (stat->kind != STATS_EXT_MCV)
+			continue;
+
+		/*
+		 * Ignore MCVs not covering all the attributes/expressions.
+		 *
+		 * XXX Maybe we shouldn't be so strict and consider only partial
+		 * matches for join clauses too?
+		 */
+		if (!bms_is_subset(attnums, stat->keys) ||
+			!stat_covers_expressions(stat, exprs, NULL))
+			continue;
+
+		/* If there's no matching MCV yet, keep this one. */
+		if (!mcv)
+		{
+			mcv = stat;
+			continue;
+		}
+
+		/*
+		 * OK, we have two candidate statistics and we need to decide which one
+		 * to keep. We'll use two simple heuristics:
+		 *
+		 * (a) We prefer smaller statistics (fewer columns), on the assumption
+		 * that it represents a larger fraction of the data (due to having fewer
+		 * combinations with higher counts).
+		 *
+		 * (b) If the statistics covers some additional conditions for the rels,
+		 * that may help with considering additional dependencies between the
+		 * tables.
+		 *
+		 * Of course, those two heuristict are somewhat contradictory - smaller
+		 * stats are less likely to cover as many conditions as a larger one. We
+		 * consider the additional conditions first - if someone created such
+		 * statistics, there probably is a dependency worth considering.
+		 *
+		 * When inspecting the restrictions, we need to be careful - we don't
+		 * know which of them are compatible with extended stats, so we have to
+		 * run them through statext_is_compatible_clause first and then match
+		 * them.to the statistics.
+		 *
+		 * XXX Maybe we shouldn't pick statistics that covers just a single join
+		 * clause, without any additional conditions. In such case we could just
+		 * as well pick regular statistics for the column/expression, but it's
+		 * not clear if that actually exists (so we might reject the stats here
+		 * and then fail to find something simpler/better).
+		 */
+		conditions1 = statext_determine_join_restrictions(root, rel, stat);
+		conditions2 = statext_determine_join_restrictions(root, rel, mcv);
+
+		/* if the new statistics covers more conditions, use it */
+		if (list_length(conditions1) > list_length(conditions2))
+		{
+			mcv = stat;
+			continue;
+		}
+
+		/* The statistics seem about equal, so just use the smaller one. */
+		if (bms_num_members(mcv->keys) + list_length(mcv->exprs) >
+			bms_num_members(stat->keys) + list_length(stat->exprs))
+		{
+			mcv = stat;
+		}
+	}
+
+	return mcv;
+}
+
+/*
+ * statext_determine_join_restrictions
+ *		Get restrictions on base relation, covered by the statistics.
+ *
+ * Returns a list of baserel restrictinfos, compatible with extended statistics
+ * and covered by the extended statistics.
+ *
+ * When using extended statistics to estimate joins, we can use conditions
+ * from base relations to calculate conditional probability
+ *
+ *    P(join clauses | baserel restrictions)
+ *
+ * which should be a better estimate than just P(join clauses). We want to pick
+ * the statistics covering the most such conditions.
+ */
+List *
+statext_determine_join_restrictions(PlannerInfo *root, RelOptInfo *rel,
+									StatisticExtInfo *info)
+{
+	ListCell   *lc;
+	List	   *conditions = NIL;
+
+	/* extract conditions that may be applied to the MCV list */
+	foreach (lc, rel->baserestrictinfo)
+	{
+		RestrictInfo *rinfo = (RestrictInfo *) lfirst(lc);
+		Bitmapset *indexes = NULL;
+		Bitmapset *attnums = NULL;
+		List *exprs = NIL;
+
+		/* clause has to be supported by MCV in general */
+		if (!statext_is_compatible_clause(root, (Node *) rinfo, rel->relid,
+										  &attnums, &exprs))
+			continue;
+
+		/*
+		 * clause is compatible in general, but is it actually covered
+		 * by this partiular statistics object?
+		 */
+		if (!bms_is_subset(attnums, info->keys) ||
+			!stat_covers_expressions(info, exprs, &indexes))
+			continue;
+
+		conditions = lappend(conditions, rinfo->clause);
+	}
+
+	return conditions;
+}
+
+/*
+ * statext_is_supported_join_clause
+ *		Check if a join clause may be estimated using extended stats.
+ *
+ * Determines if this is a join clause of the form (Expr op Expr) which may be
+ * estimated using extended statistics. Each side must reference just a single
+ * relation for now.
+ *
+ * Similar to treat_as_join_clause, but we place additional restrictions
+ * on the conditions.
+ */
+static bool
+statext_is_supported_join_clause(PlannerInfo *root, Node *clause,
+								 int varRelid, SpecialJoinInfo *sjinfo)
+{
+	Oid	oprsel;
+	RestrictInfo   *rinfo;
+	OpExpr		   *opclause;
+	ListCell	   *lc;
+
+	/*
+	 * evaluation as a restriction clause, either at scan node or forced
+	 *
+	 * XXX See treat_as_join_clause.
+	 */
+	if ((varRelid != 0) || (sjinfo == NULL))
+		return false;
+
+	/* XXX Can we rely on always getting RestrictInfo here? */
+	if (!IsA(clause, RestrictInfo))
+		return false;
+
+	/* strip the RestrictInfo */
+	rinfo = (RestrictInfo *) clause;
+	clause = (Node *) rinfo->clause;
+
+	/* is it referencing multiple relations? */
+	if (bms_membership(rinfo->clause_relids) != BMS_MULTIPLE)
+		return false;
+
+	/* we only support simple operator clauses for now */
+	if (!is_opclause(clause))
+		return false;
+
+	opclause = (OpExpr *) clause;
+
+	/* for now we only support estimating equijoins */
+	oprsel = get_oprjoin(opclause->opno);
+
+	/* has to be an equality condition */
+	if (oprsel != F_EQJOINSEL)
+		return false;
+
+	/*
+	 * Make sure we're not mixing vars from multiple relations on the same
+	 * side, like
+	 *
+	 *   (t1.a + t2.a) = (t1.b + t2.b)
+	 *
+	 * which is still technically an opclause, but we can't match it to
+	 * extended statistics in a simple way.
+	 *
+	 * XXX This also means we require rinfo->clause_relids to have 2 rels.
+	 *
+	 * XXX Also check it's not expression on system attributes, which we
+	 * don't allow in extended statistics.
+	 *
+	 * XXX Although maybe we could allow cases that combine expressions
+	 * from both relations on either side? Like (t1.a + t2.b = t1.c - t2.d)
+	 * or something like that. We could do "cartesian product" of the MCV
+	 * stats and restrict it using this condition.
+	 */
+	foreach (lc, opclause->args)
+	{
+		Bitmapset *varnos = NULL;
+		Node *expr = (Node *) lfirst(lc);
+
+		varnos = pull_varnos(root, expr);
+
+		/*
+		 * No argument should reference more than just one relation.
+		 *
+		 * This effectively means each side references just two relations.
+		 * If there's no relation on one side, it's a Const, and the other
+		 * side has to be either Const or Expr with a single rel. In which
+		 * case it can't be a join clause.
+		 */
+		if (bms_num_members(varnos) > 1)
+			return false;
+
+		/*
+		 * XXX Maybe check that both relations have extended statistics
+		 * (no point in considering the clause as useful without it). But
+		 * we'll do that check later anyway, so keep this cheap.
+		 */
+	}
+
+	return true;
+}
+
+/*
+ * statext_try_join_estimates
+ *		Checks if it's worth considering extended stats on join estimates.
+ *
+ * This is supposed to be a quick/cheap check to decide whether to expend
+ * more effort on applying extended statistics to join clauses.
+ */
+bool
+statext_try_join_estimates(PlannerInfo *root, List *clauses, int varRelid,
+						   JoinType jointype, SpecialJoinInfo *sjinfo,
+						   Bitmapset *estimatedclauses)
+{
+	int			listidx;
+	int			k;
+	ListCell   *lc;
+	Bitmapset  *relids = NULL;
+
+	/*
+	 * XXX Not having these values means treat_as_join_clause returns false,
+	 * so we're not supposed to handle join clauses here. So just bail out.
+	 */
+	if ((varRelid != 0) || (sjinfo == NULL))
+		return false;
+
+	/*
+	 * Check if there are any unestimated join clauses, collect relids.
+	 *
+	 * XXX Currently this only allows simple OpExpr equality clauses with each
+	 * argument refering to single relation, AND-ed together. Maybe we could
+	 * relax this in the future, e.g. to allow more complex (deeper) expressions
+	 * and to allow OR-ed join clauses too. And maybe supporting inequalities.
+	 *
+	 * Handling more complex expressions seems simple - we already do that for
+	 * baserel estimates by building the match bitmap recursively, and we could
+	 * do something similar for combinations of MCV items (a bit like building
+	 * a single bit in the match bitmap). The challenge is what to do about the
+	 * part not represented by MCV, which is now based on ndistinct estimates.
+	 */
+	listidx = -1;
+	foreach (lc, clauses)
+	{
+		Node *clause = (Node *) lfirst(lc);
+		RestrictInfo *rinfo;
+
+		/* needs to happen before skipping any clauses */
+		listidx++;
+
+		/* Skip clauses that were already estimated. */
+		if (bms_is_member(listidx, estimatedclauses))
+			continue;
+
+		/*
+		 * Skip clauses that are not join clauses or that we don't know
+		 * how to handle estimate using extended statistics.
+		 */
+		if (!statext_is_supported_join_clause(root, clause, varRelid, sjinfo))
+			continue;
+
+		/*
+		 * XXX We're guaranteed to have RestrictInfo thanks to the checks
+		 * in statext_is_supported_join_clause.
+		 */
+		rinfo = (RestrictInfo *) clause;
+
+		/* Collect relids from all usable clauses. */
+		relids = bms_union(relids, rinfo->clause_relids);
+	}
+
+	/* no join clauses found, don't try applying extended stats */
+	if (bms_num_members(relids) == 0)
+		return false;
+
+	/*
+	 * We expect either 0 or >= 2 relids, a case with 1 relid in join clauses
+	 * should be impossible. And we just ruled out 0, so there are at least 2.
+	 */
+	Assert(bms_num_members(relids) >= 2);
+
+	/*
+	 * Check that at least some of the rels referenced by the clauses have
+	 * extended stats.
+	 *
+	 * XXX Maybe we should check how many rels have stats, and cross-check how
+	 * compatible they are (e.g. that both have MCVs, etc.). We might also
+	 * cross-check the exact joined pairs of rels, but it's is supposed to be
+	 * a cheap check, so maybe better leave that for later.
+	 *
+	 * XXX We could also check the number of parameters in each rel to consider
+	 * extended stats. If there's just a single attribute, it's pointless to use
+	 * extended statistics. OTOH we can also consider restriction clauses from
+	 * baserestrictinfo and use them to calculate conditional probabilities.
+	 */
+	k = -1;
+	while ((k = bms_next_member(relids, k)) >= 0)
+	{
+		RelOptInfo *rel = find_base_rel(root, k);
+		if (rel->statlist)
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * Information about two joined relations, along with the join clauses between.
+ */
+typedef struct JoinPairInfo
+{
+	Bitmapset  *rels;
+	List	   *clauses;
+} JoinPairInfo;
+
+/*
+ * statext_build_join_pairs
+ *		Extract pairs of joined rels with join clauses for each pair.
+ *
+ * Walks the remaining (not yet estimated) clauses, and splits them into
+ * lists for each pair of joined relations. Returns NULL if there are no
+ * suitable join pairs that might be estimated using extended stats.
+ *
+ * XXX It's possible there are join clauses, but the clauses are not
+ * supported by the extended stats machinery (we only support opclauses
+ * with F_EQJOINSEL selectivity function at the moment).
+ */
+static JoinPairInfo *
+statext_build_join_pairs(PlannerInfo *root, List *clauses, int varRelid,
+						 JoinType jointype, SpecialJoinInfo *sjinfo,
+						 Bitmapset *estimatedclauses, int *npairs)
+{
+	int				cnt;
+	int				listidx;
+	JoinPairInfo   *info;
+	ListCell	   *lc;
+
+	/*
+	 * Assume each clause is for a different pair of relations (some of them
+	 * might be already estimated, but meh - there shouldn't be too many of
+	 * them and it's cheaper than repalloc.
+	 */
+	info = (JoinPairInfo *) palloc0(sizeof(JoinPairInfo) * list_length(clauses));
+	cnt = 0;
+
+	listidx = -1;
+	foreach(lc, clauses)
+	{
+		int				i;
+		bool			found;
+		Node		   *clause = (Node *) lfirst(lc);
+		RestrictInfo   *rinfo;
+
+		listidx++;
+
+		/* skip already estimated clauses */
+		if (bms_is_member(listidx, estimatedclauses))
+			continue;
+
+		/*
+		 * Make sure the clause is a join clause of a supported shape (at
+		 * the moment we support just (Expr op Expr) clauses with each
+		 * side referencing just a single relation.
+		 */
+		if (!statext_is_supported_join_clause(root, clause, varRelid, sjinfo))
+			continue;
+
+		/* statext_is_supported_join_clause guarantees RestrictInfo */
+		rinfo = (RestrictInfo *) clause;
+		clause = (Node *) rinfo->clause;
+
+		/* search for a matching join pair */
+		found = false;
+		for (i = 0; i < cnt; i++)
+		{
+			if (bms_is_subset(rinfo->clause_relids, info[i].rels))
+			{
+				info[i].clauses = lappend(info[i].clauses, clause);
+				found = true;
+				break;
+			}
+		}
+
+		if (!found)
+		{
+			info[cnt].rels = rinfo->clause_relids;
+			info[cnt].clauses = lappend(info[cnt].clauses, clause);
+			cnt++;
+		}
+	}
+
+	if (cnt == 0)
+		return NULL;
+
+	*npairs = cnt;
+	return info;
+}
+
+/*
+ * extract_relation_info
+ *		Extract information about a relation in a join pair.
+ *
+ * The relation is identified by index (generally 0 or 1), and picks extended
+ * statistics covering the join clauses and baserel restrictions.
+ *
+ * XXX Can we have cases with indexes above 1? Probably for clauses mixing
+ * vars from 3 relations, but statext_is_supported_join_clause rejects those.
+ */
+static RelOptInfo *
+extract_relation_info(PlannerInfo *root, JoinPairInfo *info, int index,
+					  StatisticExtInfo **stat)
+{
+	int			k;
+	int			relid;
+	RelOptInfo *rel;
+	ListCell   *lc;
+	List	   *exprs = NIL;
+
+	Bitmapset  *attnums = NULL;
+
+	Assert((index >= 0) && (index <= 1));
+
+	k = -1;
+	while (index >= 0)
+	{
+		k = bms_next_member(info->rels, k);
+		if (k < 0)
+			elog(ERROR, "failed to extract relid");
+
+		relid = k;
+		index--;
+	}
+
+	rel = find_base_rel(root, relid);
+
+	/*
+	 * Walk the clauses for this join pair, and extract expressions about
+	 * the relation identified by index / relid. For simple Vars we extract
+	 * the attnum. Otherwise we keep the whole expression.
+	 */
+	foreach (lc, info->clauses)
+	{
+		ListCell *lc2;
+		Node *clause = (Node *) lfirst(lc);
+		OpExpr *opclause = (OpExpr *) clause;
+
+		/* only opclauses supported for now */
+		Assert(is_opclause(clause));
+
+		foreach (lc2, opclause->args)
+		{
+			Node *arg = (Node *) lfirst(lc2);
+			Bitmapset *varnos = NULL;
+
+			/* plain Var references (boolean Vars or recursive checks) */
+			if (IsA(arg, Var))
+			{
+				Var		   *var = (Var *) arg;
+
+				/* Ignore vars from other relations. */
+				if (var->varno != relid)
+					continue;
+
+				/* we also better ensure the Var is from the current level */
+				if (var->varlevelsup > 0)
+					continue;
+
+				/* Also skip system attributes (we don't allow stats on those). */
+				if (!AttrNumberIsForUserDefinedAttr(var->varattno))
+					elog(ERROR, "unexpected system attribute");
+
+				attnums = bms_add_member(attnums, var->varattno);
+
+				/* Done, process the next argument. */
+				continue;
+			}
+
+			/*
+			 * OK, it's a more complex expression, so check if it matches
+			 * the relid and maybe keep it as a whole. It should be
+			 * compatible because we already checked it when building the
+			 * join pairs.
+			 */
+			varnos = pull_varnos(root, arg);
+
+			if (relid == bms_singleton_member(varnos))
+				exprs = lappend(exprs, arg);
+		}
+	}
+
+	*stat = statext_find_matching_mcv(root, rel, attnums, exprs);
+
+	return rel;
+}
+
+/*
+ * get_expression_for_rel
+ *		Extract expression for a given relation from the join clause.
+ *
+ * Given a join clause supported by the extended statistics (currently that
+ * means just OpExpr clauses with each argument referencing single rel),
+ * return either the left or right argument expression for the rel.
+ *
+ * XXX This should probably return a flag identifying whether it's the
+ * left or right argument.
+ */
+static Node *
+get_expression_for_rel(PlannerInfo *root, RelOptInfo *rel, Node *clause)
+{
+	OpExpr *opexpr;
+	Node   *expr;
+
+	/*
+	 * Strip the RestrictInfo node, get the actual clause.
+	 *
+	 * XXX Not sure if we need to care about removing other node types
+	 * too (e.g. RelabelType etc.). statext_is_supported_join_clause
+	 * matches this, but maybe we need to relax it?
+	 */
+	if (IsA(clause, RestrictInfo))
+		clause = (Node *) ((RestrictInfo *) clause)->clause;
+
+	opexpr = (OpExpr *) clause;
+
+	/* Make sure we have the expected node type. */
+	Assert(is_opclause(clause));
+	Assert(list_length(opexpr->args) == 2);
+
+	/* FIXME strip relabel etc. the way examine_opclause_args does */
+	expr = linitial(opexpr->args);
+	if (bms_singleton_member(pull_varnos(root, expr)) == rel->relid)
+		return expr;
+
+	expr = lsecond(opexpr->args);
+	if (bms_singleton_member(pull_varnos(root, expr)) == rel->relid)
+		return expr;
+
+	return NULL;
+}
+
+/*
+ * statext_clauselist_join_selectivity
+ *		Use extended stats to estimate join clauses.
+ *
+ * XXX In principle, we should not restrict this to cases with multiple
+ * join clauses - we should consider dependencies with conditions at the
+ * base relations, i.e. calculate P(join clause | base restrictions).
+ * But currently that does not happen, because clauselist_selectivity_ext
+ * treats a single clause as a special case (and we don't apply extended
+ * statistics in that case yet).
+ */
+Selectivity
+statext_clauselist_join_selectivity(PlannerInfo *root, List *clauses, int varRelid,
+									JoinType jointype, SpecialJoinInfo *sjinfo,
+									Bitmapset **estimatedclauses)
+{
+	int			i;
+	int			listidx;
+	Selectivity	s = 1.0;
+
+	JoinPairInfo *info;
+	int				ninfo;
+
+	if (!clauses)
+		return 1.0;
+
+	/* extract pairs of joined relations from the list of clauses */
+	info = statext_build_join_pairs(root, clauses, varRelid, jointype, sjinfo,
+									*estimatedclauses, &ninfo);
+
+	/* no useful join pairs */
+	if (!info)
+		return 1.0;
+
+	/*
+	 * Process the join pairs, try to find a matching MCV on each side.
+	 *
+	 * XXX The basic principle is quite similar to eqjoinsel_inner, i.e.
+	 * we try to find a MCV on both sides of the join, and use it to get
+	 * better join estimate. It's a bit more complicated, because there
+	 * might be multiple MCV lists, we also need ndistinct estimate, and
+	 * there may be interesting baserestrictions too.
+	 *
+	 * XXX At the moment we only handle the case with matching MCVs on
+	 * both sides, but it'd be good to also handle case with just ndistinct
+	 * statistics improving ndistinct estimates.
+	 *
+	 * XXX We might also handle cases with a regular MCV on one side and
+	 * an extended MCV on the other side.
+	 *
+	 * XXX Perhaps it'd be good to also handle case with one side only
+	 * having "regular" statistics (e.g. MCV), especially in cases with
+	 * no conditions on that side of the join (where we can't use the
+	 * extended MCV to calculate conditional probability).
+	 */
+	for (i = 0; i < ninfo; i++)
+	{
+		ListCell *lc;
+
+		RelOptInfo *rel1;
+		RelOptInfo *rel2;
+
+		StatisticExtInfo *stat1;
+		StatisticExtInfo *stat2;
+
+		/* extract info about the first relation */
+		rel1 = extract_relation_info(root, &info[i], 0, &stat1);
+
+		/* extract info about the second relation */
+		rel2 = extract_relation_info(root, &info[i], 1, &stat2);
+
+		/*
+		 * We can handle three basic cases:
+		 *
+		 * a) Extended stats (with MCV) on both sides is an ideal case, and we
+		 * can simply combine the two MCVs, possibly with additional conditions
+		 * from the relations.
+		 *
+		 * b) Extended stats on one side, regular MCV on the other side (this
+		 * means there's just one join clause / expression). It also means the
+		 * extended stats likely covers at least one extra condition, otherwise
+		 * we could just use regular statistics. We can combine the stats just
+		 * similarly to (a).
+		 *
+		 * c) No extended stats with MCV. If there are multiple join clauses,
+		 * we can try using ndistinct coefficients and do what esjoinsel does.
+		 *
+		 * If none of these applies, we fallback to the regular selectivity
+		 * estimation in eqjoinsel.
+		 */
+		if (stat1 && stat2)
+		{
+			s *= mcv_combine_extended(root, rel1, rel2, stat1, stat2, info[i].clauses);
+		}
+		else if (stat1 && (list_length(info[i].clauses) == 1))
+		{
+			/* try finding MCV on the other relation */
+			VariableStatData	vardata;
+			AttStatsSlot		sslot;
+			Form_pg_statistic	stats = NULL;
+			bool				have_mcvs = false;
+			Node			   *clause = linitial(info[i].clauses);
+			Node			   *expr = get_expression_for_rel(root, rel2, clause);
+			double				nd;
+			bool				isdefault;
+
+			examine_variable(root, expr, 0, &vardata);
+
+			nd = get_variable_numdistinct(&vardata, &isdefault);
+
+			memset(&sslot, 0, sizeof(sslot));
+
+			if (HeapTupleIsValid(vardata.statsTuple))
+			{
+				/* note we allow use of nullfrac regardless of security check */
+				stats = (Form_pg_statistic) GETSTRUCT(vardata.statsTuple);
+				/* FIXME should this call statistic_proc_security_check like eqjoinsel? */
+				have_mcvs = get_attstatsslot(&sslot, vardata.statsTuple,
+											 STATISTIC_KIND_MCV, InvalidOid,
+											 ATTSTATSSLOT_VALUES | ATTSTATSSLOT_NUMBERS);
+			}
+
+			if (have_mcvs)
+				s *= mcv_combine_simple(root, rel1, stat1, &sslot,
+										stats->stanullfrac, nd, isdefault, clause);
+
+			free_attstatsslot(&sslot);
+
+			ReleaseVariableStats(vardata);
+
+			/* no stats, don't mark the clauses as estimated */
+			if (!have_mcvs)
+				continue;
+		}
+		else if (stat2 && (list_length(info[i].clauses) == 1))
+		{
+			/* try finding MCV on the other relation */
+			VariableStatData	vardata;
+			AttStatsSlot		sslot;
+			Form_pg_statistic	stats = NULL;
+			bool				have_mcvs = false;
+			Node			   *clause = (Node *) linitial(info[i].clauses);
+			Node			   *expr = get_expression_for_rel(root, rel1, clause);
+			double				nd;
+			bool				isdefault;
+
+			examine_variable(root, expr, 0, &vardata);
+
+			nd = get_variable_numdistinct(&vardata, &isdefault);
+
+			memset(&sslot, 0, sizeof(sslot));
+
+			if (HeapTupleIsValid(vardata.statsTuple))
+			{
+				/* note we allow use of nullfrac regardless of security check */
+				stats = (Form_pg_statistic) GETSTRUCT(vardata.statsTuple);
+				/* FIXME should this call statistic_proc_security_check like eqjoinsel? */
+				have_mcvs = get_attstatsslot(&sslot, vardata.statsTuple,
+											 STATISTIC_KIND_MCV, InvalidOid,
+											 ATTSTATSSLOT_VALUES | ATTSTATSSLOT_NUMBERS);
+			}
+
+			if (have_mcvs)
+				s *= mcv_combine_simple(root, rel2, stat2, &sslot,
+										stats->stanullfrac, nd, isdefault, clause);
+
+			free_attstatsslot(&sslot);
+
+			ReleaseVariableStats(vardata);
+
+			/* no stats, don't mark the clauses as estimated */
+			if (!have_mcvs)
+				continue;
+		}
+		else
+			continue;
+
+		/*
+		 * Now mark all the clauses for this join pair as estimated.
+		 *
+		 * XXX Maybe track the indexes in JoinPairInfo, so that we can
+		 * simply union the two bitmaps, without the extra matching.
+		 */
+		foreach (lc, info->clauses)
+		{
+			Node *clause = (Node *) lfirst(lc);
+			ListCell *lc2;
+
+			listidx = -1;
+			foreach (lc2, clauses)
+			{
+				Node *clause2 = (Node *) lfirst(lc2);
+				listidx++;
+
+				Assert(IsA(clause2, RestrictInfo));
+
+				clause2 = (Node *) ((RestrictInfo *) clause2)->clause;
+
+				if (equal(clause, clause2))
+				{
+					*estimatedclauses = bms_add_member(*estimatedclauses, listidx);
+					break;
+				}
+			}
+		}
+	}
+
+	return s;
+}
diff --git a/src/backend/statistics/mcv.c b/src/backend/statistics/mcv.c
index b350fc5f7b..b0e877c92e 100644
--- a/src/backend/statistics/mcv.c
+++ b/src/backend/statistics/mcv.c
@@ -24,6 +24,7 @@
 #include "funcapi.h"
 #include "nodes/nodeFuncs.h"
 #include "optimizer/clauses.h"
+#include "optimizer/optimizer.h"
 #include "statistics/extended_stats_internal.h"
 #include "statistics/statistics.h"
 #include "utils/array.h"
@@ -2156,3 +2157,756 @@ mcv_clause_selectivity_or(PlannerInfo *root, StatisticExtInfo *stat,
 
 	return s;
 }
+
+/*
+ * statext_compare_mcvs
+ *		Calculte join selectivity using extended statistics, similarly to
+ *		eqjoinsel_inner.
+ *
+ * Considers restrictions on base relations too, essentially computing a
+ * conditional probability
+ *
+ *	P(join clauses | baserestrictinfos on either side)
+ *
+ * Compared to eqjoinsel_inner there's a couple problems. With per-column MCV
+ * lists it's obvious that the number of distinct values not covered by the MCV
+ * is (ndistinct - size(MCV)). With multi-column MCVs it's not that simple,
+ * particularly when the conditions are on a subset of the MCV attributes and/or
+ * NULLs are involved. E.g. with MCV (a,b,c) and conditions on (a,b), it's not
+ * clear if the number of (a,b) combinations not covered by the MCV is
+ *
+ * (ndistinct(a,b) - ndistinct_mcv(a,b))
+ *
+ * where ndistinct_mcv(a,b) is the number of distinct (a,b) combinations
+ * included in the MCV list. These combinations may be present in the rest
+ * of the data (outside MCV), just with some extra values in "c". So in
+ * principle there may be between
+ *
+ * (ndistinct(a,b) - ndistinct_mcv(a,b)) and ndistinct(a,b)
+ *
+ * distinct values in the part of the data not covered by the MCV. So we need
+ * to pick something in between, there's no way to calculate this accurately.
+ */
+Selectivity
+mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
+					 StatisticExtInfo *stat1, StatisticExtInfo *stat2,
+					 List *clauses)
+{
+	ListCell   *lc;
+
+	MCVList    *mcv1,
+			   *mcv2;
+	int			idx,
+				i,
+				j;
+	Selectivity s = 0;
+
+	/* match bitmaps and selectivity for baserel conditions (if any) */
+	List   *exprs1 = NIL,
+		   *exprs2 = NIL;
+	List   *conditions1 = NIL,
+		   *conditions2 = NIL;
+	bool   *cmatches1 = NULL,
+		   *cmatches2 = NULL;
+
+	double	csel1 = 1.0,
+			csel2 = 1.0;
+
+	bool   *matches1 = NULL,
+		   *matches2 = NULL;
+
+	/* estimates for the two relations */
+	double	matchfreq1,
+			unmatchfreq1,
+			otherfreq1,
+			mcvfreq1,
+			nd1,
+			totalsel1;
+
+	double 	matchfreq2,
+			unmatchfreq2,
+			otherfreq2,
+			mcvfreq2,
+			nd2,
+			totalsel2;
+
+	/* info about clauses and how they match to MCV stats */
+	FmgrInfo   *opprocs;
+	int		   *indexes1,
+			   *indexes2;
+	bool	   *reverse;
+
+	/* we picked the stats so that they have MCV enabled */
+	Assert((stat1->kind = STATS_EXT_MCV) && (stat2->kind = STATS_EXT_MCV));
+
+	mcv1 = statext_mcv_load(stat1->statOid);
+	mcv2 = statext_mcv_load(stat2->statOid);
+
+	/* should only get here with MCV on both sides */
+	Assert(mcv1 && mcv2);
+
+	/* Determine which baserel clauses to use for conditional probability. */
+	conditions1 = statext_determine_join_restrictions(root, rel1, stat1);
+	conditions2 = statext_determine_join_restrictions(root, rel2, stat2);
+
+	/*
+	 * Calculate match bitmaps for restrictions on either side of the join
+	 * (there may be none, in which case this will be NULL).
+	 */
+	if (conditions1)
+	{
+		cmatches1 = mcv_get_match_bitmap(root, conditions1,
+										 stat1->keys, stat1->exprs,
+										 mcv1, false);
+		csel1 = clauselist_selectivity(root, conditions1, rel1->relid, 0, NULL);
+	}
+
+	if (conditions2)
+	{
+		cmatches2 = mcv_get_match_bitmap(root, conditions2,
+										 stat2->keys, stat2->exprs,
+										 mcv2, false);
+		csel2 = clauselist_selectivity(root, conditions2, rel2->relid, 0, NULL);
+	}
+
+	/*
+	 * Match bitmaps for matches between MCV elements. By default there
+	 * are no matches, so we set all items to 0.
+	 */
+	matches1 = (bool *) palloc0(sizeof(bool) * mcv1->nitems);
+	matches2 = (bool *) palloc0(sizeof(bool) * mcv2->nitems);
+
+	/*
+	 * Initialize information about clauses and how they match to the MCV
+	 * stats we picked. We do this only once before processing the lists,
+	 * so that we don't have to do that for each MCV item or so.
+	 */
+	opprocs = (FmgrInfo *) palloc(sizeof(FmgrInfo) * list_length(clauses));
+	indexes1 = (int *) palloc(sizeof(int) * list_length(clauses));
+	indexes2 = (int *) palloc(sizeof(int) * list_length(clauses));
+	reverse = (bool *) palloc(sizeof(bool) * list_length(clauses));
+
+	idx = 0;
+	foreach (lc, clauses)
+	{
+		Node	   *clause = (Node *) lfirst(lc);
+		OpExpr	   *opexpr;
+		Node	   *expr1,
+				   *expr2;
+		Bitmapset  *relids1,
+				   *relids2;
+
+		/*
+		 * Strip the RestrictInfo node, get the actual clause.
+		 *
+		 * XXX Not sure if we need to care about removing other node types
+		 * too (e.g. RelabelType etc.). statext_is_supported_join_clause
+		 * matches this, but maybe we need to relax it?
+		 */
+		if (IsA(clause, RestrictInfo))
+			clause = (Node *) ((RestrictInfo *) clause)->clause;
+
+		opexpr = (OpExpr *) clause;
+
+		/* Make sure we have the expected node type. */
+		Assert(is_opclause(clause));
+		Assert(list_length(opexpr->args) == 2);
+
+		fmgr_info(get_opcode(opexpr->opno), &opprocs[idx]);
+
+		/* FIXME strip relabel etc. the way examine_opclause_args does */
+		expr1 = linitial(opexpr->args);
+		expr2 = lsecond(opexpr->args);
+
+		/* determine order of clauses (rel1 op rel2) or (rel2 op rel1) */
+		relids1 = pull_varnos(root, expr1);
+		relids2 = pull_varnos(root, expr2);
+
+		if ((bms_singleton_member(relids1) == rel1->relid) &&
+			(bms_singleton_member(relids2) == rel2->relid))
+		{
+			Oid		collid;
+
+			indexes1[idx] = mcv_match_expression(expr1,
+												 stat1->keys, stat1->exprs,
+												 &collid);
+			indexes2[idx] = mcv_match_expression(expr2,
+												 stat2->keys, stat2->exprs,
+												 &collid);
+			reverse[idx] = false;
+
+			exprs1 = lappend(exprs1, expr1);
+			exprs2 = lappend(exprs2, expr2);
+		}
+		else if ((bms_singleton_member(relids2) == rel1->relid) &&
+				 (bms_singleton_member(relids1) == rel2->relid))
+		{
+			Oid		collid;
+
+			indexes1[idx] = mcv_match_expression(expr2,
+												 stat2->keys, stat2->exprs,
+												 &collid);
+			indexes2[idx] = mcv_match_expression(expr1,
+												 stat1->keys, stat1->exprs,
+												 &collid);
+			reverse[idx] = true;
+
+			exprs1 = lappend(exprs1, expr2);
+			exprs2 = lappend(exprs2, expr1);
+		}
+		else
+			/* should never happen */
+			Assert(false);
+
+		Assert((indexes1[idx] >= 0) &&
+			   (indexes1[idx] < bms_num_members(stat1->keys) + list_length(stat1->exprs)));
+
+		Assert((indexes2[idx] >= 0) &&
+			   (indexes2[idx] < bms_num_members(stat2->keys) + list_length(stat2->exprs)));
+
+		idx++;
+	}
+
+	/*
+	 * Match items between the two MCV lists.
+	 *
+	 * We don't know if the join conditions match all attributes in the MCV, the
+	 * overlap may be just on a subset  of attributes, e.g. (a,b,c) vs. (b,c,d).
+	 * So there may be multiple matches on either side. So we can't optimize by
+	 * aborting the inner loop after the first match, etc.
+	 *
+	 * XXX We can skip the items eliminated by the base restrictions, of course.
+	 *
+	 * XXX We might optimize this in two ways. We might sort the MCV items on
+	 * both sides using the "join" attributes, and then perform somthing like
+	 * merge join. Or we might calculate hash from the join columns, and then
+	 * compare this (to eliminate most expensive equality functions).
+	 */
+	for (i = 0; i < mcv1->nitems; i++)
+	{
+		bool	has_nulls;
+
+		/* skip items eliminated by restrictions on rel1 */
+		if (cmatches1 && !cmatches1[i])
+			continue;
+
+		/*
+		 * Check if any value in the first MCV item is NULL, because it'll be
+		 * mismatch anyway.
+		 *
+		 * XXX This might not work for some join clauses, e.g. IS NOT DISTINCT
+		 * FROM, but those are currently not considered compatible (we only
+		 * allow OpExpr at the moment).
+		 */
+		has_nulls = false;
+		for (j = 0; j < list_length(clauses); j++)
+			has_nulls |= mcv1->items[i].isnull[indexes1[j]];
+
+		if (has_nulls)
+			continue;
+
+		/* find matches in the second MCV list */
+		for (j = 0; j < mcv2->nitems; j++)
+		{
+			int			idx;
+			bool		items_match = true;
+
+			/* skip items eliminated by restrictions on rel2 */
+			if (cmatches2 && !cmatches2[j])
+				continue;
+
+			/*
+			 * XXX We can't skip based on existing matches2 value, because there
+			 * may be duplicates in the first MCV.
+			 */
+
+			/*
+			 * Evaluate if all the join clauses match between the two MCV items.
+			 *
+			 * XXX We might optimize the order of evaluation, using the costs of
+			 * operator functions for individual columns. It does depend on the
+			 * number of distinct values, etc.
+			 */
+			idx = 0;
+			foreach (lc, clauses)
+			{
+				bool	match;
+				int		index1 = indexes1[idx],
+						index2 = indexes2[idx];
+				Datum	value1,
+						value2;
+				bool	reverse_args = reverse[idx];
+
+				/* If either value is null, it's a mismatch */
+				if (mcv2->items[j].isnull[index2])
+					match = false;
+				else
+				{
+					value1 = mcv1->items[i].values[index1];
+					value2 = mcv2->items[j].values[index2];
+
+					/*
+					 * Careful about order of parameters. For same-type equality
+					 * that should not matter, but easy enough.
+					 *
+					 * FIXME Use appropriate collation.
+					 */
+					if (reverse_args)
+						match = DatumGetBool(FunctionCall2Coll(&opprocs[idx],
+															   InvalidOid,
+															   value2, value1));
+					else
+						match = DatumGetBool(FunctionCall2Coll(&opprocs[idx],
+															   InvalidOid,
+															   value1, value2));
+				}
+
+				items_match &= match;
+
+				if (!items_match)
+					break;
+
+				idx++;
+			}
+
+			if (items_match)
+			{
+				/* XXX Do we need to do something about base frequency? */
+				matches1[i] = matches2[j] = true;
+				s += mcv1->items[i].frequency * mcv2->items[j].frequency;
+			}
+		}
+	}
+
+	matchfreq1 = unmatchfreq1 = mcvfreq1 = 0.0;
+	for (i = 0; i < mcv1->nitems; i++)
+	{
+		mcvfreq1 += mcv1->items[i].frequency;
+
+		/* ignore MCV items eliminated by baserel conditions */
+		if (cmatches1 && !cmatches1[i])
+			continue;
+
+		if (matches1[i])
+			matchfreq1 += mcv1->items[i].frequency;
+		else
+			unmatchfreq1 += mcv1->items[i].frequency;
+	}
+
+	/* not represented by the MCV */
+	otherfreq1 = 1.0 - mcvfreq1;
+
+	matchfreq2 = unmatchfreq2 = mcvfreq2 = 0.0;
+	for (i = 0; i < mcv2->nitems; i++)
+	{
+		mcvfreq2 += mcv2->items[i].frequency;
+
+		/* ignore MCV items eliminated by baserel conditions */
+		if (cmatches2 && !cmatches2[i])
+			continue;
+
+		if (matches2[i])
+			matchfreq2 += mcv2->items[i].frequency;
+		else
+			unmatchfreq2 += mcv2->items[i].frequency;
+	}
+
+	/* not represented by the MCV */
+	otherfreq2 = 1.0 - mcvfreq2;
+
+	/*
+	 * Correction for MCV parts eliminated by the conditions.
+	 *
+	 * We need to be careful about cases where conditions eliminated all
+	 * the MCV items. We must not divide by 0.0, because that would either
+	 * produce bogus value or trigger division by zero. Instead we simply
+	 * set the selectivity to 0.0, because there can't be any matches.
+	 */
+	if ((matchfreq1 + unmatchfreq1) > 0)
+		s = s * mcvfreq1 / (matchfreq1 + unmatchfreq1);
+	else
+		s = 0.0;
+
+	if ((matchfreq2 + unmatchfreq2) > 0)
+		s = s * mcvfreq2 / (matchfreq2 + unmatchfreq2);
+	else
+		s = 0.0;
+
+	/* calculate ndistinct for the expression in join clauses for each rel */
+	nd1 = estimate_num_groups(root, exprs1, rel1->rows, NULL, NULL);
+	nd2 = estimate_num_groups(root, exprs2, rel2->rows, NULL, NULL);
+
+	/*
+	 * Consider the part of the data not represented by the MCV lists.
+	 *
+	 * XXX this is a bit bogus, because we don't know what fraction of
+	 * distinct combinations is covered by the MCV list (we're only
+	 * dealing with some of the columns), so we can't use the same
+	 * formular as eqjoinsel_inner exactly. We just use the estimates
+	 * for the whole table - this is likely an overestimate, because
+	 * (a) items may repeat in the MCV list, if it has more columns,
+	 * and (b) some of the combinations may be present in non-MCV data.
+	 *
+	 * Moreover, we need to look at the conditions. For now we simply
+	 * assume the conditions affect the distinct groups, and use that.
+	 *
+	 * XXX We might calculate the number of distinct groups in the MCV,
+	 * and then use something between (nd1 - distinct(MCV)) and (nd1),
+	 * which are the possible extreme values, assuming the estimates
+	 * are accurate. Maybe mean or geometric mean would work?
+	 *
+	 * XXX Not sure multiplying ndistinct with probabilities is good.
+	 * Maybe we should do something more like estimate_num_groups?
+	 */
+	nd1 *= csel1;
+	nd2 *= csel2;
+
+	totalsel1 = s;
+	totalsel1 += unmatchfreq1 * otherfreq2 / nd2;
+	totalsel1 += otherfreq1 * (otherfreq2 + unmatchfreq2) / nd2;
+
+//	if (nd2 > mcvb->nitems)
+//		totalsel1 += unmatchfreq1 * otherfreq2 / (nd2 - mcvb->nitems);
+//	if (nd2 > nmatches)
+//		totalsel1 += otherfreq1 * (otherfreq2 + unmatchfreq2) /
+//			(nd2 - nmatches);
+
+	totalsel2 = s;
+	totalsel2 += unmatchfreq2 * otherfreq1 / nd1;
+	totalsel2 += otherfreq2 * (otherfreq1 + unmatchfreq1) / nd1;
+
+//	if (nd1 > mcva->nitems)
+//		totalsel2 += unmatchfreq2 * otherfreq1 / (nd1 - mcva->nitems);
+//	if (nd1 > nmatches)
+//		totalsel2 += otherfreq2 * (otherfreq1 + unmatchfreq1) /
+//			(nd1 - nmatches);
+
+	s = Min(totalsel1, totalsel2);
+
+	return s;
+}
+
+
+/*
+ * statext_compare_simple
+ *		Calculte join selectivity using a combination of extended statistics
+ * MCV on one side, and simple per-column MCV on the other.
+ *
+ * Most of the mcv_combine_extended comment applies here too, but we can make
+ * some simplifications because we know the second (per-column) MCV is simpler,
+ * contains no NULL or duplicate values, etc.
+ */
+Selectivity
+mcv_combine_simple(PlannerInfo *root, RelOptInfo *rel, StatisticExtInfo *stat,
+				   AttStatsSlot *sslot, double stanullfrac, double nd,
+				   bool isdefault, Node *clause)
+{
+	MCVList    *mcv;
+	int			i,
+				j;
+	Selectivity s = 0;
+
+	/* match bitmaps and selectivity for baserel conditions (if any) */
+	List   *conditions = NIL;
+	bool   *cmatches = NULL;
+
+	double	csel = 1.0;
+
+	bool   *matches1 = NULL,
+		   *matches2 = NULL;
+
+	/* estimates for the two sides */
+	double	matchfreq1,
+			unmatchfreq1,
+			otherfreq1,
+			mcvfreq1,
+			nd1,
+			totalsel1;
+
+	double 	matchfreq2,
+			unmatchfreq2,
+			otherfreq2,
+			mcvfreq2,
+			nd2,
+			totalsel2;
+
+	List   *exprs1 = NIL,
+		   *exprs2 = NIL;
+
+	/* info about clauses and how they match to MCV stats */
+	FmgrInfo	opproc;
+	int			index;
+	bool		reverse;
+
+	/* we picked the stats so that they have MCV enabled */
+	Assert(stat->kind = STATS_EXT_MCV);
+
+	mcv = statext_mcv_load(stat->statOid);
+
+	/* should only get here with MCV on both sides */
+	Assert(mcv);
+
+	/* Determine which baserel clauses to use for conditional probability. */
+	conditions = statext_determine_join_restrictions(root, rel, stat);
+
+	/*
+	 * Calculate match bitmaps for restrictions on either side of the join
+	 * (there may be none, in which case this will be NULL).
+	 */
+	if (conditions)
+	{
+		cmatches = mcv_get_match_bitmap(root, conditions,
+										 stat->keys, stat->exprs,
+										 mcv, false);
+		csel = clauselist_selectivity(root, conditions, rel->relid, 0, NULL);
+	}
+
+	/*
+	 * Match bitmaps for matches between MCV elements. By default there
+	 * are no matches, so we set all items to 0.
+	 */
+	matches1 = (bool *) palloc0(sizeof(bool) * mcv->nitems);
+
+	/* Matches for the side with just regular single-column MCV. */
+	matches2 = (bool *) palloc0(sizeof(bool) * sslot->nvalues);
+
+	/*
+	 * Initialize information about the clause and how it matches to the
+	 * extended stats we picked. We do this only once before processing
+	 * the lists, so that we don't have to do that for each item or so.
+	 */
+	{
+		OpExpr	   *opexpr;
+		Node	   *expr1,
+				   *expr2;
+		Bitmapset  *relids1,
+				   *relids2;
+
+		/*
+		 * Strip the RestrictInfo node, get the actual clause.
+		 *
+		 * XXX Not sure if we need to care about removing other node types
+		 * too (e.g. RelabelType etc.). statext_is_supported_join_clause
+		 * matches this, but maybe we need to relax it?
+		 */
+		if (IsA(clause, RestrictInfo))
+			clause = (Node *) ((RestrictInfo *) clause)->clause;
+
+		opexpr = (OpExpr *) clause;
+
+		/* Make sure we have the expected node type. */
+		Assert(is_opclause(clause));
+		Assert(list_length(opexpr->args) == 2);
+
+		fmgr_info(get_opcode(opexpr->opno), &opproc);
+
+		/* FIXME strip relabel etc. the way examine_opclause_args does */
+		expr1 = linitial(opexpr->args);
+		expr2 = lsecond(opexpr->args);
+
+		/* determine order of clauses (rel1 op rel2) or (rel2 op rel1) */
+		relids1 = pull_varnos(root, expr1);
+		relids2 = pull_varnos(root, expr2);
+
+		if (bms_singleton_member(relids1) == rel->relid)
+		{
+			Oid		collid;
+
+			index = mcv_match_expression(expr1, stat->keys, stat->exprs,
+										 &collid);
+			reverse = false;
+
+			exprs1 = lappend(exprs1, expr1);
+			exprs2 = lappend(exprs2, expr2);
+		}
+		else if (bms_singleton_member(relids2) == rel->relid)
+		{
+			Oid		collid;
+
+			index = mcv_match_expression(expr2, stat->keys, stat->exprs,
+										 &collid);
+			reverse = true;
+
+			exprs1 = lappend(exprs1, expr2);
+			exprs2 = lappend(exprs2, expr1);
+		}
+		else
+			/* should never happen */
+			Assert(false);
+
+		Assert((index >= 0) &&
+			   (index < bms_num_members(stat->keys) + list_length(stat->exprs)));
+	}
+
+	/*
+	 * Match items between the two MCV lists.
+	 *
+	 * We don't know if the join conditions match all attributes in the MCV, the
+	 * overlap may be just on a subset  of attributes, e.g. (a,b,c) vs. (b,c,d).
+	 * So there may be multiple matches on either side. So we can't optimize by
+	 * aborting the inner loop after the first match, etc.
+	 *
+	 * XXX We can skip the items eliminated by the base restrictions, of course.
+	 *
+	 * XXX We might optimize this in two ways. We might sort the MCV items on
+	 * both sides using the "join" attributes, and then perform somthing like
+	 * merge join. Or we might calculate hash from the join columns, and then
+	 * compare this (to eliminate most expensive equality functions).
+	 */
+	for (i = 0; i < mcv->nitems; i++)
+	{
+		/* skip items eliminated by restrictions on rel1 */
+		if (cmatches && !cmatches[i])
+			continue;
+
+		/*
+		 * We can check mcv1->items[i].isnull[index1] here, because it'll be a
+		 * mismatch anyway (the simple MCV does not contain NULLs).
+		 */
+		if (mcv->items[i].isnull[index])
+			continue;
+
+		/* find matches in the second MCV list */
+		for (j = 0; j < sslot->nvalues; j++)
+		{
+			bool	match;
+			Datum	value1 = mcv->items[i].values[index];
+			Datum	value2 = sslot->values[j];
+
+			/*
+			 * Evaluate the join clause between the two MCV lists. We don't
+			 * need to deal with NULL values here - we've already checked for
+			 * NULL in the extended statistics earlier, and the simple MCV
+			 * does not contain NULL values.
+			*
+			 * Careful about order of parameters. For same-type equality
+			 * that should not matter, but easy enough.
+			 *
+			 * FIXME Use appropriate collation.
+			 */
+			if (reverse)
+				match = DatumGetBool(FunctionCall2Coll(&opproc,
+													   InvalidOid,
+													   value2, value1));
+			else
+				match = DatumGetBool(FunctionCall2Coll(&opproc,
+													   InvalidOid,
+													   value1, value2));
+
+			if (match)
+			{
+				/* XXX Do we need to do something about base frequency? */
+				matches1[i] = matches2[j] = true;
+				s += mcv->items[i].frequency * sslot->numbers[j];
+
+				/*
+				 * We know there can be just a single match in the regular
+				 * MCV list, so we can abort the inner loop.
+				 */
+				break;
+			}
+		}
+	}
+
+	matchfreq1 = unmatchfreq1 = mcvfreq1 = 0.0;
+	for (i = 0; i < mcv->nitems; i++)
+	{
+		mcvfreq1 += mcv->items[i].frequency;
+
+		/* ignore MCV items eliminated by baserel conditions */
+		if (cmatches && !cmatches[i])
+			continue;
+
+		if (matches1[i])
+			matchfreq1 += mcv->items[i].frequency;
+		else
+			unmatchfreq1 += mcv->items[i].frequency;
+	}
+
+	/* not represented by the MCV */
+	otherfreq1 = 1.0 - mcvfreq1;
+
+	matchfreq2 = unmatchfreq2 = mcvfreq2 = 0.0;
+	for (i = 0; i < sslot->nvalues; i++)
+	{
+		mcvfreq2 += sslot->numbers[i];
+
+		if (matches2[i])
+			matchfreq2 += sslot->numbers[i];
+		else
+			unmatchfreq2 += sslot->numbers[i];
+	}
+
+	/* not represented by the MCV */
+	otherfreq2 = 1.0 - mcvfreq2;
+
+	/*
+	 * Correction for MCV parts eliminated by the conditions.
+	 *
+	 * We need to be careful about cases where conditions eliminated all
+	 * the MCV items. We must not divide by 0.0, because that would either
+	 * produce bogus value or trigger division by zero. Instead we simply
+	 * set the selectivity to 0.0, because there can't be any matches.
+	 */
+	if ((matchfreq1 + unmatchfreq1) > 0)
+		s = s * mcvfreq1 / (matchfreq1 + unmatchfreq1);
+	else
+		s = 0.0;
+
+	if ((matchfreq2 + unmatchfreq2) > 0)
+		s = s * mcvfreq2 / (matchfreq2 + unmatchfreq2);
+	else
+		s = 0.0;
+
+	/* calculate ndistinct for the expression in join clauses for each rel */
+	nd1 = estimate_num_groups(root, exprs1, rel->rows, NULL, NULL);
+	nd2 = nd;
+
+	/*
+	 * Consider the part of the data not represented by the MCV lists.
+	 *
+	 * XXX this is a bit bogus, because we don't know what fraction of
+	 * distinct combinations is covered by the MCV list (we're only
+	 * dealing with some of the columns), so we can't use the same
+	 * formular as eqjoinsel_inner exactly. We just use the estimates
+	 * for the whole table - this is likely an overestimate, because
+	 * (a) items may repeat in the MCV list, if it has more columns,
+	 * and (b) some of the combinations may be present in non-MCV data.
+	 *
+	 * Moreover, we need to look at the conditions. For now we simply
+	 * assume the conditions affect the distinct groups, and use that.
+	 *
+	 * XXX We might calculate the number of distinct groups in the MCV,
+	 * and then use something between (nd1 - distinct(MCV)) and (nd1),
+	 * which are the possible extreme values, assuming the estimates
+	 * are accurate. Maybe mean or geometric mean would work?
+	 *
+	 * XXX Not sure multiplying ndistinct with probabilities is good.
+	 * Maybe we should do something more like estimate_num_groups?
+	 */
+	nd1 *= csel;
+
+	totalsel1 = s;
+	totalsel1 += unmatchfreq1 * otherfreq2 / nd2;
+	totalsel1 += otherfreq1 * (otherfreq2 + unmatchfreq2) / nd2;
+
+//	if (nd2 > mcvb->nitems)
+//		totalsel1 += unmatchfreq1 * otherfreq2 / (nd2 - mcvb->nitems);
+//	if (nd2 > nmatches)
+//		totalsel1 += otherfreq1 * (otherfreq2 + unmatchfreq2) /
+//			(nd2 - nmatches);
+
+	totalsel2 = s;
+	totalsel2 += unmatchfreq2 * otherfreq1 / nd1;
+	totalsel2 += otherfreq2 * (otherfreq1 + unmatchfreq1) / nd1;
+
+//	if (nd1 > mcva->nitems)
+//		totalsel2 += unmatchfreq2 * otherfreq1 / (nd1 - mcva->nitems);
+//	if (nd1 > nmatches)
+//		totalsel2 += otherfreq2 * (otherfreq1 + unmatchfreq1) /
+//			(nd1 - nmatches);
+
+	s = Min(totalsel1, totalsel2);
+
+	return s;
+}
diff --git a/src/include/statistics/extended_stats_internal.h b/src/include/statistics/extended_stats_internal.h
index 55cd9252a5..1e51c54fef 100644
--- a/src/include/statistics/extended_stats_internal.h
+++ b/src/include/statistics/extended_stats_internal.h
@@ -15,6 +15,7 @@
 #define EXTENDED_STATS_INTERNAL_H
 
 #include "statistics/statistics.h"
+#include "utils/lsyscache.h"
 #include "utils/sortsupport.h"
 
 typedef struct
@@ -127,4 +128,23 @@ extern Selectivity mcv_clause_selectivity_or(PlannerInfo *root,
 											 Selectivity *overlap_basesel,
 											 Selectivity *totalsel);
 
+extern Selectivity mcv_combine_simple(PlannerInfo *root,
+									  RelOptInfo *rel,
+									  StatisticExtInfo *stat,
+									  AttStatsSlot *sslot,
+									  double stanullfrac,
+									  double nd, bool isdefault,
+									  Node *clause);
+
+extern Selectivity mcv_combine_extended(PlannerInfo *root,
+										RelOptInfo *rel1,
+										RelOptInfo *rel2,
+										StatisticExtInfo *stat1,
+										StatisticExtInfo *stat2,
+										List *clauses);
+
+extern List *statext_determine_join_restrictions(PlannerInfo *root,
+												 RelOptInfo *rel,
+												 StatisticExtInfo *info);
+
 #endif							/* EXTENDED_STATS_INTERNAL_H */
diff --git a/src/include/statistics/statistics.h b/src/include/statistics/statistics.h
index 326cf26fea..4bf27240f6 100644
--- a/src/include/statistics/statistics.h
+++ b/src/include/statistics/statistics.h
@@ -126,4 +126,16 @@ extern StatisticExtInfo *choose_best_statistics(List *stats, char requiredkind,
 												int nclauses);
 extern HeapTuple statext_expressions_load(Oid stxoid, int idx);
 
+extern StatisticExtInfo *statext_find_matching_mcv(PlannerInfo *root, RelOptInfo *rel,
+										   Bitmapset *attnums, List *exprs);
+
+extern bool statext_try_join_estimates(PlannerInfo *root, List *clauses, int varRelid,
+									   JoinType jointype, SpecialJoinInfo *sjinfo,
+									   Bitmapset *estimatedclauses);
+
+extern Selectivity statext_clauselist_join_selectivity(PlannerInfo *root, List *clauses,
+													   int varRelid,
+													   JoinType jointype, SpecialJoinInfo *sjinfo,
+													   Bitmapset **estimatedclauses);
+
 #endif							/* STATISTICS_H */
diff --git a/src/test/regress/expected/stats_ext.out b/src/test/regress/expected/stats_ext.out
index c60ba45aba..8846d55c23 100644
--- a/src/test/regress/expected/stats_ext.out
+++ b/src/test/regress/expected/stats_ext.out
@@ -2974,6 +2974,173 @@ SELECT c0 FROM ONLY expr_stats_incompatible_test WHERE
 (0 rows)
 
 DROP TABLE expr_stats_incompatible_test;
+-- Test join estimates.
+CREATE TABLE join_test_1 (a int, b int, c int);
+CREATE TABLE join_test_2 (a int, b int, c int);
+INSERT INTO join_test_1 SELECT mod(i,10), mod(i,10), mod(i,10) FROM generate_series(1,1000) s(i);
+INSERT INTO join_test_2 SELECT mod(i,10), mod(i,10), mod(i,10) FROM generate_series(1,1000) s(i);
+ANALYZE join_test_1;
+ANALYZE join_test_2;
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b))');
+ estimated | actual 
+-----------+--------
+       500 | 100000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5');
+ estimated | actual 
+-----------+--------
+       250 |  50000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 0');
+ estimated | actual 
+-----------+--------
+         1 |      0
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5 and j2.c < 3');
+ estimated | actual 
+-----------+--------
+        75 |  30000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5 and j2.c > 5');
+ estimated | actual 
+-----------+--------
+       100 |      0
+(1 row)
+
+-- can't be improved due to the optimization in clauselist_selectivity_ext,
+-- which skips cases with a single (join) clause
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5');
+ estimated | actual 
+-----------+--------
+      2500 |  50000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5 and j2.c < 5');
+ estimated | actual 
+-----------+--------
+      1250 |  50000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5 and j2.c > 5');
+ estimated | actual 
+-----------+--------
+      1000 |      0
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a) where j1.c < 5');
+ estimated | actual 
+-----------+--------
+      2500 |  50000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a + 1) where j1.c < 5');
+ estimated | actual 
+-----------+--------
+      2500 |  50000
+(1 row)
+
+-- create extended statistics on the join/where columns
+CREATE STATISTICS join_stats_1 ON a, b, c, (a+1), (b+1) FROM join_test_1;
+CREATE STATISTICS join_stats_2 ON a, b, c, (a+1), (b+1) FROM join_test_2;
+ANALYZE join_test_1;
+ANALYZE join_test_2;
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b))');
+ estimated | actual 
+-----------+--------
+    100000 | 100000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5');
+ estimated | actual 
+-----------+--------
+     50000 |  50000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 0');
+ estimated | actual 
+-----------+--------
+         1 |      0
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5 and j2.c < 3');
+ estimated | actual 
+-----------+--------
+     30000 |  30000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5 and j2.c > 5');
+ estimated | actual 
+-----------+--------
+         1 |      0
+(1 row)
+
+-- can't be improved due to the optimization in clauselist_selectivity_ext,
+-- which skips cases with a single (join) clause
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5');
+ estimated | actual 
+-----------+--------
+     50000 |  50000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5 and j2.c < 5');
+ estimated | actual 
+-----------+--------
+     50000 |  50000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5 and j2.c > 5');
+ estimated | actual 
+-----------+--------
+         1 |      0
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a) where j1.c < 5');
+ estimated | actual 
+-----------+--------
+     50000 |  50000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a + 1) where j1.c < 5');
+ estimated | actual 
+-----------+--------
+     50000 |  50000
+(1 row)
+
+-- try combining with single-column (and single-expression) statistics
+DROP STATISTICS join_stats_2;
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a) where j1.c < 5');
+ estimated | actual 
+-----------+--------
+     50000 |  50000
+(1 row)
+
+-- no MCV on join_test_2 (on the (a+1) expression)
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a + 1) where j1.c < 5');
+ estimated | actual 
+-----------+--------
+      2500 |  50000
+(1 row)
+
+CREATE STATISTICS join_stats_2 ON (a+1) FROM join_test_2;
+ANALYZE join_test_2;
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a) where j1.c < 5');
+ estimated | actual 
+-----------+--------
+     50000 |  50000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a + 1) where j1.c < 5');
+ estimated | actual 
+-----------+--------
+     50000 |  50000
+(1 row)
+
+DROP TABLE join_test_1;
+DROP TABLE join_test_2;
 -- Permission tests. Users should not be able to see specific data values in
 -- the extended statistics, if they lack permission to see those values in
 -- the underlying table.
diff --git a/src/test/regress/sql/stats_ext.sql b/src/test/regress/sql/stats_ext.sql
index 6fb37962a7..71e59b5279 100644
--- a/src/test/regress/sql/stats_ext.sql
+++ b/src/test/regress/sql/stats_ext.sql
@@ -1500,6 +1500,72 @@ SELECT c0 FROM ONLY expr_stats_incompatible_test WHERE
 
 DROP TABLE expr_stats_incompatible_test;
 
+
+-- Test join estimates.
+CREATE TABLE join_test_1 (a int, b int, c int);
+CREATE TABLE join_test_2 (a int, b int, c int);
+
+INSERT INTO join_test_1 SELECT mod(i,10), mod(i,10), mod(i,10) FROM generate_series(1,1000) s(i);
+INSERT INTO join_test_2 SELECT mod(i,10), mod(i,10), mod(i,10) FROM generate_series(1,1000) s(i);
+
+ANALYZE join_test_1;
+ANALYZE join_test_2;
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b))');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 0');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5 and j2.c < 3');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5 and j2.c > 5');
+
+-- can't be improved due to the optimization in clauselist_selectivity_ext,
+-- which skips cases with a single (join) clause
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5 and j2.c < 5');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5 and j2.c > 5');
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a) where j1.c < 5');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a + 1) where j1.c < 5');
+
+-- create extended statistics on the join/where columns
+CREATE STATISTICS join_stats_1 ON a, b, c, (a+1), (b+1) FROM join_test_1;
+CREATE STATISTICS join_stats_2 ON a, b, c, (a+1), (b+1) FROM join_test_2;
+
+ANALYZE join_test_1;
+ANALYZE join_test_2;
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b))');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 0');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5 and j2.c < 3');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5 and j2.c > 5');
+
+-- can't be improved due to the optimization in clauselist_selectivity_ext,
+-- which skips cases with a single (join) clause
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5 and j2.c < 5');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5 and j2.c > 5');
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a) where j1.c < 5');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a + 1) where j1.c < 5');
+
+-- try combining with single-column (and single-expression) statistics
+DROP STATISTICS join_stats_2;
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a) where j1.c < 5');
+
+-- no MCV on join_test_2 (on the (a+1) expression)
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a + 1) where j1.c < 5');
+
+CREATE STATISTICS join_stats_2 ON (a+1) FROM join_test_2;
+ANALYZE join_test_2;
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a) where j1.c < 5');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a + 1) where j1.c < 5');
+
+
+DROP TABLE join_test_1;
+DROP TABLE join_test_2;
+
 -- Permission tests. Users should not be able to see specific data values in
 -- the extended statistics, if they lack permission to see those values in
 -- the underlying table.
-- 
2.31.1

#10

Tomas Vondra

tomas.vondra@enterprisedb.com

about 4 years ago

In reply to: Andy Fan (#7)

Re: using extended statistics to improve join estimates

On 11/6/21 11:03, Andy Fan wrote:

Hi Tomas:

This is the exact patch I want, thanks for the patch!

Good to hear.

On Thu, Oct 7, 2021 at 3:33 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

3) estimation by join pairs

At the moment, the estimates are calculated for pairs of relations, so
for example given a query

explain analyze
select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b)
join t3 on (t1.b = t3.b and t2.c = t3.c);

we'll estimate the first join (t1,t2) just fine, but then the second
join actually combines (t1,t2,t3). What the patch currently does is it
splits it into (t1,t2) and (t2,t3) and estimates those.

Actually I can't understand how this works even for a simpler example.
let's say we query like this (ONLY use t2's column to join t3).

select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b)
join t3 on (t2.c = t3.c and t2.d = t3.d);

Then it works well on JoinRel(t1, t2) AND JoinRel(t2, t3). But when comes
to JoinRel(t1, t2, t3), we didn't maintain the MCV on join rel, so it
is hard to
work. Here I see your solution is splitting it into (t1, t2) AND (t2,
t3) and estimate
those. But how does this help to estimate the size of JoinRel(t1, t2, t3)?

Yeah, this is really confusing. The crucial thing to keep in mind is
this works with clauses before running setrefs.c, so the clauses
reference the original relations - *not* the join relation. Otherwise
even the regular estimation would not work, because where would it get
the per-column MCV lists etc.

Let's use a simple case with join clauses referencing just a single
attribute for each pair or relations. And let's talk about how many join
pairs it'll extract:

t1 JOIN t2 ON (t1.a = t2.a) JOIN t3 ON (t1.b = t3.b)

=> first we join t1/t2, which is 1 join pair (t1,t2)
=> then we join t1/t2/t3, but the join clause references just 2 rels, so
1 join pair (t1,t3)

Now a more complicated case, with more complex join clause

t1 JOIN t2 ON (t1.a = t2.a) JOIN t3 ON (t1.b = t3.b AND t2.c = t3.c)

=> first we join t1/t2, which is 1 join pair (t1,t2)
=> then we join t1/t2/t3, but this time the join clause references all
three rels, so we have two join pairs (t1,t3) and (t2,t3) and we can use
all the statistics.

I wonder if this
should actually combine all three MCVs at once - we're pretty much
combining the MCVs into one large MCV representing the join result.

I guess we can keep the MCVs on joinrel for these matches. Take the above
query I provided for example, and suppose the MCV data as below:

t1(a, b)
(1, 2) -> 0.1
(1, 3) -> 0.2
(2, 3) -> 0.5
(2, 8) -> 0.1

t2(a, b)
(1, 2) -> 0.2
(1, 3) -> 0.1
(2, 4) -> 0.2
(2, 10) -> 0.1

After t1.a = t2.a AND t1.b = t2.b, we can build the MCV as below

(1, 2, 1, 2) -> 0.1 * 0.2
(1, 3, 1, 3) -> 0.2 * 0.1

And recording the total mcv frequence as (0.1 + 0.2 + 0.5 + 0.1) *
(0.2 + 0.1 + 0.2 + 0.1)

Right, that's about the joint distribution I whole join.

With this design, the nitems of MCV on joinrel would be less than
either of baserel.

Actually, I think the number of items can grow, because the matches may
duplicate some items. For example in your example with (t1.a = t2.a) the
first first (1,2) item in t1 matches (1,2) and (1,3) in t2. And same for
(1,3) in t1. So that's 4 combinations. Of course, we could aggregate the
MCV by ignoring columns not used in the query.

and since we handle the eqjoin as well, we even can record the items as

(1, 2) -> 0.1 * 0.2
(1, 3) -> 0.2 * 0.1;

About when we should maintain the JoinRel's MCV data, rather than
maintain this just
after the JoinRel size is estimated, we can only estimate it when it
is needed. for example:

select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b)
join t3 on (t2.c = t3.c and t2.d = t3.d);

we don't need to maintain the MCV on (t1, t2, t3) since no others
need it at all. However
I don't check code too closely to see if it (Lazing computing MVC on
joinrel) is convenient
to do.

I'm not sure I understand what you're proposing here.

However, I think that estimating it for pairs has two advantages:

1) Combining MCVs for k relations requires k for loops. Processing 2
relations at a time limits the amount of CPU we need. Of course, this
assumes the joins are independent, which may or may not be true.

2) It seems fairly easy to combine different types of statistics
(regular, extended, ...), and also consider the part not represented by
MCV. It seems much harder when joining more than 2 relations.

I'm also worried about amplification of errors - I suspect attempting to
build the joint MCV for the whole join relation may produce significant
estimation errors.

Furthermore, I think joins with clauses referencing more than just two
relations are fairly uncommon. And we can always improve the feature in
this direction in the future.

But I haven't done that yet, as it requires the MCVs to be combined
using the join clauses (overlap in a way), but I'm not sure how likely
that is in practice. In the example it could help, but that's a bit
artificial example.

4) still just inner equi-joins

I haven't done any work on extending this to outer joins etc. Adding
outer and semi joins should not be complicated, mostly copying and
tweaking what eqjoinsel() does.

Overall, thanks for the feature and I am expecting there are more cases
to handle during discussion. To make the review process more efficient,
I suggest that we split the patch into smaller ones and review/commit them
separately if we have finalized the design roughly . For example:

Patch 1 -- required both sides to have extended statistics.
Patch 2 -- required one side to have extended statistics and the other side had
per-column MCV.
Patch 3 -- handle the case like WHERE t1.a = t2.a and t1.b = Const;
Patch 3 -- handle the case for 3+ table joins.
Patch 4 -- supports the outer join.

I think we can do this if we are sure that each individual patch would work in
some cases and would not make any other case worse. If you agree with this,
I can do that splitting work during my review process.

I'll consider splitting it like this, but I'm not sure it makes the main
patch that much smaller.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#11

Tomas Vondra

tomas.vondra@enterprisedb.com

about 4 years ago

In reply to: Tomas Vondra (#9)

1 attachment(s)

Re: using extended statistics to improve join estimates

Hi,

Here's an updated patch, rebased and fixing a couple typos reported by
Justin Pryzby directly.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

0001-Estimate-joins-using-extended-statistics-20220101.patchtext/x-patch; charset=UTF-8; name=0001-Estimate-joins-using-extended-statistics-20220101.patchDownload

From 15d0fa5b565d9ae8b4f333c1d54745397964110d Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas.vondra@postgresql.org>
Date: Mon, 13 Dec 2021 14:05:17 +0100
Subject: [PATCH] Estimate joins using extended statistics

Use extended statistics (MCV) to improve join estimates. In general this
is similar to how we use regular statistics - we search for extended
statistics (with MCV) covering all join clauses, and if we find such MCV
on both sides of the join, we combine those two MCVs.

Extended statistics allow a couple additional improvements - e.g. if
there are baserel conditions, we can use them to restrict the part of
the MCVs combined. This means we're building conditional probability
distribution and calculating conditional probability

    P(join clauses | baserel conditions)

instead of just P(join clauses).

The patch also allows combining regular and extended MCV - we don't need
extended MCVs on both sides. This helps when one of the tables does not
have extended statistics (e.g. because there are no correlations).
---
 src/backend/optimizer/path/clausesel.c        |  63 +-
 src/backend/statistics/extended_stats.c       | 805 ++++++++++++++++++
 src/backend/statistics/mcv.c                  | 754 ++++++++++++++++
 .../statistics/extended_stats_internal.h      |  20 +
 src/include/statistics/statistics.h           |  12 +
 src/test/regress/expected/stats_ext.out       | 167 ++++
 src/test/regress/sql/stats_ext.sql            |  66 ++
 7 files changed, 1886 insertions(+), 1 deletion(-)

diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c
index d263ecf0827..09f3d246c9d 100644
--- a/src/backend/optimizer/path/clausesel.c
+++ b/src/backend/optimizer/path/clausesel.c
@@ -50,6 +50,9 @@ static Selectivity clauselist_selectivity_or(PlannerInfo *root,
 											 JoinType jointype,
 											 SpecialJoinInfo *sjinfo,
 											 bool use_extended_stats);
+static inline bool treat_as_join_clause(PlannerInfo *root,
+										Node *clause, RestrictInfo *rinfo,
+										int varRelid, SpecialJoinInfo *sjinfo);
 
 /****************************************************************************
  *		ROUTINES TO COMPUTE SELECTIVITIES
@@ -129,12 +132,53 @@ clauselist_selectivity_ext(PlannerInfo *root,
 	RangeQueryClause *rqlist = NULL;
 	ListCell   *l;
 	int			listidx;
+	bool		single_clause_optimization = true;
+
+	/*
+	 * The optimization of skipping to clause_selectivity_ext for single
+	 * clauses means we can't improve join estimates with a single join
+	 * clause but additional baserel restrictions. So we disable it when
+	 * estimating joins.
+	 *
+	 * XXX Not sure if this is the right way to do it, but more elaborate
+	 * checks would mostly negate the whole point of the optimization.
+	 * The (Var op Var) patch has the same issue.
+	 *
+	 * XXX An alternative might be making clause_selectivity_ext smarter
+	 * and make it use the join extended stats there. But that seems kinda
+	 * against the whole point of the optimization (skipping expensive
+	 * stuff) and it's making other parts more complex.
+	 *
+	 * XXX Maybe this should check if there are at least some restrictions
+	 * on some base relations, which seems important. But then again, that
+	 * seems to go against the idea of this check to be cheap. Moreover, it
+	 * won't work for OR clauses, which may have multiple parts but we still
+	 * see them as a single BoolExpr clause (it doesn't work later, though).
+	 */
+	if (list_length(clauses) == 1)
+	{
+		Node *clause = linitial(clauses);
+		RestrictInfo *rinfo = NULL;
+
+		if (IsA(clause, RestrictInfo))
+		{
+			rinfo = (RestrictInfo *) clause;
+			clause = (Node *) rinfo->clause;
+		}
+
+		single_clause_optimization
+			= !treat_as_join_clause(root, clause, rinfo, varRelid, sjinfo);
+	}
 
 	/*
 	 * If there's exactly one clause, just go directly to
 	 * clause_selectivity_ext(). None of what we might do below is relevant.
+	 *
+	 * XXX This means we won't try using extended stats on OR-clauses (which
+	 * are a single BoolExpr clause at this point), although we'll do that
+	 * later (once we look at the arguments).
 	 */
-	if (list_length(clauses) == 1)
+	if ((list_length(clauses) == 1) && single_clause_optimization)
 		return clause_selectivity_ext(root, (Node *) linitial(clauses),
 									  varRelid, jointype, sjinfo,
 									  use_extended_stats);
@@ -157,6 +201,23 @@ clauselist_selectivity_ext(PlannerInfo *root,
 											&estimatedclauses, false);
 	}
 
+	/*
+	 * Try applying extended statistics to joins. There's not much we can
+	 * do to detect when this makes sense, but we can check that there are
+	 * join clauses, and that at least some of the rels have stats.
+	 *
+	 * XXX Isn't this mutually exclusive with the preceding block which
+	 * calculates estimates for a single relation?
+	 */
+	if (use_extended_stats &&
+		statext_try_join_estimates(root, clauses, varRelid, jointype, sjinfo,
+						 estimatedclauses))
+	{
+		s1 *= statext_clauselist_join_selectivity(root, clauses, varRelid,
+												  jointype, sjinfo,
+												  &estimatedclauses);
+	}
+
 	/*
 	 * Apply normal selectivity estimates for remaining clauses. We'll be
 	 * careful to skip any clauses which were already estimated above.
diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c
index 69ca52094f9..57e951400c5 100644
--- a/src/backend/statistics/extended_stats.c
+++ b/src/backend/statistics/extended_stats.c
@@ -30,6 +30,7 @@
 #include "nodes/nodeFuncs.h"
 #include "optimizer/clauses.h"
 #include "optimizer/optimizer.h"
+#include "optimizer/pathnode.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
 #include "statistics/extended_stats_internal.h"
@@ -101,6 +102,8 @@ static StatsBuildData *make_build_data(Relation onerel, StatExtEntry *stat,
 									   int numrows, HeapTuple *rows,
 									   VacAttrStats **stats, int stattarget);
 
+static bool stat_covers_expressions(StatisticExtInfo *stat, List *exprs,
+									Bitmapset **expr_idxs);
 
 /*
  * Compute requested extended stats, using the rows sampled for the plain
@@ -2608,3 +2611,805 @@ make_build_data(Relation rel, StatExtEntry *stat, int numrows, HeapTuple *rows,
 
 	return result;
 }
+
+/*
+ * statext_find_matching_mcv
+ *		Search for a MCV covering all the attributes and expressions.
+ *
+ * We pick the statistics to use for join estimation. The statistics object has
+ * to have MCV, and we require it to match all the join conditions, because it
+ * makes the estimation simpler.
+ *
+ * If there are multiple candidate statistics objects (matching all join clauses),
+ * we pick the smallest one, and we also consider additional conditions on
+ * the base relations to restrict the MCV items used for estimation (using
+ * conditional probability).
+ *
+ * XXX The requirement that all the attributes need to be covered might be
+ * too strong. We could relax this and and require fewer matches (at least two,
+ * if counting the additional conditions), and we might even apply multiple
+ * statistics etc. But that would require matching statistics on both sides of
+ * the join, while now we simply know the statistics match. We don't really
+ * expect many candidate MCVs, so this simple approach seems sufficient. And
+ * the joins usually use only one or two columns, so there's not much room
+ * for applying multiple statistics anyway.
+ */
+StatisticExtInfo *
+statext_find_matching_mcv(PlannerInfo *root, RelOptInfo *rel,
+						  Bitmapset *attnums, List *exprs)
+{
+	ListCell   *l;
+	StatisticExtInfo *mcv = NULL;
+	List *stats = rel->statlist;
+
+	foreach(l, stats)
+	{
+		StatisticExtInfo *stat = (StatisticExtInfo *) lfirst(l);
+		List *conditions1 = NIL,
+			 *conditions2 = NIL;
+
+		/* We only care about MCV statistics here. */
+		if (stat->kind != STATS_EXT_MCV)
+			continue;
+
+		/*
+		 * Ignore MCVs not covering all the attributes/expressions.
+		 *
+		 * XXX Maybe we shouldn't be so strict and consider only partial
+		 * matches for join clauses too?
+		 */
+		if (!bms_is_subset(attnums, stat->keys) ||
+			!stat_covers_expressions(stat, exprs, NULL))
+			continue;
+
+		/* If there's no matching MCV yet, keep this one. */
+		if (!mcv)
+		{
+			mcv = stat;
+			continue;
+		}
+
+		/*
+		 * OK, we have two candidate statistics objects and we need to decide
+		 * which one to keep. We'll use two simple heuristics:
+		 *
+		 * (a) We prefer smaller statistics (fewer columns), on the assumption
+		 * that it represents a larger fraction of the data (due to having fewer
+		 * combinations with higher counts).
+		 *
+		 * (b) If the statistics object covers some additional conditions for the rels,
+		 * that may help with considering additional dependencies between the
+		 * tables.
+		 *
+		 * Of course, those two heuristict are somewhat contradictory - smaller
+		 * stats are less likely to cover as many conditions as a larger one. We
+		 * consider the additional conditions first - if someone created such
+		 * statistics, there probably is a dependency worth considering.
+		 *
+		 * When inspecting the restrictions, we need to be careful - we don't
+		 * know which of them are compatible with extended stats, so we have to
+		 * run them through statext_is_compatible_clause first and then match
+		 * them to the statistics.
+		 *
+		 * XXX Maybe we shouldn't pick statistics that covers just a single join
+		 * clause, without any additional conditions. In such case we could just
+		 * as well pick regular statistics for the column/expression, but it's
+		 * not clear if that actually exists (so we might reject the stats here
+		 * and then fail to find something simpler/better).
+		 */
+		conditions1 = statext_determine_join_restrictions(root, rel, stat);
+		conditions2 = statext_determine_join_restrictions(root, rel, mcv);
+
+		/* if the new statistics object covers more conditions, use it */
+		if (list_length(conditions1) > list_length(conditions2))
+		{
+			mcv = stat;
+			continue;
+		}
+
+		/* The statistics seem about equal, so just use the smaller one. */
+		if (bms_num_members(mcv->keys) + list_length(mcv->exprs) >
+			bms_num_members(stat->keys) + list_length(stat->exprs))
+		{
+			mcv = stat;
+		}
+	}
+
+	return mcv;
+}
+
+/*
+ * statext_determine_join_restrictions
+ *		Get restrictions on base relation, covered by the statistics object.
+ *
+ * Returns a list of baserel restrictinfos, compatible with extended statistics
+ * and covered by the extended statistics object.
+ *
+ * When using extended statistics to estimate joins, we can use conditions
+ * from base relations to calculate conditional probability
+ *
+ *    P(join clauses | baserel restrictions)
+ *
+ * which should be a better estimate than just P(join clauses). We want to pick
+ * the statistics object covering the most such conditions.
+ */
+List *
+statext_determine_join_restrictions(PlannerInfo *root, RelOptInfo *rel,
+									StatisticExtInfo *info)
+{
+	ListCell   *lc;
+	List	   *conditions = NIL;
+
+	/* extract conditions that may be applied to the MCV list */
+	foreach (lc, rel->baserestrictinfo)
+	{
+		RestrictInfo *rinfo = (RestrictInfo *) lfirst(lc);
+		Bitmapset *indexes = NULL;
+		Bitmapset *attnums = NULL;
+		List *exprs = NIL;
+
+		/* clause has to be supported by MCV in general */
+		if (!statext_is_compatible_clause(root, (Node *) rinfo, rel->relid,
+										  &attnums, &exprs))
+			continue;
+
+		/*
+		 * clause is compatible in general, but is it actually covered
+		 * by this particular statistics object?
+		 */
+		if (!bms_is_subset(attnums, info->keys) ||
+			!stat_covers_expressions(info, exprs, &indexes))
+			continue;
+
+		conditions = lappend(conditions, rinfo->clause);
+	}
+
+	return conditions;
+}
+
+/*
+ * statext_is_supported_join_clause
+ *		Check if a join clause may be estimated using extended stats.
+ *
+ * Determines if this is a join clause of the form (Expr op Expr) which may be
+ * estimated using extended statistics. Each side must reference just a single
+ * relation for now.
+ *
+ * Similar to treat_as_join_clause, but we place additional restrictions
+ * on the conditions.
+ */
+static bool
+statext_is_supported_join_clause(PlannerInfo *root, Node *clause,
+								 int varRelid, SpecialJoinInfo *sjinfo)
+{
+	Oid	oprsel;
+	RestrictInfo   *rinfo;
+	OpExpr		   *opclause;
+	ListCell	   *lc;
+
+	/*
+	 * evaluation as a restriction clause, either at scan node or forced
+	 *
+	 * XXX See treat_as_join_clause.
+	 */
+	if ((varRelid != 0) || (sjinfo == NULL))
+		return false;
+
+	/* XXX Can we rely on always getting RestrictInfo here? */
+	if (!IsA(clause, RestrictInfo))
+		return false;
+
+	/* strip the RestrictInfo */
+	rinfo = (RestrictInfo *) clause;
+	clause = (Node *) rinfo->clause;
+
+	/* is it referencing multiple relations? */
+	if (bms_membership(rinfo->clause_relids) != BMS_MULTIPLE)
+		return false;
+
+	/* we only support simple operator clauses for now */
+	if (!is_opclause(clause))
+		return false;
+
+	opclause = (OpExpr *) clause;
+
+	/* for now we only support estimating equijoins */
+	oprsel = get_oprjoin(opclause->opno);
+
+	/* has to be an equality condition */
+	if (oprsel != F_EQJOINSEL)
+		return false;
+
+	/*
+	 * Make sure we're not mixing vars from multiple relations on the same
+	 * side, like
+	 *
+	 *   (t1.a + t2.a) = (t1.b + t2.b)
+	 *
+	 * which is still technically an opclause, but we can't match it to
+	 * extended statistics in a simple way.
+	 *
+	 * XXX This also means we require rinfo->clause_relids to have 2 rels.
+	 *
+	 * XXX Also check it's not expression on system attributes, which we
+	 * don't allow in extended statistics.
+	 *
+	 * XXX Although maybe we could allow cases that combine expressions
+	 * from both relations on either side? Like (t1.a + t2.b = t1.c - t2.d)
+	 * or something like that. We could do "cartesian product" of the MCV
+	 * stats and restrict it using this condition.
+	 */
+	foreach (lc, opclause->args)
+	{
+		Bitmapset *varnos = NULL;
+		Node *expr = (Node *) lfirst(lc);
+
+		varnos = pull_varnos(root, expr);
+
+		/*
+		 * No argument should reference more than just one relation.
+		 *
+		 * This effectively means each side references just two relations.
+		 * If there's no relation on one side, it's a Const, and the other
+		 * side has to be either Const or Expr with a single rel. In which
+		 * case it can't be a join clause.
+		 */
+		if (bms_num_members(varnos) > 1)
+			return false;
+
+		/*
+		 * XXX Maybe check that both relations have extended statistics
+		 * (no point in considering the clause as useful without it). But
+		 * we'll do that check later anyway, so keep this cheap.
+		 */
+	}
+
+	return true;
+}
+
+/*
+ * statext_try_join_estimates
+ *		Checks if it's worth considering extended stats on join estimates.
+ *
+ * This is supposed to be a quick/cheap check to decide whether to expend
+ * more effort on applying extended statistics to join clauses.
+ */
+bool
+statext_try_join_estimates(PlannerInfo *root, List *clauses, int varRelid,
+						   JoinType jointype, SpecialJoinInfo *sjinfo,
+						   Bitmapset *estimatedclauses)
+{
+	int			listidx;
+	int			k;
+	ListCell   *lc;
+	Bitmapset  *relids = NULL;
+
+	/*
+	 * XXX Not having these values means treat_as_join_clause returns false,
+	 * so we're not supposed to handle join clauses here. So just bail out.
+	 */
+	if ((varRelid != 0) || (sjinfo == NULL))
+		return false;
+
+	/*
+	 * Check if there are any unestimated join clauses, collect relids.
+	 *
+	 * XXX Currently this only allows simple OpExpr equality clauses with each
+	 * argument referring to single relation, AND-ed together. Maybe we could
+	 * relax this in the future, e.g. to allow more complex (deeper) expressions
+	 * and to allow OR-ed join clauses too. And maybe supporting inequalities.
+	 *
+	 * Handling more complex expressions seems simple - we already do that for
+	 * baserel estimates by building the match bitmap recursively, and we could
+	 * do something similar for combinations of MCV items (a bit like building
+	 * a single bit in the match bitmap). The challenge is what to do about the
+	 * part not represented by MCV, which is now based on ndistinct estimates.
+	 */
+	listidx = -1;
+	foreach (lc, clauses)
+	{
+		Node *clause = (Node *) lfirst(lc);
+		RestrictInfo *rinfo;
+
+		/* needs to happen before skipping any clauses */
+		listidx++;
+
+		/* Skip clauses that were already estimated. */
+		if (bms_is_member(listidx, estimatedclauses))
+			continue;
+
+		/*
+		 * Skip clauses that are not join clauses or that we don't know
+		 * how to handle estimate using extended statistics.
+		 */
+		if (!statext_is_supported_join_clause(root, clause, varRelid, sjinfo))
+			continue;
+
+		/*
+		 * XXX We're guaranteed to have RestrictInfo thanks to the checks
+		 * in statext_is_supported_join_clause.
+		 */
+		rinfo = (RestrictInfo *) clause;
+
+		/* Collect relids from all usable clauses. */
+		relids = bms_union(relids, rinfo->clause_relids);
+	}
+
+	/* no join clauses found, don't try applying extended stats */
+	if (bms_num_members(relids) == 0)
+		return false;
+
+	/*
+	 * We expect either 0 or >= 2 relids, a case with 1 relid in join clauses
+	 * should be impossible. And we just ruled out 0, so there are at least 2.
+	 */
+	Assert(bms_num_members(relids) >= 2);
+
+	/*
+	 * Check that at least some of the rels referenced by the clauses have
+	 * extended stats.
+	 *
+	 * XXX Maybe we should check how many rels have stats, and cross-check how
+	 * compatible they are (e.g. that both have MCVs, etc.). We might also
+	 * cross-check the exact joined pairs of rels, but it's supposed to be a
+	 * cheap check, so maybe better leave that for later.
+	 *
+	 * XXX We could also check the number of parameters in each rel to consider
+	 * extended stats. If there's just a single attribute, it's pointless to use
+	 * extended statistics. OTOH we can also consider restriction clauses from
+	 * baserestrictinfo and use them to calculate conditional probabilities.
+	 */
+	k = -1;
+	while ((k = bms_next_member(relids, k)) >= 0)
+	{
+		RelOptInfo *rel = find_base_rel(root, k);
+		if (rel->statlist)
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * Information about two joined relations, along with the join clauses between.
+ */
+typedef struct JoinPairInfo
+{
+	Bitmapset  *rels;
+	List	   *clauses;
+} JoinPairInfo;
+
+/*
+ * statext_build_join_pairs
+ *		Extract pairs of joined rels with join clauses for each pair.
+ *
+ * Walks the remaining (not yet estimated) clauses, and splits them into
+ * lists for each pair of joined relations. Returns NULL if there are no
+ * suitable join pairs that might be estimated using extended stats.
+ *
+ * XXX It's possible there are join clauses, but the clauses are not
+ * supported by the extended stats machinery (we only support opclauses
+ * with F_EQJOINSEL selectivity function at the moment).
+ */
+static JoinPairInfo *
+statext_build_join_pairs(PlannerInfo *root, List *clauses, int varRelid,
+						 JoinType jointype, SpecialJoinInfo *sjinfo,
+						 Bitmapset *estimatedclauses, int *npairs)
+{
+	int				cnt;
+	int				listidx;
+	JoinPairInfo   *info;
+	ListCell	   *lc;
+
+	/*
+	 * Assume each clause is for a different pair of relations (some of them
+	 * might be already estimated, but meh - there shouldn't be too many of
+	 * them and it's cheaper than repalloc).
+	 */
+	info = (JoinPairInfo *) palloc0(sizeof(JoinPairInfo) * list_length(clauses));
+	cnt = 0;
+
+	listidx = -1;
+	foreach(lc, clauses)
+	{
+		int				i;
+		bool			found;
+		Node		   *clause = (Node *) lfirst(lc);
+		RestrictInfo   *rinfo;
+
+		listidx++;
+
+		/* skip already estimated clauses */
+		if (bms_is_member(listidx, estimatedclauses))
+			continue;
+
+		/*
+		 * Make sure the clause is a join clause of a supported shape (at
+		 * the moment we support just (Expr op Expr) clauses with each
+		 * side referencing just a single relation).
+		 */
+		if (!statext_is_supported_join_clause(root, clause, varRelid, sjinfo))
+			continue;
+
+		/* statext_is_supported_join_clause guarantees RestrictInfo */
+		rinfo = (RestrictInfo *) clause;
+		clause = (Node *) rinfo->clause;
+
+		/* search for a matching join pair */
+		found = false;
+		for (i = 0; i < cnt; i++)
+		{
+			if (bms_is_subset(rinfo->clause_relids, info[i].rels))
+			{
+				info[i].clauses = lappend(info[i].clauses, clause);
+				found = true;
+				break;
+			}
+		}
+
+		if (!found)
+		{
+			info[cnt].rels = rinfo->clause_relids;
+			info[cnt].clauses = lappend(info[cnt].clauses, clause);
+			cnt++;
+		}
+	}
+
+	if (cnt == 0)
+		return NULL;
+
+	*npairs = cnt;
+	return info;
+}
+
+/*
+ * extract_relation_info
+ *		Extract information about a relation in a join pair.
+ *
+ * The relation is identified by index (generally 0 or 1), and picks extended
+ * statistics object covering the join clauses and baserel restrictions.
+ *
+ * XXX Can we have cases with indexes above 1? Probably for clauses mixing
+ * vars from 3 relations, but statext_is_supported_join_clause rejects those.
+ */
+static RelOptInfo *
+extract_relation_info(PlannerInfo *root, JoinPairInfo *info, int index,
+					  StatisticExtInfo **stat)
+{
+	int			k;
+	int			relid;
+	RelOptInfo *rel;
+	ListCell   *lc;
+	List	   *exprs = NIL;
+
+	Bitmapset  *attnums = NULL;
+
+	Assert((index >= 0) && (index <= 1));
+
+	k = -1;
+	while (index >= 0)
+	{
+		k = bms_next_member(info->rels, k);
+		if (k < 0)
+			elog(ERROR, "failed to extract relid");
+
+		relid = k;
+		index--;
+	}
+
+	rel = find_base_rel(root, relid);
+
+	/*
+	 * Walk the clauses for this join pair, and extract expressions about
+	 * the relation identified by index / relid. For simple Vars we extract
+	 * the attnum. Otherwise we keep the whole expression.
+	 */
+	foreach (lc, info->clauses)
+	{
+		ListCell *lc2;
+		Node *clause = (Node *) lfirst(lc);
+		OpExpr *opclause = (OpExpr *) clause;
+
+		/* only opclauses supported for now */
+		Assert(is_opclause(clause));
+
+		foreach (lc2, opclause->args)
+		{
+			Node *arg = (Node *) lfirst(lc2);
+			Bitmapset *varnos = NULL;
+
+			/* plain Var references (boolean Vars or recursive checks) */
+			if (IsA(arg, Var))
+			{
+				Var		   *var = (Var *) arg;
+
+				/* Ignore vars from other relations. */
+				if (var->varno != relid)
+					continue;
+
+				/* we also better ensure the Var is from the current level */
+				if (var->varlevelsup > 0)
+					continue;
+
+				/* Also skip system attributes (we don't allow stats on those). */
+				if (!AttrNumberIsForUserDefinedAttr(var->varattno))
+					elog(ERROR, "unexpected system attribute");
+
+				attnums = bms_add_member(attnums, var->varattno);
+
+				/* Done, process the next argument. */
+				continue;
+			}
+
+			/*
+			 * OK, it's a more complex expression, so check if it matches
+			 * the relid and maybe keep it as a whole. It should be
+			 * compatible because we already checked it when building the
+			 * join pairs.
+			 */
+			varnos = pull_varnos(root, arg);
+
+			if (relid == bms_singleton_member(varnos))
+				exprs = lappend(exprs, arg);
+		}
+	}
+
+	*stat = statext_find_matching_mcv(root, rel, attnums, exprs);
+
+	return rel;
+}
+
+/*
+ * get_expression_for_rel
+ *		Extract expression for a given relation from the join clause.
+ *
+ * Given a join clause supported by the extended statistics object (currently
+ * that means just OpExpr clauses with each argument referencing single rel),
+ * return either the left or right argument expression for the rel.
+ *
+ * XXX This should probably return a flag identifying whether it's the
+ * left or right argument.
+ */
+static Node *
+get_expression_for_rel(PlannerInfo *root, RelOptInfo *rel, Node *clause)
+{
+	OpExpr *opexpr;
+	Node   *expr;
+
+	/*
+	 * Strip the RestrictInfo node, get the actual clause.
+	 *
+	 * XXX Not sure if we need to care about removing other node types
+	 * too (e.g. RelabelType etc.). statext_is_supported_join_clause
+	 * matches this, but maybe we need to relax it?
+	 */
+	if (IsA(clause, RestrictInfo))
+		clause = (Node *) ((RestrictInfo *) clause)->clause;
+
+	opexpr = (OpExpr *) clause;
+
+	/* Make sure we have the expected node type. */
+	Assert(is_opclause(clause));
+	Assert(list_length(opexpr->args) == 2);
+
+	/* FIXME strip relabel etc. the way examine_opclause_args does */
+	expr = linitial(opexpr->args);
+	if (bms_singleton_member(pull_varnos(root, expr)) == rel->relid)
+		return expr;
+
+	expr = lsecond(opexpr->args);
+	if (bms_singleton_member(pull_varnos(root, expr)) == rel->relid)
+		return expr;
+
+	return NULL;
+}
+
+/*
+ * statext_clauselist_join_selectivity
+ *		Use extended stats to estimate join clauses.
+ *
+ * XXX In principle, we should not restrict this to cases with multiple
+ * join clauses - we should consider dependencies with conditions at the
+ * base relations, i.e. calculate P(join clause | base restrictions).
+ * But currently that does not happen, because clauselist_selectivity_ext
+ * treats a single clause as a special case (and we don't apply extended
+ * statistics in that case yet).
+ */
+Selectivity
+statext_clauselist_join_selectivity(PlannerInfo *root, List *clauses, int varRelid,
+									JoinType jointype, SpecialJoinInfo *sjinfo,
+									Bitmapset **estimatedclauses)
+{
+	int			i;
+	int			listidx;
+	Selectivity	s = 1.0;
+
+	JoinPairInfo *info;
+	int				ninfo;
+
+	if (!clauses)
+		return 1.0;
+
+	/* extract pairs of joined relations from the list of clauses */
+	info = statext_build_join_pairs(root, clauses, varRelid, jointype, sjinfo,
+									*estimatedclauses, &ninfo);
+
+	/* no useful join pairs */
+	if (!info)
+		return 1.0;
+
+	/*
+	 * Process the join pairs, try to find a matching MCV on each side.
+	 *
+	 * XXX The basic principle is quite similar to eqjoinsel_inner, i.e.
+	 * we try to find a MCV on both sides of the join, and use it to get
+	 * a better join estimate. It's a bit more complicated, because there
+	 * might be multiple MCV lists, we also need ndistinct estimate, and
+	 * there may be interesting baserestrictions too.
+	 *
+	 * XXX At the moment we only handle the case with matching MCVs on
+	 * both sides, but it'd be good to also handle case with just ndistinct
+	 * statistics improving ndistinct estimates.
+	 *
+	 * XXX We might also handle cases with a regular MCV on one side and
+	 * an extended MCV on the other side.
+	 *
+	 * XXX Perhaps it'd be good to also handle case with one side only
+	 * having "regular" statistics (e.g. MCV), especially in cases with
+	 * no conditions on that side of the join (where we can't use the
+	 * extended MCV to calculate conditional probability).
+	 */
+	for (i = 0; i < ninfo; i++)
+	{
+		ListCell *lc;
+
+		RelOptInfo *rel1;
+		RelOptInfo *rel2;
+
+		StatisticExtInfo *stat1;
+		StatisticExtInfo *stat2;
+
+		/* extract info about the first relation */
+		rel1 = extract_relation_info(root, &info[i], 0, &stat1);
+
+		/* extract info about the second relation */
+		rel2 = extract_relation_info(root, &info[i], 1, &stat2);
+
+		/*
+		 * We can handle three basic cases:
+		 *
+		 * a) Extended stats (with MCV) on both sides is an ideal case, and we
+		 * can simply combine the two MCVs, possibly with additional conditions
+		 * from the relations.
+		 *
+		 * b) Extended stats on one side, regular MCV on the other side (this
+		 * means there's just one join clause / expression). It also means the
+		 * extended stats likely covers at least one extra condition, otherwise
+		 * we could just use regular statistics. We can combine the stats just
+		 * similarly to (a).
+		 *
+		 * c) No extended stats with MCV. If there are multiple join clauses,
+		 * we can try using ndistinct coefficients and do what eqjoinsel does.
+		 *
+		 * If none of these applies, we fallback to the regular selectivity
+		 * estimation in eqjoinsel.
+		 */
+		if (stat1 && stat2)
+		{
+			s *= mcv_combine_extended(root, rel1, rel2, stat1, stat2, info[i].clauses);
+		}
+		else if (stat1 && (list_length(info[i].clauses) == 1))
+		{
+			/* try finding MCV on the other relation */
+			VariableStatData	vardata;
+			AttStatsSlot		sslot;
+			Form_pg_statistic	stats = NULL;
+			bool				have_mcvs = false;
+			Node			   *clause = linitial(info[i].clauses);
+			Node			   *expr = get_expression_for_rel(root, rel2, clause);
+			double				nd;
+			bool				isdefault;
+
+			examine_variable(root, expr, 0, &vardata);
+
+			nd = get_variable_numdistinct(&vardata, &isdefault);
+
+			memset(&sslot, 0, sizeof(sslot));
+
+			if (HeapTupleIsValid(vardata.statsTuple))
+			{
+				/* note we allow use of nullfrac regardless of security check */
+				stats = (Form_pg_statistic) GETSTRUCT(vardata.statsTuple);
+				/* FIXME should this call statistic_proc_security_check like eqjoinsel? */
+				have_mcvs = get_attstatsslot(&sslot, vardata.statsTuple,
+											 STATISTIC_KIND_MCV, InvalidOid,
+											 ATTSTATSSLOT_VALUES | ATTSTATSSLOT_NUMBERS);
+			}
+
+			if (have_mcvs)
+				s *= mcv_combine_simple(root, rel1, stat1, &sslot,
+										stats->stanullfrac, nd, isdefault, clause);
+
+			free_attstatsslot(&sslot);
+
+			ReleaseVariableStats(vardata);
+
+			/* no stats, don't mark the clauses as estimated */
+			if (!have_mcvs)
+				continue;
+		}
+		else if (stat2 && (list_length(info[i].clauses) == 1))
+		{
+			/* try finding MCV on the other relation */
+			VariableStatData	vardata;
+			AttStatsSlot		sslot;
+			Form_pg_statistic	stats = NULL;
+			bool				have_mcvs = false;
+			Node			   *clause = (Node *) linitial(info[i].clauses);
+			Node			   *expr = get_expression_for_rel(root, rel1, clause);
+			double				nd;
+			bool				isdefault;
+
+			examine_variable(root, expr, 0, &vardata);
+
+			nd = get_variable_numdistinct(&vardata, &isdefault);
+
+			memset(&sslot, 0, sizeof(sslot));
+
+			if (HeapTupleIsValid(vardata.statsTuple))
+			{
+				/* note we allow use of nullfrac regardless of security check */
+				stats = (Form_pg_statistic) GETSTRUCT(vardata.statsTuple);
+				/* FIXME should this call statistic_proc_security_check like eqjoinsel? */
+				have_mcvs = get_attstatsslot(&sslot, vardata.statsTuple,
+											 STATISTIC_KIND_MCV, InvalidOid,
+											 ATTSTATSSLOT_VALUES | ATTSTATSSLOT_NUMBERS);
+			}
+
+			if (have_mcvs)
+				s *= mcv_combine_simple(root, rel2, stat2, &sslot,
+										stats->stanullfrac, nd, isdefault, clause);
+
+			free_attstatsslot(&sslot);
+
+			ReleaseVariableStats(vardata);
+
+			/* no stats, don't mark the clauses as estimated */
+			if (!have_mcvs)
+				continue;
+		}
+		else
+			continue;
+
+		/*
+		 * Now mark all the clauses for this join pair as estimated.
+		 *
+		 * XXX Maybe track the indexes in JoinPairInfo, so that we can
+		 * simply union the two bitmaps, without the extra matching.
+		 */
+		foreach (lc, info->clauses)
+		{
+			Node *clause = (Node *) lfirst(lc);
+			ListCell *lc2;
+
+			listidx = -1;
+			foreach (lc2, clauses)
+			{
+				Node *clause2 = (Node *) lfirst(lc2);
+				listidx++;
+
+				Assert(IsA(clause2, RestrictInfo));
+
+				clause2 = (Node *) ((RestrictInfo *) clause2)->clause;
+
+				if (equal(clause, clause2))
+				{
+					*estimatedclauses = bms_add_member(*estimatedclauses, listidx);
+					break;
+				}
+			}
+		}
+	}
+
+	return s;
+}
diff --git a/src/backend/statistics/mcv.c b/src/backend/statistics/mcv.c
index b350fc5f7b2..779a4e6121a 100644
--- a/src/backend/statistics/mcv.c
+++ b/src/backend/statistics/mcv.c
@@ -24,6 +24,7 @@
 #include "funcapi.h"
 #include "nodes/nodeFuncs.h"
 #include "optimizer/clauses.h"
+#include "optimizer/optimizer.h"
 #include "statistics/extended_stats_internal.h"
 #include "statistics/statistics.h"
 #include "utils/array.h"
@@ -2156,3 +2157,756 @@ mcv_clause_selectivity_or(PlannerInfo *root, StatisticExtInfo *stat,
 
 	return s;
 }
+
+/*
+ * statext_compare_mcvs
+ *		Calculate join selectivity using extended statistics, similar to
+ *		eqjoinsel_inner.
+ *
+ * Considers restrictions on base relations too, essentially computing a
+ * conditional probability
+ *
+ *	P(join clauses | baserestrictinfos on either side)
+ *
+ * Compared to eqjoinsel_inner there's a couple problems. With per-column MCV
+ * lists it's obvious that the number of distinct values not covered by the MCV
+ * is (ndistinct - size(MCV)). With multi-column MCVs it's not that simple,
+ * particularly when the conditions are on a subset of the MCV attributes and/or
+ * NULLs are involved. E.g. with MCV (a,b,c) and conditions on (a,b), it's not
+ * clear if the number of (a,b) combinations not covered by the MCV is
+ *
+ * (ndistinct(a,b) - ndistinct_mcv(a,b))
+ *
+ * where ndistinct_mcv(a,b) is the number of distinct (a,b) combinations
+ * included in the MCV list. These combinations may be present in the rest
+ * of the data (outside MCV), just with some extra values in "c". So in
+ * principle there may be between
+ *
+ * (ndistinct(a,b) - ndistinct_mcv(a,b)) and ndistinct(a,b)
+ *
+ * distinct values in the part of the data not covered by the MCV. So we need
+ * to pick something in between, there's no way to calculate this accurately.
+ */
+Selectivity
+mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
+					 StatisticExtInfo *stat1, StatisticExtInfo *stat2,
+					 List *clauses)
+{
+	ListCell   *lc;
+
+	MCVList    *mcv1,
+			   *mcv2;
+	int			idx,
+				i,
+				j;
+	Selectivity s = 0;
+
+	/* match bitmaps and selectivity for baserel conditions (if any) */
+	List   *exprs1 = NIL,
+		   *exprs2 = NIL;
+	List   *conditions1 = NIL,
+		   *conditions2 = NIL;
+	bool   *cmatches1 = NULL,
+		   *cmatches2 = NULL;
+
+	double	csel1 = 1.0,
+			csel2 = 1.0;
+
+	bool   *matches1 = NULL,
+		   *matches2 = NULL;
+
+	/* estimates for the two relations */
+	double	matchfreq1,
+			unmatchfreq1,
+			otherfreq1,
+			mcvfreq1,
+			nd1,
+			totalsel1;
+
+	double 	matchfreq2,
+			unmatchfreq2,
+			otherfreq2,
+			mcvfreq2,
+			nd2,
+			totalsel2;
+
+	/* info about clauses and how they match to MCV stats */
+	FmgrInfo   *opprocs;
+	int		   *indexes1,
+			   *indexes2;
+	bool	   *reverse;
+
+	/* we picked the stats so that they have MCV enabled */
+	Assert((stat1->kind = STATS_EXT_MCV) && (stat2->kind = STATS_EXT_MCV));
+
+	mcv1 = statext_mcv_load(stat1->statOid);
+	mcv2 = statext_mcv_load(stat2->statOid);
+
+	/* should only get here with MCV on both sides */
+	Assert(mcv1 && mcv2);
+
+	/* Determine which baserel clauses to use for conditional probability. */
+	conditions1 = statext_determine_join_restrictions(root, rel1, stat1);
+	conditions2 = statext_determine_join_restrictions(root, rel2, stat2);
+
+	/*
+	 * Calculate match bitmaps for restrictions on either side of the join
+	 * (there may be none, in which case this will be NULL).
+	 */
+	if (conditions1)
+	{
+		cmatches1 = mcv_get_match_bitmap(root, conditions1,
+										 stat1->keys, stat1->exprs,
+										 mcv1, false);
+		csel1 = clauselist_selectivity(root, conditions1, rel1->relid, 0, NULL);
+	}
+
+	if (conditions2)
+	{
+		cmatches2 = mcv_get_match_bitmap(root, conditions2,
+										 stat2->keys, stat2->exprs,
+										 mcv2, false);
+		csel2 = clauselist_selectivity(root, conditions2, rel2->relid, 0, NULL);
+	}
+
+	/*
+	 * Match bitmaps for matches between MCV elements. By default there
+	 * are no matches, so we set all items to 0.
+	 */
+	matches1 = (bool *) palloc0(sizeof(bool) * mcv1->nitems);
+	matches2 = (bool *) palloc0(sizeof(bool) * mcv2->nitems);
+
+	/*
+	 * Initialize information about clauses and how they match to the MCV
+	 * stats we picked. We do this only once before processing the lists,
+	 * so that we don't have to do that for each MCV item or so.
+	 */
+	opprocs = (FmgrInfo *) palloc(sizeof(FmgrInfo) * list_length(clauses));
+	indexes1 = (int *) palloc(sizeof(int) * list_length(clauses));
+	indexes2 = (int *) palloc(sizeof(int) * list_length(clauses));
+	reverse = (bool *) palloc(sizeof(bool) * list_length(clauses));
+
+	idx = 0;
+	foreach (lc, clauses)
+	{
+		Node	   *clause = (Node *) lfirst(lc);
+		OpExpr	   *opexpr;
+		Node	   *expr1,
+				   *expr2;
+		Bitmapset  *relids1,
+				   *relids2;
+
+		/*
+		 * Strip the RestrictInfo node, get the actual clause.
+		 *
+		 * XXX Not sure if we need to care about removing other node types
+		 * too (e.g. RelabelType etc.). statext_is_supported_join_clause
+		 * matches this, but maybe we need to relax it?
+		 */
+		if (IsA(clause, RestrictInfo))
+			clause = (Node *) ((RestrictInfo *) clause)->clause;
+
+		opexpr = (OpExpr *) clause;
+
+		/* Make sure we have the expected node type. */
+		Assert(is_opclause(clause));
+		Assert(list_length(opexpr->args) == 2);
+
+		fmgr_info(get_opcode(opexpr->opno), &opprocs[idx]);
+
+		/* FIXME strip relabel etc. the way examine_opclause_args does */
+		expr1 = linitial(opexpr->args);
+		expr2 = lsecond(opexpr->args);
+
+		/* determine order of clauses (rel1 op rel2) or (rel2 op rel1) */
+		relids1 = pull_varnos(root, expr1);
+		relids2 = pull_varnos(root, expr2);
+
+		if ((bms_singleton_member(relids1) == rel1->relid) &&
+			(bms_singleton_member(relids2) == rel2->relid))
+		{
+			Oid		collid;
+
+			indexes1[idx] = mcv_match_expression(expr1,
+												 stat1->keys, stat1->exprs,
+												 &collid);
+			indexes2[idx] = mcv_match_expression(expr2,
+												 stat2->keys, stat2->exprs,
+												 &collid);
+			reverse[idx] = false;
+
+			exprs1 = lappend(exprs1, expr1);
+			exprs2 = lappend(exprs2, expr2);
+		}
+		else if ((bms_singleton_member(relids2) == rel1->relid) &&
+				 (bms_singleton_member(relids1) == rel2->relid))
+		{
+			Oid		collid;
+
+			indexes1[idx] = mcv_match_expression(expr2,
+												 stat2->keys, stat2->exprs,
+												 &collid);
+			indexes2[idx] = mcv_match_expression(expr1,
+												 stat1->keys, stat1->exprs,
+												 &collid);
+			reverse[idx] = true;
+
+			exprs1 = lappend(exprs1, expr2);
+			exprs2 = lappend(exprs2, expr1);
+		}
+		else
+			/* should never happen */
+			Assert(false);
+
+		Assert((indexes1[idx] >= 0) &&
+			   (indexes1[idx] < bms_num_members(stat1->keys) + list_length(stat1->exprs)));
+
+		Assert((indexes2[idx] >= 0) &&
+			   (indexes2[idx] < bms_num_members(stat2->keys) + list_length(stat2->exprs)));
+
+		idx++;
+	}
+
+	/*
+	 * Match items between the two MCV lists.
+	 *
+	 * We don't know if the join conditions match all attributes in the MCV, the
+	 * overlap may be just on a subset of attributes, e.g. (a,b,c) vs. (b,c,d).
+	 * So there may be multiple matches on either side. So we can't optimize by
+	 * aborting the inner loop after the first match, etc.
+	 *
+	 * XXX We can skip the items eliminated by the base restrictions, of course.
+	 *
+	 * XXX We might optimize this in two ways. We might sort the MCV items on
+	 * both sides using the "join" attributes, and then perform something like
+	 * merge join. Or we might calculate a hash from the join columns, and then
+	 * compare this (to eliminate the most expensive equality functions).
+	 */
+	for (i = 0; i < mcv1->nitems; i++)
+	{
+		bool	has_nulls;
+
+		/* skip items eliminated by restrictions on rel1 */
+		if (cmatches1 && !cmatches1[i])
+			continue;
+
+		/*
+		 * Check if any value in the first MCV item is NULL, because it'll be
+		 * mismatch anyway.
+		 *
+		 * XXX This might not work for some join clauses, e.g. IS NOT DISTINCT
+		 * FROM, but those are currently not considered compatible (we only
+		 * allow OpExpr at the moment).
+		 */
+		has_nulls = false;
+		for (j = 0; j < list_length(clauses); j++)
+			has_nulls |= mcv1->items[i].isnull[indexes1[j]];
+
+		if (has_nulls)
+			continue;
+
+		/* find matches in the second MCV list */
+		for (j = 0; j < mcv2->nitems; j++)
+		{
+			int			idx;
+			bool		items_match = true;
+
+			/* skip items eliminated by restrictions on rel2 */
+			if (cmatches2 && !cmatches2[j])
+				continue;
+
+			/*
+			 * XXX We can't skip based on existing matches2 value, because there
+			 * may be duplicates in the first MCV.
+			 */
+
+			/*
+			 * Evaluate if all the join clauses match between the two MCV items.
+			 *
+			 * XXX We might optimize the order of evaluation, using the costs of
+			 * operator functions for individual columns. It does depend on the
+			 * number of distinct values, etc.
+			 */
+			idx = 0;
+			foreach (lc, clauses)
+			{
+				bool	match;
+				int		index1 = indexes1[idx],
+						index2 = indexes2[idx];
+				Datum	value1,
+						value2;
+				bool	reverse_args = reverse[idx];
+
+				/* If either value is null, it's a mismatch */
+				if (mcv2->items[j].isnull[index2])
+					match = false;
+				else
+				{
+					value1 = mcv1->items[i].values[index1];
+					value2 = mcv2->items[j].values[index2];
+
+					/*
+					 * Careful about order of parameters. For same-type equality
+					 * that should not matter, but easy enough.
+					 *
+					 * FIXME Use appropriate collation.
+					 */
+					if (reverse_args)
+						match = DatumGetBool(FunctionCall2Coll(&opprocs[idx],
+															   InvalidOid,
+															   value2, value1));
+					else
+						match = DatumGetBool(FunctionCall2Coll(&opprocs[idx],
+															   InvalidOid,
+															   value1, value2));
+				}
+
+				items_match &= match;
+
+				if (!items_match)
+					break;
+
+				idx++;
+			}
+
+			if (items_match)
+			{
+				/* XXX Do we need to do something about base frequency? */
+				matches1[i] = matches2[j] = true;
+				s += mcv1->items[i].frequency * mcv2->items[j].frequency;
+			}
+		}
+	}
+
+	matchfreq1 = unmatchfreq1 = mcvfreq1 = 0.0;
+	for (i = 0; i < mcv1->nitems; i++)
+	{
+		mcvfreq1 += mcv1->items[i].frequency;
+
+		/* ignore MCV items eliminated by baserel conditions */
+		if (cmatches1 && !cmatches1[i])
+			continue;
+
+		if (matches1[i])
+			matchfreq1 += mcv1->items[i].frequency;
+		else
+			unmatchfreq1 += mcv1->items[i].frequency;
+	}
+
+	/* not represented by the MCV */
+	otherfreq1 = 1.0 - mcvfreq1;
+
+	matchfreq2 = unmatchfreq2 = mcvfreq2 = 0.0;
+	for (i = 0; i < mcv2->nitems; i++)
+	{
+		mcvfreq2 += mcv2->items[i].frequency;
+
+		/* ignore MCV items eliminated by baserel conditions */
+		if (cmatches2 && !cmatches2[i])
+			continue;
+
+		if (matches2[i])
+			matchfreq2 += mcv2->items[i].frequency;
+		else
+			unmatchfreq2 += mcv2->items[i].frequency;
+	}
+
+	/* not represented by the MCV */
+	otherfreq2 = 1.0 - mcvfreq2;
+
+	/*
+	 * Correction for MCV parts eliminated by the conditions.
+	 *
+	 * We need to be careful about cases where conditions eliminated all
+	 * the MCV items. We must not divide by 0.0, because that would either
+	 * produce bogus value or trigger division by zero. Instead we simply
+	 * set the selectivity to 0.0, because there can't be any matches.
+	 */
+	if ((matchfreq1 + unmatchfreq1) > 0)
+		s = s * mcvfreq1 / (matchfreq1 + unmatchfreq1);
+	else
+		s = 0.0;
+
+	if ((matchfreq2 + unmatchfreq2) > 0)
+		s = s * mcvfreq2 / (matchfreq2 + unmatchfreq2);
+	else
+		s = 0.0;
+
+	/* calculate ndistinct for the expression in join clauses for each rel */
+	nd1 = estimate_num_groups(root, exprs1, rel1->rows, NULL, NULL);
+	nd2 = estimate_num_groups(root, exprs2, rel2->rows, NULL, NULL);
+
+	/*
+	 * Consider the part of the data not represented by the MCV lists.
+	 *
+	 * XXX this is a bit bogus, because we don't know what fraction of
+	 * distinct combinations is covered by the MCV list (we're only
+	 * dealing with some of the columns), so we can't use the same
+	 * formular as eqjoinsel_inner exactly. We just use the estimates
+	 * for the whole table - this is likely an overestimate, because
+	 * (a) items may repeat in the MCV list, if it has more columns,
+	 * and (b) some of the combinations may be present in non-MCV data.
+	 *
+	 * Moreover, we need to look at the conditions. For now we simply
+	 * assume the conditions affect the distinct groups, and use that.
+	 *
+	 * XXX We might calculate the number of distinct groups in the MCV,
+	 * and then use something between (nd1 - distinct(MCV)) and (nd1),
+	 * which are the possible extreme values, assuming the estimates
+	 * are accurate. Maybe mean or geometric mean would work?
+	 *
+	 * XXX Not sure multiplying ndistinct with probabilities is good.
+	 * Maybe we should do something more like estimate_num_groups?
+	 */
+	nd1 *= csel1;
+	nd2 *= csel2;
+
+	totalsel1 = s;
+	totalsel1 += unmatchfreq1 * otherfreq2 / nd2;
+	totalsel1 += otherfreq1 * (otherfreq2 + unmatchfreq2) / nd2;
+
+//	if (nd2 > mcvb->nitems)
+//		totalsel1 += unmatchfreq1 * otherfreq2 / (nd2 - mcvb->nitems);
+//	if (nd2 > nmatches)
+//		totalsel1 += otherfreq1 * (otherfreq2 + unmatchfreq2) /
+//			(nd2 - nmatches);
+
+	totalsel2 = s;
+	totalsel2 += unmatchfreq2 * otherfreq1 / nd1;
+	totalsel2 += otherfreq2 * (otherfreq1 + unmatchfreq1) / nd1;
+
+//	if (nd1 > mcva->nitems)
+//		totalsel2 += unmatchfreq2 * otherfreq1 / (nd1 - mcva->nitems);
+//	if (nd1 > nmatches)
+//		totalsel2 += otherfreq2 * (otherfreq1 + unmatchfreq1) /
+//			(nd1 - nmatches);
+
+	s = Min(totalsel1, totalsel2);
+
+	return s;
+}
+
+
+/*
+ * statext_compare_simple
+ *		Calculate join selectivity using a combination of extended
+ * statistics MCV on one side, and simple per-column MCV on the other.
+ *
+ * Most of the mcv_combine_extended comment applies here too, but we can make
+ * some simplifications because we know the second (per-column) MCV is simpler,
+ * contains no NULL or duplicate values, etc.
+ */
+Selectivity
+mcv_combine_simple(PlannerInfo *root, RelOptInfo *rel, StatisticExtInfo *stat,
+				   AttStatsSlot *sslot, double stanullfrac, double nd,
+				   bool isdefault, Node *clause)
+{
+	MCVList    *mcv;
+	int			i,
+				j;
+	Selectivity s = 0;
+
+	/* match bitmaps and selectivity for baserel conditions (if any) */
+	List   *conditions = NIL;
+	bool   *cmatches = NULL;
+
+	double	csel = 1.0;
+
+	bool   *matches1 = NULL,
+		   *matches2 = NULL;
+
+	/* estimates for the two sides */
+	double	matchfreq1,
+			unmatchfreq1,
+			otherfreq1,
+			mcvfreq1,
+			nd1,
+			totalsel1;
+
+	double 	matchfreq2,
+			unmatchfreq2,
+			otherfreq2,
+			mcvfreq2,
+			nd2,
+			totalsel2;
+
+	List   *exprs1 = NIL,
+		   *exprs2 = NIL;
+
+	/* info about clauses and how they match to MCV stats */
+	FmgrInfo	opproc;
+	int			index;
+	bool		reverse;
+
+	/* we picked the stats so that they have MCV enabled */
+	Assert(stat->kind = STATS_EXT_MCV);
+
+	mcv = statext_mcv_load(stat->statOid);
+
+	/* should only get here with MCV on both sides */
+	Assert(mcv);
+
+	/* Determine which baserel clauses to use for conditional probability. */
+	conditions = statext_determine_join_restrictions(root, rel, stat);
+
+	/*
+	 * Calculate match bitmaps for restrictions on either side of the join
+	 * (there may be none, in which case this will be NULL).
+	 */
+	if (conditions)
+	{
+		cmatches = mcv_get_match_bitmap(root, conditions,
+										 stat->keys, stat->exprs,
+										 mcv, false);
+		csel = clauselist_selectivity(root, conditions, rel->relid, 0, NULL);
+	}
+
+	/*
+	 * Match bitmaps for matches between MCV elements. By default there
+	 * are no matches, so we set all items to 0.
+	 */
+	matches1 = (bool *) palloc0(sizeof(bool) * mcv->nitems);
+
+	/* Matches for the side with just regular single-column MCV. */
+	matches2 = (bool *) palloc0(sizeof(bool) * sslot->nvalues);
+
+	/*
+	 * Initialize information about the clause and how it matches to the
+	 * extended stats we picked. We do this only once before processing
+	 * the lists, so that we don't have to do that for each item or so.
+	 */
+	{
+		OpExpr	   *opexpr;
+		Node	   *expr1,
+				   *expr2;
+		Bitmapset  *relids1,
+				   *relids2;
+
+		/*
+		 * Strip the RestrictInfo node, get the actual clause.
+		 *
+		 * XXX Not sure if we need to care about removing other node types
+		 * too (e.g. RelabelType etc.). statext_is_supported_join_clause
+		 * matches this, but maybe we need to relax it?
+		 */
+		if (IsA(clause, RestrictInfo))
+			clause = (Node *) ((RestrictInfo *) clause)->clause;
+
+		opexpr = (OpExpr *) clause;
+
+		/* Make sure we have the expected node type. */
+		Assert(is_opclause(clause));
+		Assert(list_length(opexpr->args) == 2);
+
+		fmgr_info(get_opcode(opexpr->opno), &opproc);
+
+		/* FIXME strip relabel etc. the way examine_opclause_args does */
+		expr1 = linitial(opexpr->args);
+		expr2 = lsecond(opexpr->args);
+
+		/* determine order of clauses (rel1 op rel2) or (rel2 op rel1) */
+		relids1 = pull_varnos(root, expr1);
+		relids2 = pull_varnos(root, expr2);
+
+		if (bms_singleton_member(relids1) == rel->relid)
+		{
+			Oid		collid;
+
+			index = mcv_match_expression(expr1, stat->keys, stat->exprs,
+										 &collid);
+			reverse = false;
+
+			exprs1 = lappend(exprs1, expr1);
+			exprs2 = lappend(exprs2, expr2);
+		}
+		else if (bms_singleton_member(relids2) == rel->relid)
+		{
+			Oid		collid;
+
+			index = mcv_match_expression(expr2, stat->keys, stat->exprs,
+										 &collid);
+			reverse = true;
+
+			exprs1 = lappend(exprs1, expr2);
+			exprs2 = lappend(exprs2, expr1);
+		}
+		else
+			/* should never happen */
+			Assert(false);
+
+		Assert((index >= 0) &&
+			   (index < bms_num_members(stat->keys) + list_length(stat->exprs)));
+	}
+
+	/*
+	 * Match items between the two MCV lists.
+	 *
+	 * We don't know if the join conditions match all attributes in the MCV, the
+	 * overlap may be just on a subset of attributes, e.g. (a,b,c) vs. (b,c,d).
+	 * So there may be multiple matches on either side. So we can't optimize by
+	 * aborting the inner loop after the first match, etc.
+	 *
+	 * XXX We can skip the items eliminated by the base restrictions, of course.
+	 *
+	 * XXX We might optimize this in two ways. We might sort the MCV items on
+	 * both sides using the "join" attributes, and then perform something like
+	 * merge join. Or we might calculate a hash from the join columns, and then
+	 * compare this (to eliminate the most expensive equality functions).
+	 */
+	for (i = 0; i < mcv->nitems; i++)
+	{
+		/* skip items eliminated by restrictions on rel1 */
+		if (cmatches && !cmatches[i])
+			continue;
+
+		/*
+		 * We can check mcv1->items[i].isnull[index1] here, because it'll be a
+		 * mismatch anyway (the simple MCV does not contain NULLs).
+		 */
+		if (mcv->items[i].isnull[index])
+			continue;
+
+		/* find matches in the second MCV list */
+		for (j = 0; j < sslot->nvalues; j++)
+		{
+			bool	match;
+			Datum	value1 = mcv->items[i].values[index];
+			Datum	value2 = sslot->values[j];
+
+			/*
+			 * Evaluate the join clause between the two MCV lists. We don't
+			 * need to deal with NULL values here - we've already checked for
+			 * NULL in the extended statistics earlier, and the simple MCV
+			 * does not contain NULL values.
+			 *
+			 * Careful about order of parameters. For same-type equality
+			 * that should not matter, but easy enough.
+			 *
+			 * FIXME Use appropriate collation.
+			 */
+			if (reverse)
+				match = DatumGetBool(FunctionCall2Coll(&opproc,
+													   InvalidOid,
+													   value2, value1));
+			else
+				match = DatumGetBool(FunctionCall2Coll(&opproc,
+													   InvalidOid,
+													   value1, value2));
+
+			if (match)
+			{
+				/* XXX Do we need to do something about base frequency? */
+				matches1[i] = matches2[j] = true;
+				s += mcv->items[i].frequency * sslot->numbers[j];
+
+				/*
+				 * We know there can be just a single match in the regular
+				 * MCV list, so we can abort the inner loop.
+				 */
+				break;
+			}
+		}
+	}
+
+	matchfreq1 = unmatchfreq1 = mcvfreq1 = 0.0;
+	for (i = 0; i < mcv->nitems; i++)
+	{
+		mcvfreq1 += mcv->items[i].frequency;
+
+		/* ignore MCV items eliminated by baserel conditions */
+		if (cmatches && !cmatches[i])
+			continue;
+
+		if (matches1[i])
+			matchfreq1 += mcv->items[i].frequency;
+		else
+			unmatchfreq1 += mcv->items[i].frequency;
+	}
+
+	/* not represented by the MCV */
+	otherfreq1 = 1.0 - mcvfreq1;
+
+	matchfreq2 = unmatchfreq2 = mcvfreq2 = 0.0;
+	for (i = 0; i < sslot->nvalues; i++)
+	{
+		mcvfreq2 += sslot->numbers[i];
+
+		if (matches2[i])
+			matchfreq2 += sslot->numbers[i];
+		else
+			unmatchfreq2 += sslot->numbers[i];
+	}
+
+	/* not represented by the MCV */
+	otherfreq2 = 1.0 - mcvfreq2;
+
+	/*
+	 * Correction for MCV parts eliminated by the conditions.
+	 *
+	 * We need to be careful about cases where conditions eliminated all
+	 * the MCV items. We must not divide by 0.0, because that would either
+	 * produce bogus value or trigger division by zero. Instead we simply
+	 * set the selectivity to 0.0, because there can't be any matches.
+	 */
+	if ((matchfreq1 + unmatchfreq1) > 0)
+		s = s * mcvfreq1 / (matchfreq1 + unmatchfreq1);
+	else
+		s = 0.0;
+
+	if ((matchfreq2 + unmatchfreq2) > 0)
+		s = s * mcvfreq2 / (matchfreq2 + unmatchfreq2);
+	else
+		s = 0.0;
+
+	/* calculate ndistinct for the expression in join clauses for each rel */
+	nd1 = estimate_num_groups(root, exprs1, rel->rows, NULL, NULL);
+	nd2 = nd;
+
+	/*
+	 * Consider the part of the data not represented by the MCV lists.
+	 *
+	 * XXX this is a bit bogus, because we don't know what fraction of
+	 * distinct combinations is covered by the MCV list (we're only
+	 * dealing with some of the columns), so we can't use the same
+	 * formular as eqjoinsel_inner exactly. We just use the estimates
+	 * for the whole table - this is likely an overestimate, because
+	 * (a) items may repeat in the MCV list, if it has more columns,
+	 * and (b) some of the combinations may be present in non-MCV data.
+	 *
+	 * Moreover, we need to look at the conditions. For now we simply
+	 * assume the conditions affect the distinct groups, and use that.
+	 *
+	 * XXX We might calculate the number of distinct groups in the MCV,
+	 * and then use something between (nd1 - distinct(MCV)) and (nd1),
+	 * which are the possible extreme values, assuming the estimates
+	 * are accurate. Maybe mean or geometric mean would work?
+	 *
+	 * XXX Not sure multiplying ndistinct with probabilities is good.
+	 * Maybe we should do something more like estimate_num_groups?
+	 */
+	nd1 *= csel;
+
+	totalsel1 = s;
+	totalsel1 += unmatchfreq1 * otherfreq2 / nd2;
+	totalsel1 += otherfreq1 * (otherfreq2 + unmatchfreq2) / nd2;
+
+//	if (nd2 > mcvb->nitems)
+//		totalsel1 += unmatchfreq1 * otherfreq2 / (nd2 - mcvb->nitems);
+//	if (nd2 > nmatches)
+//		totalsel1 += otherfreq1 * (otherfreq2 + unmatchfreq2) /
+//			(nd2 - nmatches);
+
+	totalsel2 = s;
+	totalsel2 += unmatchfreq2 * otherfreq1 / nd1;
+	totalsel2 += otherfreq2 * (otherfreq1 + unmatchfreq1) / nd1;
+
+//	if (nd1 > mcva->nitems)
+//		totalsel2 += unmatchfreq2 * otherfreq1 / (nd1 - mcva->nitems);
+//	if (nd1 > nmatches)
+//		totalsel2 += otherfreq2 * (otherfreq1 + unmatchfreq1) /
+//			(nd1 - nmatches);
+
+	s = Min(totalsel1, totalsel2);
+
+	return s;
+}
diff --git a/src/include/statistics/extended_stats_internal.h b/src/include/statistics/extended_stats_internal.h
index 55cd9252a55..1e51c54fefb 100644
--- a/src/include/statistics/extended_stats_internal.h
+++ b/src/include/statistics/extended_stats_internal.h
@@ -15,6 +15,7 @@
 #define EXTENDED_STATS_INTERNAL_H
 
 #include "statistics/statistics.h"
+#include "utils/lsyscache.h"
 #include "utils/sortsupport.h"
 
 typedef struct
@@ -127,4 +128,23 @@ extern Selectivity mcv_clause_selectivity_or(PlannerInfo *root,
 											 Selectivity *overlap_basesel,
 											 Selectivity *totalsel);
 
+extern Selectivity mcv_combine_simple(PlannerInfo *root,
+									  RelOptInfo *rel,
+									  StatisticExtInfo *stat,
+									  AttStatsSlot *sslot,
+									  double stanullfrac,
+									  double nd, bool isdefault,
+									  Node *clause);
+
+extern Selectivity mcv_combine_extended(PlannerInfo *root,
+										RelOptInfo *rel1,
+										RelOptInfo *rel2,
+										StatisticExtInfo *stat1,
+										StatisticExtInfo *stat2,
+										List *clauses);
+
+extern List *statext_determine_join_restrictions(PlannerInfo *root,
+												 RelOptInfo *rel,
+												 StatisticExtInfo *info);
+
 #endif							/* EXTENDED_STATS_INTERNAL_H */
diff --git a/src/include/statistics/statistics.h b/src/include/statistics/statistics.h
index 326cf26feae..4bf27240f6f 100644
--- a/src/include/statistics/statistics.h
+++ b/src/include/statistics/statistics.h
@@ -126,4 +126,16 @@ extern StatisticExtInfo *choose_best_statistics(List *stats, char requiredkind,
 												int nclauses);
 extern HeapTuple statext_expressions_load(Oid stxoid, int idx);
 
+extern StatisticExtInfo *statext_find_matching_mcv(PlannerInfo *root, RelOptInfo *rel,
+										   Bitmapset *attnums, List *exprs);
+
+extern bool statext_try_join_estimates(PlannerInfo *root, List *clauses, int varRelid,
+									   JoinType jointype, SpecialJoinInfo *sjinfo,
+									   Bitmapset *estimatedclauses);
+
+extern Selectivity statext_clauselist_join_selectivity(PlannerInfo *root, List *clauses,
+													   int varRelid,
+													   JoinType jointype, SpecialJoinInfo *sjinfo,
+													   Bitmapset **estimatedclauses);
+
 #endif							/* STATISTICS_H */
diff --git a/src/test/regress/expected/stats_ext.out b/src/test/regress/expected/stats_ext.out
index c60ba45aba8..8846d55c236 100644
--- a/src/test/regress/expected/stats_ext.out
+++ b/src/test/regress/expected/stats_ext.out
@@ -2974,6 +2974,173 @@ SELECT c0 FROM ONLY expr_stats_incompatible_test WHERE
 (0 rows)
 
 DROP TABLE expr_stats_incompatible_test;
+-- Test join estimates.
+CREATE TABLE join_test_1 (a int, b int, c int);
+CREATE TABLE join_test_2 (a int, b int, c int);
+INSERT INTO join_test_1 SELECT mod(i,10), mod(i,10), mod(i,10) FROM generate_series(1,1000) s(i);
+INSERT INTO join_test_2 SELECT mod(i,10), mod(i,10), mod(i,10) FROM generate_series(1,1000) s(i);
+ANALYZE join_test_1;
+ANALYZE join_test_2;
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b))');
+ estimated | actual 
+-----------+--------
+       500 | 100000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5');
+ estimated | actual 
+-----------+--------
+       250 |  50000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 0');
+ estimated | actual 
+-----------+--------
+         1 |      0
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5 and j2.c < 3');
+ estimated | actual 
+-----------+--------
+        75 |  30000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5 and j2.c > 5');
+ estimated | actual 
+-----------+--------
+       100 |      0
+(1 row)
+
+-- can't be improved due to the optimization in clauselist_selectivity_ext,
+-- which skips cases with a single (join) clause
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5');
+ estimated | actual 
+-----------+--------
+      2500 |  50000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5 and j2.c < 5');
+ estimated | actual 
+-----------+--------
+      1250 |  50000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5 and j2.c > 5');
+ estimated | actual 
+-----------+--------
+      1000 |      0
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a) where j1.c < 5');
+ estimated | actual 
+-----------+--------
+      2500 |  50000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a + 1) where j1.c < 5');
+ estimated | actual 
+-----------+--------
+      2500 |  50000
+(1 row)
+
+-- create extended statistics on the join/where columns
+CREATE STATISTICS join_stats_1 ON a, b, c, (a+1), (b+1) FROM join_test_1;
+CREATE STATISTICS join_stats_2 ON a, b, c, (a+1), (b+1) FROM join_test_2;
+ANALYZE join_test_1;
+ANALYZE join_test_2;
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b))');
+ estimated | actual 
+-----------+--------
+    100000 | 100000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5');
+ estimated | actual 
+-----------+--------
+     50000 |  50000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 0');
+ estimated | actual 
+-----------+--------
+         1 |      0
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5 and j2.c < 3');
+ estimated | actual 
+-----------+--------
+     30000 |  30000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5 and j2.c > 5');
+ estimated | actual 
+-----------+--------
+         1 |      0
+(1 row)
+
+-- can't be improved due to the optimization in clauselist_selectivity_ext,
+-- which skips cases with a single (join) clause
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5');
+ estimated | actual 
+-----------+--------
+     50000 |  50000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5 and j2.c < 5');
+ estimated | actual 
+-----------+--------
+     50000 |  50000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5 and j2.c > 5');
+ estimated | actual 
+-----------+--------
+         1 |      0
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a) where j1.c < 5');
+ estimated | actual 
+-----------+--------
+     50000 |  50000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a + 1) where j1.c < 5');
+ estimated | actual 
+-----------+--------
+     50000 |  50000
+(1 row)
+
+-- try combining with single-column (and single-expression) statistics
+DROP STATISTICS join_stats_2;
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a) where j1.c < 5');
+ estimated | actual 
+-----------+--------
+     50000 |  50000
+(1 row)
+
+-- no MCV on join_test_2 (on the (a+1) expression)
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a + 1) where j1.c < 5');
+ estimated | actual 
+-----------+--------
+      2500 |  50000
+(1 row)
+
+CREATE STATISTICS join_stats_2 ON (a+1) FROM join_test_2;
+ANALYZE join_test_2;
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a) where j1.c < 5');
+ estimated | actual 
+-----------+--------
+     50000 |  50000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a + 1) where j1.c < 5');
+ estimated | actual 
+-----------+--------
+     50000 |  50000
+(1 row)
+
+DROP TABLE join_test_1;
+DROP TABLE join_test_2;
 -- Permission tests. Users should not be able to see specific data values in
 -- the extended statistics, if they lack permission to see those values in
 -- the underlying table.
diff --git a/src/test/regress/sql/stats_ext.sql b/src/test/regress/sql/stats_ext.sql
index 6fb37962a72..71e59b52798 100644
--- a/src/test/regress/sql/stats_ext.sql
+++ b/src/test/regress/sql/stats_ext.sql
@@ -1500,6 +1500,72 @@ SELECT c0 FROM ONLY expr_stats_incompatible_test WHERE
 
 DROP TABLE expr_stats_incompatible_test;
 
+
+-- Test join estimates.
+CREATE TABLE join_test_1 (a int, b int, c int);
+CREATE TABLE join_test_2 (a int, b int, c int);
+
+INSERT INTO join_test_1 SELECT mod(i,10), mod(i,10), mod(i,10) FROM generate_series(1,1000) s(i);
+INSERT INTO join_test_2 SELECT mod(i,10), mod(i,10), mod(i,10) FROM generate_series(1,1000) s(i);
+
+ANALYZE join_test_1;
+ANALYZE join_test_2;
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b))');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 0');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5 and j2.c < 3');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5 and j2.c > 5');
+
+-- can't be improved due to the optimization in clauselist_selectivity_ext,
+-- which skips cases with a single (join) clause
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5 and j2.c < 5');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5 and j2.c > 5');
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a) where j1.c < 5');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a + 1) where j1.c < 5');
+
+-- create extended statistics on the join/where columns
+CREATE STATISTICS join_stats_1 ON a, b, c, (a+1), (b+1) FROM join_test_1;
+CREATE STATISTICS join_stats_2 ON a, b, c, (a+1), (b+1) FROM join_test_2;
+
+ANALYZE join_test_1;
+ANALYZE join_test_2;
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b))');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 0');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5 and j2.c < 3');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5 and j2.c > 5');
+
+-- can't be improved due to the optimization in clauselist_selectivity_ext,
+-- which skips cases with a single (join) clause
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5 and j2.c < 5');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5 and j2.c > 5');
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a) where j1.c < 5');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a + 1) where j1.c < 5');
+
+-- try combining with single-column (and single-expression) statistics
+DROP STATISTICS join_stats_2;
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a) where j1.c < 5');
+
+-- no MCV on join_test_2 (on the (a+1) expression)
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a + 1) where j1.c < 5');
+
+CREATE STATISTICS join_stats_2 ON (a+1) FROM join_test_2;
+ANALYZE join_test_2;
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a) where j1.c < 5');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a + 1) where j1.c < 5');
+
+
+DROP TABLE join_test_1;
+DROP TABLE join_test_2;
+
 -- Permission tests. Users should not be able to see specific data values in
 -- the extended statistics, if they lack permission to see those values in
 -- the underlying table.
-- 
2.31.1

#12

Andres Freund

andres@anarazel.de

about 4 years ago

In reply to: Tomas Vondra (#11)

Re: using extended statistics to improve join estimates

On 2022-01-01 18:21:06 +0100, Tomas Vondra wrote:

Here's an updated patch, rebased and fixing a couple typos reported by
Justin Pryzby directly.

FWIW, cfbot reports a few compiler warnings:

https://cirrus-ci.com/task/6067262669979648?logs=gcc_warning#L505
[18:52:15.132] time make -s -j${BUILD_JOBS} world-bin
[18:52:22.697] mcv.c: In function ‘mcv_combine_simple’:
[18:52:22.697] mcv.c:2787:7: error: ‘reverse’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
[18:52:22.697] 2787 | if (reverse)
[18:52:22.697] | ^
[18:52:22.697] mcv.c:2766:27: error: ‘index’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
[18:52:22.697] 2766 | if (mcv->items[i].isnull[index])
[18:52:22.697] | ^

Greetings,

Andres Freund

#13

Julien Rouhaud

rjuju123@gmail.com

almost 4 years ago

In reply to: Andres Freund (#12)

Re: using extended statistics to improve join estimates

Hi,

On Tue, Jan 04, 2022 at 03:55:50PM -0800, Andres Freund wrote:

On 2022-01-01 18:21:06 +0100, Tomas Vondra wrote:

Here's an updated patch, rebased and fixing a couple typos reported by
Justin Pryzby directly.

FWIW, cfbot reports a few compiler warnings:

Also the patch doesn't apply anymore:

http://cfbot.cputube.org/patch_36_3055.log
=== Applying patches on top of PostgreSQL commit ID 74527c3e022d3ace648340b79a6ddec3419f6732 ===
=== applying patch ./0001-Estimate-joins-using-extended-statistics-20220101.patch
patching file src/backend/optimizer/path/clausesel.c
patching file src/backend/statistics/extended_stats.c
Hunk #1 FAILED at 30.
Hunk #2 succeeded at 102 (offset 1 line).
Hunk #3 succeeded at 2619 (offset 9 lines).
1 out of 3 hunks FAILED -- saving rejects to file src/backend/statistics/extended_stats.c.rej

#14

Justin Pryzby

pryzby@telsasoft.com

almost 4 years ago

In reply to: Julien Rouhaud (#13)

Re: using extended statistics to improve join estimates

On Wed, Jan 19, 2022 at 06:18:09PM +0800, Julien Rouhaud wrote:

On Tue, Jan 04, 2022 at 03:55:50PM -0800, Andres Freund wrote:

On 2022-01-01 18:21:06 +0100, Tomas Vondra wrote:

Here's an updated patch, rebased and fixing a couple typos reported by
Justin Pryzby directly.

FWIW, cfbot reports a few compiler warnings:

Also the patch doesn't apply anymore:

http://cfbot.cputube.org/patch_36_3055.log
=== Applying patches on top of PostgreSQL commit ID 74527c3e022d3ace648340b79a6ddec3419f6732 ===
=== applying patch ./0001-Estimate-joins-using-extended-statistics-20220101.patch
patching file src/backend/optimizer/path/clausesel.c
patching file src/backend/statistics/extended_stats.c
Hunk #1 FAILED at 30.
Hunk #2 succeeded at 102 (offset 1 line).
Hunk #3 succeeded at 2619 (offset 9 lines).
1 out of 3 hunks FAILED -- saving rejects to file src/backend/statistics/extended_stats.c.rej

Rebased over 269b532ae and muted compiler warnings.

Tomas - is this patch viable for pg15 , or should move to the next CF ?

In case it's useful, I ran this on cirrus with my branch for code coverage.
https://cirrus-ci.com/task/5816731397521408
https://api.cirrus-ci.com/v1/artifact/task/5816731397521408/coverage/coverage/00-index.html

statext_find_matching_mcv() has poor coverage.
statext_clauselist_join_selectivity() has poor coverage for the "stats2" case.

In mcv.c: mcv_combine_extended() and mcv_combine_simple() have poor coverage
for the "else if" cases (does it matter?)

Not related to this patch:
build_attnums_array() isn't being hit.

Same at statext_is_compatible_clause_internal()
1538 0 : *exprs = lappend(*exprs, clause);

statext_mcv_[de]serialize() aren't being hit for cstrings.

--
Justin

#15

Justin Pryzby

pryzby@telsasoft.com

almost 4 years ago

In reply to: Justin Pryzby (#14)

1 attachment(s)

Re: using extended statistics to improve join estimates

On Wed, Mar 02, 2022 at 11:38:21AM -0600, Justin Pryzby wrote:

Rebased over 269b532ae and muted compiler warnings.

And attached.

Attachments:

0001-Estimate-joins-using-extended-statistics.patchtext/x-diff; charset=us-asciiDownload

From 587a5e9fe87c26cdcd9602fc349f092da95cc580 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas.vondra@postgresql.org>
Date: Mon, 13 Dec 2021 14:05:17 +0100
Subject: [PATCH] Estimate joins using extended statistics

Use extended statistics (MCV) to improve join estimates. In general this
is similar to how we use regular statistics - we search for extended
statistics (with MCV) covering all join clauses, and if we find such MCV
on both sides of the join, we combine those two MCVs.

Extended statistics allow a couple additional improvements - e.g. if
there are baserel conditions, we can use them to restrict the part of
the MCVs combined. This means we're building conditional probability
distribution and calculating conditional probability

    P(join clauses | baserel conditions)

instead of just P(join clauses).

The patch also allows combining regular and extended MCV - we don't need
extended MCVs on both sides. This helps when one of the tables does not
have extended statistics (e.g. because there are no correlations).
---
 src/backend/optimizer/path/clausesel.c        |  63 +-
 src/backend/statistics/extended_stats.c       | 805 ++++++++++++++++++
 src/backend/statistics/mcv.c                  | 757 ++++++++++++++++
 .../statistics/extended_stats_internal.h      |  20 +
 src/include/statistics/statistics.h           |  12 +
 src/test/regress/expected/stats_ext.out       | 167 ++++
 src/test/regress/sql/stats_ext.sql            |  66 ++
 7 files changed, 1889 insertions(+), 1 deletion(-)

diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c
index 06f836308d0..1b2227321a2 100644
--- a/src/backend/optimizer/path/clausesel.c
+++ b/src/backend/optimizer/path/clausesel.c
@@ -50,6 +50,9 @@ static Selectivity clauselist_selectivity_or(PlannerInfo *root,
 											 JoinType jointype,
 											 SpecialJoinInfo *sjinfo,
 											 bool use_extended_stats);
+static inline bool treat_as_join_clause(PlannerInfo *root,
+										Node *clause, RestrictInfo *rinfo,
+										int varRelid, SpecialJoinInfo *sjinfo);
 
 /****************************************************************************
  *		ROUTINES TO COMPUTE SELECTIVITIES
@@ -129,12 +132,53 @@ clauselist_selectivity_ext(PlannerInfo *root,
 	RangeQueryClause *rqlist = NULL;
 	ListCell   *l;
 	int			listidx;
+	bool		single_clause_optimization = true;
+
+	/*
+	 * The optimization of skipping to clause_selectivity_ext for single
+	 * clauses means we can't improve join estimates with a single join
+	 * clause but additional baserel restrictions. So we disable it when
+	 * estimating joins.
+	 *
+	 * XXX Not sure if this is the right way to do it, but more elaborate
+	 * checks would mostly negate the whole point of the optimization.
+	 * The (Var op Var) patch has the same issue.
+	 *
+	 * XXX An alternative might be making clause_selectivity_ext smarter
+	 * and make it use the join extended stats there. But that seems kinda
+	 * against the whole point of the optimization (skipping expensive
+	 * stuff) and it's making other parts more complex.
+	 *
+	 * XXX Maybe this should check if there are at least some restrictions
+	 * on some base relations, which seems important. But then again, that
+	 * seems to go against the idea of this check to be cheap. Moreover, it
+	 * won't work for OR clauses, which may have multiple parts but we still
+	 * see them as a single BoolExpr clause (it doesn't work later, though).
+	 */
+	if (list_length(clauses) == 1)
+	{
+		Node *clause = linitial(clauses);
+		RestrictInfo *rinfo = NULL;
+
+		if (IsA(clause, RestrictInfo))
+		{
+			rinfo = (RestrictInfo *) clause;
+			clause = (Node *) rinfo->clause;
+		}
+
+		single_clause_optimization
+			= !treat_as_join_clause(root, clause, rinfo, varRelid, sjinfo);
+	}
 
 	/*
 	 * If there's exactly one clause, just go directly to
 	 * clause_selectivity_ext(). None of what we might do below is relevant.
+	 *
+	 * XXX This means we won't try using extended stats on OR-clauses (which
+	 * are a single BoolExpr clause at this point), although we'll do that
+	 * later (once we look at the arguments).
 	 */
-	if (list_length(clauses) == 1)
+	if ((list_length(clauses) == 1) && single_clause_optimization)
 		return clause_selectivity_ext(root, (Node *) linitial(clauses),
 									  varRelid, jointype, sjinfo,
 									  use_extended_stats);
@@ -157,6 +201,23 @@ clauselist_selectivity_ext(PlannerInfo *root,
 											&estimatedclauses, false);
 	}
 
+	/*
+	 * Try applying extended statistics to joins. There's not much we can
+	 * do to detect when this makes sense, but we can check that there are
+	 * join clauses, and that at least some of the rels have stats.
+	 *
+	 * XXX Isn't this mutually exclusive with the preceding block which
+	 * calculates estimates for a single relation?
+	 */
+	if (use_extended_stats &&
+		statext_try_join_estimates(root, clauses, varRelid, jointype, sjinfo,
+						 estimatedclauses))
+	{
+		s1 *= statext_clauselist_join_selectivity(root, clauses, varRelid,
+												  jointype, sjinfo,
+												  &estimatedclauses);
+	}
+
 	/*
 	 * Apply normal selectivity estimates for remaining clauses. We'll be
 	 * careful to skip any clauses which were already estimated above.
diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c
index ca48395d5c5..427ba015b73 100644
--- a/src/backend/statistics/extended_stats.c
+++ b/src/backend/statistics/extended_stats.c
@@ -31,6 +31,7 @@
 #include "nodes/nodeFuncs.h"
 #include "optimizer/clauses.h"
 #include "optimizer/optimizer.h"
+#include "optimizer/pathnode.h"
 #include "parser/parsetree.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -103,6 +104,8 @@ static StatsBuildData *make_build_data(Relation onerel, StatExtEntry *stat,
 									   int numrows, HeapTuple *rows,
 									   VacAttrStats **stats, int stattarget);
 
+static bool stat_covers_expressions(StatisticExtInfo *stat, List *exprs,
+									Bitmapset **expr_idxs);
 
 /*
  * Compute requested extended stats, using the rows sampled for the plain
@@ -2611,3 +2614,805 @@ make_build_data(Relation rel, StatExtEntry *stat, int numrows, HeapTuple *rows,
 
 	return result;
 }
+
+/*
+ * statext_find_matching_mcv
+ *		Search for a MCV covering all the attributes and expressions.
+ *
+ * We pick the statistics to use for join estimation. The statistics object has
+ * to have MCV, and we require it to match all the join conditions, because it
+ * makes the estimation simpler.
+ *
+ * If there are multiple candidate statistics objects (matching all join clauses),
+ * we pick the smallest one, and we also consider additional conditions on
+ * the base relations to restrict the MCV items used for estimation (using
+ * conditional probability).
+ *
+ * XXX The requirement that all the attributes need to be covered might be
+ * too strong. We could relax this and and require fewer matches (at least two,
+ * if counting the additional conditions), and we might even apply multiple
+ * statistics etc. But that would require matching statistics on both sides of
+ * the join, while now we simply know the statistics match. We don't really
+ * expect many candidate MCVs, so this simple approach seems sufficient. And
+ * the joins usually use only one or two columns, so there's not much room
+ * for applying multiple statistics anyway.
+ */
+StatisticExtInfo *
+statext_find_matching_mcv(PlannerInfo *root, RelOptInfo *rel,
+						  Bitmapset *attnums, List *exprs)
+{
+	ListCell   *l;
+	StatisticExtInfo *mcv = NULL;
+	List *stats = rel->statlist;
+
+	foreach(l, stats)
+	{
+		StatisticExtInfo *stat = (StatisticExtInfo *) lfirst(l);
+		List *conditions1 = NIL,
+			 *conditions2 = NIL;
+
+		/* We only care about MCV statistics here. */
+		if (stat->kind != STATS_EXT_MCV)
+			continue;
+
+		/*
+		 * Ignore MCVs not covering all the attributes/expressions.
+		 *
+		 * XXX Maybe we shouldn't be so strict and consider only partial
+		 * matches for join clauses too?
+		 */
+		if (!bms_is_subset(attnums, stat->keys) ||
+			!stat_covers_expressions(stat, exprs, NULL))
+			continue;
+
+		/* If there's no matching MCV yet, keep this one. */
+		if (!mcv)
+		{
+			mcv = stat;
+			continue;
+		}
+
+		/*
+		 * OK, we have two candidate statistics objects and we need to decide
+		 * which one to keep. We'll use two simple heuristics:
+		 *
+		 * (a) We prefer smaller statistics (fewer columns), on the assumption
+		 * that it represents a larger fraction of the data (due to having fewer
+		 * combinations with higher counts).
+		 *
+		 * (b) If the statistics object covers some additional conditions for the rels,
+		 * that may help with considering additional dependencies between the
+		 * tables.
+		 *
+		 * Of course, those two heuristict are somewhat contradictory - smaller
+		 * stats are less likely to cover as many conditions as a larger one. We
+		 * consider the additional conditions first - if someone created such
+		 * statistics, there probably is a dependency worth considering.
+		 *
+		 * When inspecting the restrictions, we need to be careful - we don't
+		 * know which of them are compatible with extended stats, so we have to
+		 * run them through statext_is_compatible_clause first and then match
+		 * them to the statistics.
+		 *
+		 * XXX Maybe we shouldn't pick statistics that covers just a single join
+		 * clause, without any additional conditions. In such case we could just
+		 * as well pick regular statistics for the column/expression, but it's
+		 * not clear if that actually exists (so we might reject the stats here
+		 * and then fail to find something simpler/better).
+		 */
+		conditions1 = statext_determine_join_restrictions(root, rel, stat);
+		conditions2 = statext_determine_join_restrictions(root, rel, mcv);
+
+		/* if the new statistics object covers more conditions, use it */
+		if (list_length(conditions1) > list_length(conditions2))
+		{
+			mcv = stat;
+			continue;
+		}
+
+		/* The statistics seem about equal, so just use the smaller one. */
+		if (bms_num_members(mcv->keys) + list_length(mcv->exprs) >
+			bms_num_members(stat->keys) + list_length(stat->exprs))
+		{
+			mcv = stat;
+		}
+	}
+
+	return mcv;
+}
+
+/*
+ * statext_determine_join_restrictions
+ *		Get restrictions on base relation, covered by the statistics object.
+ *
+ * Returns a list of baserel restrictinfos, compatible with extended statistics
+ * and covered by the extended statistics object.
+ *
+ * When using extended statistics to estimate joins, we can use conditions
+ * from base relations to calculate conditional probability
+ *
+ *    P(join clauses | baserel restrictions)
+ *
+ * which should be a better estimate than just P(join clauses). We want to pick
+ * the statistics object covering the most such conditions.
+ */
+List *
+statext_determine_join_restrictions(PlannerInfo *root, RelOptInfo *rel,
+									StatisticExtInfo *info)
+{
+	ListCell   *lc;
+	List	   *conditions = NIL;
+
+	/* extract conditions that may be applied to the MCV list */
+	foreach (lc, rel->baserestrictinfo)
+	{
+		RestrictInfo *rinfo = (RestrictInfo *) lfirst(lc);
+		Bitmapset *indexes = NULL;
+		Bitmapset *attnums = NULL;
+		List *exprs = NIL;
+
+		/* clause has to be supported by MCV in general */
+		if (!statext_is_compatible_clause(root, (Node *) rinfo, rel->relid,
+										  &attnums, &exprs))
+			continue;
+
+		/*
+		 * clause is compatible in general, but is it actually covered
+		 * by this particular statistics object?
+		 */
+		if (!bms_is_subset(attnums, info->keys) ||
+			!stat_covers_expressions(info, exprs, &indexes))
+			continue;
+
+		conditions = lappend(conditions, rinfo->clause);
+	}
+
+	return conditions;
+}
+
+/*
+ * statext_is_supported_join_clause
+ *		Check if a join clause may be estimated using extended stats.
+ *
+ * Determines if this is a join clause of the form (Expr op Expr) which may be
+ * estimated using extended statistics. Each side must reference just a single
+ * relation for now.
+ *
+ * Similar to treat_as_join_clause, but we place additional restrictions
+ * on the conditions.
+ */
+static bool
+statext_is_supported_join_clause(PlannerInfo *root, Node *clause,
+								 int varRelid, SpecialJoinInfo *sjinfo)
+{
+	Oid	oprsel;
+	RestrictInfo   *rinfo;
+	OpExpr		   *opclause;
+	ListCell	   *lc;
+
+	/*
+	 * evaluation as a restriction clause, either at scan node or forced
+	 *
+	 * XXX See treat_as_join_clause.
+	 */
+	if ((varRelid != 0) || (sjinfo == NULL))
+		return false;
+
+	/* XXX Can we rely on always getting RestrictInfo here? */
+	if (!IsA(clause, RestrictInfo))
+		return false;
+
+	/* strip the RestrictInfo */
+	rinfo = (RestrictInfo *) clause;
+	clause = (Node *) rinfo->clause;
+
+	/* is it referencing multiple relations? */
+	if (bms_membership(rinfo->clause_relids) != BMS_MULTIPLE)
+		return false;
+
+	/* we only support simple operator clauses for now */
+	if (!is_opclause(clause))
+		return false;
+
+	opclause = (OpExpr *) clause;
+
+	/* for now we only support estimating equijoins */
+	oprsel = get_oprjoin(opclause->opno);
+
+	/* has to be an equality condition */
+	if (oprsel != F_EQJOINSEL)
+		return false;
+
+	/*
+	 * Make sure we're not mixing vars from multiple relations on the same
+	 * side, like
+	 *
+	 *   (t1.a + t2.a) = (t1.b + t2.b)
+	 *
+	 * which is still technically an opclause, but we can't match it to
+	 * extended statistics in a simple way.
+	 *
+	 * XXX This also means we require rinfo->clause_relids to have 2 rels.
+	 *
+	 * XXX Also check it's not expression on system attributes, which we
+	 * don't allow in extended statistics.
+	 *
+	 * XXX Although maybe we could allow cases that combine expressions
+	 * from both relations on either side? Like (t1.a + t2.b = t1.c - t2.d)
+	 * or something like that. We could do "cartesian product" of the MCV
+	 * stats and restrict it using this condition.
+	 */
+	foreach (lc, opclause->args)
+	{
+		Bitmapset *varnos = NULL;
+		Node *expr = (Node *) lfirst(lc);
+
+		varnos = pull_varnos(root, expr);
+
+		/*
+		 * No argument should reference more than just one relation.
+		 *
+		 * This effectively means each side references just two relations.
+		 * If there's no relation on one side, it's a Const, and the other
+		 * side has to be either Const or Expr with a single rel. In which
+		 * case it can't be a join clause.
+		 */
+		if (bms_num_members(varnos) > 1)
+			return false;
+
+		/*
+		 * XXX Maybe check that both relations have extended statistics
+		 * (no point in considering the clause as useful without it). But
+		 * we'll do that check later anyway, so keep this cheap.
+		 */
+	}
+
+	return true;
+}
+
+/*
+ * statext_try_join_estimates
+ *		Checks if it's worth considering extended stats on join estimates.
+ *
+ * This is supposed to be a quick/cheap check to decide whether to expend
+ * more effort on applying extended statistics to join clauses.
+ */
+bool
+statext_try_join_estimates(PlannerInfo *root, List *clauses, int varRelid,
+						   JoinType jointype, SpecialJoinInfo *sjinfo,
+						   Bitmapset *estimatedclauses)
+{
+	int			listidx;
+	int			k;
+	ListCell   *lc;
+	Bitmapset  *relids = NULL;
+
+	/*
+	 * XXX Not having these values means treat_as_join_clause returns false,
+	 * so we're not supposed to handle join clauses here. So just bail out.
+	 */
+	if ((varRelid != 0) || (sjinfo == NULL))
+		return false;
+
+	/*
+	 * Check if there are any unestimated join clauses, collect relids.
+	 *
+	 * XXX Currently this only allows simple OpExpr equality clauses with each
+	 * argument referring to single relation, AND-ed together. Maybe we could
+	 * relax this in the future, e.g. to allow more complex (deeper) expressions
+	 * and to allow OR-ed join clauses too. And maybe supporting inequalities.
+	 *
+	 * Handling more complex expressions seems simple - we already do that for
+	 * baserel estimates by building the match bitmap recursively, and we could
+	 * do something similar for combinations of MCV items (a bit like building
+	 * a single bit in the match bitmap). The challenge is what to do about the
+	 * part not represented by MCV, which is now based on ndistinct estimates.
+	 */
+	listidx = -1;
+	foreach (lc, clauses)
+	{
+		Node *clause = (Node *) lfirst(lc);
+		RestrictInfo *rinfo;
+
+		/* needs to happen before skipping any clauses */
+		listidx++;
+
+		/* Skip clauses that were already estimated. */
+		if (bms_is_member(listidx, estimatedclauses))
+			continue;
+
+		/*
+		 * Skip clauses that are not join clauses or that we don't know
+		 * how to handle estimate using extended statistics.
+		 */
+		if (!statext_is_supported_join_clause(root, clause, varRelid, sjinfo))
+			continue;
+
+		/*
+		 * XXX We're guaranteed to have RestrictInfo thanks to the checks
+		 * in statext_is_supported_join_clause.
+		 */
+		rinfo = (RestrictInfo *) clause;
+
+		/* Collect relids from all usable clauses. */
+		relids = bms_union(relids, rinfo->clause_relids);
+	}
+
+	/* no join clauses found, don't try applying extended stats */
+	if (bms_num_members(relids) == 0)
+		return false;
+
+	/*
+	 * We expect either 0 or >= 2 relids, a case with 1 relid in join clauses
+	 * should be impossible. And we just ruled out 0, so there are at least 2.
+	 */
+	Assert(bms_num_members(relids) >= 2);
+
+	/*
+	 * Check that at least some of the rels referenced by the clauses have
+	 * extended stats.
+	 *
+	 * XXX Maybe we should check how many rels have stats, and cross-check how
+	 * compatible they are (e.g. that both have MCVs, etc.). We might also
+	 * cross-check the exact joined pairs of rels, but it's supposed to be a
+	 * cheap check, so maybe better leave that for later.
+	 *
+	 * XXX We could also check the number of parameters in each rel to consider
+	 * extended stats. If there's just a single attribute, it's pointless to use
+	 * extended statistics. OTOH we can also consider restriction clauses from
+	 * baserestrictinfo and use them to calculate conditional probabilities.
+	 */
+	k = -1;
+	while ((k = bms_next_member(relids, k)) >= 0)
+	{
+		RelOptInfo *rel = find_base_rel(root, k);
+		if (rel->statlist)
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * Information about two joined relations, along with the join clauses between.
+ */
+typedef struct JoinPairInfo
+{
+	Bitmapset  *rels;
+	List	   *clauses;
+} JoinPairInfo;
+
+/*
+ * statext_build_join_pairs
+ *		Extract pairs of joined rels with join clauses for each pair.
+ *
+ * Walks the remaining (not yet estimated) clauses, and splits them into
+ * lists for each pair of joined relations. Returns NULL if there are no
+ * suitable join pairs that might be estimated using extended stats.
+ *
+ * XXX It's possible there are join clauses, but the clauses are not
+ * supported by the extended stats machinery (we only support opclauses
+ * with F_EQJOINSEL selectivity function at the moment).
+ */
+static JoinPairInfo *
+statext_build_join_pairs(PlannerInfo *root, List *clauses, int varRelid,
+						 JoinType jointype, SpecialJoinInfo *sjinfo,
+						 Bitmapset *estimatedclauses, int *npairs)
+{
+	int				cnt;
+	int				listidx;
+	JoinPairInfo   *info;
+	ListCell	   *lc;
+
+	/*
+	 * Assume each clause is for a different pair of relations (some of them
+	 * might be already estimated, but meh - there shouldn't be too many of
+	 * them and it's cheaper than repalloc).
+	 */
+	info = (JoinPairInfo *) palloc0(sizeof(JoinPairInfo) * list_length(clauses));
+	cnt = 0;
+
+	listidx = -1;
+	foreach(lc, clauses)
+	{
+		int				i;
+		bool			found;
+		Node		   *clause = (Node *) lfirst(lc);
+		RestrictInfo   *rinfo;
+
+		listidx++;
+
+		/* skip already estimated clauses */
+		if (bms_is_member(listidx, estimatedclauses))
+			continue;
+
+		/*
+		 * Make sure the clause is a join clause of a supported shape (at
+		 * the moment we support just (Expr op Expr) clauses with each
+		 * side referencing just a single relation).
+		 */
+		if (!statext_is_supported_join_clause(root, clause, varRelid, sjinfo))
+			continue;
+
+		/* statext_is_supported_join_clause guarantees RestrictInfo */
+		rinfo = (RestrictInfo *) clause;
+		clause = (Node *) rinfo->clause;
+
+		/* search for a matching join pair */
+		found = false;
+		for (i = 0; i < cnt; i++)
+		{
+			if (bms_is_subset(rinfo->clause_relids, info[i].rels))
+			{
+				info[i].clauses = lappend(info[i].clauses, clause);
+				found = true;
+				break;
+			}
+		}
+
+		if (!found)
+		{
+			info[cnt].rels = rinfo->clause_relids;
+			info[cnt].clauses = lappend(info[cnt].clauses, clause);
+			cnt++;
+		}
+	}
+
+	if (cnt == 0)
+		return NULL;
+
+	*npairs = cnt;
+	return info;
+}
+
+/*
+ * extract_relation_info
+ *		Extract information about a relation in a join pair.
+ *
+ * The relation is identified by index (generally 0 or 1), and picks extended
+ * statistics object covering the join clauses and baserel restrictions.
+ *
+ * XXX Can we have cases with indexes above 1? Probably for clauses mixing
+ * vars from 3 relations, but statext_is_supported_join_clause rejects those.
+ */
+static RelOptInfo *
+extract_relation_info(PlannerInfo *root, JoinPairInfo *info, int index,
+					  StatisticExtInfo **stat)
+{
+	int			k;
+	int			relid;
+	RelOptInfo *rel;
+	ListCell   *lc;
+	List	   *exprs = NIL;
+
+	Bitmapset  *attnums = NULL;
+
+	Assert((index >= 0) && (index <= 1));
+
+	k = -1;
+	while (index >= 0)
+	{
+		k = bms_next_member(info->rels, k);
+		if (k < 0)
+			elog(ERROR, "failed to extract relid");
+
+		relid = k;
+		index--;
+	}
+
+	rel = find_base_rel(root, relid);
+
+	/*
+	 * Walk the clauses for this join pair, and extract expressions about
+	 * the relation identified by index / relid. For simple Vars we extract
+	 * the attnum. Otherwise we keep the whole expression.
+	 */
+	foreach (lc, info->clauses)
+	{
+		ListCell *lc2;
+		Node *clause = (Node *) lfirst(lc);
+		OpExpr *opclause = (OpExpr *) clause;
+
+		/* only opclauses supported for now */
+		Assert(is_opclause(clause));
+
+		foreach (lc2, opclause->args)
+		{
+			Node *arg = (Node *) lfirst(lc2);
+			Bitmapset *varnos = NULL;
+
+			/* plain Var references (boolean Vars or recursive checks) */
+			if (IsA(arg, Var))
+			{
+				Var		   *var = (Var *) arg;
+
+				/* Ignore vars from other relations. */
+				if (var->varno != relid)
+					continue;
+
+				/* we also better ensure the Var is from the current level */
+				if (var->varlevelsup > 0)
+					continue;
+
+				/* Also skip system attributes (we don't allow stats on those). */
+				if (!AttrNumberIsForUserDefinedAttr(var->varattno))
+					elog(ERROR, "unexpected system attribute");
+
+				attnums = bms_add_member(attnums, var->varattno);
+
+				/* Done, process the next argument. */
+				continue;
+			}
+
+			/*
+			 * OK, it's a more complex expression, so check if it matches
+			 * the relid and maybe keep it as a whole. It should be
+			 * compatible because we already checked it when building the
+			 * join pairs.
+			 */
+			varnos = pull_varnos(root, arg);
+
+			if (relid == bms_singleton_member(varnos))
+				exprs = lappend(exprs, arg);
+		}
+	}
+
+	*stat = statext_find_matching_mcv(root, rel, attnums, exprs);
+
+	return rel;
+}
+
+/*
+ * get_expression_for_rel
+ *		Extract expression for a given relation from the join clause.
+ *
+ * Given a join clause supported by the extended statistics object (currently
+ * that means just OpExpr clauses with each argument referencing single rel),
+ * return either the left or right argument expression for the rel.
+ *
+ * XXX This should probably return a flag identifying whether it's the
+ * left or right argument.
+ */
+static Node *
+get_expression_for_rel(PlannerInfo *root, RelOptInfo *rel, Node *clause)
+{
+	OpExpr *opexpr;
+	Node   *expr;
+
+	/*
+	 * Strip the RestrictInfo node, get the actual clause.
+	 *
+	 * XXX Not sure if we need to care about removing other node types
+	 * too (e.g. RelabelType etc.). statext_is_supported_join_clause
+	 * matches this, but maybe we need to relax it?
+	 */
+	if (IsA(clause, RestrictInfo))
+		clause = (Node *) ((RestrictInfo *) clause)->clause;
+
+	opexpr = (OpExpr *) clause;
+
+	/* Make sure we have the expected node type. */
+	Assert(is_opclause(clause));
+	Assert(list_length(opexpr->args) == 2);
+
+	/* FIXME strip relabel etc. the way examine_opclause_args does */
+	expr = linitial(opexpr->args);
+	if (bms_singleton_member(pull_varnos(root, expr)) == rel->relid)
+		return expr;
+
+	expr = lsecond(opexpr->args);
+	if (bms_singleton_member(pull_varnos(root, expr)) == rel->relid)
+		return expr;
+
+	return NULL;
+}
+
+/*
+ * statext_clauselist_join_selectivity
+ *		Use extended stats to estimate join clauses.
+ *
+ * XXX In principle, we should not restrict this to cases with multiple
+ * join clauses - we should consider dependencies with conditions at the
+ * base relations, i.e. calculate P(join clause | base restrictions).
+ * But currently that does not happen, because clauselist_selectivity_ext
+ * treats a single clause as a special case (and we don't apply extended
+ * statistics in that case yet).
+ */
+Selectivity
+statext_clauselist_join_selectivity(PlannerInfo *root, List *clauses, int varRelid,
+									JoinType jointype, SpecialJoinInfo *sjinfo,
+									Bitmapset **estimatedclauses)
+{
+	int			i;
+	int			listidx;
+	Selectivity	s = 1.0;
+
+	JoinPairInfo *info;
+	int				ninfo;
+
+	if (!clauses)
+		return 1.0;
+
+	/* extract pairs of joined relations from the list of clauses */
+	info = statext_build_join_pairs(root, clauses, varRelid, jointype, sjinfo,
+									*estimatedclauses, &ninfo);
+
+	/* no useful join pairs */
+	if (!info)
+		return 1.0;
+
+	/*
+	 * Process the join pairs, try to find a matching MCV on each side.
+	 *
+	 * XXX The basic principle is quite similar to eqjoinsel_inner, i.e.
+	 * we try to find a MCV on both sides of the join, and use it to get
+	 * a better join estimate. It's a bit more complicated, because there
+	 * might be multiple MCV lists, we also need ndistinct estimate, and
+	 * there may be interesting baserestrictions too.
+	 *
+	 * XXX At the moment we only handle the case with matching MCVs on
+	 * both sides, but it'd be good to also handle case with just ndistinct
+	 * statistics improving ndistinct estimates.
+	 *
+	 * XXX We might also handle cases with a regular MCV on one side and
+	 * an extended MCV on the other side.
+	 *
+	 * XXX Perhaps it'd be good to also handle case with one side only
+	 * having "regular" statistics (e.g. MCV), especially in cases with
+	 * no conditions on that side of the join (where we can't use the
+	 * extended MCV to calculate conditional probability).
+	 */
+	for (i = 0; i < ninfo; i++)
+	{
+		ListCell *lc;
+
+		RelOptInfo *rel1;
+		RelOptInfo *rel2;
+
+		StatisticExtInfo *stat1;
+		StatisticExtInfo *stat2;
+
+		/* extract info about the first relation */
+		rel1 = extract_relation_info(root, &info[i], 0, &stat1);
+
+		/* extract info about the second relation */
+		rel2 = extract_relation_info(root, &info[i], 1, &stat2);
+
+		/*
+		 * We can handle three basic cases:
+		 *
+		 * a) Extended stats (with MCV) on both sides is an ideal case, and we
+		 * can simply combine the two MCVs, possibly with additional conditions
+		 * from the relations.
+		 *
+		 * b) Extended stats on one side, regular MCV on the other side (this
+		 * means there's just one join clause / expression). It also means the
+		 * extended stats likely covers at least one extra condition, otherwise
+		 * we could just use regular statistics. We can combine the stats just
+		 * similarly to (a).
+		 *
+		 * c) No extended stats with MCV. If there are multiple join clauses,
+		 * we can try using ndistinct coefficients and do what eqjoinsel does.
+		 *
+		 * If none of these applies, we fallback to the regular selectivity
+		 * estimation in eqjoinsel.
+		 */
+		if (stat1 && stat2)
+		{
+			s *= mcv_combine_extended(root, rel1, rel2, stat1, stat2, info[i].clauses);
+		}
+		else if (stat1 && (list_length(info[i].clauses) == 1))
+		{
+			/* try finding MCV on the other relation */
+			VariableStatData	vardata;
+			AttStatsSlot		sslot;
+			Form_pg_statistic	stats = NULL;
+			bool				have_mcvs = false;
+			Node			   *clause = linitial(info[i].clauses);
+			Node			   *expr = get_expression_for_rel(root, rel2, clause);
+			double				nd;
+			bool				isdefault;
+
+			examine_variable(root, expr, 0, &vardata);
+
+			nd = get_variable_numdistinct(&vardata, &isdefault);
+
+			memset(&sslot, 0, sizeof(sslot));
+
+			if (HeapTupleIsValid(vardata.statsTuple))
+			{
+				/* note we allow use of nullfrac regardless of security check */
+				stats = (Form_pg_statistic) GETSTRUCT(vardata.statsTuple);
+				/* FIXME should this call statistic_proc_security_check like eqjoinsel? */
+				have_mcvs = get_attstatsslot(&sslot, vardata.statsTuple,
+											 STATISTIC_KIND_MCV, InvalidOid,
+											 ATTSTATSSLOT_VALUES | ATTSTATSSLOT_NUMBERS);
+			}
+
+			if (have_mcvs)
+				s *= mcv_combine_simple(root, rel1, stat1, &sslot,
+										stats->stanullfrac, nd, isdefault, clause);
+
+			free_attstatsslot(&sslot);
+
+			ReleaseVariableStats(vardata);
+
+			/* no stats, don't mark the clauses as estimated */
+			if (!have_mcvs)
+				continue;
+		}
+		else if (stat2 && (list_length(info[i].clauses) == 1))
+		{
+			/* try finding MCV on the other relation */
+			VariableStatData	vardata;
+			AttStatsSlot		sslot;
+			Form_pg_statistic	stats = NULL;
+			bool				have_mcvs = false;
+			Node			   *clause = (Node *) linitial(info[i].clauses);
+			Node			   *expr = get_expression_for_rel(root, rel1, clause);
+			double				nd;
+			bool				isdefault;
+
+			examine_variable(root, expr, 0, &vardata);
+
+			nd = get_variable_numdistinct(&vardata, &isdefault);
+
+			memset(&sslot, 0, sizeof(sslot));
+
+			if (HeapTupleIsValid(vardata.statsTuple))
+			{
+				/* note we allow use of nullfrac regardless of security check */
+				stats = (Form_pg_statistic) GETSTRUCT(vardata.statsTuple);
+				/* FIXME should this call statistic_proc_security_check like eqjoinsel? */
+				have_mcvs = get_attstatsslot(&sslot, vardata.statsTuple,
+											 STATISTIC_KIND_MCV, InvalidOid,
+											 ATTSTATSSLOT_VALUES | ATTSTATSSLOT_NUMBERS);
+			}
+
+			if (have_mcvs)
+				s *= mcv_combine_simple(root, rel2, stat2, &sslot,
+										stats->stanullfrac, nd, isdefault, clause);
+
+			free_attstatsslot(&sslot);
+
+			ReleaseVariableStats(vardata);
+
+			/* no stats, don't mark the clauses as estimated */
+			if (!have_mcvs)
+				continue;
+		}
+		else
+			continue;
+
+		/*
+		 * Now mark all the clauses for this join pair as estimated.
+		 *
+		 * XXX Maybe track the indexes in JoinPairInfo, so that we can
+		 * simply union the two bitmaps, without the extra matching.
+		 */
+		foreach (lc, info->clauses)
+		{
+			Node *clause = (Node *) lfirst(lc);
+			ListCell *lc2;
+
+			listidx = -1;
+			foreach (lc2, clauses)
+			{
+				Node *clause2 = (Node *) lfirst(lc2);
+				listidx++;
+
+				Assert(IsA(clause2, RestrictInfo));
+
+				clause2 = (Node *) ((RestrictInfo *) clause2)->clause;
+
+				if (equal(clause, clause2))
+				{
+					*estimatedclauses = bms_add_member(*estimatedclauses, listidx);
+					break;
+				}
+			}
+		}
+	}
+
+	return s;
+}
diff --git a/src/backend/statistics/mcv.c b/src/backend/statistics/mcv.c
index 1ef30344285..fbfebf88ac0 100644
--- a/src/backend/statistics/mcv.c
+++ b/src/backend/statistics/mcv.c
@@ -24,6 +24,7 @@
 #include "funcapi.h"
 #include "nodes/nodeFuncs.h"
 #include "optimizer/clauses.h"
+#include "optimizer/optimizer.h"
 #include "statistics/extended_stats_internal.h"
 #include "statistics/statistics.h"
 #include "utils/array.h"
@@ -2158,3 +2159,759 @@ mcv_clause_selectivity_or(PlannerInfo *root, StatisticExtInfo *stat,
 
 	return s;
 }
+
+/*
+ * statext_compare_mcvs
+ *		Calculate join selectivity using extended statistics, similar to
+ *		eqjoinsel_inner.
+ *
+ * Considers restrictions on base relations too, essentially computing a
+ * conditional probability
+ *
+ *	P(join clauses | baserestrictinfos on either side)
+ *
+ * Compared to eqjoinsel_inner there's a couple problems. With per-column MCV
+ * lists it's obvious that the number of distinct values not covered by the MCV
+ * is (ndistinct - size(MCV)). With multi-column MCVs it's not that simple,
+ * particularly when the conditions are on a subset of the MCV attributes and/or
+ * NULLs are involved. E.g. with MCV (a,b,c) and conditions on (a,b), it's not
+ * clear if the number of (a,b) combinations not covered by the MCV is
+ *
+ * (ndistinct(a,b) - ndistinct_mcv(a,b))
+ *
+ * where ndistinct_mcv(a,b) is the number of distinct (a,b) combinations
+ * included in the MCV list. These combinations may be present in the rest
+ * of the data (outside MCV), just with some extra values in "c". So in
+ * principle there may be between
+ *
+ * (ndistinct(a,b) - ndistinct_mcv(a,b)) and ndistinct(a,b)
+ *
+ * distinct values in the part of the data not covered by the MCV. So we need
+ * to pick something in between, there's no way to calculate this accurately.
+ */
+Selectivity
+mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
+					 StatisticExtInfo *stat1, StatisticExtInfo *stat2,
+					 List *clauses)
+{
+	ListCell   *lc;
+
+	MCVList    *mcv1,
+			   *mcv2;
+	int			idx,
+				i,
+				j;
+	Selectivity s = 0;
+
+	/* match bitmaps and selectivity for baserel conditions (if any) */
+	List   *exprs1 = NIL,
+		   *exprs2 = NIL;
+	List   *conditions1 = NIL,
+		   *conditions2 = NIL;
+	bool   *cmatches1 = NULL,
+		   *cmatches2 = NULL;
+
+	double	csel1 = 1.0,
+			csel2 = 1.0;
+
+	bool   *matches1 = NULL,
+		   *matches2 = NULL;
+
+	/* estimates for the two relations */
+	double	matchfreq1,
+			unmatchfreq1,
+			otherfreq1,
+			mcvfreq1,
+			nd1,
+			totalsel1;
+
+	double 	matchfreq2,
+			unmatchfreq2,
+			otherfreq2,
+			mcvfreq2,
+			nd2,
+			totalsel2;
+
+	/* info about clauses and how they match to MCV stats */
+	FmgrInfo   *opprocs;
+	int		   *indexes1,
+			   *indexes2;
+	bool	   *reverse;
+	RangeTblEntry *rte1 = root->simple_rte_array[rel1->relid];
+	RangeTblEntry *rte2 = root->simple_rte_array[rel2->relid];
+
+	/* we picked the stats so that they have MCV enabled */
+	Assert((stat1->kind = STATS_EXT_MCV) && (stat2->kind = STATS_EXT_MCV));
+
+	mcv1 = statext_mcv_load(stat1->statOid, rte1->inh);
+	mcv2 = statext_mcv_load(stat2->statOid, rte2->inh);
+
+	/* should only get here with MCV on both sides */
+	Assert(mcv1 && mcv2);
+
+	/* Determine which baserel clauses to use for conditional probability. */
+	conditions1 = statext_determine_join_restrictions(root, rel1, stat1);
+	conditions2 = statext_determine_join_restrictions(root, rel2, stat2);
+
+	/*
+	 * Calculate match bitmaps for restrictions on either side of the join
+	 * (there may be none, in which case this will be NULL).
+	 */
+	if (conditions1)
+	{
+		cmatches1 = mcv_get_match_bitmap(root, conditions1,
+										 stat1->keys, stat1->exprs,
+										 mcv1, false);
+		csel1 = clauselist_selectivity(root, conditions1, rel1->relid, 0, NULL);
+	}
+
+	if (conditions2)
+	{
+		cmatches2 = mcv_get_match_bitmap(root, conditions2,
+										 stat2->keys, stat2->exprs,
+										 mcv2, false);
+		csel2 = clauselist_selectivity(root, conditions2, rel2->relid, 0, NULL);
+	}
+
+	/*
+	 * Match bitmaps for matches between MCV elements. By default there
+	 * are no matches, so we set all items to 0.
+	 */
+	matches1 = (bool *) palloc0(sizeof(bool) * mcv1->nitems);
+	matches2 = (bool *) palloc0(sizeof(bool) * mcv2->nitems);
+
+	/*
+	 * Initialize information about clauses and how they match to the MCV
+	 * stats we picked. We do this only once before processing the lists,
+	 * so that we don't have to do that for each MCV item or so.
+	 */
+	opprocs = (FmgrInfo *) palloc(sizeof(FmgrInfo) * list_length(clauses));
+	indexes1 = (int *) palloc(sizeof(int) * list_length(clauses));
+	indexes2 = (int *) palloc(sizeof(int) * list_length(clauses));
+	reverse = (bool *) palloc(sizeof(bool) * list_length(clauses));
+
+	idx = 0;
+	foreach (lc, clauses)
+	{
+		Node	   *clause = (Node *) lfirst(lc);
+		OpExpr	   *opexpr;
+		Node	   *expr1,
+				   *expr2;
+		Bitmapset  *relids1,
+				   *relids2;
+
+		/*
+		 * Strip the RestrictInfo node, get the actual clause.
+		 *
+		 * XXX Not sure if we need to care about removing other node types
+		 * too (e.g. RelabelType etc.). statext_is_supported_join_clause
+		 * matches this, but maybe we need to relax it?
+		 */
+		if (IsA(clause, RestrictInfo))
+			clause = (Node *) ((RestrictInfo *) clause)->clause;
+
+		opexpr = (OpExpr *) clause;
+
+		/* Make sure we have the expected node type. */
+		Assert(is_opclause(clause));
+		Assert(list_length(opexpr->args) == 2);
+
+		fmgr_info(get_opcode(opexpr->opno), &opprocs[idx]);
+
+		/* FIXME strip relabel etc. the way examine_opclause_args does */
+		expr1 = linitial(opexpr->args);
+		expr2 = lsecond(opexpr->args);
+
+		/* determine order of clauses (rel1 op rel2) or (rel2 op rel1) */
+		relids1 = pull_varnos(root, expr1);
+		relids2 = pull_varnos(root, expr2);
+
+		if ((bms_singleton_member(relids1) == rel1->relid) &&
+			(bms_singleton_member(relids2) == rel2->relid))
+		{
+			Oid		collid;
+
+			indexes1[idx] = mcv_match_expression(expr1,
+												 stat1->keys, stat1->exprs,
+												 &collid);
+			indexes2[idx] = mcv_match_expression(expr2,
+												 stat2->keys, stat2->exprs,
+												 &collid);
+			reverse[idx] = false;
+
+			exprs1 = lappend(exprs1, expr1);
+			exprs2 = lappend(exprs2, expr2);
+		}
+		else if ((bms_singleton_member(relids2) == rel1->relid) &&
+				 (bms_singleton_member(relids1) == rel2->relid))
+		{
+			Oid		collid;
+
+			indexes1[idx] = mcv_match_expression(expr2,
+												 stat2->keys, stat2->exprs,
+												 &collid);
+			indexes2[idx] = mcv_match_expression(expr1,
+												 stat1->keys, stat1->exprs,
+												 &collid);
+			reverse[idx] = true;
+
+			exprs1 = lappend(exprs1, expr2);
+			exprs2 = lappend(exprs2, expr1);
+		}
+		else
+			/* should never happen */
+			Assert(false);
+
+		Assert((indexes1[idx] >= 0) &&
+			   (indexes1[idx] < bms_num_members(stat1->keys) + list_length(stat1->exprs)));
+
+		Assert((indexes2[idx] >= 0) &&
+			   (indexes2[idx] < bms_num_members(stat2->keys) + list_length(stat2->exprs)));
+
+		idx++;
+	}
+
+	/*
+	 * Match items between the two MCV lists.
+	 *
+	 * We don't know if the join conditions match all attributes in the MCV, the
+	 * overlap may be just on a subset of attributes, e.g. (a,b,c) vs. (b,c,d).
+	 * So there may be multiple matches on either side. So we can't optimize by
+	 * aborting the inner loop after the first match, etc.
+	 *
+	 * XXX We can skip the items eliminated by the base restrictions, of course.
+	 *
+	 * XXX We might optimize this in two ways. We might sort the MCV items on
+	 * both sides using the "join" attributes, and then perform something like
+	 * merge join. Or we might calculate a hash from the join columns, and then
+	 * compare this (to eliminate the most expensive equality functions).
+	 */
+	for (i = 0; i < mcv1->nitems; i++)
+	{
+		bool	has_nulls;
+
+		/* skip items eliminated by restrictions on rel1 */
+		if (cmatches1 && !cmatches1[i])
+			continue;
+
+		/*
+		 * Check if any value in the first MCV item is NULL, because it'll be
+		 * mismatch anyway.
+		 *
+		 * XXX This might not work for some join clauses, e.g. IS NOT DISTINCT
+		 * FROM, but those are currently not considered compatible (we only
+		 * allow OpExpr at the moment).
+		 */
+		has_nulls = false;
+		for (j = 0; j < list_length(clauses); j++)
+			has_nulls |= mcv1->items[i].isnull[indexes1[j]];
+
+		if (has_nulls)
+			continue;
+
+		/* find matches in the second MCV list */
+		for (j = 0; j < mcv2->nitems; j++)
+		{
+			int			idx;
+			bool		items_match = true;
+
+			/* skip items eliminated by restrictions on rel2 */
+			if (cmatches2 && !cmatches2[j])
+				continue;
+
+			/*
+			 * XXX We can't skip based on existing matches2 value, because there
+			 * may be duplicates in the first MCV.
+			 */
+
+			/*
+			 * Evaluate if all the join clauses match between the two MCV items.
+			 *
+			 * XXX We might optimize the order of evaluation, using the costs of
+			 * operator functions for individual columns. It does depend on the
+			 * number of distinct values, etc.
+			 */
+			idx = 0;
+			foreach (lc, clauses)
+			{
+				bool	match;
+				int		index1 = indexes1[idx],
+						index2 = indexes2[idx];
+				Datum	value1,
+						value2;
+				bool	reverse_args = reverse[idx];
+
+				/* If either value is null, it's a mismatch */
+				if (mcv2->items[j].isnull[index2])
+					match = false;
+				else
+				{
+					value1 = mcv1->items[i].values[index1];
+					value2 = mcv2->items[j].values[index2];
+
+					/*
+					 * Careful about order of parameters. For same-type equality
+					 * that should not matter, but easy enough.
+					 *
+					 * FIXME Use appropriate collation.
+					 */
+					if (reverse_args)
+						match = DatumGetBool(FunctionCall2Coll(&opprocs[idx],
+															   InvalidOid,
+															   value2, value1));
+					else
+						match = DatumGetBool(FunctionCall2Coll(&opprocs[idx],
+															   InvalidOid,
+															   value1, value2));
+				}
+
+				items_match &= match;
+
+				if (!items_match)
+					break;
+
+				idx++;
+			}
+
+			if (items_match)
+			{
+				/* XXX Do we need to do something about base frequency? */
+				matches1[i] = matches2[j] = true;
+				s += mcv1->items[i].frequency * mcv2->items[j].frequency;
+			}
+		}
+	}
+
+	matchfreq1 = unmatchfreq1 = mcvfreq1 = 0.0;
+	for (i = 0; i < mcv1->nitems; i++)
+	{
+		mcvfreq1 += mcv1->items[i].frequency;
+
+		/* ignore MCV items eliminated by baserel conditions */
+		if (cmatches1 && !cmatches1[i])
+			continue;
+
+		if (matches1[i])
+			matchfreq1 += mcv1->items[i].frequency;
+		else
+			unmatchfreq1 += mcv1->items[i].frequency;
+	}
+
+	/* not represented by the MCV */
+	otherfreq1 = 1.0 - mcvfreq1;
+
+	matchfreq2 = unmatchfreq2 = mcvfreq2 = 0.0;
+	for (i = 0; i < mcv2->nitems; i++)
+	{
+		mcvfreq2 += mcv2->items[i].frequency;
+
+		/* ignore MCV items eliminated by baserel conditions */
+		if (cmatches2 && !cmatches2[i])
+			continue;
+
+		if (matches2[i])
+			matchfreq2 += mcv2->items[i].frequency;
+		else
+			unmatchfreq2 += mcv2->items[i].frequency;
+	}
+
+	/* not represented by the MCV */
+	otherfreq2 = 1.0 - mcvfreq2;
+
+	/*
+	 * Correction for MCV parts eliminated by the conditions.
+	 *
+	 * We need to be careful about cases where conditions eliminated all
+	 * the MCV items. We must not divide by 0.0, because that would either
+	 * produce bogus value or trigger division by zero. Instead we simply
+	 * set the selectivity to 0.0, because there can't be any matches.
+	 */
+	if ((matchfreq1 + unmatchfreq1) > 0)
+		s = s * mcvfreq1 / (matchfreq1 + unmatchfreq1);
+	else
+		s = 0.0;
+
+	if ((matchfreq2 + unmatchfreq2) > 0)
+		s = s * mcvfreq2 / (matchfreq2 + unmatchfreq2);
+	else
+		s = 0.0;
+
+	/* calculate ndistinct for the expression in join clauses for each rel */
+	nd1 = estimate_num_groups(root, exprs1, rel1->rows, NULL, NULL);
+	nd2 = estimate_num_groups(root, exprs2, rel2->rows, NULL, NULL);
+
+	/*
+	 * Consider the part of the data not represented by the MCV lists.
+	 *
+	 * XXX this is a bit bogus, because we don't know what fraction of
+	 * distinct combinations is covered by the MCV list (we're only
+	 * dealing with some of the columns), so we can't use the same
+	 * formular as eqjoinsel_inner exactly. We just use the estimates
+	 * for the whole table - this is likely an overestimate, because
+	 * (a) items may repeat in the MCV list, if it has more columns,
+	 * and (b) some of the combinations may be present in non-MCV data.
+	 *
+	 * Moreover, we need to look at the conditions. For now we simply
+	 * assume the conditions affect the distinct groups, and use that.
+	 *
+	 * XXX We might calculate the number of distinct groups in the MCV,
+	 * and then use something between (nd1 - distinct(MCV)) and (nd1),
+	 * which are the possible extreme values, assuming the estimates
+	 * are accurate. Maybe mean or geometric mean would work?
+	 *
+	 * XXX Not sure multiplying ndistinct with probabilities is good.
+	 * Maybe we should do something more like estimate_num_groups?
+	 */
+	nd1 *= csel1;
+	nd2 *= csel2;
+
+	totalsel1 = s;
+	totalsel1 += unmatchfreq1 * otherfreq2 / nd2;
+	totalsel1 += otherfreq1 * (otherfreq2 + unmatchfreq2) / nd2;
+
+//	if (nd2 > mcvb->nitems)
+//		totalsel1 += unmatchfreq1 * otherfreq2 / (nd2 - mcvb->nitems);
+//	if (nd2 > nmatches)
+//		totalsel1 += otherfreq1 * (otherfreq2 + unmatchfreq2) /
+//			(nd2 - nmatches);
+
+	totalsel2 = s;
+	totalsel2 += unmatchfreq2 * otherfreq1 / nd1;
+	totalsel2 += otherfreq2 * (otherfreq1 + unmatchfreq1) / nd1;
+
+//	if (nd1 > mcva->nitems)
+//		totalsel2 += unmatchfreq2 * otherfreq1 / (nd1 - mcva->nitems);
+//	if (nd1 > nmatches)
+//		totalsel2 += otherfreq2 * (otherfreq1 + unmatchfreq1) /
+//			(nd1 - nmatches);
+
+	s = Min(totalsel1, totalsel2);
+
+	return s;
+}
+
+
+/*
+ * statext_compare_simple
+ *		Calculate join selectivity using a combination of extended
+ * statistics MCV on one side, and simple per-column MCV on the other.
+ *
+ * Most of the mcv_combine_extended comment applies here too, but we can make
+ * some simplifications because we know the second (per-column) MCV is simpler,
+ * contains no NULL or duplicate values, etc.
+ */
+Selectivity
+mcv_combine_simple(PlannerInfo *root, RelOptInfo *rel, StatisticExtInfo *stat,
+				   AttStatsSlot *sslot, double stanullfrac, double nd,
+				   bool isdefault, Node *clause)
+{
+	MCVList    *mcv;
+	int			i,
+				j;
+	Selectivity s = 0;
+
+	/* match bitmaps and selectivity for baserel conditions (if any) */
+	List   *conditions = NIL;
+	bool   *cmatches = NULL;
+
+	double	csel = 1.0;
+
+	bool   *matches1 = NULL,
+		   *matches2 = NULL;
+
+	/* estimates for the two sides */
+	double	matchfreq1,
+			unmatchfreq1,
+			otherfreq1,
+			mcvfreq1,
+			nd1,
+			totalsel1;
+
+	double 	matchfreq2,
+			unmatchfreq2,
+			otherfreq2,
+			mcvfreq2,
+			nd2,
+			totalsel2;
+
+	List   *exprs1 = NIL,
+		   *exprs2 = NIL;
+
+	/* info about clauses and how they match to MCV stats */
+	FmgrInfo	opproc;
+	int			index = 0;
+	bool		reverse = false;
+	RangeTblEntry *rte = root->simple_rte_array[rel->relid];
+
+	/* we picked the stats so that they have MCV enabled */
+	Assert(stat->kind = STATS_EXT_MCV);
+
+	mcv = statext_mcv_load(stat->statOid, rte->inh);
+
+	/* should only get here with MCV on both sides */
+	Assert(mcv);
+
+	/* Determine which baserel clauses to use for conditional probability. */
+	conditions = statext_determine_join_restrictions(root, rel, stat);
+
+	/*
+	 * Calculate match bitmaps for restrictions on either side of the join
+	 * (there may be none, in which case this will be NULL).
+	 */
+	if (conditions)
+	{
+		cmatches = mcv_get_match_bitmap(root, conditions,
+										 stat->keys, stat->exprs,
+										 mcv, false);
+		csel = clauselist_selectivity(root, conditions, rel->relid, 0, NULL);
+	}
+
+	/*
+	 * Match bitmaps for matches between MCV elements. By default there
+	 * are no matches, so we set all items to 0.
+	 */
+	matches1 = (bool *) palloc0(sizeof(bool) * mcv->nitems);
+
+	/* Matches for the side with just regular single-column MCV. */
+	matches2 = (bool *) palloc0(sizeof(bool) * sslot->nvalues);
+
+	/*
+	 * Initialize information about the clause and how it matches to the
+	 * extended stats we picked. We do this only once before processing
+	 * the lists, so that we don't have to do that for each item or so.
+	 */
+	{
+		OpExpr	   *opexpr;
+		Node	   *expr1,
+				   *expr2;
+		Bitmapset  *relids1,
+				   *relids2;
+
+		/*
+		 * Strip the RestrictInfo node, get the actual clause.
+		 *
+		 * XXX Not sure if we need to care about removing other node types
+		 * too (e.g. RelabelType etc.). statext_is_supported_join_clause
+		 * matches this, but maybe we need to relax it?
+		 */
+		if (IsA(clause, RestrictInfo))
+			clause = (Node *) ((RestrictInfo *) clause)->clause;
+
+		opexpr = (OpExpr *) clause;
+
+		/* Make sure we have the expected node type. */
+		Assert(is_opclause(clause));
+		Assert(list_length(opexpr->args) == 2);
+
+		fmgr_info(get_opcode(opexpr->opno), &opproc);
+
+		/* FIXME strip relabel etc. the way examine_opclause_args does */
+		expr1 = linitial(opexpr->args);
+		expr2 = lsecond(opexpr->args);
+
+		/* determine order of clauses (rel1 op rel2) or (rel2 op rel1) */
+		relids1 = pull_varnos(root, expr1);
+		relids2 = pull_varnos(root, expr2);
+
+		if (bms_singleton_member(relids1) == rel->relid)
+		{
+			Oid		collid;
+
+			index = mcv_match_expression(expr1, stat->keys, stat->exprs,
+										 &collid);
+			reverse = false;
+
+			exprs1 = lappend(exprs1, expr1);
+			exprs2 = lappend(exprs2, expr2);
+		}
+		else if (bms_singleton_member(relids2) == rel->relid)
+		{
+			Oid		collid;
+
+			index = mcv_match_expression(expr2, stat->keys, stat->exprs,
+										 &collid);
+			reverse = true;
+
+			exprs1 = lappend(exprs1, expr2);
+			exprs2 = lappend(exprs2, expr1);
+		}
+		else
+			/* should never happen */
+			Assert(false);
+
+		Assert((index >= 0) &&
+			   (index < bms_num_members(stat->keys) + list_length(stat->exprs)));
+	}
+
+	/*
+	 * Match items between the two MCV lists.
+	 *
+	 * We don't know if the join conditions match all attributes in the MCV, the
+	 * overlap may be just on a subset of attributes, e.g. (a,b,c) vs. (b,c,d).
+	 * So there may be multiple matches on either side. So we can't optimize by
+	 * aborting the inner loop after the first match, etc.
+	 *
+	 * XXX We can skip the items eliminated by the base restrictions, of course.
+	 *
+	 * XXX We might optimize this in two ways. We might sort the MCV items on
+	 * both sides using the "join" attributes, and then perform something like
+	 * merge join. Or we might calculate a hash from the join columns, and then
+	 * compare this (to eliminate the most expensive equality functions).
+	 */
+	for (i = 0; i < mcv->nitems; i++)
+	{
+		/* skip items eliminated by restrictions on rel1 */
+		if (cmatches && !cmatches[i])
+			continue;
+
+		/*
+		 * We can check mcv1->items[i].isnull[index1] here, because it'll be a
+		 * mismatch anyway (the simple MCV does not contain NULLs).
+		 */
+		if (mcv->items[i].isnull[index])
+			continue;
+
+		/* find matches in the second MCV list */
+		for (j = 0; j < sslot->nvalues; j++)
+		{
+			bool	match;
+			Datum	value1 = mcv->items[i].values[index];
+			Datum	value2 = sslot->values[j];
+
+			/*
+			 * Evaluate the join clause between the two MCV lists. We don't
+			 * need to deal with NULL values here - we've already checked for
+			 * NULL in the extended statistics earlier, and the simple MCV
+			 * does not contain NULL values.
+			 *
+			 * Careful about order of parameters. For same-type equality
+			 * that should not matter, but easy enough.
+			 *
+			 * FIXME Use appropriate collation.
+			 */
+			if (reverse)
+				match = DatumGetBool(FunctionCall2Coll(&opproc,
+													   InvalidOid,
+													   value2, value1));
+			else
+				match = DatumGetBool(FunctionCall2Coll(&opproc,
+													   InvalidOid,
+													   value1, value2));
+
+			if (match)
+			{
+				/* XXX Do we need to do something about base frequency? */
+				matches1[i] = matches2[j] = true;
+				s += mcv->items[i].frequency * sslot->numbers[j];
+
+				/*
+				 * We know there can be just a single match in the regular
+				 * MCV list, so we can abort the inner loop.
+				 */
+				break;
+			}
+		}
+	}
+
+	matchfreq1 = unmatchfreq1 = mcvfreq1 = 0.0;
+	for (i = 0; i < mcv->nitems; i++)
+	{
+		mcvfreq1 += mcv->items[i].frequency;
+
+		/* ignore MCV items eliminated by baserel conditions */
+		if (cmatches && !cmatches[i])
+			continue;
+
+		if (matches1[i])
+			matchfreq1 += mcv->items[i].frequency;
+		else
+			unmatchfreq1 += mcv->items[i].frequency;
+	}
+
+	/* not represented by the MCV */
+	otherfreq1 = 1.0 - mcvfreq1;
+
+	matchfreq2 = unmatchfreq2 = mcvfreq2 = 0.0;
+	for (i = 0; i < sslot->nvalues; i++)
+	{
+		mcvfreq2 += sslot->numbers[i];
+
+		if (matches2[i])
+			matchfreq2 += sslot->numbers[i];
+		else
+			unmatchfreq2 += sslot->numbers[i];
+	}
+
+	/* not represented by the MCV */
+	otherfreq2 = 1.0 - mcvfreq2;
+
+	/*
+	 * Correction for MCV parts eliminated by the conditions.
+	 *
+	 * We need to be careful about cases where conditions eliminated all
+	 * the MCV items. We must not divide by 0.0, because that would either
+	 * produce bogus value or trigger division by zero. Instead we simply
+	 * set the selectivity to 0.0, because there can't be any matches.
+	 */
+	if ((matchfreq1 + unmatchfreq1) > 0)
+		s = s * mcvfreq1 / (matchfreq1 + unmatchfreq1);
+	else
+		s = 0.0;
+
+	if ((matchfreq2 + unmatchfreq2) > 0)
+		s = s * mcvfreq2 / (matchfreq2 + unmatchfreq2);
+	else
+		s = 0.0;
+
+	/* calculate ndistinct for the expression in join clauses for each rel */
+	nd1 = estimate_num_groups(root, exprs1, rel->rows, NULL, NULL);
+	nd2 = nd;
+
+	/*
+	 * Consider the part of the data not represented by the MCV lists.
+	 *
+	 * XXX this is a bit bogus, because we don't know what fraction of
+	 * distinct combinations is covered by the MCV list (we're only
+	 * dealing with some of the columns), so we can't use the same
+	 * formular as eqjoinsel_inner exactly. We just use the estimates
+	 * for the whole table - this is likely an overestimate, because
+	 * (a) items may repeat in the MCV list, if it has more columns,
+	 * and (b) some of the combinations may be present in non-MCV data.
+	 *
+	 * Moreover, we need to look at the conditions. For now we simply
+	 * assume the conditions affect the distinct groups, and use that.
+	 *
+	 * XXX We might calculate the number of distinct groups in the MCV,
+	 * and then use something between (nd1 - distinct(MCV)) and (nd1),
+	 * which are the possible extreme values, assuming the estimates
+	 * are accurate. Maybe mean or geometric mean would work?
+	 *
+	 * XXX Not sure multiplying ndistinct with probabilities is good.
+	 * Maybe we should do something more like estimate_num_groups?
+	 */
+	nd1 *= csel;
+
+	totalsel1 = s;
+	totalsel1 += unmatchfreq1 * otherfreq2 / nd2;
+	totalsel1 += otherfreq1 * (otherfreq2 + unmatchfreq2) / nd2;
+
+//	if (nd2 > mcvb->nitems)
+//		totalsel1 += unmatchfreq1 * otherfreq2 / (nd2 - mcvb->nitems);
+//	if (nd2 > nmatches)
+//		totalsel1 += otherfreq1 * (otherfreq2 + unmatchfreq2) /
+//			(nd2 - nmatches);
+
+	totalsel2 = s;
+	totalsel2 += unmatchfreq2 * otherfreq1 / nd1;
+	totalsel2 += otherfreq2 * (otherfreq1 + unmatchfreq1) / nd1;
+
+//	if (nd1 > mcva->nitems)
+//		totalsel2 += unmatchfreq2 * otherfreq1 / (nd1 - mcva->nitems);
+//	if (nd1 > nmatches)
+//		totalsel2 += otherfreq2 * (otherfreq1 + unmatchfreq1) /
+//			(nd1 - nmatches);
+
+	s = Min(totalsel1, totalsel2);
+
+	return s;
+}
diff --git a/src/include/statistics/extended_stats_internal.h b/src/include/statistics/extended_stats_internal.h
index 71f852c157b..d115c6c791c 100644
--- a/src/include/statistics/extended_stats_internal.h
+++ b/src/include/statistics/extended_stats_internal.h
@@ -15,6 +15,7 @@
 #define EXTENDED_STATS_INTERNAL_H
 
 #include "statistics/statistics.h"
+#include "utils/lsyscache.h"
 #include "utils/sortsupport.h"
 
 typedef struct
@@ -127,4 +128,23 @@ extern Selectivity mcv_clause_selectivity_or(PlannerInfo *root,
 											 Selectivity *overlap_basesel,
 											 Selectivity *totalsel);
 
+extern Selectivity mcv_combine_simple(PlannerInfo *root,
+									  RelOptInfo *rel,
+									  StatisticExtInfo *stat,
+									  AttStatsSlot *sslot,
+									  double stanullfrac,
+									  double nd, bool isdefault,
+									  Node *clause);
+
+extern Selectivity mcv_combine_extended(PlannerInfo *root,
+										RelOptInfo *rel1,
+										RelOptInfo *rel2,
+										StatisticExtInfo *stat1,
+										StatisticExtInfo *stat2,
+										List *clauses);
+
+extern List *statext_determine_join_restrictions(PlannerInfo *root,
+												 RelOptInfo *rel,
+												 StatisticExtInfo *info);
+
 #endif							/* EXTENDED_STATS_INTERNAL_H */
diff --git a/src/include/statistics/statistics.h b/src/include/statistics/statistics.h
index bb7ef1240e0..c69cadff3a8 100644
--- a/src/include/statistics/statistics.h
+++ b/src/include/statistics/statistics.h
@@ -127,4 +127,16 @@ extern StatisticExtInfo *choose_best_statistics(List *stats, char requiredkind,
 												int nclauses);
 extern HeapTuple statext_expressions_load(Oid stxoid, bool inh, int idx);
 
+extern StatisticExtInfo *statext_find_matching_mcv(PlannerInfo *root, RelOptInfo *rel,
+										   Bitmapset *attnums, List *exprs);
+
+extern bool statext_try_join_estimates(PlannerInfo *root, List *clauses, int varRelid,
+									   JoinType jointype, SpecialJoinInfo *sjinfo,
+									   Bitmapset *estimatedclauses);
+
+extern Selectivity statext_clauselist_join_selectivity(PlannerInfo *root, List *clauses,
+													   int varRelid,
+													   JoinType jointype, SpecialJoinInfo *sjinfo,
+													   Bitmapset **estimatedclauses);
+
 #endif							/* STATISTICS_H */
diff --git a/src/test/regress/expected/stats_ext.out b/src/test/regress/expected/stats_ext.out
index 042316aeed8..e5577e680f4 100644
--- a/src/test/regress/expected/stats_ext.out
+++ b/src/test/regress/expected/stats_ext.out
@@ -3045,6 +3045,173 @@ SELECT c0 FROM ONLY expr_stats_incompatible_test WHERE
 (0 rows)
 
 DROP TABLE expr_stats_incompatible_test;
+-- Test join estimates.
+CREATE TABLE join_test_1 (a int, b int, c int);
+CREATE TABLE join_test_2 (a int, b int, c int);
+INSERT INTO join_test_1 SELECT mod(i,10), mod(i,10), mod(i,10) FROM generate_series(1,1000) s(i);
+INSERT INTO join_test_2 SELECT mod(i,10), mod(i,10), mod(i,10) FROM generate_series(1,1000) s(i);
+ANALYZE join_test_1;
+ANALYZE join_test_2;
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b))');
+ estimated | actual 
+-----------+--------
+       500 | 100000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5');
+ estimated | actual 
+-----------+--------
+       250 |  50000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 0');
+ estimated | actual 
+-----------+--------
+         1 |      0
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5 and j2.c < 3');
+ estimated | actual 
+-----------+--------
+        75 |  30000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5 and j2.c > 5');
+ estimated | actual 
+-----------+--------
+       100 |      0
+(1 row)
+
+-- can't be improved due to the optimization in clauselist_selectivity_ext,
+-- which skips cases with a single (join) clause
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5');
+ estimated | actual 
+-----------+--------
+      2500 |  50000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5 and j2.c < 5');
+ estimated | actual 
+-----------+--------
+      1250 |  50000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5 and j2.c > 5');
+ estimated | actual 
+-----------+--------
+      1000 |      0
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a) where j1.c < 5');
+ estimated | actual 
+-----------+--------
+      2500 |  50000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a + 1) where j1.c < 5');
+ estimated | actual 
+-----------+--------
+      2500 |  50000
+(1 row)
+
+-- create extended statistics on the join/where columns
+CREATE STATISTICS join_stats_1 ON a, b, c, (a+1), (b+1) FROM join_test_1;
+CREATE STATISTICS join_stats_2 ON a, b, c, (a+1), (b+1) FROM join_test_2;
+ANALYZE join_test_1;
+ANALYZE join_test_2;
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b))');
+ estimated | actual 
+-----------+--------
+    100000 | 100000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5');
+ estimated | actual 
+-----------+--------
+     50000 |  50000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 0');
+ estimated | actual 
+-----------+--------
+         1 |      0
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5 and j2.c < 3');
+ estimated | actual 
+-----------+--------
+     30000 |  30000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5 and j2.c > 5');
+ estimated | actual 
+-----------+--------
+         1 |      0
+(1 row)
+
+-- can't be improved due to the optimization in clauselist_selectivity_ext,
+-- which skips cases with a single (join) clause
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5');
+ estimated | actual 
+-----------+--------
+     50000 |  50000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5 and j2.c < 5');
+ estimated | actual 
+-----------+--------
+     50000 |  50000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5 and j2.c > 5');
+ estimated | actual 
+-----------+--------
+         1 |      0
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a) where j1.c < 5');
+ estimated | actual 
+-----------+--------
+     50000 |  50000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a + 1) where j1.c < 5');
+ estimated | actual 
+-----------+--------
+     50000 |  50000
+(1 row)
+
+-- try combining with single-column (and single-expression) statistics
+DROP STATISTICS join_stats_2;
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a) where j1.c < 5');
+ estimated | actual 
+-----------+--------
+     50000 |  50000
+(1 row)
+
+-- no MCV on join_test_2 (on the (a+1) expression)
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a + 1) where j1.c < 5');
+ estimated | actual 
+-----------+--------
+      2500 |  50000
+(1 row)
+
+CREATE STATISTICS join_stats_2 ON (a+1) FROM join_test_2;
+ANALYZE join_test_2;
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a) where j1.c < 5');
+ estimated | actual 
+-----------+--------
+     50000 |  50000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a + 1) where j1.c < 5');
+ estimated | actual 
+-----------+--------
+     50000 |  50000
+(1 row)
+
+DROP TABLE join_test_1;
+DROP TABLE join_test_2;
 -- Permission tests. Users should not be able to see specific data values in
 -- the extended statistics, if they lack permission to see those values in
 -- the underlying table.
diff --git a/src/test/regress/sql/stats_ext.sql b/src/test/regress/sql/stats_ext.sql
index 6b954c9e500..4fb2c518d2c 100644
--- a/src/test/regress/sql/stats_ext.sql
+++ b/src/test/regress/sql/stats_ext.sql
@@ -1533,6 +1533,72 @@ SELECT c0 FROM ONLY expr_stats_incompatible_test WHERE
 
 DROP TABLE expr_stats_incompatible_test;
 
+
+-- Test join estimates.
+CREATE TABLE join_test_1 (a int, b int, c int);
+CREATE TABLE join_test_2 (a int, b int, c int);
+
+INSERT INTO join_test_1 SELECT mod(i,10), mod(i,10), mod(i,10) FROM generate_series(1,1000) s(i);
+INSERT INTO join_test_2 SELECT mod(i,10), mod(i,10), mod(i,10) FROM generate_series(1,1000) s(i);
+
+ANALYZE join_test_1;
+ANALYZE join_test_2;
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b))');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 0');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5 and j2.c < 3');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5 and j2.c > 5');
+
+-- can't be improved due to the optimization in clauselist_selectivity_ext,
+-- which skips cases with a single (join) clause
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5 and j2.c < 5');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5 and j2.c > 5');
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a) where j1.c < 5');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a + 1) where j1.c < 5');
+
+-- create extended statistics on the join/where columns
+CREATE STATISTICS join_stats_1 ON a, b, c, (a+1), (b+1) FROM join_test_1;
+CREATE STATISTICS join_stats_2 ON a, b, c, (a+1), (b+1) FROM join_test_2;
+
+ANALYZE join_test_1;
+ANALYZE join_test_2;
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b))');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 0');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5 and j2.c < 3');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5 and j2.c > 5');
+
+-- can't be improved due to the optimization in clauselist_selectivity_ext,
+-- which skips cases with a single (join) clause
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5 and j2.c < 5');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5 and j2.c > 5');
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a) where j1.c < 5');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a + 1) where j1.c < 5');
+
+-- try combining with single-column (and single-expression) statistics
+DROP STATISTICS join_stats_2;
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a) where j1.c < 5');
+
+-- no MCV on join_test_2 (on the (a+1) expression)
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a + 1) where j1.c < 5');
+
+CREATE STATISTICS join_stats_2 ON (a+1) FROM join_test_2;
+ANALYZE join_test_2;
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a) where j1.c < 5');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a + 1) where j1.c < 5');
+
+
+DROP TABLE join_test_1;
+DROP TABLE join_test_2;
+
 -- Permission tests. Users should not be able to see specific data values in
 -- the extended statistics, if they lack permission to see those values in
 -- the underlying table.
-- 
2.17.1

#16

Andy Fan

zhihuifan1213@163.com

almost 2 years ago

In reply to: Justin Pryzby (#15)

8 attachment(s)

Re: using extended statistics to improve join estimates

On Wed, Mar 02, 2022 at 11:38:21AM -0600, Justin Pryzby wrote:

Rebased over 269b532ae and muted compiler warnings.

Thank you Justin for the rebase!

Hello Tomas,

Thanks for the patch! Before I review the path at the code level, I want
to explain my understanding about this patch first.

Before this patch, we already use MCV information for the eqjoinsel, it
works as combine the MCV on the both sides to figure out the mcv_freq
and then treat the rest equally, but this doesn't work for MCV in
extended statistics, this patch fill this gap. Besides that, since
extended statistics means more than 1 columns are involved, if 1+
columns are Const based on RestrictInfo, we can use such information to
filter the MCVs we are interesting, that's really cool.

I did some more testing, all of them are inner join so far, all of them
works amazing and I am suprised this patch didn't draw enough
attention. I will test more after I go though the code.

At for the code level, I reviewed them in the top-down manner and almost
40% completed. Here are some findings just FYI. For efficiency purpose,
I provide each feedback with a individual commit, after all I want to
make sure my comment is practical and coding and testing is a good way
to archive that. I tried to make each of them as small as possible so
that you can reject or accept them convinently.

0001 is your patch, I just rebase them against the current master. 0006
is not much relevant with current patch, and I think it can be committed
individually if you are OK with that.

Hope this kind of review is helpful.

--
Best Regards
Andy Fan

Attachments:

v1-0001-Estimate-joins-using-extended-statistics.patchtext/x-diffDownload

From daa6c27bc7dd0631607f0f254cc15491633a9ccc Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas.vondra@postgresql.org>
Date: Mon, 13 Dec 2021 14:05:17 +0100
Subject: [PATCH v1 1/8] Estimate joins using extended statistics

Use extended statistics (MCV) to improve join estimates. In general this
is similar to how we use regular statistics - we search for extended
statistics (with MCV) covering all join clauses, and if we find such MCV
on both sides of the join, we combine those two MCVs.

Extended statistics allow a couple additional improvements - e.g. if
there are baserel conditions, we can use them to restrict the part of
the MCVs combined. This means we're building conditional probability
distribution and calculating conditional probability

    P(join clauses | baserel conditions)

instead of just P(join clauses).

The patch also allows combining regular and extended MCV - we don't need
extended MCVs on both sides. This helps when one of the tables does not
have extended statistics (e.g. because there are no correlations).
---
 src/backend/optimizer/path/clausesel.c        |  63 +-
 src/backend/statistics/extended_stats.c       | 805 ++++++++++++++++++
 src/backend/statistics/mcv.c                  | 758 +++++++++++++++++
 .../statistics/extended_stats_internal.h      |  20 +
 src/include/statistics/statistics.h           |  12 +
 src/test/regress/expected/stats_ext.out       | 167 ++++
 src/test/regress/sql/stats_ext.sql            |  66 ++
 7 files changed, 1890 insertions(+), 1 deletion(-)

diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c
index 0ab021c1e8..bedf76edae 100644
--- a/src/backend/optimizer/path/clausesel.c
+++ b/src/backend/optimizer/path/clausesel.c
@@ -48,6 +48,9 @@ static Selectivity clauselist_selectivity_or(PlannerInfo *root,
 											 JoinType jointype,
 											 SpecialJoinInfo *sjinfo,
 											 bool use_extended_stats);
+static inline bool treat_as_join_clause(PlannerInfo *root,
+										Node *clause, RestrictInfo *rinfo,
+										int varRelid, SpecialJoinInfo *sjinfo);
 
 /****************************************************************************
  *		ROUTINES TO COMPUTE SELECTIVITIES
@@ -127,12 +130,53 @@ clauselist_selectivity_ext(PlannerInfo *root,
 	RangeQueryClause *rqlist = NULL;
 	ListCell   *l;
 	int			listidx;
+	bool		single_clause_optimization = true;
+
+	/*
+	 * The optimization of skipping to clause_selectivity_ext for single
+	 * clauses means we can't improve join estimates with a single join
+	 * clause but additional baserel restrictions. So we disable it when
+	 * estimating joins.
+	 *
+	 * XXX Not sure if this is the right way to do it, but more elaborate
+	 * checks would mostly negate the whole point of the optimization.
+	 * The (Var op Var) patch has the same issue.
+	 *
+	 * XXX An alternative might be making clause_selectivity_ext smarter
+	 * and make it use the join extended stats there. But that seems kinda
+	 * against the whole point of the optimization (skipping expensive
+	 * stuff) and it's making other parts more complex.
+	 *
+	 * XXX Maybe this should check if there are at least some restrictions
+	 * on some base relations, which seems important. But then again, that
+	 * seems to go against the idea of this check to be cheap. Moreover, it
+	 * won't work for OR clauses, which may have multiple parts but we still
+	 * see them as a single BoolExpr clause (it doesn't work later, though).
+	 */
+	if (list_length(clauses) == 1)
+	{
+		Node *clause = linitial(clauses);
+		RestrictInfo *rinfo = NULL;
+
+		if (IsA(clause, RestrictInfo))
+		{
+			rinfo = (RestrictInfo *) clause;
+			clause = (Node *) rinfo->clause;
+		}
+
+		single_clause_optimization
+			= !treat_as_join_clause(root, clause, rinfo, varRelid, sjinfo);
+	}
 
 	/*
 	 * If there's exactly one clause, just go directly to
 	 * clause_selectivity_ext(). None of what we might do below is relevant.
+	 *
+	 * XXX This means we won't try using extended stats on OR-clauses (which
+	 * are a single BoolExpr clause at this point), although we'll do that
+	 * later (once we look at the arguments).
 	 */
-	if (list_length(clauses) == 1)
+	if ((list_length(clauses) == 1) && single_clause_optimization)
 		return clause_selectivity_ext(root, (Node *) linitial(clauses),
 									  varRelid, jointype, sjinfo,
 									  use_extended_stats);
@@ -155,6 +199,23 @@ clauselist_selectivity_ext(PlannerInfo *root,
 											&estimatedclauses, false);
 	}
 
+	/*
+	 * Try applying extended statistics to joins. There's not much we can
+	 * do to detect when this makes sense, but we can check that there are
+	 * join clauses, and that at least some of the rels have stats.
+	 *
+	 * XXX Isn't this mutually exclusive with the preceding block which
+	 * calculates estimates for a single relation?
+	 */
+	if (use_extended_stats &&
+		statext_try_join_estimates(root, clauses, varRelid, jointype, sjinfo,
+						 estimatedclauses))
+	{
+		s1 *= statext_clauselist_join_selectivity(root, clauses, varRelid,
+												  jointype, sjinfo,
+												  &estimatedclauses);
+	}
+
 	/*
 	 * Apply normal selectivity estimates for remaining clauses. We'll be
 	 * careful to skip any clauses which were already estimated above.
diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c
index 5d7bdc9d12..183a8af07b 100644
--- a/src/backend/statistics/extended_stats.c
+++ b/src/backend/statistics/extended_stats.c
@@ -29,6 +29,7 @@
 #include "miscadmin.h"
 #include "nodes/nodeFuncs.h"
 #include "optimizer/optimizer.h"
+#include "optimizer/pathnode.h"
 #include "parser/parsetree.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -100,6 +101,8 @@ static StatsBuildData *make_build_data(Relation rel, StatExtEntry *stat,
 									   int numrows, HeapTuple *rows,
 									   VacAttrStats **stats, int stattarget);
 
+static bool stat_covers_expressions(StatisticExtInfo *stat, List *exprs,
+									Bitmapset **expr_idxs);
 
 /*
  * Compute requested extended stats, using the rows sampled for the plain
@@ -2633,3 +2636,805 @@ make_build_data(Relation rel, StatExtEntry *stat, int numrows, HeapTuple *rows,
 
 	return result;
 }
+
+/*
+ * statext_find_matching_mcv
+ *		Search for a MCV covering all the attributes and expressions.
+ *
+ * We pick the statistics to use for join estimation. The statistics object has
+ * to have MCV, and we require it to match all the join conditions, because it
+ * makes the estimation simpler.
+ *
+ * If there are multiple candidate statistics objects (matching all join clauses),
+ * we pick the smallest one, and we also consider additional conditions on
+ * the base relations to restrict the MCV items used for estimation (using
+ * conditional probability).
+ *
+ * XXX The requirement that all the attributes need to be covered might be
+ * too strong. We could relax this and and require fewer matches (at least two,
+ * if counting the additional conditions), and we might even apply multiple
+ * statistics etc. But that would require matching statistics on both sides of
+ * the join, while now we simply know the statistics match. We don't really
+ * expect many candidate MCVs, so this simple approach seems sufficient. And
+ * the joins usually use only one or two columns, so there's not much room
+ * for applying multiple statistics anyway.
+ */
+StatisticExtInfo *
+statext_find_matching_mcv(PlannerInfo *root, RelOptInfo *rel,
+						  Bitmapset *attnums, List *exprs)
+{
+	ListCell   *l;
+	StatisticExtInfo *mcv = NULL;
+	List *stats = rel->statlist;
+
+	foreach(l, stats)
+	{
+		StatisticExtInfo *stat = (StatisticExtInfo *) lfirst(l);
+		List *conditions1 = NIL,
+			 *conditions2 = NIL;
+
+		/* We only care about MCV statistics here. */
+		if (stat->kind != STATS_EXT_MCV)
+			continue;
+
+		/*
+		 * Ignore MCVs not covering all the attributes/expressions.
+		 *
+		 * XXX Maybe we shouldn't be so strict and consider only partial
+		 * matches for join clauses too?
+		 */
+		if (!bms_is_subset(attnums, stat->keys) ||
+			!stat_covers_expressions(stat, exprs, NULL))
+			continue;
+
+		/* If there's no matching MCV yet, keep this one. */
+		if (!mcv)
+		{
+			mcv = stat;
+			continue;
+		}
+
+		/*
+		 * OK, we have two candidate statistics objects and we need to decide
+		 * which one to keep. We'll use two simple heuristics:
+		 *
+		 * (a) We prefer smaller statistics (fewer columns), on the assumption
+		 * that it represents a larger fraction of the data (due to having fewer
+		 * combinations with higher counts).
+		 *
+		 * (b) If the statistics object covers some additional conditions for the rels,
+		 * that may help with considering additional dependencies between the
+		 * tables.
+		 *
+		 * Of course, those two heuristict are somewhat contradictory - smaller
+		 * stats are less likely to cover as many conditions as a larger one. We
+		 * consider the additional conditions first - if someone created such
+		 * statistics, there probably is a dependency worth considering.
+		 *
+		 * When inspecting the restrictions, we need to be careful - we don't
+		 * know which of them are compatible with extended stats, so we have to
+		 * run them through statext_is_compatible_clause first and then match
+		 * them to the statistics.
+		 *
+		 * XXX Maybe we shouldn't pick statistics that covers just a single join
+		 * clause, without any additional conditions. In such case we could just
+		 * as well pick regular statistics for the column/expression, but it's
+		 * not clear if that actually exists (so we might reject the stats here
+		 * and then fail to find something simpler/better).
+		 */
+		conditions1 = statext_determine_join_restrictions(root, rel, stat);
+		conditions2 = statext_determine_join_restrictions(root, rel, mcv);
+
+		/* if the new statistics object covers more conditions, use it */
+		if (list_length(conditions1) > list_length(conditions2))
+		{
+			mcv = stat;
+			continue;
+		}
+
+		/* The statistics seem about equal, so just use the smaller one. */
+		if (bms_num_members(mcv->keys) + list_length(mcv->exprs) >
+			bms_num_members(stat->keys) + list_length(stat->exprs))
+		{
+			mcv = stat;
+		}
+	}
+
+	return mcv;
+}
+
+/*
+ * statext_determine_join_restrictions
+ *		Get restrictions on base relation, covered by the statistics object.
+ *
+ * Returns a list of baserel restrictinfos, compatible with extended statistics
+ * and covered by the extended statistics object.
+ *
+ * When using extended statistics to estimate joins, we can use conditions
+ * from base relations to calculate conditional probability
+ *
+ *    P(join clauses | baserel restrictions)
+ *
+ * which should be a better estimate than just P(join clauses). We want to pick
+ * the statistics object covering the most such conditions.
+ */
+List *
+statext_determine_join_restrictions(PlannerInfo *root, RelOptInfo *rel,
+									StatisticExtInfo *info)
+{
+	ListCell   *lc;
+	List	   *conditions = NIL;
+
+	/* extract conditions that may be applied to the MCV list */
+	foreach (lc, rel->baserestrictinfo)
+	{
+		RestrictInfo *rinfo = (RestrictInfo *) lfirst(lc);
+		Bitmapset *indexes = NULL;
+		Bitmapset *attnums = NULL;
+		List *exprs = NIL;
+
+		/* clause has to be supported by MCV in general */
+		if (!statext_is_compatible_clause(root, (Node *) rinfo, rel->relid,
+										  &attnums, &exprs))
+			continue;
+
+		/*
+		 * clause is compatible in general, but is it actually covered
+		 * by this particular statistics object?
+		 */
+		if (!bms_is_subset(attnums, info->keys) ||
+			!stat_covers_expressions(info, exprs, &indexes))
+			continue;
+
+		conditions = lappend(conditions, rinfo->clause);
+	}
+
+	return conditions;
+}
+
+/*
+ * statext_is_supported_join_clause
+ *		Check if a join clause may be estimated using extended stats.
+ *
+ * Determines if this is a join clause of the form (Expr op Expr) which may be
+ * estimated using extended statistics. Each side must reference just a single
+ * relation for now.
+ *
+ * Similar to treat_as_join_clause, but we place additional restrictions
+ * on the conditions.
+ */
+static bool
+statext_is_supported_join_clause(PlannerInfo *root, Node *clause,
+								 int varRelid, SpecialJoinInfo *sjinfo)
+{
+	Oid	oprsel;
+	RestrictInfo   *rinfo;
+	OpExpr		   *opclause;
+	ListCell	   *lc;
+
+	/*
+	 * evaluation as a restriction clause, either at scan node or forced
+	 *
+	 * XXX See treat_as_join_clause.
+	 */
+	if ((varRelid != 0) || (sjinfo == NULL))
+		return false;
+
+	/* XXX Can we rely on always getting RestrictInfo here? */
+	if (!IsA(clause, RestrictInfo))
+		return false;
+
+	/* strip the RestrictInfo */
+	rinfo = (RestrictInfo *) clause;
+	clause = (Node *) rinfo->clause;
+
+	/* is it referencing multiple relations? */
+	if (bms_membership(rinfo->clause_relids) != BMS_MULTIPLE)
+		return false;
+
+	/* we only support simple operator clauses for now */
+	if (!is_opclause(clause))
+		return false;
+
+	opclause = (OpExpr *) clause;
+
+	/* for now we only support estimating equijoins */
+	oprsel = get_oprjoin(opclause->opno);
+
+	/* has to be an equality condition */
+	if (oprsel != F_EQJOINSEL)
+		return false;
+
+	/*
+	 * Make sure we're not mixing vars from multiple relations on the same
+	 * side, like
+	 *
+	 *   (t1.a + t2.a) = (t1.b + t2.b)
+	 *
+	 * which is still technically an opclause, but we can't match it to
+	 * extended statistics in a simple way.
+	 *
+	 * XXX This also means we require rinfo->clause_relids to have 2 rels.
+	 *
+	 * XXX Also check it's not expression on system attributes, which we
+	 * don't allow in extended statistics.
+	 *
+	 * XXX Although maybe we could allow cases that combine expressions
+	 * from both relations on either side? Like (t1.a + t2.b = t1.c - t2.d)
+	 * or something like that. We could do "cartesian product" of the MCV
+	 * stats and restrict it using this condition.
+	 */
+	foreach (lc, opclause->args)
+	{
+		Bitmapset *varnos = NULL;
+		Node *expr = (Node *) lfirst(lc);
+
+		varnos = pull_varnos(root, expr);
+
+		/*
+		 * No argument should reference more than just one relation.
+		 *
+		 * This effectively means each side references just two relations.
+		 * If there's no relation on one side, it's a Const, and the other
+		 * side has to be either Const or Expr with a single rel. In which
+		 * case it can't be a join clause.
+		 */
+		if (bms_num_members(varnos) > 1)
+			return false;
+
+		/*
+		 * XXX Maybe check that both relations have extended statistics
+		 * (no point in considering the clause as useful without it). But
+		 * we'll do that check later anyway, so keep this cheap.
+		 */
+	}
+
+	return true;
+}
+
+/*
+ * statext_try_join_estimates
+ *		Checks if it's worth considering extended stats on join estimates.
+ *
+ * This is supposed to be a quick/cheap check to decide whether to expend
+ * more effort on applying extended statistics to join clauses.
+ */
+bool
+statext_try_join_estimates(PlannerInfo *root, List *clauses, int varRelid,
+						   JoinType jointype, SpecialJoinInfo *sjinfo,
+						   Bitmapset *estimatedclauses)
+{
+	int			listidx;
+	int			k;
+	ListCell   *lc;
+	Bitmapset  *relids = NULL;
+
+	/*
+	 * XXX Not having these values means treat_as_join_clause returns false,
+	 * so we're not supposed to handle join clauses here. So just bail out.
+	 */
+	if ((varRelid != 0) || (sjinfo == NULL))
+		return false;
+
+	/*
+	 * Check if there are any unestimated join clauses, collect relids.
+	 *
+	 * XXX Currently this only allows simple OpExpr equality clauses with each
+	 * argument referring to single relation, AND-ed together. Maybe we could
+	 * relax this in the future, e.g. to allow more complex (deeper) expressions
+	 * and to allow OR-ed join clauses too. And maybe supporting inequalities.
+	 *
+	 * Handling more complex expressions seems simple - we already do that for
+	 * baserel estimates by building the match bitmap recursively, and we could
+	 * do something similar for combinations of MCV items (a bit like building
+	 * a single bit in the match bitmap). The challenge is what to do about the
+	 * part not represented by MCV, which is now based on ndistinct estimates.
+	 */
+	listidx = -1;
+	foreach (lc, clauses)
+	{
+		Node *clause = (Node *) lfirst(lc);
+		RestrictInfo *rinfo;
+
+		/* needs to happen before skipping any clauses */
+		listidx++;
+
+		/* Skip clauses that were already estimated. */
+		if (bms_is_member(listidx, estimatedclauses))
+			continue;
+
+		/*
+		 * Skip clauses that are not join clauses or that we don't know
+		 * how to handle estimate using extended statistics.
+		 */
+		if (!statext_is_supported_join_clause(root, clause, varRelid, sjinfo))
+			continue;
+
+		/*
+		 * XXX We're guaranteed to have RestrictInfo thanks to the checks
+		 * in statext_is_supported_join_clause.
+		 */
+		rinfo = (RestrictInfo *) clause;
+
+		/* Collect relids from all usable clauses. */
+		relids = bms_union(relids, rinfo->clause_relids);
+	}
+
+	/* no join clauses found, don't try applying extended stats */
+	if (bms_num_members(relids) == 0)
+		return false;
+
+	/*
+	 * We expect either 0 or >= 2 relids, a case with 1 relid in join clauses
+	 * should be impossible. And we just ruled out 0, so there are at least 2.
+	 */
+	Assert(bms_num_members(relids) >= 2);
+
+	/*
+	 * Check that at least some of the rels referenced by the clauses have
+	 * extended stats.
+	 *
+	 * XXX Maybe we should check how many rels have stats, and cross-check how
+	 * compatible they are (e.g. that both have MCVs, etc.). We might also
+	 * cross-check the exact joined pairs of rels, but it's supposed to be a
+	 * cheap check, so maybe better leave that for later.
+	 *
+	 * XXX We could also check the number of parameters in each rel to consider
+	 * extended stats. If there's just a single attribute, it's pointless to use
+	 * extended statistics. OTOH we can also consider restriction clauses from
+	 * baserestrictinfo and use them to calculate conditional probabilities.
+	 */
+	k = -1;
+	while ((k = bms_next_member(relids, k)) >= 0)
+	{
+		RelOptInfo *rel = find_base_rel(root, k);
+		if (rel->statlist)
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * Information about two joined relations, along with the join clauses between.
+ */
+typedef struct JoinPairInfo
+{
+	Bitmapset  *rels;
+	List	   *clauses;
+} JoinPairInfo;
+
+/*
+ * statext_build_join_pairs
+ *		Extract pairs of joined rels with join clauses for each pair.
+ *
+ * Walks the remaining (not yet estimated) clauses, and splits them into
+ * lists for each pair of joined relations. Returns NULL if there are no
+ * suitable join pairs that might be estimated using extended stats.
+ *
+ * XXX It's possible there are join clauses, but the clauses are not
+ * supported by the extended stats machinery (we only support opclauses
+ * with F_EQJOINSEL selectivity function at the moment).
+ */
+static JoinPairInfo *
+statext_build_join_pairs(PlannerInfo *root, List *clauses, int varRelid,
+						 JoinType jointype, SpecialJoinInfo *sjinfo,
+						 Bitmapset *estimatedclauses, int *npairs)
+{
+	int				cnt;
+	int				listidx;
+	JoinPairInfo   *info;
+	ListCell	   *lc;
+
+	/*
+	 * Assume each clause is for a different pair of relations (some of them
+	 * might be already estimated, but meh - there shouldn't be too many of
+	 * them and it's cheaper than repalloc).
+	 */
+	info = (JoinPairInfo *) palloc0(sizeof(JoinPairInfo) * list_length(clauses));
+	cnt = 0;
+
+	listidx = -1;
+	foreach(lc, clauses)
+	{
+		int				i;
+		bool			found;
+		Node		   *clause = (Node *) lfirst(lc);
+		RestrictInfo   *rinfo;
+
+		listidx++;
+
+		/* skip already estimated clauses */
+		if (bms_is_member(listidx, estimatedclauses))
+			continue;
+
+		/*
+		 * Make sure the clause is a join clause of a supported shape (at
+		 * the moment we support just (Expr op Expr) clauses with each
+		 * side referencing just a single relation).
+		 */
+		if (!statext_is_supported_join_clause(root, clause, varRelid, sjinfo))
+			continue;
+
+		/* statext_is_supported_join_clause guarantees RestrictInfo */
+		rinfo = (RestrictInfo *) clause;
+		clause = (Node *) rinfo->clause;
+
+		/* search for a matching join pair */
+		found = false;
+		for (i = 0; i < cnt; i++)
+		{
+			if (bms_is_subset(rinfo->clause_relids, info[i].rels))
+			{
+				info[i].clauses = lappend(info[i].clauses, clause);
+				found = true;
+				break;
+			}
+		}
+
+		if (!found)
+		{
+			info[cnt].rels = rinfo->clause_relids;
+			info[cnt].clauses = lappend(info[cnt].clauses, clause);
+			cnt++;
+		}
+	}
+
+	if (cnt == 0)
+		return NULL;
+
+	*npairs = cnt;
+	return info;
+}
+
+/*
+ * extract_relation_info
+ *		Extract information about a relation in a join pair.
+ *
+ * The relation is identified by index (generally 0 or 1), and picks extended
+ * statistics object covering the join clauses and baserel restrictions.
+ *
+ * XXX Can we have cases with indexes above 1? Probably for clauses mixing
+ * vars from 3 relations, but statext_is_supported_join_clause rejects those.
+ */
+static RelOptInfo *
+extract_relation_info(PlannerInfo *root, JoinPairInfo *info, int index,
+					  StatisticExtInfo **stat)
+{
+	int			k;
+	int			relid;
+	RelOptInfo *rel;
+	ListCell   *lc;
+	List	   *exprs = NIL;
+
+	Bitmapset  *attnums = NULL;
+
+	Assert((index >= 0) && (index <= 1));
+
+	k = -1;
+	while (index >= 0)
+	{
+		k = bms_next_member(info->rels, k);
+		if (k < 0)
+			elog(ERROR, "failed to extract relid");
+
+		relid = k;
+		index--;
+	}
+
+	rel = find_base_rel(root, relid);
+
+	/*
+	 * Walk the clauses for this join pair, and extract expressions about
+	 * the relation identified by index / relid. For simple Vars we extract
+	 * the attnum. Otherwise we keep the whole expression.
+	 */
+	foreach (lc, info->clauses)
+	{
+		ListCell *lc2;
+		Node *clause = (Node *) lfirst(lc);
+		OpExpr *opclause = (OpExpr *) clause;
+
+		/* only opclauses supported for now */
+		Assert(is_opclause(clause));
+
+		foreach (lc2, opclause->args)
+		{
+			Node *arg = (Node *) lfirst(lc2);
+			Bitmapset *varnos = NULL;
+
+			/* plain Var references (boolean Vars or recursive checks) */
+			if (IsA(arg, Var))
+			{
+				Var		   *var = (Var *) arg;
+
+				/* Ignore vars from other relations. */
+				if (var->varno != relid)
+					continue;
+
+				/* we also better ensure the Var is from the current level */
+				if (var->varlevelsup > 0)
+					continue;
+
+				/* Also skip system attributes (we don't allow stats on those). */
+				if (!AttrNumberIsForUserDefinedAttr(var->varattno))
+					elog(ERROR, "unexpected system attribute");
+
+				attnums = bms_add_member(attnums, var->varattno);
+
+				/* Done, process the next argument. */
+				continue;
+			}
+
+			/*
+			 * OK, it's a more complex expression, so check if it matches
+			 * the relid and maybe keep it as a whole. It should be
+			 * compatible because we already checked it when building the
+			 * join pairs.
+			 */
+			varnos = pull_varnos(root, arg);
+
+			if (relid == bms_singleton_member(varnos))
+				exprs = lappend(exprs, arg);
+		}
+	}
+
+	*stat = statext_find_matching_mcv(root, rel, attnums, exprs);
+
+	return rel;
+}
+
+/*
+ * get_expression_for_rel
+ *		Extract expression for a given relation from the join clause.
+ *
+ * Given a join clause supported by the extended statistics object (currently
+ * that means just OpExpr clauses with each argument referencing single rel),
+ * return either the left or right argument expression for the rel.
+ *
+ * XXX This should probably return a flag identifying whether it's the
+ * left or right argument.
+ */
+static Node *
+get_expression_for_rel(PlannerInfo *root, RelOptInfo *rel, Node *clause)
+{
+	OpExpr *opexpr;
+	Node   *expr;
+
+	/*
+	 * Strip the RestrictInfo node, get the actual clause.
+	 *
+	 * XXX Not sure if we need to care about removing other node types
+	 * too (e.g. RelabelType etc.). statext_is_supported_join_clause
+	 * matches this, but maybe we need to relax it?
+	 */
+	if (IsA(clause, RestrictInfo))
+		clause = (Node *) ((RestrictInfo *) clause)->clause;
+
+	opexpr = (OpExpr *) clause;
+
+	/* Make sure we have the expected node type. */
+	Assert(is_opclause(clause));
+	Assert(list_length(opexpr->args) == 2);
+
+	/* FIXME strip relabel etc. the way examine_opclause_args does */
+	expr = linitial(opexpr->args);
+	if (bms_singleton_member(pull_varnos(root, expr)) == rel->relid)
+		return expr;
+
+	expr = lsecond(opexpr->args);
+	if (bms_singleton_member(pull_varnos(root, expr)) == rel->relid)
+		return expr;
+
+	return NULL;
+}
+
+/*
+ * statext_clauselist_join_selectivity
+ *		Use extended stats to estimate join clauses.
+ *
+ * XXX In principle, we should not restrict this to cases with multiple
+ * join clauses - we should consider dependencies with conditions at the
+ * base relations, i.e. calculate P(join clause | base restrictions).
+ * But currently that does not happen, because clauselist_selectivity_ext
+ * treats a single clause as a special case (and we don't apply extended
+ * statistics in that case yet).
+ */
+Selectivity
+statext_clauselist_join_selectivity(PlannerInfo *root, List *clauses, int varRelid,
+									JoinType jointype, SpecialJoinInfo *sjinfo,
+									Bitmapset **estimatedclauses)
+{
+	int			i;
+	int			listidx;
+	Selectivity	s = 1.0;
+
+	JoinPairInfo *info;
+	int				ninfo;
+
+	if (!clauses)
+		return 1.0;
+
+	/* extract pairs of joined relations from the list of clauses */
+	info = statext_build_join_pairs(root, clauses, varRelid, jointype, sjinfo,
+									*estimatedclauses, &ninfo);
+
+	/* no useful join pairs */
+	if (!info)
+		return 1.0;
+
+	/*
+	 * Process the join pairs, try to find a matching MCV on each side.
+	 *
+	 * XXX The basic principle is quite similar to eqjoinsel_inner, i.e.
+	 * we try to find a MCV on both sides of the join, and use it to get
+	 * a better join estimate. It's a bit more complicated, because there
+	 * might be multiple MCV lists, we also need ndistinct estimate, and
+	 * there may be interesting baserestrictions too.
+	 *
+	 * XXX At the moment we only handle the case with matching MCVs on
+	 * both sides, but it'd be good to also handle case with just ndistinct
+	 * statistics improving ndistinct estimates.
+	 *
+	 * XXX We might also handle cases with a regular MCV on one side and
+	 * an extended MCV on the other side.
+	 *
+	 * XXX Perhaps it'd be good to also handle case with one side only
+	 * having "regular" statistics (e.g. MCV), especially in cases with
+	 * no conditions on that side of the join (where we can't use the
+	 * extended MCV to calculate conditional probability).
+	 */
+	for (i = 0; i < ninfo; i++)
+	{
+		ListCell *lc;
+
+		RelOptInfo *rel1;
+		RelOptInfo *rel2;
+
+		StatisticExtInfo *stat1;
+		StatisticExtInfo *stat2;
+
+		/* extract info about the first relation */
+		rel1 = extract_relation_info(root, &info[i], 0, &stat1);
+
+		/* extract info about the second relation */
+		rel2 = extract_relation_info(root, &info[i], 1, &stat2);
+
+		/*
+		 * We can handle three basic cases:
+		 *
+		 * a) Extended stats (with MCV) on both sides is an ideal case, and we
+		 * can simply combine the two MCVs, possibly with additional conditions
+		 * from the relations.
+		 *
+		 * b) Extended stats on one side, regular MCV on the other side (this
+		 * means there's just one join clause / expression). It also means the
+		 * extended stats likely covers at least one extra condition, otherwise
+		 * we could just use regular statistics. We can combine the stats just
+		 * similarly to (a).
+		 *
+		 * c) No extended stats with MCV. If there are multiple join clauses,
+		 * we can try using ndistinct coefficients and do what eqjoinsel does.
+		 *
+		 * If none of these applies, we fallback to the regular selectivity
+		 * estimation in eqjoinsel.
+		 */
+		if (stat1 && stat2)
+		{
+			s *= mcv_combine_extended(root, rel1, rel2, stat1, stat2, info[i].clauses);
+		}
+		else if (stat1 && (list_length(info[i].clauses) == 1))
+		{
+			/* try finding MCV on the other relation */
+			VariableStatData	vardata;
+			AttStatsSlot		sslot;
+			Form_pg_statistic	stats = NULL;
+			bool				have_mcvs = false;
+			Node			   *clause = linitial(info[i].clauses);
+			Node			   *expr = get_expression_for_rel(root, rel2, clause);
+			double				nd;
+			bool				isdefault;
+
+			examine_variable(root, expr, 0, &vardata);
+
+			nd = get_variable_numdistinct(&vardata, &isdefault);
+
+			memset(&sslot, 0, sizeof(sslot));
+
+			if (HeapTupleIsValid(vardata.statsTuple))
+			{
+				/* note we allow use of nullfrac regardless of security check */
+				stats = (Form_pg_statistic) GETSTRUCT(vardata.statsTuple);
+				/* FIXME should this call statistic_proc_security_check like eqjoinsel? */
+				have_mcvs = get_attstatsslot(&sslot, vardata.statsTuple,
+											 STATISTIC_KIND_MCV, InvalidOid,
+											 ATTSTATSSLOT_VALUES | ATTSTATSSLOT_NUMBERS);
+			}
+
+			if (have_mcvs)
+				s *= mcv_combine_simple(root, rel1, stat1, &sslot,
+										stats->stanullfrac, nd, isdefault, clause);
+
+			free_attstatsslot(&sslot);
+
+			ReleaseVariableStats(vardata);
+
+			/* no stats, don't mark the clauses as estimated */
+			if (!have_mcvs)
+				continue;
+		}
+		else if (stat2 && (list_length(info[i].clauses) == 1))
+		{
+			/* try finding MCV on the other relation */
+			VariableStatData	vardata;
+			AttStatsSlot		sslot;
+			Form_pg_statistic	stats = NULL;
+			bool				have_mcvs = false;
+			Node			   *clause = (Node *) linitial(info[i].clauses);
+			Node			   *expr = get_expression_for_rel(root, rel1, clause);
+			double				nd;
+			bool				isdefault;
+
+			examine_variable(root, expr, 0, &vardata);
+
+			nd = get_variable_numdistinct(&vardata, &isdefault);
+
+			memset(&sslot, 0, sizeof(sslot));
+
+			if (HeapTupleIsValid(vardata.statsTuple))
+			{
+				/* note we allow use of nullfrac regardless of security check */
+				stats = (Form_pg_statistic) GETSTRUCT(vardata.statsTuple);
+				/* FIXME should this call statistic_proc_security_check like eqjoinsel? */
+				have_mcvs = get_attstatsslot(&sslot, vardata.statsTuple,
+											 STATISTIC_KIND_MCV, InvalidOid,
+											 ATTSTATSSLOT_VALUES | ATTSTATSSLOT_NUMBERS);
+			}
+
+			if (have_mcvs)
+				s *= mcv_combine_simple(root, rel2, stat2, &sslot,
+										stats->stanullfrac, nd, isdefault, clause);
+
+			free_attstatsslot(&sslot);
+
+			ReleaseVariableStats(vardata);
+
+			/* no stats, don't mark the clauses as estimated */
+			if (!have_mcvs)
+				continue;
+		}
+		else
+			continue;
+
+		/*
+		 * Now mark all the clauses for this join pair as estimated.
+		 *
+		 * XXX Maybe track the indexes in JoinPairInfo, so that we can
+		 * simply union the two bitmaps, without the extra matching.
+		 */
+		foreach (lc, info->clauses)
+		{
+			Node *clause = (Node *) lfirst(lc);
+			ListCell *lc2;
+
+			listidx = -1;
+			foreach (lc2, clauses)
+			{
+				Node *clause2 = (Node *) lfirst(lc2);
+				listidx++;
+
+				Assert(IsA(clause2, RestrictInfo));
+
+				clause2 = (Node *) ((RestrictInfo *) clause2)->clause;
+
+				if (equal(clause, clause2))
+				{
+					*estimatedclauses = bms_add_member(*estimatedclauses, listidx);
+					break;
+				}
+			}
+		}
+	}
+
+	return s;
+}
diff --git a/src/backend/statistics/mcv.c b/src/backend/statistics/mcv.c
index b0e9aead84..49299ed907 100644
--- a/src/backend/statistics/mcv.c
+++ b/src/backend/statistics/mcv.c
@@ -22,6 +22,8 @@
 #include "fmgr.h"
 #include "funcapi.h"
 #include "nodes/nodeFuncs.h"
+#include "optimizer/clauses.h"
+#include "optimizer/optimizer.h"
 #include "statistics/extended_stats_internal.h"
 #include "statistics/statistics.h"
 #include "utils/array.h"
@@ -2173,3 +2175,759 @@ mcv_clause_selectivity_or(PlannerInfo *root, StatisticExtInfo *stat,
 
 	return s;
 }
+
+/*
+ * statext_compare_mcvs
+ *		Calculate join selectivity using extended statistics, similar to
+ *		eqjoinsel_inner.
+ *
+ * Considers restrictions on base relations too, essentially computing a
+ * conditional probability
+ *
+ *	P(join clauses | baserestrictinfos on either side)
+ *
+ * Compared to eqjoinsel_inner there's a couple problems. With per-column MCV
+ * lists it's obvious that the number of distinct values not covered by the MCV
+ * is (ndistinct - size(MCV)). With multi-column MCVs it's not that simple,
+ * particularly when the conditions are on a subset of the MCV attributes and/or
+ * NULLs are involved. E.g. with MCV (a,b,c) and conditions on (a,b), it's not
+ * clear if the number of (a,b) combinations not covered by the MCV is
+ *
+ * (ndistinct(a,b) - ndistinct_mcv(a,b))
+ *
+ * where ndistinct_mcv(a,b) is the number of distinct (a,b) combinations
+ * included in the MCV list. These combinations may be present in the rest
+ * of the data (outside MCV), just with some extra values in "c". So in
+ * principle there may be between
+ *
+ * (ndistinct(a,b) - ndistinct_mcv(a,b)) and ndistinct(a,b)
+ *
+ * distinct values in the part of the data not covered by the MCV. So we need
+ * to pick something in between, there's no way to calculate this accurately.
+ */
+Selectivity
+mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
+					 StatisticExtInfo *stat1, StatisticExtInfo *stat2,
+					 List *clauses)
+{
+	ListCell   *lc;
+
+	MCVList    *mcv1,
+			   *mcv2;
+	int			idx,
+				i,
+				j;
+	Selectivity s = 0;
+
+	/* match bitmaps and selectivity for baserel conditions (if any) */
+	List   *exprs1 = NIL,
+		   *exprs2 = NIL;
+	List   *conditions1 = NIL,
+		   *conditions2 = NIL;
+	bool   *cmatches1 = NULL,
+		   *cmatches2 = NULL;
+
+	double	csel1 = 1.0,
+			csel2 = 1.0;
+
+	bool   *matches1 = NULL,
+		   *matches2 = NULL;
+
+	/* estimates for the two relations */
+	double	matchfreq1,
+			unmatchfreq1,
+			otherfreq1,
+			mcvfreq1,
+			nd1,
+			totalsel1;
+
+	double 	matchfreq2,
+			unmatchfreq2,
+			otherfreq2,
+			mcvfreq2,
+			nd2,
+			totalsel2;
+
+	/* info about clauses and how they match to MCV stats */
+	FmgrInfo   *opprocs;
+	int		   *indexes1,
+			   *indexes2;
+	bool	   *reverse;
+	RangeTblEntry *rte1 = root->simple_rte_array[rel1->relid];
+	RangeTblEntry *rte2 = root->simple_rte_array[rel2->relid];
+
+	/* we picked the stats so that they have MCV enabled */
+	Assert((stat1->kind = STATS_EXT_MCV) && (stat2->kind = STATS_EXT_MCV));
+
+	mcv1 = statext_mcv_load(stat1->statOid, rte1->inh);
+	mcv2 = statext_mcv_load(stat2->statOid, rte2->inh);
+
+	/* should only get here with MCV on both sides */
+	Assert(mcv1 && mcv2);
+
+	/* Determine which baserel clauses to use for conditional probability. */
+	conditions1 = statext_determine_join_restrictions(root, rel1, stat1);
+	conditions2 = statext_determine_join_restrictions(root, rel2, stat2);
+
+	/*
+	 * Calculate match bitmaps for restrictions on either side of the join
+	 * (there may be none, in which case this will be NULL).
+	 */
+	if (conditions1)
+	{
+		cmatches1 = mcv_get_match_bitmap(root, conditions1,
+										 stat1->keys, stat1->exprs,
+										 mcv1, false);
+		csel1 = clauselist_selectivity(root, conditions1, rel1->relid, 0, NULL);
+	}
+
+	if (conditions2)
+	{
+		cmatches2 = mcv_get_match_bitmap(root, conditions2,
+										 stat2->keys, stat2->exprs,
+										 mcv2, false);
+		csel2 = clauselist_selectivity(root, conditions2, rel2->relid, 0, NULL);
+	}
+
+	/*
+	 * Match bitmaps for matches between MCV elements. By default there
+	 * are no matches, so we set all items to 0.
+	 */
+	matches1 = (bool *) palloc0(sizeof(bool) * mcv1->nitems);
+	matches2 = (bool *) palloc0(sizeof(bool) * mcv2->nitems);
+
+	/*
+	 * Initialize information about clauses and how they match to the MCV
+	 * stats we picked. We do this only once before processing the lists,
+	 * so that we don't have to do that for each MCV item or so.
+	 */
+	opprocs = (FmgrInfo *) palloc(sizeof(FmgrInfo) * list_length(clauses));
+	indexes1 = (int *) palloc(sizeof(int) * list_length(clauses));
+	indexes2 = (int *) palloc(sizeof(int) * list_length(clauses));
+	reverse = (bool *) palloc(sizeof(bool) * list_length(clauses));
+
+	idx = 0;
+	foreach (lc, clauses)
+	{
+		Node	   *clause = (Node *) lfirst(lc);
+		OpExpr	   *opexpr;
+		Node	   *expr1,
+				   *expr2;
+		Bitmapset  *relids1,
+				   *relids2;
+
+		/*
+		 * Strip the RestrictInfo node, get the actual clause.
+		 *
+		 * XXX Not sure if we need to care about removing other node types
+		 * too (e.g. RelabelType etc.). statext_is_supported_join_clause
+		 * matches this, but maybe we need to relax it?
+		 */
+		if (IsA(clause, RestrictInfo))
+			clause = (Node *) ((RestrictInfo *) clause)->clause;
+
+		opexpr = (OpExpr *) clause;
+
+		/* Make sure we have the expected node type. */
+		Assert(is_opclause(clause));
+		Assert(list_length(opexpr->args) == 2);
+
+		fmgr_info(get_opcode(opexpr->opno), &opprocs[idx]);
+
+		/* FIXME strip relabel etc. the way examine_opclause_args does */
+		expr1 = linitial(opexpr->args);
+		expr2 = lsecond(opexpr->args);
+
+		/* determine order of clauses (rel1 op rel2) or (rel2 op rel1) */
+		relids1 = pull_varnos(root, expr1);
+		relids2 = pull_varnos(root, expr2);
+
+		if ((bms_singleton_member(relids1) == rel1->relid) &&
+			(bms_singleton_member(relids2) == rel2->relid))
+		{
+			Oid		collid;
+
+			indexes1[idx] = mcv_match_expression(expr1,
+												 stat1->keys, stat1->exprs,
+												 &collid);
+			indexes2[idx] = mcv_match_expression(expr2,
+												 stat2->keys, stat2->exprs,
+												 &collid);
+			reverse[idx] = false;
+
+			exprs1 = lappend(exprs1, expr1);
+			exprs2 = lappend(exprs2, expr2);
+		}
+		else if ((bms_singleton_member(relids2) == rel1->relid) &&
+				 (bms_singleton_member(relids1) == rel2->relid))
+		{
+			Oid		collid;
+
+			indexes1[idx] = mcv_match_expression(expr2,
+												 stat2->keys, stat2->exprs,
+												 &collid);
+			indexes2[idx] = mcv_match_expression(expr1,
+												 stat1->keys, stat1->exprs,
+												 &collid);
+			reverse[idx] = true;
+
+			exprs1 = lappend(exprs1, expr2);
+			exprs2 = lappend(exprs2, expr1);
+		}
+		else
+			/* should never happen */
+			Assert(false);
+
+		Assert((indexes1[idx] >= 0) &&
+			   (indexes1[idx] < bms_num_members(stat1->keys) + list_length(stat1->exprs)));
+
+		Assert((indexes2[idx] >= 0) &&
+			   (indexes2[idx] < bms_num_members(stat2->keys) + list_length(stat2->exprs)));
+
+		idx++;
+	}
+
+	/*
+	 * Match items between the two MCV lists.
+	 *
+	 * We don't know if the join conditions match all attributes in the MCV, the
+	 * overlap may be just on a subset of attributes, e.g. (a,b,c) vs. (b,c,d).
+	 * So there may be multiple matches on either side. So we can't optimize by
+	 * aborting the inner loop after the first match, etc.
+	 *
+	 * XXX We can skip the items eliminated by the base restrictions, of course.
+	 *
+	 * XXX We might optimize this in two ways. We might sort the MCV items on
+	 * both sides using the "join" attributes, and then perform something like
+	 * merge join. Or we might calculate a hash from the join columns, and then
+	 * compare this (to eliminate the most expensive equality functions).
+	 */
+	for (i = 0; i < mcv1->nitems; i++)
+	{
+		bool	has_nulls;
+
+		/* skip items eliminated by restrictions on rel1 */
+		if (cmatches1 && !cmatches1[i])
+			continue;
+
+		/*
+		 * Check if any value in the first MCV item is NULL, because it'll be
+		 * mismatch anyway.
+		 *
+		 * XXX This might not work for some join clauses, e.g. IS NOT DISTINCT
+		 * FROM, but those are currently not considered compatible (we only
+		 * allow OpExpr at the moment).
+		 */
+		has_nulls = false;
+		for (j = 0; j < list_length(clauses); j++)
+			has_nulls |= mcv1->items[i].isnull[indexes1[j]];
+
+		if (has_nulls)
+			continue;
+
+		/* find matches in the second MCV list */
+		for (j = 0; j < mcv2->nitems; j++)
+		{
+			int			idx;
+			bool		items_match = true;
+
+			/* skip items eliminated by restrictions on rel2 */
+			if (cmatches2 && !cmatches2[j])
+				continue;
+
+			/*
+			 * XXX We can't skip based on existing matches2 value, because there
+			 * may be duplicates in the first MCV.
+			 */
+
+			/*
+			 * Evaluate if all the join clauses match between the two MCV items.
+			 *
+			 * XXX We might optimize the order of evaluation, using the costs of
+			 * operator functions for individual columns. It does depend on the
+			 * number of distinct values, etc.
+			 */
+			idx = 0;
+			foreach (lc, clauses)
+			{
+				bool	match;
+				int		index1 = indexes1[idx],
+						index2 = indexes2[idx];
+				Datum	value1,
+						value2;
+				bool	reverse_args = reverse[idx];
+
+				/* If either value is null, it's a mismatch */
+				if (mcv2->items[j].isnull[index2])
+					match = false;
+				else
+				{
+					value1 = mcv1->items[i].values[index1];
+					value2 = mcv2->items[j].values[index2];
+
+					/*
+					 * Careful about order of parameters. For same-type equality
+					 * that should not matter, but easy enough.
+					 *
+					 * FIXME Use appropriate collation.
+					 */
+					if (reverse_args)
+						match = DatumGetBool(FunctionCall2Coll(&opprocs[idx],
+															   InvalidOid,
+															   value2, value1));
+					else
+						match = DatumGetBool(FunctionCall2Coll(&opprocs[idx],
+															   InvalidOid,
+															   value1, value2));
+				}
+
+				items_match &= match;
+
+				if (!items_match)
+					break;
+
+				idx++;
+			}
+
+			if (items_match)
+			{
+				/* XXX Do we need to do something about base frequency? */
+				matches1[i] = matches2[j] = true;
+				s += mcv1->items[i].frequency * mcv2->items[j].frequency;
+			}
+		}
+	}
+
+	matchfreq1 = unmatchfreq1 = mcvfreq1 = 0.0;
+	for (i = 0; i < mcv1->nitems; i++)
+	{
+		mcvfreq1 += mcv1->items[i].frequency;
+
+		/* ignore MCV items eliminated by baserel conditions */
+		if (cmatches1 && !cmatches1[i])
+			continue;
+
+		if (matches1[i])
+			matchfreq1 += mcv1->items[i].frequency;
+		else
+			unmatchfreq1 += mcv1->items[i].frequency;
+	}
+
+	/* not represented by the MCV */
+	otherfreq1 = 1.0 - mcvfreq1;
+
+	matchfreq2 = unmatchfreq2 = mcvfreq2 = 0.0;
+	for (i = 0; i < mcv2->nitems; i++)
+	{
+		mcvfreq2 += mcv2->items[i].frequency;
+
+		/* ignore MCV items eliminated by baserel conditions */
+		if (cmatches2 && !cmatches2[i])
+			continue;
+
+		if (matches2[i])
+			matchfreq2 += mcv2->items[i].frequency;
+		else
+			unmatchfreq2 += mcv2->items[i].frequency;
+	}
+
+	/* not represented by the MCV */
+	otherfreq2 = 1.0 - mcvfreq2;
+
+	/*
+	 * Correction for MCV parts eliminated by the conditions.
+	 *
+	 * We need to be careful about cases where conditions eliminated all
+	 * the MCV items. We must not divide by 0.0, because that would either
+	 * produce bogus value or trigger division by zero. Instead we simply
+	 * set the selectivity to 0.0, because there can't be any matches.
+	 */
+	if ((matchfreq1 + unmatchfreq1) > 0)
+		s = s * mcvfreq1 / (matchfreq1 + unmatchfreq1);
+	else
+		s = 0.0;
+
+	if ((matchfreq2 + unmatchfreq2) > 0)
+		s = s * mcvfreq2 / (matchfreq2 + unmatchfreq2);
+	else
+		s = 0.0;
+
+	/* calculate ndistinct for the expression in join clauses for each rel */
+	nd1 = estimate_num_groups(root, exprs1, rel1->rows, NULL, NULL);
+	nd2 = estimate_num_groups(root, exprs2, rel2->rows, NULL, NULL);
+
+	/*
+	 * Consider the part of the data not represented by the MCV lists.
+	 *
+	 * XXX this is a bit bogus, because we don't know what fraction of
+	 * distinct combinations is covered by the MCV list (we're only
+	 * dealing with some of the columns), so we can't use the same
+	 * formular as eqjoinsel_inner exactly. We just use the estimates
+	 * for the whole table - this is likely an overestimate, because
+	 * (a) items may repeat in the MCV list, if it has more columns,
+	 * and (b) some of the combinations may be present in non-MCV data.
+	 *
+	 * Moreover, we need to look at the conditions. For now we simply
+	 * assume the conditions affect the distinct groups, and use that.
+	 *
+	 * XXX We might calculate the number of distinct groups in the MCV,
+	 * and then use something between (nd1 - distinct(MCV)) and (nd1),
+	 * which are the possible extreme values, assuming the estimates
+	 * are accurate. Maybe mean or geometric mean would work?
+	 *
+	 * XXX Not sure multiplying ndistinct with probabilities is good.
+	 * Maybe we should do something more like estimate_num_groups?
+	 */
+	nd1 *= csel1;
+	nd2 *= csel2;
+
+	totalsel1 = s;
+	totalsel1 += unmatchfreq1 * otherfreq2 / nd2;
+	totalsel1 += otherfreq1 * (otherfreq2 + unmatchfreq2) / nd2;
+
+//	if (nd2 > mcvb->nitems)
+//		totalsel1 += unmatchfreq1 * otherfreq2 / (nd2 - mcvb->nitems);
+//	if (nd2 > nmatches)
+//		totalsel1 += otherfreq1 * (otherfreq2 + unmatchfreq2) /
+//			(nd2 - nmatches);
+
+	totalsel2 = s;
+	totalsel2 += unmatchfreq2 * otherfreq1 / nd1;
+	totalsel2 += otherfreq2 * (otherfreq1 + unmatchfreq1) / nd1;
+
+//	if (nd1 > mcva->nitems)
+//		totalsel2 += unmatchfreq2 * otherfreq1 / (nd1 - mcva->nitems);
+//	if (nd1 > nmatches)
+//		totalsel2 += otherfreq2 * (otherfreq1 + unmatchfreq1) /
+//			(nd1 - nmatches);
+
+	s = Min(totalsel1, totalsel2);
+
+	return s;
+}
+
+
+/*
+ * statext_compare_simple
+ *		Calculate join selectivity using a combination of extended
+ * statistics MCV on one side, and simple per-column MCV on the other.
+ *
+ * Most of the mcv_combine_extended comment applies here too, but we can make
+ * some simplifications because we know the second (per-column) MCV is simpler,
+ * contains no NULL or duplicate values, etc.
+ */
+Selectivity
+mcv_combine_simple(PlannerInfo *root, RelOptInfo *rel, StatisticExtInfo *stat,
+				   AttStatsSlot *sslot, double stanullfrac, double nd,
+				   bool isdefault, Node *clause)
+{
+	MCVList    *mcv;
+	int			i,
+				j;
+	Selectivity s = 0;
+
+	/* match bitmaps and selectivity for baserel conditions (if any) */
+	List   *conditions = NIL;
+	bool   *cmatches = NULL;
+
+	double	csel = 1.0;
+
+	bool   *matches1 = NULL,
+		   *matches2 = NULL;
+
+	/* estimates for the two sides */
+	double	matchfreq1,
+			unmatchfreq1,
+			otherfreq1,
+			mcvfreq1,
+			nd1,
+			totalsel1;
+
+	double 	matchfreq2,
+			unmatchfreq2,
+			otherfreq2,
+			mcvfreq2,
+			nd2,
+			totalsel2;
+
+	List   *exprs1 = NIL,
+		   *exprs2 = NIL;
+
+	/* info about clauses and how they match to MCV stats */
+	FmgrInfo	opproc;
+	int			index = 0;
+	bool		reverse = false;
+	RangeTblEntry *rte = root->simple_rte_array[rel->relid];
+
+	/* we picked the stats so that they have MCV enabled */
+	Assert(stat->kind = STATS_EXT_MCV);
+
+	mcv = statext_mcv_load(stat->statOid, rte->inh);
+
+	/* should only get here with MCV on both sides */
+	Assert(mcv);
+
+	/* Determine which baserel clauses to use for conditional probability. */
+	conditions = statext_determine_join_restrictions(root, rel, stat);
+
+	/*
+	 * Calculate match bitmaps for restrictions on either side of the join
+	 * (there may be none, in which case this will be NULL).
+	 */
+	if (conditions)
+	{
+		cmatches = mcv_get_match_bitmap(root, conditions,
+										 stat->keys, stat->exprs,
+										 mcv, false);
+		csel = clauselist_selectivity(root, conditions, rel->relid, 0, NULL);
+	}
+
+	/*
+	 * Match bitmaps for matches between MCV elements. By default there
+	 * are no matches, so we set all items to 0.
+	 */
+	matches1 = (bool *) palloc0(sizeof(bool) * mcv->nitems);
+
+	/* Matches for the side with just regular single-column MCV. */
+	matches2 = (bool *) palloc0(sizeof(bool) * sslot->nvalues);
+
+	/*
+	 * Initialize information about the clause and how it matches to the
+	 * extended stats we picked. We do this only once before processing
+	 * the lists, so that we don't have to do that for each item or so.
+	 */
+	{
+		OpExpr	   *opexpr;
+		Node	   *expr1,
+				   *expr2;
+		Bitmapset  *relids1,
+				   *relids2;
+
+		/*
+		 * Strip the RestrictInfo node, get the actual clause.
+		 *
+		 * XXX Not sure if we need to care about removing other node types
+		 * too (e.g. RelabelType etc.). statext_is_supported_join_clause
+		 * matches this, but maybe we need to relax it?
+		 */
+		if (IsA(clause, RestrictInfo))
+			clause = (Node *) ((RestrictInfo *) clause)->clause;
+
+		opexpr = (OpExpr *) clause;
+
+		/* Make sure we have the expected node type. */
+		Assert(is_opclause(clause));
+		Assert(list_length(opexpr->args) == 2);
+
+		fmgr_info(get_opcode(opexpr->opno), &opproc);
+
+		/* FIXME strip relabel etc. the way examine_opclause_args does */
+		expr1 = linitial(opexpr->args);
+		expr2 = lsecond(opexpr->args);
+
+		/* determine order of clauses (rel1 op rel2) or (rel2 op rel1) */
+		relids1 = pull_varnos(root, expr1);
+		relids2 = pull_varnos(root, expr2);
+
+		if (bms_singleton_member(relids1) == rel->relid)
+		{
+			Oid		collid;
+
+			index = mcv_match_expression(expr1, stat->keys, stat->exprs,
+										 &collid);
+			reverse = false;
+
+			exprs1 = lappend(exprs1, expr1);
+			exprs2 = lappend(exprs2, expr2);
+		}
+		else if (bms_singleton_member(relids2) == rel->relid)
+		{
+			Oid		collid;
+
+			index = mcv_match_expression(expr2, stat->keys, stat->exprs,
+										 &collid);
+			reverse = true;
+
+			exprs1 = lappend(exprs1, expr2);
+			exprs2 = lappend(exprs2, expr1);
+		}
+		else
+			/* should never happen */
+			Assert(false);
+
+		Assert((index >= 0) &&
+			   (index < bms_num_members(stat->keys) + list_length(stat->exprs)));
+	}
+
+	/*
+	 * Match items between the two MCV lists.
+	 *
+	 * We don't know if the join conditions match all attributes in the MCV, the
+	 * overlap may be just on a subset of attributes, e.g. (a,b,c) vs. (b,c,d).
+	 * So there may be multiple matches on either side. So we can't optimize by
+	 * aborting the inner loop after the first match, etc.
+	 *
+	 * XXX We can skip the items eliminated by the base restrictions, of course.
+	 *
+	 * XXX We might optimize this in two ways. We might sort the MCV items on
+	 * both sides using the "join" attributes, and then perform something like
+	 * merge join. Or we might calculate a hash from the join columns, and then
+	 * compare this (to eliminate the most expensive equality functions).
+	 */
+	for (i = 0; i < mcv->nitems; i++)
+	{
+		/* skip items eliminated by restrictions on rel1 */
+		if (cmatches && !cmatches[i])
+			continue;
+
+		/*
+		 * We can check mcv1->items[i].isnull[index1] here, because it'll be a
+		 * mismatch anyway (the simple MCV does not contain NULLs).
+		 */
+		if (mcv->items[i].isnull[index])
+			continue;
+
+		/* find matches in the second MCV list */
+		for (j = 0; j < sslot->nvalues; j++)
+		{
+			bool	match;
+			Datum	value1 = mcv->items[i].values[index];
+			Datum	value2 = sslot->values[j];
+
+			/*
+			 * Evaluate the join clause between the two MCV lists. We don't
+			 * need to deal with NULL values here - we've already checked for
+			 * NULL in the extended statistics earlier, and the simple MCV
+			 * does not contain NULL values.
+			 *
+			 * Careful about order of parameters. For same-type equality
+			 * that should not matter, but easy enough.
+			 *
+			 * FIXME Use appropriate collation.
+			 */
+			if (reverse)
+				match = DatumGetBool(FunctionCall2Coll(&opproc,
+													   InvalidOid,
+													   value2, value1));
+			else
+				match = DatumGetBool(FunctionCall2Coll(&opproc,
+													   InvalidOid,
+													   value1, value2));
+
+			if (match)
+			{
+				/* XXX Do we need to do something about base frequency? */
+				matches1[i] = matches2[j] = true;
+				s += mcv->items[i].frequency * sslot->numbers[j];
+
+				/*
+				 * We know there can be just a single match in the regular
+				 * MCV list, so we can abort the inner loop.
+				 */
+				break;
+			}
+		}
+	}
+
+	matchfreq1 = unmatchfreq1 = mcvfreq1 = 0.0;
+	for (i = 0; i < mcv->nitems; i++)
+	{
+		mcvfreq1 += mcv->items[i].frequency;
+
+		/* ignore MCV items eliminated by baserel conditions */
+		if (cmatches && !cmatches[i])
+			continue;
+
+		if (matches1[i])
+			matchfreq1 += mcv->items[i].frequency;
+		else
+			unmatchfreq1 += mcv->items[i].frequency;
+	}
+
+	/* not represented by the MCV */
+	otherfreq1 = 1.0 - mcvfreq1;
+
+	matchfreq2 = unmatchfreq2 = mcvfreq2 = 0.0;
+	for (i = 0; i < sslot->nvalues; i++)
+	{
+		mcvfreq2 += sslot->numbers[i];
+
+		if (matches2[i])
+			matchfreq2 += sslot->numbers[i];
+		else
+			unmatchfreq2 += sslot->numbers[i];
+	}
+
+	/* not represented by the MCV */
+	otherfreq2 = 1.0 - mcvfreq2;
+
+	/*
+	 * Correction for MCV parts eliminated by the conditions.
+	 *
+	 * We need to be careful about cases where conditions eliminated all
+	 * the MCV items. We must not divide by 0.0, because that would either
+	 * produce bogus value or trigger division by zero. Instead we simply
+	 * set the selectivity to 0.0, because there can't be any matches.
+	 */
+	if ((matchfreq1 + unmatchfreq1) > 0)
+		s = s * mcvfreq1 / (matchfreq1 + unmatchfreq1);
+	else
+		s = 0.0;
+
+	if ((matchfreq2 + unmatchfreq2) > 0)
+		s = s * mcvfreq2 / (matchfreq2 + unmatchfreq2);
+	else
+		s = 0.0;
+
+	/* calculate ndistinct for the expression in join clauses for each rel */
+	nd1 = estimate_num_groups(root, exprs1, rel->rows, NULL, NULL);
+	nd2 = nd;
+
+	/*
+	 * Consider the part of the data not represented by the MCV lists.
+	 *
+	 * XXX this is a bit bogus, because we don't know what fraction of
+	 * distinct combinations is covered by the MCV list (we're only
+	 * dealing with some of the columns), so we can't use the same
+	 * formular as eqjoinsel_inner exactly. We just use the estimates
+	 * for the whole table - this is likely an overestimate, because
+	 * (a) items may repeat in the MCV list, if it has more columns,
+	 * and (b) some of the combinations may be present in non-MCV data.
+	 *
+	 * Moreover, we need to look at the conditions. For now we simply
+	 * assume the conditions affect the distinct groups, and use that.
+	 *
+	 * XXX We might calculate the number of distinct groups in the MCV,
+	 * and then use something between (nd1 - distinct(MCV)) and (nd1),
+	 * which are the possible extreme values, assuming the estimates
+	 * are accurate. Maybe mean or geometric mean would work?
+	 *
+	 * XXX Not sure multiplying ndistinct with probabilities is good.
+	 * Maybe we should do something more like estimate_num_groups?
+	 */
+	nd1 *= csel;
+
+	totalsel1 = s;
+	totalsel1 += unmatchfreq1 * otherfreq2 / nd2;
+	totalsel1 += otherfreq1 * (otherfreq2 + unmatchfreq2) / nd2;
+
+//	if (nd2 > mcvb->nitems)
+//		totalsel1 += unmatchfreq1 * otherfreq2 / (nd2 - mcvb->nitems);
+//	if (nd2 > nmatches)
+//		totalsel1 += otherfreq1 * (otherfreq2 + unmatchfreq2) /
+//			(nd2 - nmatches);
+
+	totalsel2 = s;
+	totalsel2 += unmatchfreq2 * otherfreq1 / nd1;
+	totalsel2 += otherfreq2 * (otherfreq1 + unmatchfreq1) / nd1;
+
+//	if (nd1 > mcva->nitems)
+//		totalsel2 += unmatchfreq2 * otherfreq1 / (nd1 - mcva->nitems);
+//	if (nd1 > nmatches)
+//		totalsel2 += otherfreq2 * (otherfreq1 + unmatchfreq1) /
+//			(nd1 - nmatches);
+
+	s = Min(totalsel1, totalsel2);
+
+	return s;
+}
diff --git a/src/include/statistics/extended_stats_internal.h b/src/include/statistics/extended_stats_internal.h
index 8eed9b338d..a85f896d53 100644
--- a/src/include/statistics/extended_stats_internal.h
+++ b/src/include/statistics/extended_stats_internal.h
@@ -15,6 +15,7 @@
 #define EXTENDED_STATS_INTERNAL_H
 
 #include "statistics/statistics.h"
+#include "utils/lsyscache.h"
 #include "utils/sortsupport.h"
 
 typedef struct
@@ -127,4 +128,23 @@ extern Selectivity mcv_clause_selectivity_or(PlannerInfo *root,
 											 Selectivity *overlap_basesel,
 											 Selectivity *totalsel);
 
+extern Selectivity mcv_combine_simple(PlannerInfo *root,
+									  RelOptInfo *rel,
+									  StatisticExtInfo *stat,
+									  AttStatsSlot *sslot,
+									  double stanullfrac,
+									  double nd, bool isdefault,
+									  Node *clause);
+
+extern Selectivity mcv_combine_extended(PlannerInfo *root,
+										RelOptInfo *rel1,
+										RelOptInfo *rel2,
+										StatisticExtInfo *stat1,
+										StatisticExtInfo *stat2,
+										List *clauses);
+
+extern List *statext_determine_join_restrictions(PlannerInfo *root,
+												 RelOptInfo *rel,
+												 StatisticExtInfo *info);
+
 #endif							/* EXTENDED_STATS_INTERNAL_H */
diff --git a/src/include/statistics/statistics.h b/src/include/statistics/statistics.h
index 7f2bf18716..60b222028d 100644
--- a/src/include/statistics/statistics.h
+++ b/src/include/statistics/statistics.h
@@ -127,4 +127,16 @@ extern StatisticExtInfo *choose_best_statistics(List *stats, char requiredkind,
 												int nclauses);
 extern HeapTuple statext_expressions_load(Oid stxoid, bool inh, int idx);
 
+extern StatisticExtInfo *statext_find_matching_mcv(PlannerInfo *root, RelOptInfo *rel,
+										   Bitmapset *attnums, List *exprs);
+
+extern bool statext_try_join_estimates(PlannerInfo *root, List *clauses, int varRelid,
+									   JoinType jointype, SpecialJoinInfo *sjinfo,
+									   Bitmapset *estimatedclauses);
+
+extern Selectivity statext_clauselist_join_selectivity(PlannerInfo *root, List *clauses,
+													   int varRelid,
+													   JoinType jointype, SpecialJoinInfo *sjinfo,
+													   Bitmapset **estimatedclauses);
+
 #endif							/* STATISTICS_H */
diff --git a/src/test/regress/expected/stats_ext.out b/src/test/regress/expected/stats_ext.out
index 10903bdab0..95246522bb 100644
--- a/src/test/regress/expected/stats_ext.out
+++ b/src/test/regress/expected/stats_ext.out
@@ -3074,6 +3074,173 @@ SELECT c0 FROM ONLY expr_stats_incompatible_test WHERE
 (0 rows)
 
 DROP TABLE expr_stats_incompatible_test;
+-- Test join estimates.
+CREATE TABLE join_test_1 (a int, b int, c int);
+CREATE TABLE join_test_2 (a int, b int, c int);
+INSERT INTO join_test_1 SELECT mod(i,10), mod(i,10), mod(i,10) FROM generate_series(1,1000) s(i);
+INSERT INTO join_test_2 SELECT mod(i,10), mod(i,10), mod(i,10) FROM generate_series(1,1000) s(i);
+ANALYZE join_test_1;
+ANALYZE join_test_2;
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b))');
+ estimated | actual 
+-----------+--------
+       500 | 100000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5');
+ estimated | actual 
+-----------+--------
+       250 |  50000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 0');
+ estimated | actual 
+-----------+--------
+         1 |      0
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5 and j2.c < 3');
+ estimated | actual 
+-----------+--------
+        75 |  30000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5 and j2.c > 5');
+ estimated | actual 
+-----------+--------
+       100 |      0
+(1 row)
+
+-- can't be improved due to the optimization in clauselist_selectivity_ext,
+-- which skips cases with a single (join) clause
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5');
+ estimated | actual 
+-----------+--------
+      2500 |  50000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5 and j2.c < 5');
+ estimated | actual 
+-----------+--------
+      1250 |  50000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5 and j2.c > 5');
+ estimated | actual 
+-----------+--------
+      1000 |      0
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a) where j1.c < 5');
+ estimated | actual 
+-----------+--------
+      2500 |  50000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a + 1) where j1.c < 5');
+ estimated | actual 
+-----------+--------
+      2500 |  50000
+(1 row)
+
+-- create extended statistics on the join/where columns
+CREATE STATISTICS join_stats_1 ON a, b, c, (a+1), (b+1) FROM join_test_1;
+CREATE STATISTICS join_stats_2 ON a, b, c, (a+1), (b+1) FROM join_test_2;
+ANALYZE join_test_1;
+ANALYZE join_test_2;
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b))');
+ estimated | actual 
+-----------+--------
+    100000 | 100000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5');
+ estimated | actual 
+-----------+--------
+     50000 |  50000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 0');
+ estimated | actual 
+-----------+--------
+         1 |      0
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5 and j2.c < 3');
+ estimated | actual 
+-----------+--------
+     30000 |  30000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5 and j2.c > 5');
+ estimated | actual 
+-----------+--------
+         1 |      0
+(1 row)
+
+-- can't be improved due to the optimization in clauselist_selectivity_ext,
+-- which skips cases with a single (join) clause
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5');
+ estimated | actual 
+-----------+--------
+     50000 |  50000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5 and j2.c < 5');
+ estimated | actual 
+-----------+--------
+     50000 |  50000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5 and j2.c > 5');
+ estimated | actual 
+-----------+--------
+         1 |      0
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a) where j1.c < 5');
+ estimated | actual 
+-----------+--------
+     50000 |  50000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a + 1) where j1.c < 5');
+ estimated | actual 
+-----------+--------
+     50000 |  50000
+(1 row)
+
+-- try combining with single-column (and single-expression) statistics
+DROP STATISTICS join_stats_2;
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a) where j1.c < 5');
+ estimated | actual 
+-----------+--------
+     50000 |  50000
+(1 row)
+
+-- no MCV on join_test_2 (on the (a+1) expression)
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a + 1) where j1.c < 5');
+ estimated | actual 
+-----------+--------
+      2500 |  50000
+(1 row)
+
+CREATE STATISTICS join_stats_2 ON (a+1) FROM join_test_2;
+ANALYZE join_test_2;
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a) where j1.c < 5');
+ estimated | actual 
+-----------+--------
+     50000 |  50000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a + 1) where j1.c < 5');
+ estimated | actual 
+-----------+--------
+     50000 |  50000
+(1 row)
+
+DROP TABLE join_test_1;
+DROP TABLE join_test_2;
 -- Permission tests. Users should not be able to see specific data values in
 -- the extended statistics, if they lack permission to see those values in
 -- the underlying table.
diff --git a/src/test/regress/sql/stats_ext.sql b/src/test/regress/sql/stats_ext.sql
index 42cb7dd97d..c7023620a1 100644
--- a/src/test/regress/sql/stats_ext.sql
+++ b/src/test/regress/sql/stats_ext.sql
@@ -1547,6 +1547,72 @@ SELECT c0 FROM ONLY expr_stats_incompatible_test WHERE
 
 DROP TABLE expr_stats_incompatible_test;
 
+
+-- Test join estimates.
+CREATE TABLE join_test_1 (a int, b int, c int);
+CREATE TABLE join_test_2 (a int, b int, c int);
+
+INSERT INTO join_test_1 SELECT mod(i,10), mod(i,10), mod(i,10) FROM generate_series(1,1000) s(i);
+INSERT INTO join_test_2 SELECT mod(i,10), mod(i,10), mod(i,10) FROM generate_series(1,1000) s(i);
+
+ANALYZE join_test_1;
+ANALYZE join_test_2;
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b))');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 0');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5 and j2.c < 3');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5 and j2.c > 5');
+
+-- can't be improved due to the optimization in clauselist_selectivity_ext,
+-- which skips cases with a single (join) clause
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5 and j2.c < 5');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5 and j2.c > 5');
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a) where j1.c < 5');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a + 1) where j1.c < 5');
+
+-- create extended statistics on the join/where columns
+CREATE STATISTICS join_stats_1 ON a, b, c, (a+1), (b+1) FROM join_test_1;
+CREATE STATISTICS join_stats_2 ON a, b, c, (a+1), (b+1) FROM join_test_2;
+
+ANALYZE join_test_1;
+ANALYZE join_test_2;
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b))');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 0');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5 and j2.c < 3');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5 and j2.c > 5');
+
+-- can't be improved due to the optimization in clauselist_selectivity_ext,
+-- which skips cases with a single (join) clause
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5 and j2.c < 5');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5 and j2.c > 5');
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a) where j1.c < 5');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a + 1) where j1.c < 5');
+
+-- try combining with single-column (and single-expression) statistics
+DROP STATISTICS join_stats_2;
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a) where j1.c < 5');
+
+-- no MCV on join_test_2 (on the (a+1) expression)
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a + 1) where j1.c < 5');
+
+CREATE STATISTICS join_stats_2 ON (a+1) FROM join_test_2;
+ANALYZE join_test_2;
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a) where j1.c < 5');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a + 1) where j1.c < 5');
+
+
+DROP TABLE join_test_1;
+DROP TABLE join_test_2;
+
 -- Permission tests. Users should not be able to see specific data values in
 -- the extended statistics, if they lack permission to see those values in
 -- the underlying table.
-- 
2.34.1

v1-0002-Remove-estimiatedcluases-and-varRelid-arguments.patchtext/x-diffDownload

From c42c404c84e001c7c7f39a1b5afaaeec3bd5912d Mon Sep 17 00:00:00 2001
From: "yizhi.fzh" <yizhi.fzh@alibaba-inc.com>
Date: Tue, 2 Apr 2024 09:39:17 +0800
Subject: [PATCH v1 2/8] Remove estimiatedcluases and varRelid arguments

comments and Assert around the changes provides more information.
---
 src/backend/optimizer/path/clausesel.c  | 16 ++++++++++------
 src/backend/statistics/extended_stats.c | 24 ++++++++++--------------
 src/include/statistics/statistics.h     |  4 +---
 3 files changed, 21 insertions(+), 23 deletions(-)

diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c
index bedf76edae..e1683febf6 100644
--- a/src/backend/optimizer/path/clausesel.c
+++ b/src/backend/optimizer/path/clausesel.c
@@ -204,14 +204,18 @@ clauselist_selectivity_ext(PlannerInfo *root,
 	 * do to detect when this makes sense, but we can check that there are
 	 * join clauses, and that at least some of the rels have stats.
 	 *
-	 * XXX Isn't this mutually exclusive with the preceding block which
-	 * calculates estimates for a single relation?
+	 * rel != NULL can't grantee the clause is not a join clause, for example
+	 * t1 left join t2 ON t1.a = 3, but it can grantee we can't use extended
+	 * statistics for estimation since it has only 1 relid.
+	 *
+	 * XXX: so we can grantee estimatedclauses == NULL now, so estimatedclauses
+	 * in statext_try_join_estimates is removed.
 	 */
-	if (use_extended_stats &&
-		statext_try_join_estimates(root, clauses, varRelid, jointype, sjinfo,
-						 estimatedclauses))
+	if (use_extended_stats && rel == NULL &&
+		statext_try_join_estimates(root, clauses, varRelid, jointype, sjinfo))
 	{
-		s1 *= statext_clauselist_join_selectivity(root, clauses, varRelid,
+		Assert(varRelid == 0);
+		s1 *= statext_clauselist_join_selectivity(root, clauses,
 												  jointype, sjinfo,
 												  &estimatedclauses);
 	}
diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c
index 183a8af07b..519c367dee 100644
--- a/src/backend/statistics/extended_stats.c
+++ b/src/backend/statistics/extended_stats.c
@@ -2804,8 +2804,7 @@ statext_determine_join_restrictions(PlannerInfo *root, RelOptInfo *rel,
  * on the conditions.
  */
 static bool
-statext_is_supported_join_clause(PlannerInfo *root, Node *clause,
-								 int varRelid, SpecialJoinInfo *sjinfo)
+statext_is_supported_join_clause(PlannerInfo *root, Node *clause, SpecialJoinInfo *sjinfo)
 {
 	Oid	oprsel;
 	RestrictInfo   *rinfo;
@@ -2817,7 +2816,9 @@ statext_is_supported_join_clause(PlannerInfo *root, Node *clause,
 	 *
 	 * XXX See treat_as_join_clause.
 	 */
-	if ((varRelid != 0) || (sjinfo == NULL))
+
+	/* duplicated with statext_try_join_estimates */
+	if (sjinfo == NULL)
 		return false;
 
 	/* XXX Can we rely on always getting RestrictInfo here? */
@@ -2901,8 +2902,7 @@ statext_is_supported_join_clause(PlannerInfo *root, Node *clause,
  */
 bool
 statext_try_join_estimates(PlannerInfo *root, List *clauses, int varRelid,
-						   JoinType jointype, SpecialJoinInfo *sjinfo,
-						   Bitmapset *estimatedclauses)
+						   JoinType jointype, SpecialJoinInfo *sjinfo)
 {
 	int			listidx;
 	int			k;
@@ -2939,15 +2939,11 @@ statext_try_join_estimates(PlannerInfo *root, List *clauses, int varRelid,
 		/* needs to happen before skipping any clauses */
 		listidx++;
 
-		/* Skip clauses that were already estimated. */
-		if (bms_is_member(listidx, estimatedclauses))
-			continue;
-
 		/*
 		 * Skip clauses that are not join clauses or that we don't know
 		 * how to handle estimate using extended statistics.
 		 */
-		if (!statext_is_supported_join_clause(root, clause, varRelid, sjinfo))
+		if (!statext_is_supported_join_clause(root, clause, sjinfo))
 			continue;
 
 		/*
@@ -3017,7 +3013,7 @@ typedef struct JoinPairInfo
  * with F_EQJOINSEL selectivity function at the moment).
  */
 static JoinPairInfo *
-statext_build_join_pairs(PlannerInfo *root, List *clauses, int varRelid,
+statext_build_join_pairs(PlannerInfo *root, List *clauses,
 						 JoinType jointype, SpecialJoinInfo *sjinfo,
 						 Bitmapset *estimatedclauses, int *npairs)
 {
@@ -3053,7 +3049,7 @@ statext_build_join_pairs(PlannerInfo *root, List *clauses, int varRelid,
 		 * the moment we support just (Expr op Expr) clauses with each
 		 * side referencing just a single relation).
 		 */
-		if (!statext_is_supported_join_clause(root, clause, varRelid, sjinfo))
+		if (!statext_is_supported_join_clause(root, clause, sjinfo))
 			continue;
 
 		/* statext_is_supported_join_clause guarantees RestrictInfo */
@@ -3241,7 +3237,7 @@ get_expression_for_rel(PlannerInfo *root, RelOptInfo *rel, Node *clause)
  * statistics in that case yet).
  */
 Selectivity
-statext_clauselist_join_selectivity(PlannerInfo *root, List *clauses, int varRelid,
+statext_clauselist_join_selectivity(PlannerInfo *root, List *clauses,
 									JoinType jointype, SpecialJoinInfo *sjinfo,
 									Bitmapset **estimatedclauses)
 {
@@ -3256,7 +3252,7 @@ statext_clauselist_join_selectivity(PlannerInfo *root, List *clauses, int varRel
 		return 1.0;
 
 	/* extract pairs of joined relations from the list of clauses */
-	info = statext_build_join_pairs(root, clauses, varRelid, jointype, sjinfo,
+	info = statext_build_join_pairs(root, clauses, jointype, sjinfo,
 									*estimatedclauses, &ninfo);
 
 	/* no useful join pairs */
diff --git a/src/include/statistics/statistics.h b/src/include/statistics/statistics.h
index 60b222028d..4f70034983 100644
--- a/src/include/statistics/statistics.h
+++ b/src/include/statistics/statistics.h
@@ -131,11 +131,9 @@ extern StatisticExtInfo *statext_find_matching_mcv(PlannerInfo *root, RelOptInfo
 										   Bitmapset *attnums, List *exprs);
 
 extern bool statext_try_join_estimates(PlannerInfo *root, List *clauses, int varRelid,
-									   JoinType jointype, SpecialJoinInfo *sjinfo,
-									   Bitmapset *estimatedclauses);
+									   JoinType jointype, SpecialJoinInfo *sjinfo);
 
 extern Selectivity statext_clauselist_join_selectivity(PlannerInfo *root, List *clauses,
-													   int varRelid,
 													   JoinType jointype, SpecialJoinInfo *sjinfo,
 													   Bitmapset **estimatedclauses);
 
-- 
2.34.1

v1-0003-Remove-SpecialJoinInfo-sjinfo-argument.patchtext/x-diffDownload

From 30105dfd04e7817413766c1910797129f1ae5b30 Mon Sep 17 00:00:00 2001
From: "yizhi.fzh" <yizhi.fzh@alibaba-inc.com>
Date: Tue, 2 Apr 2024 09:58:18 +0800
Subject: [PATCH v1 3/8] Remove SpecialJoinInfo *sjinfo argument

It was passed down to statext_is_supported_join_clause where it is
used for checking if it is NULL.  However it has been checked before
in statext_try_join_estimates.
---
 src/backend/optimizer/path/clausesel.c  |  3 ++-
 src/backend/statistics/extended_stats.c | 16 ++++++----------
 src/include/statistics/statistics.h     |  3 +--
 3 files changed, 9 insertions(+), 13 deletions(-)

diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c
index e1683febf6..ca550e6c0c 100644
--- a/src/backend/optimizer/path/clausesel.c
+++ b/src/backend/optimizer/path/clausesel.c
@@ -215,8 +215,9 @@ clauselist_selectivity_ext(PlannerInfo *root,
 		statext_try_join_estimates(root, clauses, varRelid, jointype, sjinfo))
 	{
 		Assert(varRelid == 0);
+		Assert(sjinfo != NULL);
 		s1 *= statext_clauselist_join_selectivity(root, clauses,
-												  jointype, sjinfo,
+												  jointype,
 												  &estimatedclauses);
 	}
 
diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c
index 519c367dee..516428873e 100644
--- a/src/backend/statistics/extended_stats.c
+++ b/src/backend/statistics/extended_stats.c
@@ -2804,7 +2804,7 @@ statext_determine_join_restrictions(PlannerInfo *root, RelOptInfo *rel,
  * on the conditions.
  */
 static bool
-statext_is_supported_join_clause(PlannerInfo *root, Node *clause, SpecialJoinInfo *sjinfo)
+statext_is_supported_join_clause(PlannerInfo *root, Node *clause)
 {
 	Oid	oprsel;
 	RestrictInfo   *rinfo;
@@ -2817,10 +2817,6 @@ statext_is_supported_join_clause(PlannerInfo *root, Node *clause, SpecialJoinInf
 	 * XXX See treat_as_join_clause.
 	 */
 
-	/* duplicated with statext_try_join_estimates */
-	if (sjinfo == NULL)
-		return false;
-
 	/* XXX Can we rely on always getting RestrictInfo here? */
 	if (!IsA(clause, RestrictInfo))
 		return false;
@@ -2943,7 +2939,7 @@ statext_try_join_estimates(PlannerInfo *root, List *clauses, int varRelid,
 		 * Skip clauses that are not join clauses or that we don't know
 		 * how to handle estimate using extended statistics.
 		 */
-		if (!statext_is_supported_join_clause(root, clause, sjinfo))
+		if (!statext_is_supported_join_clause(root, clause))
 			continue;
 
 		/*
@@ -3014,7 +3010,7 @@ typedef struct JoinPairInfo
  */
 static JoinPairInfo *
 statext_build_join_pairs(PlannerInfo *root, List *clauses,
-						 JoinType jointype, SpecialJoinInfo *sjinfo,
+						 JoinType jointype,
 						 Bitmapset *estimatedclauses, int *npairs)
 {
 	int				cnt;
@@ -3049,7 +3045,7 @@ statext_build_join_pairs(PlannerInfo *root, List *clauses,
 		 * the moment we support just (Expr op Expr) clauses with each
 		 * side referencing just a single relation).
 		 */
-		if (!statext_is_supported_join_clause(root, clause, sjinfo))
+		if (!statext_is_supported_join_clause(root, clause))
 			continue;
 
 		/* statext_is_supported_join_clause guarantees RestrictInfo */
@@ -3238,7 +3234,7 @@ get_expression_for_rel(PlannerInfo *root, RelOptInfo *rel, Node *clause)
  */
 Selectivity
 statext_clauselist_join_selectivity(PlannerInfo *root, List *clauses,
-									JoinType jointype, SpecialJoinInfo *sjinfo,
+									JoinType jointype,
 									Bitmapset **estimatedclauses)
 {
 	int			i;
@@ -3252,7 +3248,7 @@ statext_clauselist_join_selectivity(PlannerInfo *root, List *clauses,
 		return 1.0;
 
 	/* extract pairs of joined relations from the list of clauses */
-	info = statext_build_join_pairs(root, clauses, jointype, sjinfo,
+	info = statext_build_join_pairs(root, clauses, jointype,
 									*estimatedclauses, &ninfo);
 
 	/* no useful join pairs */
diff --git a/src/include/statistics/statistics.h b/src/include/statistics/statistics.h
index 4f70034983..28d9e72e54 100644
--- a/src/include/statistics/statistics.h
+++ b/src/include/statistics/statistics.h
@@ -134,7 +134,6 @@ extern bool statext_try_join_estimates(PlannerInfo *root, List *clauses, int var
 									   JoinType jointype, SpecialJoinInfo *sjinfo);
 
 extern Selectivity statext_clauselist_join_selectivity(PlannerInfo *root, List *clauses,
-													   JoinType jointype, SpecialJoinInfo *sjinfo,
-													   Bitmapset **estimatedclauses);
+													   JoinType jointype, Bitmapset **estimatedclauses);
 
 #endif							/* STATISTICS_H */
-- 
2.34.1

v1-0004-Remove-joinType-argument.patchtext/x-diffDownload

From a208eb82a399b7c10d798e4005b075060d7acff1 Mon Sep 17 00:00:00 2001
From: "yizhi.fzh" <yizhi.fzh@alibaba-inc.com>
Date: Tue, 2 Apr 2024 10:07:00 +0800
Subject: [PATCH v1 4/8] Remove joinType argument.

---
 src/backend/optimizer/path/clausesel.c  | 1 -
 src/backend/statistics/extended_stats.c | 4 +---
 src/include/statistics/statistics.h     | 3 +--
 3 files changed, 2 insertions(+), 6 deletions(-)

diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c
index ca550e6c0c..50210ec2ca 100644
--- a/src/backend/optimizer/path/clausesel.c
+++ b/src/backend/optimizer/path/clausesel.c
@@ -217,7 +217,6 @@ clauselist_selectivity_ext(PlannerInfo *root,
 		Assert(varRelid == 0);
 		Assert(sjinfo != NULL);
 		s1 *= statext_clauselist_join_selectivity(root, clauses,
-												  jointype,
 												  &estimatedclauses);
 	}
 
diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c
index 516428873e..4e6f604273 100644
--- a/src/backend/statistics/extended_stats.c
+++ b/src/backend/statistics/extended_stats.c
@@ -3010,7 +3010,6 @@ typedef struct JoinPairInfo
  */
 static JoinPairInfo *
 statext_build_join_pairs(PlannerInfo *root, List *clauses,
-						 JoinType jointype,
 						 Bitmapset *estimatedclauses, int *npairs)
 {
 	int				cnt;
@@ -3234,7 +3233,6 @@ get_expression_for_rel(PlannerInfo *root, RelOptInfo *rel, Node *clause)
  */
 Selectivity
 statext_clauselist_join_selectivity(PlannerInfo *root, List *clauses,
-									JoinType jointype,
 									Bitmapset **estimatedclauses)
 {
 	int			i;
@@ -3248,7 +3246,7 @@ statext_clauselist_join_selectivity(PlannerInfo *root, List *clauses,
 		return 1.0;
 
 	/* extract pairs of joined relations from the list of clauses */
-	info = statext_build_join_pairs(root, clauses, jointype,
+	info = statext_build_join_pairs(root, clauses,
 									*estimatedclauses, &ninfo);
 
 	/* no useful join pairs */
diff --git a/src/include/statistics/statistics.h b/src/include/statistics/statistics.h
index 28d9e72e54..97a217af1e 100644
--- a/src/include/statistics/statistics.h
+++ b/src/include/statistics/statistics.h
@@ -133,7 +133,6 @@ extern StatisticExtInfo *statext_find_matching_mcv(PlannerInfo *root, RelOptInfo
 extern bool statext_try_join_estimates(PlannerInfo *root, List *clauses, int varRelid,
 									   JoinType jointype, SpecialJoinInfo *sjinfo);
 
-extern Selectivity statext_clauselist_join_selectivity(PlannerInfo *root, List *clauses,
-													   JoinType jointype, Bitmapset **estimatedclauses);
+extern Selectivity statext_clauselist_join_selectivity(PlannerInfo *root, List *clauses, Bitmapset **estimatedclauses);
 
 #endif							/* STATISTICS_H */
-- 
2.34.1

v1-0005-use-the-pre-calculated-RestrictInfo-left-right_re.patchtext/x-diffDownload

From 92e03900de9015ce30bf7bced45a1070c0380957 Mon Sep 17 00:00:00 2001
From: "yizhi.fzh" <yizhi.fzh@alibaba-inc.com>
Date: Tue, 2 Apr 2024 10:39:03 +0800
Subject: [PATCH v1 5/8] use the pre-calculated RestrictInfo->left|right_relids

It should has better performance than pull_varnos and easier to
understand.
---
 src/backend/statistics/extended_stats.c | 35 +++++--------------------
 1 file changed, 6 insertions(+), 29 deletions(-)

diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c
index 4e6f604273..0deb0c3c55 100644
--- a/src/backend/statistics/extended_stats.c
+++ b/src/backend/statistics/extended_stats.c
@@ -2809,7 +2809,7 @@ statext_is_supported_join_clause(PlannerInfo *root, Node *clause)
 	Oid	oprsel;
 	RestrictInfo   *rinfo;
 	OpExpr		   *opclause;
-	ListCell	   *lc;
+	int				left_relid, right_relid;
 
 	/*
 	 * evaluation as a restriction clause, either at scan node or forced
@@ -2825,10 +2825,6 @@ statext_is_supported_join_clause(PlannerInfo *root, Node *clause)
 	rinfo = (RestrictInfo *) clause;
 	clause = (Node *) rinfo->clause;
 
-	/* is it referencing multiple relations? */
-	if (bms_membership(rinfo->clause_relids) != BMS_MULTIPLE)
-		return false;
-
 	/* we only support simple operator clauses for now */
 	if (!is_opclause(clause))
 		return false;
@@ -2851,8 +2847,6 @@ statext_is_supported_join_clause(PlannerInfo *root, Node *clause)
 	 * which is still technically an opclause, but we can't match it to
 	 * extended statistics in a simple way.
 	 *
-	 * XXX This also means we require rinfo->clause_relids to have 2 rels.
-	 *
 	 * XXX Also check it's not expression on system attributes, which we
 	 * don't allow in extended statistics.
 	 *
@@ -2861,30 +2855,13 @@ statext_is_supported_join_clause(PlannerInfo *root, Node *clause)
 	 * or something like that. We could do "cartesian product" of the MCV
 	 * stats and restrict it using this condition.
 	 */
-	foreach (lc, opclause->args)
-	{
-		Bitmapset *varnos = NULL;
-		Node *expr = (Node *) lfirst(lc);
-
-		varnos = pull_varnos(root, expr);
 
-		/*
-		 * No argument should reference more than just one relation.
-		 *
-		 * This effectively means each side references just two relations.
-		 * If there's no relation on one side, it's a Const, and the other
-		 * side has to be either Const or Expr with a single rel. In which
-		 * case it can't be a join clause.
-		 */
-		if (bms_num_members(varnos) > 1)
-			return false;
+	if (!bms_get_singleton_member(rinfo->left_relids, &left_relid) ||
+		!bms_get_singleton_member(rinfo->right_relids, &right_relid))
+		return false;
 
-		/*
-		 * XXX Maybe check that both relations have extended statistics
-		 * (no point in considering the clause as useful without it). But
-		 * we'll do that check later anyway, so keep this cheap.
-		 */
-	}
+	if (left_relid == right_relid)
+		return false;
 
 	return true;
 }
-- 
2.34.1

v1-0006-Fast-path-for-general-clauselist_selectivity.patchtext/x-diffDownload

From 245021fa69be1d78a567e8c7348a41695d0a315a Mon Sep 17 00:00:00 2001
From: "yizhi.fzh" <yizhi.fzh@alibaba-inc.com>
Date: Tue, 2 Apr 2024 14:47:54 +0800
Subject: [PATCH v1 6/8] Fast path for general clauselist_selectivity

It should be common in the most queries like

SELECT * FROM t1, t2 WHERE t1.a = t2.a AND t1.a > 3;

clauses == NULL at the scan level of t2.
---
 src/backend/optimizer/path/clausesel.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c
index 50210ec2ca..c4f5fae9d7 100644
--- a/src/backend/optimizer/path/clausesel.c
+++ b/src/backend/optimizer/path/clausesel.c
@@ -132,6 +132,9 @@ clauselist_selectivity_ext(PlannerInfo *root,
 	int			listidx;
 	bool		single_clause_optimization = true;
 
+	if (clauses == NULL)
+		return 1.0;
+
 	/*
 	 * The optimization of skipping to clause_selectivity_ext for single
 	 * clauses means we can't improve join estimates with a single join
-- 
2.34.1

v1-0007-bms_is_empty-is-more-effective-than-bms_num_membe.patchtext/x-diffDownload

From 849a1078161cc14495f9b25f0afd31f433d91f41 Mon Sep 17 00:00:00 2001
From: "yizhi.fzh" <yizhi.fzh@alibaba-inc.com>
Date: Tue, 2 Apr 2024 14:53:30 +0800
Subject: [PATCH v1 7/8] bms_is_empty is more effective than bms_num_members(b)
 == 0.

---
 src/backend/statistics/extended_stats.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c
index 0deb0c3c55..109fe5a04a 100644
--- a/src/backend/statistics/extended_stats.c
+++ b/src/backend/statistics/extended_stats.c
@@ -2930,7 +2930,7 @@ statext_try_join_estimates(PlannerInfo *root, List *clauses, int varRelid,
 	}
 
 	/* no join clauses found, don't try applying extended stats */
-	if (bms_num_members(relids) == 0)
+	if (bms_is_empty(relids))
 		return false;
 
 	/*
-- 
2.34.1

v1-0008-a-branch-of-updates-around-JoinPairInfo.patchtext/x-diffDownload

From 95db6c962ff7ab29987379748d6c61c6aa799d7c Mon Sep 17 00:00:00 2001
From: "yizhi.fzh" <yizhi.fzh@alibaba-inc.com>
Date: Tue, 2 Apr 2024 16:01:05 +0800
Subject: [PATCH v1 8/8] a branch of updates around JoinPairInfo

1. rename rels to relids while the "rels" may reference to list of
RelOptInfo or Relids. but the later one reference to Relids all the
time.

2. Store RestrictInfo to JoinPairInfo.clauses so that we can reuse
the left_relids, right_relids which will save us from calling
pull_varnos.

3. create bms_nth_member function in bitmapset.c and use it
extract_relation_info, the function name is self-documented.

4. pfree the JoinPairInfo array when we are done with that.
---
 src/backend/nodes/bitmapset.c           | 18 ++++++++++++
 src/backend/statistics/extended_stats.c | 37 ++++++++++++-------------
 src/backend/statistics/mcv.c            | 34 ++++++-----------------
 src/include/nodes/bitmapset.h           |  1 +
 4 files changed, 44 insertions(+), 46 deletions(-)

diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index cd05c642b0..7c1291ae64 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -772,6 +772,24 @@ bms_num_members(const Bitmapset *a)
 	return result;
 }
 
+/*
+ * bms_nth_member - return the nth member, index starts with 0.
+ */
+int
+bms_nth_member(const Bitmapset *a, int i)
+{
+	int idx, res = -1;
+
+	for (idx = 0; idx <= i; idx++)
+	{
+		res = bms_next_member(a, res);
+
+		if (res < 0)
+			elog(ERROR, "no enough members for %d", i);
+	}
+	return res;
+}
+
 /*
  * bms_membership - does a set have zero, one, or multiple members?
  *
diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c
index 109fe5a04a..8d17fd91f0 100644
--- a/src/backend/statistics/extended_stats.c
+++ b/src/backend/statistics/extended_stats.c
@@ -2965,11 +2965,11 @@ statext_try_join_estimates(PlannerInfo *root, List *clauses, int varRelid,
 }
 
 /*
- * Information about two joined relations, along with the join clauses between.
+ * Information about two joined relations, group by clauses by relids.
  */
 typedef struct JoinPairInfo
 {
-	Bitmapset  *rels;
+	Bitmapset  *relids;
 	List	   *clauses;
 } JoinPairInfo;
 
@@ -3032,9 +3032,9 @@ statext_build_join_pairs(PlannerInfo *root, List *clauses,
 		found = false;
 		for (i = 0; i < cnt; i++)
 		{
-			if (bms_is_subset(rinfo->clause_relids, info[i].rels))
+			if (bms_is_subset(rinfo->clause_relids, info[i].relids))
 			{
-				info[i].clauses = lappend(info[i].clauses, clause);
+				info[i].clauses = lappend(info[i].clauses, rinfo);
 				found = true;
 				break;
 			}
@@ -3042,14 +3042,17 @@ statext_build_join_pairs(PlannerInfo *root, List *clauses,
 
 		if (!found)
 		{
-			info[cnt].rels = rinfo->clause_relids;
-			info[cnt].clauses = lappend(info[cnt].clauses, clause);
+			info[cnt].relids = rinfo->clause_relids;
+			info[cnt].clauses = lappend(info[cnt].clauses, rinfo);
 			cnt++;
 		}
 	}
 
 	if (cnt == 0)
+	{
+		pfree(info);
 		return NULL;
+	}
 
 	*npairs = cnt;
 	return info;
@@ -3069,7 +3072,6 @@ static RelOptInfo *
 extract_relation_info(PlannerInfo *root, JoinPairInfo *info, int index,
 					  StatisticExtInfo **stat)
 {
-	int			k;
 	int			relid;
 	RelOptInfo *rel;
 	ListCell   *lc;
@@ -3079,16 +3081,7 @@ extract_relation_info(PlannerInfo *root, JoinPairInfo *info, int index,
 
 	Assert((index >= 0) && (index <= 1));
 
-	k = -1;
-	while (index >= 0)
-	{
-		k = bms_next_member(info->rels, k);
-		if (k < 0)
-			elog(ERROR, "failed to extract relid");
-
-		relid = k;
-		index--;
-	}
+	relid = bms_nth_member(info->relids, index);
 
 	rel = find_base_rel(root, relid);
 
@@ -3100,7 +3093,8 @@ extract_relation_info(PlannerInfo *root, JoinPairInfo *info, int index,
 	foreach (lc, info->clauses)
 	{
 		ListCell *lc2;
-		Node *clause = (Node *) lfirst(lc);
+		RestrictInfo *rinfo = (RestrictInfo *) lfirst(lc);
+		Node *clause = (Node *) rinfo->clause;
 		OpExpr *opclause = (OpExpr *) clause;
 
 		/* only opclauses supported for now */
@@ -3140,7 +3134,8 @@ extract_relation_info(PlannerInfo *root, JoinPairInfo *info, int index,
 			 * compatible because we already checked it when building the
 			 * join pairs.
 			 */
-			varnos = pull_varnos(root, arg);
+			varnos = list_cell_number(opclause->args, lc2) == 0 ?
+				rinfo->left_relids : rinfo->right_relids;
 
 			if (relid == bms_singleton_member(varnos))
 				exprs = lappend(exprs, arg);
@@ -3381,7 +3376,8 @@ statext_clauselist_join_selectivity(PlannerInfo *root, List *clauses,
 		 */
 		foreach (lc, info->clauses)
 		{
-			Node *clause = (Node *) lfirst(lc);
+			RestrictInfo *rinfo = (RestrictInfo *) lfirst(lc);
+			Node *clause = (Node *) rinfo->clause;
 			ListCell *lc2;
 
 			listidx = -1;
@@ -3403,5 +3399,6 @@ statext_clauselist_join_selectivity(PlannerInfo *root, List *clauses,
 		}
 	}
 
+	pfree(info);
 	return s;
 }
diff --git a/src/backend/statistics/mcv.c b/src/backend/statistics/mcv.c
index 49299ed907..53b481a291 100644
--- a/src/backend/statistics/mcv.c
+++ b/src/backend/statistics/mcv.c
@@ -2214,8 +2214,7 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 
 	MCVList    *mcv1,
 			   *mcv2;
-	int			idx,
-				i,
+	int			i,
 				j;
 	Selectivity s = 0;
 
@@ -2306,25 +2305,14 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 	indexes2 = (int *) palloc(sizeof(int) * list_length(clauses));
 	reverse = (bool *) palloc(sizeof(bool) * list_length(clauses));
 
-	idx = 0;
 	foreach (lc, clauses)
 	{
-		Node	   *clause = (Node *) lfirst(lc);
+		RestrictInfo	*rinfo = (RestrictInfo *) lfirst(lc);
+		Node	   *clause = (Node *) rinfo->clause;
 		OpExpr	   *opexpr;
 		Node	   *expr1,
 				   *expr2;
-		Bitmapset  *relids1,
-				   *relids2;
-
-		/*
-		 * Strip the RestrictInfo node, get the actual clause.
-		 *
-		 * XXX Not sure if we need to care about removing other node types
-		 * too (e.g. RelabelType etc.). statext_is_supported_join_clause
-		 * matches this, but maybe we need to relax it?
-		 */
-		if (IsA(clause, RestrictInfo))
-			clause = (Node *) ((RestrictInfo *) clause)->clause;
+		int		idx = list_cell_number(clauses, lc);
 
 		opexpr = (OpExpr *) clause;
 
@@ -2338,12 +2326,8 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 		expr1 = linitial(opexpr->args);
 		expr2 = lsecond(opexpr->args);
 
-		/* determine order of clauses (rel1 op rel2) or (rel2 op rel1) */
-		relids1 = pull_varnos(root, expr1);
-		relids2 = pull_varnos(root, expr2);
-
-		if ((bms_singleton_member(relids1) == rel1->relid) &&
-			(bms_singleton_member(relids2) == rel2->relid))
+		if ((bms_singleton_member(rinfo->left_relids) == rel1->relid) &&
+			(bms_singleton_member(rinfo->right_relids) == rel2->relid))
 		{
 			Oid		collid;
 
@@ -2358,8 +2342,8 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 			exprs1 = lappend(exprs1, expr1);
 			exprs2 = lappend(exprs2, expr2);
 		}
-		else if ((bms_singleton_member(relids2) == rel1->relid) &&
-				 (bms_singleton_member(relids1) == rel2->relid))
+		else if ((bms_singleton_member(rinfo->right_relids) == rel1->relid) &&
+				 (bms_singleton_member(rinfo->left_relids) == rel2->relid))
 		{
 			Oid		collid;
 
@@ -2383,8 +2367,6 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 
 		Assert((indexes2[idx] >= 0) &&
 			   (indexes2[idx] < bms_num_members(stat2->keys) + list_length(stat2->exprs)));
-
-		idx++;
 	}
 
 	/*
diff --git a/src/include/nodes/bitmapset.h b/src/include/nodes/bitmapset.h
index 283bea5ea9..8d32e7a244 100644
--- a/src/include/nodes/bitmapset.h
+++ b/src/include/nodes/bitmapset.h
@@ -110,6 +110,7 @@ extern bool bms_nonempty_difference(const Bitmapset *a, const Bitmapset *b);
 extern int	bms_singleton_member(const Bitmapset *a);
 extern bool bms_get_singleton_member(const Bitmapset *a, int *member);
 extern int	bms_num_members(const Bitmapset *a);
+extern int  bms_nth_member(const Bitmapset *a, int i);
 
 /* optimized tests when we don't need to know exact membership count: */
 extern BMS_Membership bms_membership(const Bitmapset *a);
-- 
2.34.1

#17

Tomas Vondra

tomas.vondra@enterprisedb.com

almost 2 years ago

In reply to: Andy Fan (#16)

Re: using extended statistics to improve join estimates

On 4/2/24 10:23, Andy Fan wrote:

On Wed, Mar 02, 2022 at 11:38:21AM -0600, Justin Pryzby wrote:

Rebased over 269b532ae and muted compiler warnings.

Thank you Justin for the rebase!

Hello Tomas,

Thanks for the patch! Before I review the path at the code level, I want
to explain my understanding about this patch first.

If you want to work on this patch, that'd be cool. A review would be
great, but if you want to maybe take over and try moving it forward,
that'd be even better. I don't know when I'll have time to work on it
again, but I'd promise to help you with working on it.

Before this patch, we already use MCV information for the eqjoinsel, it
works as combine the MCV on the both sides to figure out the mcv_freq
and then treat the rest equally, but this doesn't work for MCV in
extended statistics, this patch fill this gap. Besides that, since
extended statistics means more than 1 columns are involved, if 1+
columns are Const based on RestrictInfo, we can use such information to
filter the MCVs we are interesting, that's really cool.

Yes, I think that's an accurate description of what the patch does.

I did some more testing, all of them are inner join so far, all of them
works amazing and I am suprised this patch didn't draw enough
attention. I will test more after I go though the code.

I think it didn't go forward for a bunch of reasons:

1) I got distracted by something else requiring immediate attention, and
forgot about this patch.

2) I got stuck on some detail of the patch, unsure which of the possible
solutions to try first.

3) Uncertainty about how applicable the patch is in practice.

I suppose it was some combination of these reasons, not sure.

As for the "practicality" mentioned in (3), it's been a while since I
worked on the patch so I don't recall the details, but I think I've been
thinking mostly about "start join" queries, where a big "fact" table
joins to small dimensions. And in that case the fact table may have a
MCV, but the dimensions certainly don't have any (because the join
happens on a PK).

But maybe that's a wrong way to think about it - it was clearly useful
to consider the case with (per-attribute) MCVs on both sides as worth
special handling. So why not to do that for multi-column MCVs, right?

At for the code level, I reviewed them in the top-down manner and almost
40% completed. Here are some findings just FYI. For efficiency purpose,
I provide each feedback with a individual commit, after all I want to
make sure my comment is practical and coding and testing is a good way
to archive that. I tried to make each of them as small as possible so
that you can reject or accept them convinently.

0001 is your patch, I just rebase them against the current master. 0006
is not much relevant with current patch, and I think it can be committed
individually if you are OK with that.

Hope this kind of review is helpful.

Cool! There's obviously no chance to get this into v18, and I have stuff
to do in this CF. But I'll take a look after that.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#18

Andy Fan

zhihuifan1213@163.com

almost 2 years ago

In reply to: Tomas Vondra (#17)

Re: using extended statistics to improve join estimates

Tomas Vondra <tomas.vondra@enterprisedb.com> writes:

On 4/2/24 10:23, Andy Fan wrote:

On Wed, Mar 02, 2022 at 11:38:21AM -0600, Justin Pryzby wrote:

Rebased over 269b532ae and muted compiler warnings.

Thank you Justin for the rebase!

Hello Tomas,

Thanks for the patch! Before I review the path at the code level, I want
to explain my understanding about this patch first.

If you want to work on this patch, that'd be cool. A review would be
great, but if you want to maybe take over and try moving it forward,
that'd be even better. I don't know when I'll have time to work on it
again, but I'd promise to help you with working on it.

OK, I'd try to moving it forward.

Before this patch, we already use MCV information for the eqjoinsel, it
works as combine the MCV on the both sides to figure out the mcv_freq
and then treat the rest equally, but this doesn't work for MCV in
extended statistics, this patch fill this gap. Besides that, since
extended statistics means more than 1 columns are involved, if 1+
columns are Const based on RestrictInfo, we can use such information to
filter the MCVs we are interesting, that's really cool.

Yes, I think that's an accurate description of what the patch does.

Great to know that:)

I did some more testing, all of them are inner join so far, all of them
works amazing and I am suprised this patch didn't draw enough
attention.

I think it didn't go forward for a bunch of reasons:

3) Uncertainty about how applicable the patch is in practice.

I suppose it was some combination of these reasons, not sure.

As for the "practicality" mentioned in (3), it's been a while since I
worked on the patch so I don't recall the details, but I think I've been
thinking mostly about "start join" queries, where a big "fact" table
joins to small dimensions. And in that case the fact table may have a
MCV, but the dimensions certainly don't have any (because the join
happens on a PK).

But maybe that's a wrong way to think about it - it was clearly useful
to consider the case with (per-attribute) MCVs on both sides as worth
special handling. So why not to do that for multi-column MCVs, right?

Yes, that's what my current understanding is.

There are some cases where there are 2+ clauses between two tables AND
the rows estimiation is bad AND the plan is not the best one. In such
sisuations, I'd think this patch probably be helpful. The current case
in hand is PG11, there is no MCV information for extended statistics, so
I even can't verify the patch here is useful or not manually. When I see
them next time in a newer version of PG, I can verity it manually to see
if the rows estimation can be better.

At for the code level, I reviewed them in the top-down manner and almost
40% completed. Here are some findings just FYI. For efficiency purpose,
I provide each feedback with a individual commit, after all I want to
make sure my comment is practical and coding and testing is a good way
to archive that. I tried to make each of them as small as possible so
that you can reject or accept them convinently.

0001 is your patch, I just rebase them against the current master. 0006
is not much relevant with current patch, and I think it can be committed
individually if you are OK with that.

Hope this kind of review is helpful.

Cool! There's obviously no chance to get this into v18, and I have stuff
to do in this CF. But I'll take a look after that.

Good to know that. I will continue my work before that.

--
Best Regards
Andy Fan

#19

Justin Pryzby

pryzby@telsasoft.com

almost 2 years ago

In reply to: Andy Fan (#16)

Re: using extended statistics to improve join estimates

On Tue, Apr 02, 2024 at 04:23:45PM +0800, Andy Fan wrote:

0001 is your patch, I just rebase them against the current master. 0006
is not much relevant with current patch, and I think it can be committed
individually if you are OK with that.

Your 002 should also remove listidx to avoid warning
../src/backend/statistics/extended_stats.c:2879:8: error: variable 'listidx' set but not used [-Werror,-Wunused-but-set-variable]

Subject: [PATCH v1 2/8] Remove estimiatedcluases and varRelid arguments

@@ -2939,15 +2939,11 @@ statext_try_join_estimates(PlannerInfo *root, List *clauses, int varRelid,
/* needs to happen before skipping any clauses */
listidx++;

- /* Skip clauses that were already estimated. */
- if (bms_is_member(listidx, estimatedclauses))
- continue;
-

Your 007 could instead test if relids == NULL:

Subject: [PATCH v1 7/8] bms_is_empty is more effective than bms_num_members(b)
-       if (bms_num_members(relids) == 0)
+       if (bms_is_empty(relids))

typos:
001: s/heuristict/heuristics/
002: s/grantee/guarantee/
002: s/estimiatedcluases/estimatedclauses/

It'd be nice to fix/silence these warnings from 001:

|../src/backend/statistics/extended_stats.c:3151:36: warning: ‘relid’ may be used uninitialized [-Wmaybe-uninitialized]
| 3151 | if (var->varno != relid)
| | ^
|../src/backend/statistics/extended_stats.c:3104:33: note: ‘relid’ was declared here
| 3104 | int relid;
| | ^~~~~
|[1016/1893] Compiling C object src/backend/postgres_lib.a.p/statistics_mcv.c.o
|../src/backend/statistics/mcv.c: In function ‘mcv_combine_extended’:
|../src/backend/statistics/mcv.c:2431:49: warning: declaration of ‘idx’ shadows a previous local [-Wshadow=compatible-local]

FYI, I also ran the patch with a $large number of reports without
observing any errors or crashes.

I'll try to look harder at the next patch revision.

--
Justin

#20

Andy Fan

zhihuifan1213@163.com

over 1 year ago

In reply to: Andy Fan (#18)

1 attachment(s)

Re: using extended statistics to improve join estimates

Hello Tomas!

At for the code level, I reviewed them in the top-down manner and almost
40% completed. Here are some findings just FYI. For efficiency purpose,
I provide each feedback with a individual commit, after all I want to
make sure my comment is practical and coding and testing is a good way
to archive that. I tried to make each of them as small as possible so
that you can reject or accept them convinently.

0001 is your patch, I just rebase them against the current master. 0006
is not much relevant with current patch, and I think it can be committed
individually if you are OK with that.

Hope this kind of review is helpful.

Cool! There's obviously no chance to get this into v18, and I have stuff
to do in this CF. But I'll take a look after that.

Good to know that. I will continue my work before that.

I have completed my code level review and modification. These individual
commits and message probably be helpful for discussion.

--
Best Regards
Andy Fan

Attachments:

ext_stats_on_join.tarapplication/x-tarDownload

v1-0001-Estimate-joins-using-extended-statistics.patch0000644000175000017500000020713514613326575023457 0ustar  yizhi.fzhyizhi.fzhFrom 20d9524a2e33505d1bd0d851a31058dbee6005c1 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas.vondra@postgresql.org>
Date: Mon, 13 Dec 2021 14:05:17 +0100
Subject: [PATCH v1 01/22] Estimate joins using extended statistics

Use extended statistics (MCV) to improve join estimates. In general this
is similar to how we use regular statistics - we search for extended
statistics (with MCV) covering all join clauses, and if we find such MCV
on both sides of the join, we combine those two MCVs.

Extended statistics allow a couple additional improvements - e.g. if
there are baserel conditions, we can use them to restrict the part of
the MCVs combined. This means we're building conditional probability
distribution and calculating conditional probability

    P(join clauses | baserel conditions)

instead of just P(join clauses).

The patch also allows combining regular and extended MCV - we don't need
extended MCVs on both sides. This helps when one of the tables does not
have extended statistics (e.g. because there are no correlations).
---
 src/backend/optimizer/path/clausesel.c        |  63 +-
 src/backend/statistics/extended_stats.c       | 805 ++++++++++++++++++
 src/backend/statistics/mcv.c                  | 758 +++++++++++++++++
 .../statistics/extended_stats_internal.h      |  20 +
 src/include/statistics/statistics.h           |  12 +
 src/test/regress/expected/stats_ext.out       | 167 ++++
 src/test/regress/sql/stats_ext.sql            |  66 ++
 7 files changed, 1890 insertions(+), 1 deletion(-)

diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c
index 0ab021c1e8..bedf76edae 100644
--- a/src/backend/optimizer/path/clausesel.c
+++ b/src/backend/optimizer/path/clausesel.c
@@ -48,6 +48,9 @@ static Selectivity clauselist_selectivity_or(PlannerInfo *root,
 											 JoinType jointype,
 											 SpecialJoinInfo *sjinfo,
 											 bool use_extended_stats);
+static inline bool treat_as_join_clause(PlannerInfo *root,
+										Node *clause, RestrictInfo *rinfo,
+										int varRelid, SpecialJoinInfo *sjinfo);
 
 /****************************************************************************
  *		ROUTINES TO COMPUTE SELECTIVITIES
@@ -127,12 +130,53 @@ clauselist_selectivity_ext(PlannerInfo *root,
 	RangeQueryClause *rqlist = NULL;
 	ListCell   *l;
 	int			listidx;
+	bool		single_clause_optimization = true;
+
+	/*
+	 * The optimization of skipping to clause_selectivity_ext for single
+	 * clauses means we can't improve join estimates with a single join
+	 * clause but additional baserel restrictions. So we disable it when
+	 * estimating joins.
+	 *
+	 * XXX Not sure if this is the right way to do it, but more elaborate
+	 * checks would mostly negate the whole point of the optimization.
+	 * The (Var op Var) patch has the same issue.
+	 *
+	 * XXX An alternative might be making clause_selectivity_ext smarter
+	 * and make it use the join extended stats there. But that seems kinda
+	 * against the whole point of the optimization (skipping expensive
+	 * stuff) and it's making other parts more complex.
+	 *
+	 * XXX Maybe this should check if there are at least some restrictions
+	 * on some base relations, which seems important. But then again, that
+	 * seems to go against the idea of this check to be cheap. Moreover, it
+	 * won't work for OR clauses, which may have multiple parts but we still
+	 * see them as a single BoolExpr clause (it doesn't work later, though).
+	 */
+	if (list_length(clauses) == 1)
+	{
+		Node *clause = linitial(clauses);
+		RestrictInfo *rinfo = NULL;
+
+		if (IsA(clause, RestrictInfo))
+		{
+			rinfo = (RestrictInfo *) clause;
+			clause = (Node *) rinfo->clause;
+		}
+
+		single_clause_optimization
+			= !treat_as_join_clause(root, clause, rinfo, varRelid, sjinfo);
+	}
 
 	/*
 	 * If there's exactly one clause, just go directly to
 	 * clause_selectivity_ext(). None of what we might do below is relevant.
+	 *
+	 * XXX This means we won't try using extended stats on OR-clauses (which
+	 * are a single BoolExpr clause at this point), although we'll do that
+	 * later (once we look at the arguments).
 	 */
-	if (list_length(clauses) == 1)
+	if ((list_length(clauses) == 1) && single_clause_optimization)
 		return clause_selectivity_ext(root, (Node *) linitial(clauses),
 									  varRelid, jointype, sjinfo,
 									  use_extended_stats);
@@ -155,6 +199,23 @@ clauselist_selectivity_ext(PlannerInfo *root,
 											&estimatedclauses, false);
 	}
 
+	/*
+	 * Try applying extended statistics to joins. There's not much we can
+	 * do to detect when this makes sense, but we can check that there are
+	 * join clauses, and that at least some of the rels have stats.
+	 *
+	 * XXX Isn't this mutually exclusive with the preceding block which
+	 * calculates estimates for a single relation?
+	 */
+	if (use_extended_stats &&
+		statext_try_join_estimates(root, clauses, varRelid, jointype, sjinfo,
+						 estimatedclauses))
+	{
+		s1 *= statext_clauselist_join_selectivity(root, clauses, varRelid,
+												  jointype, sjinfo,
+												  &estimatedclauses);
+	}
+
 	/*
 	 * Apply normal selectivity estimates for remaining clauses. We'll be
 	 * careful to skip any clauses which were already estimated above.
diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c
index 99fdf208db..80872cc7da 100644
--- a/src/backend/statistics/extended_stats.c
+++ b/src/backend/statistics/extended_stats.c
@@ -29,6 +29,7 @@
 #include "miscadmin.h"
 #include "nodes/nodeFuncs.h"
 #include "optimizer/optimizer.h"
+#include "optimizer/pathnode.h"
 #include "parser/parsetree.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -100,6 +101,8 @@ static StatsBuildData *make_build_data(Relation rel, StatExtEntry *stat,
 									   int numrows, HeapTuple *rows,
 									   VacAttrStats **stats, int stattarget);
 
+static bool stat_covers_expressions(StatisticExtInfo *stat, List *exprs,
+									Bitmapset **expr_idxs);
 
 /*
  * Compute requested extended stats, using the rows sampled for the plain
@@ -2635,3 +2638,805 @@ make_build_data(Relation rel, StatExtEntry *stat, int numrows, HeapTuple *rows,
 
 	return result;
 }
+
+/*
+ * statext_find_matching_mcv
+ *		Search for a MCV covering all the attributes and expressions.
+ *
+ * We pick the statistics to use for join estimation. The statistics object has
+ * to have MCV, and we require it to match all the join conditions, because it
+ * makes the estimation simpler.
+ *
+ * If there are multiple candidate statistics objects (matching all join clauses),
+ * we pick the smallest one, and we also consider additional conditions on
+ * the base relations to restrict the MCV items used for estimation (using
+ * conditional probability).
+ *
+ * XXX The requirement that all the attributes need to be covered might be
+ * too strong. We could relax this and and require fewer matches (at least two,
+ * if counting the additional conditions), and we might even apply multiple
+ * statistics etc. But that would require matching statistics on both sides of
+ * the join, while now we simply know the statistics match. We don't really
+ * expect many candidate MCVs, so this simple approach seems sufficient. And
+ * the joins usually use only one or two columns, so there's not much room
+ * for applying multiple statistics anyway.
+ */
+StatisticExtInfo *
+statext_find_matching_mcv(PlannerInfo *root, RelOptInfo *rel,
+						  Bitmapset *attnums, List *exprs)
+{
+	ListCell   *l;
+	StatisticExtInfo *mcv = NULL;
+	List *stats = rel->statlist;
+
+	foreach(l, stats)
+	{
+		StatisticExtInfo *stat = (StatisticExtInfo *) lfirst(l);
+		List *conditions1 = NIL,
+			 *conditions2 = NIL;
+
+		/* We only care about MCV statistics here. */
+		if (stat->kind != STATS_EXT_MCV)
+			continue;
+
+		/*
+		 * Ignore MCVs not covering all the attributes/expressions.
+		 *
+		 * XXX Maybe we shouldn't be so strict and consider only partial
+		 * matches for join clauses too?
+		 */
+		if (!bms_is_subset(attnums, stat->keys) ||
+			!stat_covers_expressions(stat, exprs, NULL))
+			continue;
+
+		/* If there's no matching MCV yet, keep this one. */
+		if (!mcv)
+		{
+			mcv = stat;
+			continue;
+		}
+
+		/*
+		 * OK, we have two candidate statistics objects and we need to decide
+		 * which one to keep. We'll use two simple heuristics:
+		 *
+		 * (a) We prefer smaller statistics (fewer columns), on the assumption
+		 * that it represents a larger fraction of the data (due to having fewer
+		 * combinations with higher counts).
+		 *
+		 * (b) If the statistics object covers some additional conditions for the rels,
+		 * that may help with considering additional dependencies between the
+		 * tables.
+		 *
+		 * Of course, those two heuristict are somewhat contradictory - smaller
+		 * stats are less likely to cover as many conditions as a larger one. We
+		 * consider the additional conditions first - if someone created such
+		 * statistics, there probably is a dependency worth considering.
+		 *
+		 * When inspecting the restrictions, we need to be careful - we don't
+		 * know which of them are compatible with extended stats, so we have to
+		 * run them through statext_is_compatible_clause first and then match
+		 * them to the statistics.
+		 *
+		 * XXX Maybe we shouldn't pick statistics that covers just a single join
+		 * clause, without any additional conditions. In such case we could just
+		 * as well pick regular statistics for the column/expression, but it's
+		 * not clear if that actually exists (so we might reject the stats here
+		 * and then fail to find something simpler/better).
+		 */
+		conditions1 = statext_determine_join_restrictions(root, rel, stat);
+		conditions2 = statext_determine_join_restrictions(root, rel, mcv);
+
+		/* if the new statistics object covers more conditions, use it */
+		if (list_length(conditions1) > list_length(conditions2))
+		{
+			mcv = stat;
+			continue;
+		}
+
+		/* The statistics seem about equal, so just use the smaller one. */
+		if (bms_num_members(mcv->keys) + list_length(mcv->exprs) >
+			bms_num_members(stat->keys) + list_length(stat->exprs))
+		{
+			mcv = stat;
+		}
+	}
+
+	return mcv;
+}
+
+/*
+ * statext_determine_join_restrictions
+ *		Get restrictions on base relation, covered by the statistics object.
+ *
+ * Returns a list of baserel restrictinfos, compatible with extended statistics
+ * and covered by the extended statistics object.
+ *
+ * When using extended statistics to estimate joins, we can use conditions
+ * from base relations to calculate conditional probability
+ *
+ *    P(join clauses | baserel restrictions)
+ *
+ * which should be a better estimate than just P(join clauses). We want to pick
+ * the statistics object covering the most such conditions.
+ */
+List *
+statext_determine_join_restrictions(PlannerInfo *root, RelOptInfo *rel,
+									StatisticExtInfo *info)
+{
+	ListCell   *lc;
+	List	   *conditions = NIL;
+
+	/* extract conditions that may be applied to the MCV list */
+	foreach (lc, rel->baserestrictinfo)
+	{
+		RestrictInfo *rinfo = (RestrictInfo *) lfirst(lc);
+		Bitmapset *indexes = NULL;
+		Bitmapset *attnums = NULL;
+		List *exprs = NIL;
+
+		/* clause has to be supported by MCV in general */
+		if (!statext_is_compatible_clause(root, (Node *) rinfo, rel->relid,
+										  &attnums, &exprs))
+			continue;
+
+		/*
+		 * clause is compatible in general, but is it actually covered
+		 * by this particular statistics object?
+		 */
+		if (!bms_is_subset(attnums, info->keys) ||
+			!stat_covers_expressions(info, exprs, &indexes))
+			continue;
+
+		conditions = lappend(conditions, rinfo->clause);
+	}
+
+	return conditions;
+}
+
+/*
+ * statext_is_supported_join_clause
+ *		Check if a join clause may be estimated using extended stats.
+ *
+ * Determines if this is a join clause of the form (Expr op Expr) which may be
+ * estimated using extended statistics. Each side must reference just a single
+ * relation for now.
+ *
+ * Similar to treat_as_join_clause, but we place additional restrictions
+ * on the conditions.
+ */
+static bool
+statext_is_supported_join_clause(PlannerInfo *root, Node *clause,
+								 int varRelid, SpecialJoinInfo *sjinfo)
+{
+	Oid	oprsel;
+	RestrictInfo   *rinfo;
+	OpExpr		   *opclause;
+	ListCell	   *lc;
+
+	/*
+	 * evaluation as a restriction clause, either at scan node or forced
+	 *
+	 * XXX See treat_as_join_clause.
+	 */
+	if ((varRelid != 0) || (sjinfo == NULL))
+		return false;
+
+	/* XXX Can we rely on always getting RestrictInfo here? */
+	if (!IsA(clause, RestrictInfo))
+		return false;
+
+	/* strip the RestrictInfo */
+	rinfo = (RestrictInfo *) clause;
+	clause = (Node *) rinfo->clause;
+
+	/* is it referencing multiple relations? */
+	if (bms_membership(rinfo->clause_relids) != BMS_MULTIPLE)
+		return false;
+
+	/* we only support simple operator clauses for now */
+	if (!is_opclause(clause))
+		return false;
+
+	opclause = (OpExpr *) clause;
+
+	/* for now we only support estimating equijoins */
+	oprsel = get_oprjoin(opclause->opno);
+
+	/* has to be an equality condition */
+	if (oprsel != F_EQJOINSEL)
+		return false;
+
+	/*
+	 * Make sure we're not mixing vars from multiple relations on the same
+	 * side, like
+	 *
+	 *   (t1.a + t2.a) = (t1.b + t2.b)
+	 *
+	 * which is still technically an opclause, but we can't match it to
+	 * extended statistics in a simple way.
+	 *
+	 * XXX This also means we require rinfo->clause_relids to have 2 rels.
+	 *
+	 * XXX Also check it's not expression on system attributes, which we
+	 * don't allow in extended statistics.
+	 *
+	 * XXX Although maybe we could allow cases that combine expressions
+	 * from both relations on either side? Like (t1.a + t2.b = t1.c - t2.d)
+	 * or something like that. We could do "cartesian product" of the MCV
+	 * stats and restrict it using this condition.
+	 */
+	foreach (lc, opclause->args)
+	{
+		Bitmapset *varnos = NULL;
+		Node *expr = (Node *) lfirst(lc);
+
+		varnos = pull_varnos(root, expr);
+
+		/*
+		 * No argument should reference more than just one relation.
+		 *
+		 * This effectively means each side references just two relations.
+		 * If there's no relation on one side, it's a Const, and the other
+		 * side has to be either Const or Expr with a single rel. In which
+		 * case it can't be a join clause.
+		 */
+		if (bms_num_members(varnos) > 1)
+			return false;
+
+		/*
+		 * XXX Maybe check that both relations have extended statistics
+		 * (no point in considering the clause as useful without it). But
+		 * we'll do that check later anyway, so keep this cheap.
+		 */
+	}
+
+	return true;
+}
+
+/*
+ * statext_try_join_estimates
+ *		Checks if it's worth considering extended stats on join estimates.
+ *
+ * This is supposed to be a quick/cheap check to decide whether to expend
+ * more effort on applying extended statistics to join clauses.
+ */
+bool
+statext_try_join_estimates(PlannerInfo *root, List *clauses, int varRelid,
+						   JoinType jointype, SpecialJoinInfo *sjinfo,
+						   Bitmapset *estimatedclauses)
+{
+	int			listidx;
+	int			k;
+	ListCell   *lc;
+	Bitmapset  *relids = NULL;
+
+	/*
+	 * XXX Not having these values means treat_as_join_clause returns false,
+	 * so we're not supposed to handle join clauses here. So just bail out.
+	 */
+	if ((varRelid != 0) || (sjinfo == NULL))
+		return false;
+
+	/*
+	 * Check if there are any unestimated join clauses, collect relids.
+	 *
+	 * XXX Currently this only allows simple OpExpr equality clauses with each
+	 * argument referring to single relation, AND-ed together. Maybe we could
+	 * relax this in the future, e.g. to allow more complex (deeper) expressions
+	 * and to allow OR-ed join clauses too. And maybe supporting inequalities.
+	 *
+	 * Handling more complex expressions seems simple - we already do that for
+	 * baserel estimates by building the match bitmap recursively, and we could
+	 * do something similar for combinations of MCV items (a bit like building
+	 * a single bit in the match bitmap). The challenge is what to do about the
+	 * part not represented by MCV, which is now based on ndistinct estimates.
+	 */
+	listidx = -1;
+	foreach (lc, clauses)
+	{
+		Node *clause = (Node *) lfirst(lc);
+		RestrictInfo *rinfo;
+
+		/* needs to happen before skipping any clauses */
+		listidx++;
+
+		/* Skip clauses that were already estimated. */
+		if (bms_is_member(listidx, estimatedclauses))
+			continue;
+
+		/*
+		 * Skip clauses that are not join clauses or that we don't know
+		 * how to handle estimate using extended statistics.
+		 */
+		if (!statext_is_supported_join_clause(root, clause, varRelid, sjinfo))
+			continue;
+
+		/*
+		 * XXX We're guaranteed to have RestrictInfo thanks to the checks
+		 * in statext_is_supported_join_clause.
+		 */
+		rinfo = (RestrictInfo *) clause;
+
+		/* Collect relids from all usable clauses. */
+		relids = bms_union(relids, rinfo->clause_relids);
+	}
+
+	/* no join clauses found, don't try applying extended stats */
+	if (bms_num_members(relids) == 0)
+		return false;
+
+	/*
+	 * We expect either 0 or >= 2 relids, a case with 1 relid in join clauses
+	 * should be impossible. And we just ruled out 0, so there are at least 2.
+	 */
+	Assert(bms_num_members(relids) >= 2);
+
+	/*
+	 * Check that at least some of the rels referenced by the clauses have
+	 * extended stats.
+	 *
+	 * XXX Maybe we should check how many rels have stats, and cross-check how
+	 * compatible they are (e.g. that both have MCVs, etc.). We might also
+	 * cross-check the exact joined pairs of rels, but it's supposed to be a
+	 * cheap check, so maybe better leave that for later.
+	 *
+	 * XXX We could also check the number of parameters in each rel to consider
+	 * extended stats. If there's just a single attribute, it's pointless to use
+	 * extended statistics. OTOH we can also consider restriction clauses from
+	 * baserestrictinfo and use them to calculate conditional probabilities.
+	 */
+	k = -1;
+	while ((k = bms_next_member(relids, k)) >= 0)
+	{
+		RelOptInfo *rel = find_base_rel(root, k);
+		if (rel->statlist)
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * Information about two joined relations, along with the join clauses between.
+ */
+typedef struct JoinPairInfo
+{
+	Bitmapset  *rels;
+	List	   *clauses;
+} JoinPairInfo;
+
+/*
+ * statext_build_join_pairs
+ *		Extract pairs of joined rels with join clauses for each pair.
+ *
+ * Walks the remaining (not yet estimated) clauses, and splits them into
+ * lists for each pair of joined relations. Returns NULL if there are no
+ * suitable join pairs that might be estimated using extended stats.
+ *
+ * XXX It's possible there are join clauses, but the clauses are not
+ * supported by the extended stats machinery (we only support opclauses
+ * with F_EQJOINSEL selectivity function at the moment).
+ */
+static JoinPairInfo *
+statext_build_join_pairs(PlannerInfo *root, List *clauses, int varRelid,
+						 JoinType jointype, SpecialJoinInfo *sjinfo,
+						 Bitmapset *estimatedclauses, int *npairs)
+{
+	int				cnt;
+	int				listidx;
+	JoinPairInfo   *info;
+	ListCell	   *lc;
+
+	/*
+	 * Assume each clause is for a different pair of relations (some of them
+	 * might be already estimated, but meh - there shouldn't be too many of
+	 * them and it's cheaper than repalloc).
+	 */
+	info = (JoinPairInfo *) palloc0(sizeof(JoinPairInfo) * list_length(clauses));
+	cnt = 0;
+
+	listidx = -1;
+	foreach(lc, clauses)
+	{
+		int				i;
+		bool			found;
+		Node		   *clause = (Node *) lfirst(lc);
+		RestrictInfo   *rinfo;
+
+		listidx++;
+
+		/* skip already estimated clauses */
+		if (bms_is_member(listidx, estimatedclauses))
+			continue;
+
+		/*
+		 * Make sure the clause is a join clause of a supported shape (at
+		 * the moment we support just (Expr op Expr) clauses with each
+		 * side referencing just a single relation).
+		 */
+		if (!statext_is_supported_join_clause(root, clause, varRelid, sjinfo))
+			continue;
+
+		/* statext_is_supported_join_clause guarantees RestrictInfo */
+		rinfo = (RestrictInfo *) clause;
+		clause = (Node *) rinfo->clause;
+
+		/* search for a matching join pair */
+		found = false;
+		for (i = 0; i < cnt; i++)
+		{
+			if (bms_is_subset(rinfo->clause_relids, info[i].rels))
+			{
+				info[i].clauses = lappend(info[i].clauses, clause);
+				found = true;
+				break;
+			}
+		}
+
+		if (!found)
+		{
+			info[cnt].rels = rinfo->clause_relids;
+			info[cnt].clauses = lappend(info[cnt].clauses, clause);
+			cnt++;
+		}
+	}
+
+	if (cnt == 0)
+		return NULL;
+
+	*npairs = cnt;
+	return info;
+}
+
+/*
+ * extract_relation_info
+ *		Extract information about a relation in a join pair.
+ *
+ * The relation is identified by index (generally 0 or 1), and picks extended
+ * statistics object covering the join clauses and baserel restrictions.
+ *
+ * XXX Can we have cases with indexes above 1? Probably for clauses mixing
+ * vars from 3 relations, but statext_is_supported_join_clause rejects those.
+ */
+static RelOptInfo *
+extract_relation_info(PlannerInfo *root, JoinPairInfo *info, int index,
+					  StatisticExtInfo **stat)
+{
+	int			k;
+	int			relid;
+	RelOptInfo *rel;
+	ListCell   *lc;
+	List	   *exprs = NIL;
+
+	Bitmapset  *attnums = NULL;
+
+	Assert((index >= 0) && (index <= 1));
+
+	k = -1;
+	while (index >= 0)
+	{
+		k = bms_next_member(info->rels, k);
+		if (k < 0)
+			elog(ERROR, "failed to extract relid");
+
+		relid = k;
+		index--;
+	}
+
+	rel = find_base_rel(root, relid);
+
+	/*
+	 * Walk the clauses for this join pair, and extract expressions about
+	 * the relation identified by index / relid. For simple Vars we extract
+	 * the attnum. Otherwise we keep the whole expression.
+	 */
+	foreach (lc, info->clauses)
+	{
+		ListCell *lc2;
+		Node *clause = (Node *) lfirst(lc);
+		OpExpr *opclause = (OpExpr *) clause;
+
+		/* only opclauses supported for now */
+		Assert(is_opclause(clause));
+
+		foreach (lc2, opclause->args)
+		{
+			Node *arg = (Node *) lfirst(lc2);
+			Bitmapset *varnos = NULL;
+
+			/* plain Var references (boolean Vars or recursive checks) */
+			if (IsA(arg, Var))
+			{
+				Var		   *var = (Var *) arg;
+
+				/* Ignore vars from other relations. */
+				if (var->varno != relid)
+					continue;
+
+				/* we also better ensure the Var is from the current level */
+				if (var->varlevelsup > 0)
+					continue;
+
+				/* Also skip system attributes (we don't allow stats on those). */
+				if (!AttrNumberIsForUserDefinedAttr(var->varattno))
+					elog(ERROR, "unexpected system attribute");
+
+				attnums = bms_add_member(attnums, var->varattno);
+
+				/* Done, process the next argument. */
+				continue;
+			}
+
+			/*
+			 * OK, it's a more complex expression, so check if it matches
+			 * the relid and maybe keep it as a whole. It should be
+			 * compatible because we already checked it when building the
+			 * join pairs.
+			 */
+			varnos = pull_varnos(root, arg);
+
+			if (relid == bms_singleton_member(varnos))
+				exprs = lappend(exprs, arg);
+		}
+	}
+
+	*stat = statext_find_matching_mcv(root, rel, attnums, exprs);
+
+	return rel;
+}
+
+/*
+ * get_expression_for_rel
+ *		Extract expression for a given relation from the join clause.
+ *
+ * Given a join clause supported by the extended statistics object (currently
+ * that means just OpExpr clauses with each argument referencing single rel),
+ * return either the left or right argument expression for the rel.
+ *
+ * XXX This should probably return a flag identifying whether it's the
+ * left or right argument.
+ */
+static Node *
+get_expression_for_rel(PlannerInfo *root, RelOptInfo *rel, Node *clause)
+{
+	OpExpr *opexpr;
+	Node   *expr;
+
+	/*
+	 * Strip the RestrictInfo node, get the actual clause.
+	 *
+	 * XXX Not sure if we need to care about removing other node types
+	 * too (e.g. RelabelType etc.). statext_is_supported_join_clause
+	 * matches this, but maybe we need to relax it?
+	 */
+	if (IsA(clause, RestrictInfo))
+		clause = (Node *) ((RestrictInfo *) clause)->clause;
+
+	opexpr = (OpExpr *) clause;
+
+	/* Make sure we have the expected node type. */
+	Assert(is_opclause(clause));
+	Assert(list_length(opexpr->args) == 2);
+
+	/* FIXME strip relabel etc. the way examine_opclause_args does */
+	expr = linitial(opexpr->args);
+	if (bms_singleton_member(pull_varnos(root, expr)) == rel->relid)
+		return expr;
+
+	expr = lsecond(opexpr->args);
+	if (bms_singleton_member(pull_varnos(root, expr)) == rel->relid)
+		return expr;
+
+	return NULL;
+}
+
+/*
+ * statext_clauselist_join_selectivity
+ *		Use extended stats to estimate join clauses.
+ *
+ * XXX In principle, we should not restrict this to cases with multiple
+ * join clauses - we should consider dependencies with conditions at the
+ * base relations, i.e. calculate P(join clause | base restrictions).
+ * But currently that does not happen, because clauselist_selectivity_ext
+ * treats a single clause as a special case (and we don't apply extended
+ * statistics in that case yet).
+ */
+Selectivity
+statext_clauselist_join_selectivity(PlannerInfo *root, List *clauses, int varRelid,
+									JoinType jointype, SpecialJoinInfo *sjinfo,
+									Bitmapset **estimatedclauses)
+{
+	int			i;
+	int			listidx;
+	Selectivity	s = 1.0;
+
+	JoinPairInfo *info;
+	int				ninfo;
+
+	if (!clauses)
+		return 1.0;
+
+	/* extract pairs of joined relations from the list of clauses */
+	info = statext_build_join_pairs(root, clauses, varRelid, jointype, sjinfo,
+									*estimatedclauses, &ninfo);
+
+	/* no useful join pairs */
+	if (!info)
+		return 1.0;
+
+	/*
+	 * Process the join pairs, try to find a matching MCV on each side.
+	 *
+	 * XXX The basic principle is quite similar to eqjoinsel_inner, i.e.
+	 * we try to find a MCV on both sides of the join, and use it to get
+	 * a better join estimate. It's a bit more complicated, because there
+	 * might be multiple MCV lists, we also need ndistinct estimate, and
+	 * there may be interesting baserestrictions too.
+	 *
+	 * XXX At the moment we only handle the case with matching MCVs on
+	 * both sides, but it'd be good to also handle case with just ndistinct
+	 * statistics improving ndistinct estimates.
+	 *
+	 * XXX We might also handle cases with a regular MCV on one side and
+	 * an extended MCV on the other side.
+	 *
+	 * XXX Perhaps it'd be good to also handle case with one side only
+	 * having "regular" statistics (e.g. MCV), especially in cases with
+	 * no conditions on that side of the join (where we can't use the
+	 * extended MCV to calculate conditional probability).
+	 */
+	for (i = 0; i < ninfo; i++)
+	{
+		ListCell *lc;
+
+		RelOptInfo *rel1;
+		RelOptInfo *rel2;
+
+		StatisticExtInfo *stat1;
+		StatisticExtInfo *stat2;
+
+		/* extract info about the first relation */
+		rel1 = extract_relation_info(root, &info[i], 0, &stat1);
+
+		/* extract info about the second relation */
+		rel2 = extract_relation_info(root, &info[i], 1, &stat2);
+
+		/*
+		 * We can handle three basic cases:
+		 *
+		 * a) Extended stats (with MCV) on both sides is an ideal case, and we
+		 * can simply combine the two MCVs, possibly with additional conditions
+		 * from the relations.
+		 *
+		 * b) Extended stats on one side, regular MCV on the other side (this
+		 * means there's just one join clause / expression). It also means the
+		 * extended stats likely covers at least one extra condition, otherwise
+		 * we could just use regular statistics. We can combine the stats just
+		 * similarly to (a).
+		 *
+		 * c) No extended stats with MCV. If there are multiple join clauses,
+		 * we can try using ndistinct coefficients and do what eqjoinsel does.
+		 *
+		 * If none of these applies, we fallback to the regular selectivity
+		 * estimation in eqjoinsel.
+		 */
+		if (stat1 && stat2)
+		{
+			s *= mcv_combine_extended(root, rel1, rel2, stat1, stat2, info[i].clauses);
+		}
+		else if (stat1 && (list_length(info[i].clauses) == 1))
+		{
+			/* try finding MCV on the other relation */
+			VariableStatData	vardata;
+			AttStatsSlot		sslot;
+			Form_pg_statistic	stats = NULL;
+			bool				have_mcvs = false;
+			Node			   *clause = linitial(info[i].clauses);
+			Node			   *expr = get_expression_for_rel(root, rel2, clause);
+			double				nd;
+			bool				isdefault;
+
+			examine_variable(root, expr, 0, &vardata);
+
+			nd = get_variable_numdistinct(&vardata, &isdefault);
+
+			memset(&sslot, 0, sizeof(sslot));
+
+			if (HeapTupleIsValid(vardata.statsTuple))
+			{
+				/* note we allow use of nullfrac regardless of security check */
+				stats = (Form_pg_statistic) GETSTRUCT(vardata.statsTuple);
+				/* FIXME should this call statistic_proc_security_check like eqjoinsel? */
+				have_mcvs = get_attstatsslot(&sslot, vardata.statsTuple,
+											 STATISTIC_KIND_MCV, InvalidOid,
+											 ATTSTATSSLOT_VALUES | ATTSTATSSLOT_NUMBERS);
+			}
+
+			if (have_mcvs)
+				s *= mcv_combine_simple(root, rel1, stat1, &sslot,
+										stats->stanullfrac, nd, isdefault, clause);
+
+			free_attstatsslot(&sslot);
+
+			ReleaseVariableStats(vardata);
+
+			/* no stats, don't mark the clauses as estimated */
+			if (!have_mcvs)
+				continue;
+		}
+		else if (stat2 && (list_length(info[i].clauses) == 1))
+		{
+			/* try finding MCV on the other relation */
+			VariableStatData	vardata;
+			AttStatsSlot		sslot;
+			Form_pg_statistic	stats = NULL;
+			bool				have_mcvs = false;
+			Node			   *clause = (Node *) linitial(info[i].clauses);
+			Node			   *expr = get_expression_for_rel(root, rel1, clause);
+			double				nd;
+			bool				isdefault;
+
+			examine_variable(root, expr, 0, &vardata);
+
+			nd = get_variable_numdistinct(&vardata, &isdefault);
+
+			memset(&sslot, 0, sizeof(sslot));
+
+			if (HeapTupleIsValid(vardata.statsTuple))
+			{
+				/* note we allow use of nullfrac regardless of security check */
+				stats = (Form_pg_statistic) GETSTRUCT(vardata.statsTuple);
+				/* FIXME should this call statistic_proc_security_check like eqjoinsel? */
+				have_mcvs = get_attstatsslot(&sslot, vardata.statsTuple,
+											 STATISTIC_KIND_MCV, InvalidOid,
+											 ATTSTATSSLOT_VALUES | ATTSTATSSLOT_NUMBERS);
+			}
+
+			if (have_mcvs)
+				s *= mcv_combine_simple(root, rel2, stat2, &sslot,
+										stats->stanullfrac, nd, isdefault, clause);
+
+			free_attstatsslot(&sslot);
+
+			ReleaseVariableStats(vardata);
+
+			/* no stats, don't mark the clauses as estimated */
+			if (!have_mcvs)
+				continue;
+		}
+		else
+			continue;
+
+		/*
+		 * Now mark all the clauses for this join pair as estimated.
+		 *
+		 * XXX Maybe track the indexes in JoinPairInfo, so that we can
+		 * simply union the two bitmaps, without the extra matching.
+		 */
+		foreach (lc, info->clauses)
+		{
+			Node *clause = (Node *) lfirst(lc);
+			ListCell *lc2;
+
+			listidx = -1;
+			foreach (lc2, clauses)
+			{
+				Node *clause2 = (Node *) lfirst(lc2);
+				listidx++;
+
+				Assert(IsA(clause2, RestrictInfo));
+
+				clause2 = (Node *) ((RestrictInfo *) clause2)->clause;
+
+				if (equal(clause, clause2))
+				{
+					*estimatedclauses = bms_add_member(*estimatedclauses, listidx);
+					break;
+				}
+			}
+		}
+	}
+
+	return s;
+}
diff --git a/src/backend/statistics/mcv.c b/src/backend/statistics/mcv.c
index b0e9aead84..49299ed907 100644
--- a/src/backend/statistics/mcv.c
+++ b/src/backend/statistics/mcv.c
@@ -22,6 +22,8 @@
 #include "fmgr.h"
 #include "funcapi.h"
 #include "nodes/nodeFuncs.h"
+#include "optimizer/clauses.h"
+#include "optimizer/optimizer.h"
 #include "statistics/extended_stats_internal.h"
 #include "statistics/statistics.h"
 #include "utils/array.h"
@@ -2173,3 +2175,759 @@ mcv_clause_selectivity_or(PlannerInfo *root, StatisticExtInfo *stat,
 
 	return s;
 }
+
+/*
+ * statext_compare_mcvs
+ *		Calculate join selectivity using extended statistics, similar to
+ *		eqjoinsel_inner.
+ *
+ * Considers restrictions on base relations too, essentially computing a
+ * conditional probability
+ *
+ *	P(join clauses | baserestrictinfos on either side)
+ *
+ * Compared to eqjoinsel_inner there's a couple problems. With per-column MCV
+ * lists it's obvious that the number of distinct values not covered by the MCV
+ * is (ndistinct - size(MCV)). With multi-column MCVs it's not that simple,
+ * particularly when the conditions are on a subset of the MCV attributes and/or
+ * NULLs are involved. E.g. with MCV (a,b,c) and conditions on (a,b), it's not
+ * clear if the number of (a,b) combinations not covered by the MCV is
+ *
+ * (ndistinct(a,b) - ndistinct_mcv(a,b))
+ *
+ * where ndistinct_mcv(a,b) is the number of distinct (a,b) combinations
+ * included in the MCV list. These combinations may be present in the rest
+ * of the data (outside MCV), just with some extra values in "c". So in
+ * principle there may be between
+ *
+ * (ndistinct(a,b) - ndistinct_mcv(a,b)) and ndistinct(a,b)
+ *
+ * distinct values in the part of the data not covered by the MCV. So we need
+ * to pick something in between, there's no way to calculate this accurately.
+ */
+Selectivity
+mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
+					 StatisticExtInfo *stat1, StatisticExtInfo *stat2,
+					 List *clauses)
+{
+	ListCell   *lc;
+
+	MCVList    *mcv1,
+			   *mcv2;
+	int			idx,
+				i,
+				j;
+	Selectivity s = 0;
+
+	/* match bitmaps and selectivity for baserel conditions (if any) */
+	List   *exprs1 = NIL,
+		   *exprs2 = NIL;
+	List   *conditions1 = NIL,
+		   *conditions2 = NIL;
+	bool   *cmatches1 = NULL,
+		   *cmatches2 = NULL;
+
+	double	csel1 = 1.0,
+			csel2 = 1.0;
+
+	bool   *matches1 = NULL,
+		   *matches2 = NULL;
+
+	/* estimates for the two relations */
+	double	matchfreq1,
+			unmatchfreq1,
+			otherfreq1,
+			mcvfreq1,
+			nd1,
+			totalsel1;
+
+	double 	matchfreq2,
+			unmatchfreq2,
+			otherfreq2,
+			mcvfreq2,
+			nd2,
+			totalsel2;
+
+	/* info about clauses and how they match to MCV stats */
+	FmgrInfo   *opprocs;
+	int		   *indexes1,
+			   *indexes2;
+	bool	   *reverse;
+	RangeTblEntry *rte1 = root->simple_rte_array[rel1->relid];
+	RangeTblEntry *rte2 = root->simple_rte_array[rel2->relid];
+
+	/* we picked the stats so that they have MCV enabled */
+	Assert((stat1->kind = STATS_EXT_MCV) && (stat2->kind = STATS_EXT_MCV));
+
+	mcv1 = statext_mcv_load(stat1->statOid, rte1->inh);
+	mcv2 = statext_mcv_load(stat2->statOid, rte2->inh);
+
+	/* should only get here with MCV on both sides */
+	Assert(mcv1 && mcv2);
+
+	/* Determine which baserel clauses to use for conditional probability. */
+	conditions1 = statext_determine_join_restrictions(root, rel1, stat1);
+	conditions2 = statext_determine_join_restrictions(root, rel2, stat2);
+
+	/*
+	 * Calculate match bitmaps for restrictions on either side of the join
+	 * (there may be none, in which case this will be NULL).
+	 */
+	if (conditions1)
+	{
+		cmatches1 = mcv_get_match_bitmap(root, conditions1,
+										 stat1->keys, stat1->exprs,
+										 mcv1, false);
+		csel1 = clauselist_selectivity(root, conditions1, rel1->relid, 0, NULL);
+	}
+
+	if (conditions2)
+	{
+		cmatches2 = mcv_get_match_bitmap(root, conditions2,
+										 stat2->keys, stat2->exprs,
+										 mcv2, false);
+		csel2 = clauselist_selectivity(root, conditions2, rel2->relid, 0, NULL);
+	}
+
+	/*
+	 * Match bitmaps for matches between MCV elements. By default there
+	 * are no matches, so we set all items to 0.
+	 */
+	matches1 = (bool *) palloc0(sizeof(bool) * mcv1->nitems);
+	matches2 = (bool *) palloc0(sizeof(bool) * mcv2->nitems);
+
+	/*
+	 * Initialize information about clauses and how they match to the MCV
+	 * stats we picked. We do this only once before processing the lists,
+	 * so that we don't have to do that for each MCV item or so.
+	 */
+	opprocs = (FmgrInfo *) palloc(sizeof(FmgrInfo) * list_length(clauses));
+	indexes1 = (int *) palloc(sizeof(int) * list_length(clauses));
+	indexes2 = (int *) palloc(sizeof(int) * list_length(clauses));
+	reverse = (bool *) palloc(sizeof(bool) * list_length(clauses));
+
+	idx = 0;
+	foreach (lc, clauses)
+	{
+		Node	   *clause = (Node *) lfirst(lc);
+		OpExpr	   *opexpr;
+		Node	   *expr1,
+				   *expr2;
+		Bitmapset  *relids1,
+				   *relids2;
+
+		/*
+		 * Strip the RestrictInfo node, get the actual clause.
+		 *
+		 * XXX Not sure if we need to care about removing other node types
+		 * too (e.g. RelabelType etc.). statext_is_supported_join_clause
+		 * matches this, but maybe we need to relax it?
+		 */
+		if (IsA(clause, RestrictInfo))
+			clause = (Node *) ((RestrictInfo *) clause)->clause;
+
+		opexpr = (OpExpr *) clause;
+
+		/* Make sure we have the expected node type. */
+		Assert(is_opclause(clause));
+		Assert(list_length(opexpr->args) == 2);
+
+		fmgr_info(get_opcode(opexpr->opno), &opprocs[idx]);
+
+		/* FIXME strip relabel etc. the way examine_opclause_args does */
+		expr1 = linitial(opexpr->args);
+		expr2 = lsecond(opexpr->args);
+
+		/* determine order of clauses (rel1 op rel2) or (rel2 op rel1) */
+		relids1 = pull_varnos(root, expr1);
+		relids2 = pull_varnos(root, expr2);
+
+		if ((bms_singleton_member(relids1) == rel1->relid) &&
+			(bms_singleton_member(relids2) == rel2->relid))
+		{
+			Oid		collid;
+
+			indexes1[idx] = mcv_match_expression(expr1,
+												 stat1->keys, stat1->exprs,
+												 &collid);
+			indexes2[idx] = mcv_match_expression(expr2,
+												 stat2->keys, stat2->exprs,
+												 &collid);
+			reverse[idx] = false;
+
+			exprs1 = lappend(exprs1, expr1);
+			exprs2 = lappend(exprs2, expr2);
+		}
+		else if ((bms_singleton_member(relids2) == rel1->relid) &&
+				 (bms_singleton_member(relids1) == rel2->relid))
+		{
+			Oid		collid;
+
+			indexes1[idx] = mcv_match_expression(expr2,
+												 stat2->keys, stat2->exprs,
+												 &collid);
+			indexes2[idx] = mcv_match_expression(expr1,
+												 stat1->keys, stat1->exprs,
+												 &collid);
+			reverse[idx] = true;
+
+			exprs1 = lappend(exprs1, expr2);
+			exprs2 = lappend(exprs2, expr1);
+		}
+		else
+			/* should never happen */
+			Assert(false);
+
+		Assert((indexes1[idx] >= 0) &&
+			   (indexes1[idx] < bms_num_members(stat1->keys) + list_length(stat1->exprs)));
+
+		Assert((indexes2[idx] >= 0) &&
+			   (indexes2[idx] < bms_num_members(stat2->keys) + list_length(stat2->exprs)));
+
+		idx++;
+	}
+
+	/*
+	 * Match items between the two MCV lists.
+	 *
+	 * We don't know if the join conditions match all attributes in the MCV, the
+	 * overlap may be just on a subset of attributes, e.g. (a,b,c) vs. (b,c,d).
+	 * So there may be multiple matches on either side. So we can't optimize by
+	 * aborting the inner loop after the first match, etc.
+	 *
+	 * XXX We can skip the items eliminated by the base restrictions, of course.
+	 *
+	 * XXX We might optimize this in two ways. We might sort the MCV items on
+	 * both sides using the "join" attributes, and then perform something like
+	 * merge join. Or we might calculate a hash from the join columns, and then
+	 * compare this (to eliminate the most expensive equality functions).
+	 */
+	for (i = 0; i < mcv1->nitems; i++)
+	{
+		bool	has_nulls;
+
+		/* skip items eliminated by restrictions on rel1 */
+		if (cmatches1 && !cmatches1[i])
+			continue;
+
+		/*
+		 * Check if any value in the first MCV item is NULL, because it'll be
+		 * mismatch anyway.
+		 *
+		 * XXX This might not work for some join clauses, e.g. IS NOT DISTINCT
+		 * FROM, but those are currently not considered compatible (we only
+		 * allow OpExpr at the moment).
+		 */
+		has_nulls = false;
+		for (j = 0; j < list_length(clauses); j++)
+			has_nulls |= mcv1->items[i].isnull[indexes1[j]];
+
+		if (has_nulls)
+			continue;
+
+		/* find matches in the second MCV list */
+		for (j = 0; j < mcv2->nitems; j++)
+		{
+			int			idx;
+			bool		items_match = true;
+
+			/* skip items eliminated by restrictions on rel2 */
+			if (cmatches2 && !cmatches2[j])
+				continue;
+
+			/*
+			 * XXX We can't skip based on existing matches2 value, because there
+			 * may be duplicates in the first MCV.
+			 */
+
+			/*
+			 * Evaluate if all the join clauses match between the two MCV items.
+			 *
+			 * XXX We might optimize the order of evaluation, using the costs of
+			 * operator functions for individual columns. It does depend on the
+			 * number of distinct values, etc.
+			 */
+			idx = 0;
+			foreach (lc, clauses)
+			{
+				bool	match;
+				int		index1 = indexes1[idx],
+						index2 = indexes2[idx];
+				Datum	value1,
+						value2;
+				bool	reverse_args = reverse[idx];
+
+				/* If either value is null, it's a mismatch */
+				if (mcv2->items[j].isnull[index2])
+					match = false;
+				else
+				{
+					value1 = mcv1->items[i].values[index1];
+					value2 = mcv2->items[j].values[index2];
+
+					/*
+					 * Careful about order of parameters. For same-type equality
+					 * that should not matter, but easy enough.
+					 *
+					 * FIXME Use appropriate collation.
+					 */
+					if (reverse_args)
+						match = DatumGetBool(FunctionCall2Coll(&opprocs[idx],
+															   InvalidOid,
+															   value2, value1));
+					else
+						match = DatumGetBool(FunctionCall2Coll(&opprocs[idx],
+															   InvalidOid,
+															   value1, value2));
+				}
+
+				items_match &= match;
+
+				if (!items_match)
+					break;
+
+				idx++;
+			}
+
+			if (items_match)
+			{
+				/* XXX Do we need to do something about base frequency? */
+				matches1[i] = matches2[j] = true;
+				s += mcv1->items[i].frequency * mcv2->items[j].frequency;
+			}
+		}
+	}
+
+	matchfreq1 = unmatchfreq1 = mcvfreq1 = 0.0;
+	for (i = 0; i < mcv1->nitems; i++)
+	{
+		mcvfreq1 += mcv1->items[i].frequency;
+
+		/* ignore MCV items eliminated by baserel conditions */
+		if (cmatches1 && !cmatches1[i])
+			continue;
+
+		if (matches1[i])
+			matchfreq1 += mcv1->items[i].frequency;
+		else
+			unmatchfreq1 += mcv1->items[i].frequency;
+	}
+
+	/* not represented by the MCV */
+	otherfreq1 = 1.0 - mcvfreq1;
+
+	matchfreq2 = unmatchfreq2 = mcvfreq2 = 0.0;
+	for (i = 0; i < mcv2->nitems; i++)
+	{
+		mcvfreq2 += mcv2->items[i].frequency;
+
+		/* ignore MCV items eliminated by baserel conditions */
+		if (cmatches2 && !cmatches2[i])
+			continue;
+
+		if (matches2[i])
+			matchfreq2 += mcv2->items[i].frequency;
+		else
+			unmatchfreq2 += mcv2->items[i].frequency;
+	}
+
+	/* not represented by the MCV */
+	otherfreq2 = 1.0 - mcvfreq2;
+
+	/*
+	 * Correction for MCV parts eliminated by the conditions.
+	 *
+	 * We need to be careful about cases where conditions eliminated all
+	 * the MCV items. We must not divide by 0.0, because that would either
+	 * produce bogus value or trigger division by zero. Instead we simply
+	 * set the selectivity to 0.0, because there can't be any matches.
+	 */
+	if ((matchfreq1 + unmatchfreq1) > 0)
+		s = s * mcvfreq1 / (matchfreq1 + unmatchfreq1);
+	else
+		s = 0.0;
+
+	if ((matchfreq2 + unmatchfreq2) > 0)
+		s = s * mcvfreq2 / (matchfreq2 + unmatchfreq2);
+	else
+		s = 0.0;
+
+	/* calculate ndistinct for the expression in join clauses for each rel */
+	nd1 = estimate_num_groups(root, exprs1, rel1->rows, NULL, NULL);
+	nd2 = estimate_num_groups(root, exprs2, rel2->rows, NULL, NULL);
+
+	/*
+	 * Consider the part of the data not represented by the MCV lists.
+	 *
+	 * XXX this is a bit bogus, because we don't know what fraction of
+	 * distinct combinations is covered by the MCV list (we're only
+	 * dealing with some of the columns), so we can't use the same
+	 * formular as eqjoinsel_inner exactly. We just use the estimates
+	 * for the whole table - this is likely an overestimate, because
+	 * (a) items may repeat in the MCV list, if it has more columns,
+	 * and (b) some of the combinations may be present in non-MCV data.
+	 *
+	 * Moreover, we need to look at the conditions. For now we simply
+	 * assume the conditions affect the distinct groups, and use that.
+	 *
+	 * XXX We might calculate the number of distinct groups in the MCV,
+	 * and then use something between (nd1 - distinct(MCV)) and (nd1),
+	 * which are the possible extreme values, assuming the estimates
+	 * are accurate. Maybe mean or geometric mean would work?
+	 *
+	 * XXX Not sure multiplying ndistinct with probabilities is good.
+	 * Maybe we should do something more like estimate_num_groups?
+	 */
+	nd1 *= csel1;
+	nd2 *= csel2;
+
+	totalsel1 = s;
+	totalsel1 += unmatchfreq1 * otherfreq2 / nd2;
+	totalsel1 += otherfreq1 * (otherfreq2 + unmatchfreq2) / nd2;
+
+//	if (nd2 > mcvb->nitems)
+//		totalsel1 += unmatchfreq1 * otherfreq2 / (nd2 - mcvb->nitems);
+//	if (nd2 > nmatches)
+//		totalsel1 += otherfreq1 * (otherfreq2 + unmatchfreq2) /
+//			(nd2 - nmatches);
+
+	totalsel2 = s;
+	totalsel2 += unmatchfreq2 * otherfreq1 / nd1;
+	totalsel2 += otherfreq2 * (otherfreq1 + unmatchfreq1) / nd1;
+
+//	if (nd1 > mcva->nitems)
+//		totalsel2 += unmatchfreq2 * otherfreq1 / (nd1 - mcva->nitems);
+//	if (nd1 > nmatches)
+//		totalsel2 += otherfreq2 * (otherfreq1 + unmatchfreq1) /
+//			(nd1 - nmatches);
+
+	s = Min(totalsel1, totalsel2);
+
+	return s;
+}
+
+
+/*
+ * statext_compare_simple
+ *		Calculate join selectivity using a combination of extended
+ * statistics MCV on one side, and simple per-column MCV on the other.
+ *
+ * Most of the mcv_combine_extended comment applies here too, but we can make
+ * some simplifications because we know the second (per-column) MCV is simpler,
+ * contains no NULL or duplicate values, etc.
+ */
+Selectivity
+mcv_combine_simple(PlannerInfo *root, RelOptInfo *rel, StatisticExtInfo *stat,
+				   AttStatsSlot *sslot, double stanullfrac, double nd,
+				   bool isdefault, Node *clause)
+{
+	MCVList    *mcv;
+	int			i,
+				j;
+	Selectivity s = 0;
+
+	/* match bitmaps and selectivity for baserel conditions (if any) */
+	List   *conditions = NIL;
+	bool   *cmatches = NULL;
+
+	double	csel = 1.0;
+
+	bool   *matches1 = NULL,
+		   *matches2 = NULL;
+
+	/* estimates for the two sides */
+	double	matchfreq1,
+			unmatchfreq1,
+			otherfreq1,
+			mcvfreq1,
+			nd1,
+			totalsel1;
+
+	double 	matchfreq2,
+			unmatchfreq2,
+			otherfreq2,
+			mcvfreq2,
+			nd2,
+			totalsel2;
+
+	List   *exprs1 = NIL,
+		   *exprs2 = NIL;
+
+	/* info about clauses and how they match to MCV stats */
+	FmgrInfo	opproc;
+	int			index = 0;
+	bool		reverse = false;
+	RangeTblEntry *rte = root->simple_rte_array[rel->relid];
+
+	/* we picked the stats so that they have MCV enabled */
+	Assert(stat->kind = STATS_EXT_MCV);
+
+	mcv = statext_mcv_load(stat->statOid, rte->inh);
+
+	/* should only get here with MCV on both sides */
+	Assert(mcv);
+
+	/* Determine which baserel clauses to use for conditional probability. */
+	conditions = statext_determine_join_restrictions(root, rel, stat);
+
+	/*
+	 * Calculate match bitmaps for restrictions on either side of the join
+	 * (there may be none, in which case this will be NULL).
+	 */
+	if (conditions)
+	{
+		cmatches = mcv_get_match_bitmap(root, conditions,
+										 stat->keys, stat->exprs,
+										 mcv, false);
+		csel = clauselist_selectivity(root, conditions, rel->relid, 0, NULL);
+	}
+
+	/*
+	 * Match bitmaps for matches between MCV elements. By default there
+	 * are no matches, so we set all items to 0.
+	 */
+	matches1 = (bool *) palloc0(sizeof(bool) * mcv->nitems);
+
+	/* Matches for the side with just regular single-column MCV. */
+	matches2 = (bool *) palloc0(sizeof(bool) * sslot->nvalues);
+
+	/*
+	 * Initialize information about the clause and how it matches to the
+	 * extended stats we picked. We do this only once before processing
+	 * the lists, so that we don't have to do that for each item or so.
+	 */
+	{
+		OpExpr	   *opexpr;
+		Node	   *expr1,
+				   *expr2;
+		Bitmapset  *relids1,
+				   *relids2;
+
+		/*
+		 * Strip the RestrictInfo node, get the actual clause.
+		 *
+		 * XXX Not sure if we need to care about removing other node types
+		 * too (e.g. RelabelType etc.). statext_is_supported_join_clause
+		 * matches this, but maybe we need to relax it?
+		 */
+		if (IsA(clause, RestrictInfo))
+			clause = (Node *) ((RestrictInfo *) clause)->clause;
+
+		opexpr = (OpExpr *) clause;
+
+		/* Make sure we have the expected node type. */
+		Assert(is_opclause(clause));
+		Assert(list_length(opexpr->args) == 2);
+
+		fmgr_info(get_opcode(opexpr->opno), &opproc);
+
+		/* FIXME strip relabel etc. the way examine_opclause_args does */
+		expr1 = linitial(opexpr->args);
+		expr2 = lsecond(opexpr->args);
+
+		/* determine order of clauses (rel1 op rel2) or (rel2 op rel1) */
+		relids1 = pull_varnos(root, expr1);
+		relids2 = pull_varnos(root, expr2);
+
+		if (bms_singleton_member(relids1) == rel->relid)
+		{
+			Oid		collid;
+
+			index = mcv_match_expression(expr1, stat->keys, stat->exprs,
+										 &collid);
+			reverse = false;
+
+			exprs1 = lappend(exprs1, expr1);
+			exprs2 = lappend(exprs2, expr2);
+		}
+		else if (bms_singleton_member(relids2) == rel->relid)
+		{
+			Oid		collid;
+
+			index = mcv_match_expression(expr2, stat->keys, stat->exprs,
+										 &collid);
+			reverse = true;
+
+			exprs1 = lappend(exprs1, expr2);
+			exprs2 = lappend(exprs2, expr1);
+		}
+		else
+			/* should never happen */
+			Assert(false);
+
+		Assert((index >= 0) &&
+			   (index < bms_num_members(stat->keys) + list_length(stat->exprs)));
+	}
+
+	/*
+	 * Match items between the two MCV lists.
+	 *
+	 * We don't know if the join conditions match all attributes in the MCV, the
+	 * overlap may be just on a subset of attributes, e.g. (a,b,c) vs. (b,c,d).
+	 * So there may be multiple matches on either side. So we can't optimize by
+	 * aborting the inner loop after the first match, etc.
+	 *
+	 * XXX We can skip the items eliminated by the base restrictions, of course.
+	 *
+	 * XXX We might optimize this in two ways. We might sort the MCV items on
+	 * both sides using the "join" attributes, and then perform something like
+	 * merge join. Or we might calculate a hash from the join columns, and then
+	 * compare this (to eliminate the most expensive equality functions).
+	 */
+	for (i = 0; i < mcv->nitems; i++)
+	{
+		/* skip items eliminated by restrictions on rel1 */
+		if (cmatches && !cmatches[i])
+			continue;
+
+		/*
+		 * We can check mcv1->items[i].isnull[index1] here, because it'll be a
+		 * mismatch anyway (the simple MCV does not contain NULLs).
+		 */
+		if (mcv->items[i].isnull[index])
+			continue;
+
+		/* find matches in the second MCV list */
+		for (j = 0; j < sslot->nvalues; j++)
+		{
+			bool	match;
+			Datum	value1 = mcv->items[i].values[index];
+			Datum	value2 = sslot->values[j];
+
+			/*
+			 * Evaluate the join clause between the two MCV lists. We don't
+			 * need to deal with NULL values here - we've already checked for
+			 * NULL in the extended statistics earlier, and the simple MCV
+			 * does not contain NULL values.
+			 *
+			 * Careful about order of parameters. For same-type equality
+			 * that should not matter, but easy enough.
+			 *
+			 * FIXME Use appropriate collation.
+			 */
+			if (reverse)
+				match = DatumGetBool(FunctionCall2Coll(&opproc,
+													   InvalidOid,
+													   value2, value1));
+			else
+				match = DatumGetBool(FunctionCall2Coll(&opproc,
+													   InvalidOid,
+													   value1, value2));
+
+			if (match)
+			{
+				/* XXX Do we need to do something about base frequency? */
+				matches1[i] = matches2[j] = true;
+				s += mcv->items[i].frequency * sslot->numbers[j];
+
+				/*
+				 * We know there can be just a single match in the regular
+				 * MCV list, so we can abort the inner loop.
+				 */
+				break;
+			}
+		}
+	}
+
+	matchfreq1 = unmatchfreq1 = mcvfreq1 = 0.0;
+	for (i = 0; i < mcv->nitems; i++)
+	{
+		mcvfreq1 += mcv->items[i].frequency;
+
+		/* ignore MCV items eliminated by baserel conditions */
+		if (cmatches && !cmatches[i])
+			continue;
+
+		if (matches1[i])
+			matchfreq1 += mcv->items[i].frequency;
+		else
+			unmatchfreq1 += mcv->items[i].frequency;
+	}
+
+	/* not represented by the MCV */
+	otherfreq1 = 1.0 - mcvfreq1;
+
+	matchfreq2 = unmatchfreq2 = mcvfreq2 = 0.0;
+	for (i = 0; i < sslot->nvalues; i++)
+	{
+		mcvfreq2 += sslot->numbers[i];
+
+		if (matches2[i])
+			matchfreq2 += sslot->numbers[i];
+		else
+			unmatchfreq2 += sslot->numbers[i];
+	}
+
+	/* not represented by the MCV */
+	otherfreq2 = 1.0 - mcvfreq2;
+
+	/*
+	 * Correction for MCV parts eliminated by the conditions.
+	 *
+	 * We need to be careful about cases where conditions eliminated all
+	 * the MCV items. We must not divide by 0.0, because that would either
+	 * produce bogus value or trigger division by zero. Instead we simply
+	 * set the selectivity to 0.0, because there can't be any matches.
+	 */
+	if ((matchfreq1 + unmatchfreq1) > 0)
+		s = s * mcvfreq1 / (matchfreq1 + unmatchfreq1);
+	else
+		s = 0.0;
+
+	if ((matchfreq2 + unmatchfreq2) > 0)
+		s = s * mcvfreq2 / (matchfreq2 + unmatchfreq2);
+	else
+		s = 0.0;
+
+	/* calculate ndistinct for the expression in join clauses for each rel */
+	nd1 = estimate_num_groups(root, exprs1, rel->rows, NULL, NULL);
+	nd2 = nd;
+
+	/*
+	 * Consider the part of the data not represented by the MCV lists.
+	 *
+	 * XXX this is a bit bogus, because we don't know what fraction of
+	 * distinct combinations is covered by the MCV list (we're only
+	 * dealing with some of the columns), so we can't use the same
+	 * formular as eqjoinsel_inner exactly. We just use the estimates
+	 * for the whole table - this is likely an overestimate, because
+	 * (a) items may repeat in the MCV list, if it has more columns,
+	 * and (b) some of the combinations may be present in non-MCV data.
+	 *
+	 * Moreover, we need to look at the conditions. For now we simply
+	 * assume the conditions affect the distinct groups, and use that.
+	 *
+	 * XXX We might calculate the number of distinct groups in the MCV,
+	 * and then use something between (nd1 - distinct(MCV)) and (nd1),
+	 * which are the possible extreme values, assuming the estimates
+	 * are accurate. Maybe mean or geometric mean would work?
+	 *
+	 * XXX Not sure multiplying ndistinct with probabilities is good.
+	 * Maybe we should do something more like estimate_num_groups?
+	 */
+	nd1 *= csel;
+
+	totalsel1 = s;
+	totalsel1 += unmatchfreq1 * otherfreq2 / nd2;
+	totalsel1 += otherfreq1 * (otherfreq2 + unmatchfreq2) / nd2;
+
+//	if (nd2 > mcvb->nitems)
+//		totalsel1 += unmatchfreq1 * otherfreq2 / (nd2 - mcvb->nitems);
+//	if (nd2 > nmatches)
+//		totalsel1 += otherfreq1 * (otherfreq2 + unmatchfreq2) /
+//			(nd2 - nmatches);
+
+	totalsel2 = s;
+	totalsel2 += unmatchfreq2 * otherfreq1 / nd1;
+	totalsel2 += otherfreq2 * (otherfreq1 + unmatchfreq1) / nd1;
+
+//	if (nd1 > mcva->nitems)
+//		totalsel2 += unmatchfreq2 * otherfreq1 / (nd1 - mcva->nitems);
+//	if (nd1 > nmatches)
+//		totalsel2 += otherfreq2 * (otherfreq1 + unmatchfreq1) /
+//			(nd1 - nmatches);
+
+	s = Min(totalsel1, totalsel2);
+
+	return s;
+}
diff --git a/src/include/statistics/extended_stats_internal.h b/src/include/statistics/extended_stats_internal.h
index 8eed9b338d..a85f896d53 100644
--- a/src/include/statistics/extended_stats_internal.h
+++ b/src/include/statistics/extended_stats_internal.h
@@ -15,6 +15,7 @@
 #define EXTENDED_STATS_INTERNAL_H
 
 #include "statistics/statistics.h"
+#include "utils/lsyscache.h"
 #include "utils/sortsupport.h"
 
 typedef struct
@@ -127,4 +128,23 @@ extern Selectivity mcv_clause_selectivity_or(PlannerInfo *root,
 											 Selectivity *overlap_basesel,
 											 Selectivity *totalsel);
 
+extern Selectivity mcv_combine_simple(PlannerInfo *root,
+									  RelOptInfo *rel,
+									  StatisticExtInfo *stat,
+									  AttStatsSlot *sslot,
+									  double stanullfrac,
+									  double nd, bool isdefault,
+									  Node *clause);
+
+extern Selectivity mcv_combine_extended(PlannerInfo *root,
+										RelOptInfo *rel1,
+										RelOptInfo *rel2,
+										StatisticExtInfo *stat1,
+										StatisticExtInfo *stat2,
+										List *clauses);
+
+extern List *statext_determine_join_restrictions(PlannerInfo *root,
+												 RelOptInfo *rel,
+												 StatisticExtInfo *info);
+
 #endif							/* EXTENDED_STATS_INTERNAL_H */
diff --git a/src/include/statistics/statistics.h b/src/include/statistics/statistics.h
index 7f2bf18716..60b222028d 100644
--- a/src/include/statistics/statistics.h
+++ b/src/include/statistics/statistics.h
@@ -127,4 +127,16 @@ extern StatisticExtInfo *choose_best_statistics(List *stats, char requiredkind,
 												int nclauses);
 extern HeapTuple statext_expressions_load(Oid stxoid, bool inh, int idx);
 
+extern StatisticExtInfo *statext_find_matching_mcv(PlannerInfo *root, RelOptInfo *rel,
+										   Bitmapset *attnums, List *exprs);
+
+extern bool statext_try_join_estimates(PlannerInfo *root, List *clauses, int varRelid,
+									   JoinType jointype, SpecialJoinInfo *sjinfo,
+									   Bitmapset *estimatedclauses);
+
+extern Selectivity statext_clauselist_join_selectivity(PlannerInfo *root, List *clauses,
+													   int varRelid,
+													   JoinType jointype, SpecialJoinInfo *sjinfo,
+													   Bitmapset **estimatedclauses);
+
 #endif							/* STATISTICS_H */
diff --git a/src/test/regress/expected/stats_ext.out b/src/test/regress/expected/stats_ext.out
index 10903bdab0..95246522bb 100644
--- a/src/test/regress/expected/stats_ext.out
+++ b/src/test/regress/expected/stats_ext.out
@@ -3074,6 +3074,173 @@ SELECT c0 FROM ONLY expr_stats_incompatible_test WHERE
 (0 rows)
 
 DROP TABLE expr_stats_incompatible_test;
+-- Test join estimates.
+CREATE TABLE join_test_1 (a int, b int, c int);
+CREATE TABLE join_test_2 (a int, b int, c int);
+INSERT INTO join_test_1 SELECT mod(i,10), mod(i,10), mod(i,10) FROM generate_series(1,1000) s(i);
+INSERT INTO join_test_2 SELECT mod(i,10), mod(i,10), mod(i,10) FROM generate_series(1,1000) s(i);
+ANALYZE join_test_1;
+ANALYZE join_test_2;
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b))');
+ estimated | actual 
+-----------+--------
+       500 | 100000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5');
+ estimated | actual 
+-----------+--------
+       250 |  50000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 0');
+ estimated | actual 
+-----------+--------
+         1 |      0
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5 and j2.c < 3');
+ estimated | actual 
+-----------+--------
+        75 |  30000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5 and j2.c > 5');
+ estimated | actual 
+-----------+--------
+       100 |      0
+(1 row)
+
+-- can't be improved due to the optimization in clauselist_selectivity_ext,
+-- which skips cases with a single (join) clause
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5');
+ estimated | actual 
+-----------+--------
+      2500 |  50000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5 and j2.c < 5');
+ estimated | actual 
+-----------+--------
+      1250 |  50000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5 and j2.c > 5');
+ estimated | actual 
+-----------+--------
+      1000 |      0
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a) where j1.c < 5');
+ estimated | actual 
+-----------+--------
+      2500 |  50000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a + 1) where j1.c < 5');
+ estimated | actual 
+-----------+--------
+      2500 |  50000
+(1 row)
+
+-- create extended statistics on the join/where columns
+CREATE STATISTICS join_stats_1 ON a, b, c, (a+1), (b+1) FROM join_test_1;
+CREATE STATISTICS join_stats_2 ON a, b, c, (a+1), (b+1) FROM join_test_2;
+ANALYZE join_test_1;
+ANALYZE join_test_2;
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b))');
+ estimated | actual 
+-----------+--------
+    100000 | 100000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5');
+ estimated | actual 
+-----------+--------
+     50000 |  50000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 0');
+ estimated | actual 
+-----------+--------
+         1 |      0
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5 and j2.c < 3');
+ estimated | actual 
+-----------+--------
+     30000 |  30000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5 and j2.c > 5');
+ estimated | actual 
+-----------+--------
+         1 |      0
+(1 row)
+
+-- can't be improved due to the optimization in clauselist_selectivity_ext,
+-- which skips cases with a single (join) clause
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5');
+ estimated | actual 
+-----------+--------
+     50000 |  50000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5 and j2.c < 5');
+ estimated | actual 
+-----------+--------
+     50000 |  50000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5 and j2.c > 5');
+ estimated | actual 
+-----------+--------
+         1 |      0
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a) where j1.c < 5');
+ estimated | actual 
+-----------+--------
+     50000 |  50000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a + 1) where j1.c < 5');
+ estimated | actual 
+-----------+--------
+     50000 |  50000
+(1 row)
+
+-- try combining with single-column (and single-expression) statistics
+DROP STATISTICS join_stats_2;
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a) where j1.c < 5');
+ estimated | actual 
+-----------+--------
+     50000 |  50000
+(1 row)
+
+-- no MCV on join_test_2 (on the (a+1) expression)
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a + 1) where j1.c < 5');
+ estimated | actual 
+-----------+--------
+      2500 |  50000
+(1 row)
+
+CREATE STATISTICS join_stats_2 ON (a+1) FROM join_test_2;
+ANALYZE join_test_2;
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a) where j1.c < 5');
+ estimated | actual 
+-----------+--------
+     50000 |  50000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a + 1) where j1.c < 5');
+ estimated | actual 
+-----------+--------
+     50000 |  50000
+(1 row)
+
+DROP TABLE join_test_1;
+DROP TABLE join_test_2;
 -- Permission tests. Users should not be able to see specific data values in
 -- the extended statistics, if they lack permission to see those values in
 -- the underlying table.
diff --git a/src/test/regress/sql/stats_ext.sql b/src/test/regress/sql/stats_ext.sql
index 42cb7dd97d..c7023620a1 100644
--- a/src/test/regress/sql/stats_ext.sql
+++ b/src/test/regress/sql/stats_ext.sql
@@ -1547,6 +1547,72 @@ SELECT c0 FROM ONLY expr_stats_incompatible_test WHERE
 
 DROP TABLE expr_stats_incompatible_test;
 
+
+-- Test join estimates.
+CREATE TABLE join_test_1 (a int, b int, c int);
+CREATE TABLE join_test_2 (a int, b int, c int);
+
+INSERT INTO join_test_1 SELECT mod(i,10), mod(i,10), mod(i,10) FROM generate_series(1,1000) s(i);
+INSERT INTO join_test_2 SELECT mod(i,10), mod(i,10), mod(i,10) FROM generate_series(1,1000) s(i);
+
+ANALYZE join_test_1;
+ANALYZE join_test_2;
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b))');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 0');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5 and j2.c < 3');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5 and j2.c > 5');
+
+-- can't be improved due to the optimization in clauselist_selectivity_ext,
+-- which skips cases with a single (join) clause
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5 and j2.c < 5');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5 and j2.c > 5');
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a) where j1.c < 5');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a + 1) where j1.c < 5');
+
+-- create extended statistics on the join/where columns
+CREATE STATISTICS join_stats_1 ON a, b, c, (a+1), (b+1) FROM join_test_1;
+CREATE STATISTICS join_stats_2 ON a, b, c, (a+1), (b+1) FROM join_test_2;
+
+ANALYZE join_test_1;
+ANALYZE join_test_2;
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b))');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 0');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5 and j2.c < 3');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5 and j2.c > 5');
+
+-- can't be improved due to the optimization in clauselist_selectivity_ext,
+-- which skips cases with a single (join) clause
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5 and j2.c < 5');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5 and j2.c > 5');
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a) where j1.c < 5');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a + 1) where j1.c < 5');
+
+-- try combining with single-column (and single-expression) statistics
+DROP STATISTICS join_stats_2;
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a) where j1.c < 5');
+
+-- no MCV on join_test_2 (on the (a+1) expression)
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a + 1) where j1.c < 5');
+
+CREATE STATISTICS join_stats_2 ON (a+1) FROM join_test_2;
+ANALYZE join_test_2;
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a) where j1.c < 5');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a + 1) where j1.c < 5');
+
+
+DROP TABLE join_test_1;
+DROP TABLE join_test_2;
+
 -- Permission tests. Users should not be able to see specific data values in
 -- the extended statistics, if they lack permission to see those values in
 -- the underlying table.
-- 
2.34.1

v1-0002-Remove-estimiatedcluases-and-varRelid-arguments.patch0000644000175000017500000001414414613326575024724 0ustar  yizhi.fzhyizhi.fzhFrom 6596c932545ef25b856b07f44a23639ec1210ccf Mon Sep 17 00:00:00 2001
From: "yizhi.fzh" <yizhi.fzh@alibaba-inc.com>
Date: Tue, 2 Apr 2024 09:39:17 +0800
Subject: [PATCH v1 02/22] Remove estimiatedcluases and varRelid arguments

comments and Assert around the changes provides more information.
---
 src/backend/optimizer/path/clausesel.c  | 16 ++++++++++------
 src/backend/statistics/extended_stats.c | 24 ++++++++++--------------
 src/include/statistics/statistics.h     |  4 +---
 3 files changed, 21 insertions(+), 23 deletions(-)

diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c
index bedf76edae..e1683febf6 100644
--- a/src/backend/optimizer/path/clausesel.c
+++ b/src/backend/optimizer/path/clausesel.c
@@ -204,14 +204,18 @@ clauselist_selectivity_ext(PlannerInfo *root,
 	 * do to detect when this makes sense, but we can check that there are
 	 * join clauses, and that at least some of the rels have stats.
 	 *
-	 * XXX Isn't this mutually exclusive with the preceding block which
-	 * calculates estimates for a single relation?
+	 * rel != NULL can't grantee the clause is not a join clause, for example
+	 * t1 left join t2 ON t1.a = 3, but it can grantee we can't use extended
+	 * statistics for estimation since it has only 1 relid.
+	 *
+	 * XXX: so we can grantee estimatedclauses == NULL now, so estimatedclauses
+	 * in statext_try_join_estimates is removed.
 	 */
-	if (use_extended_stats &&
-		statext_try_join_estimates(root, clauses, varRelid, jointype, sjinfo,
-						 estimatedclauses))
+	if (use_extended_stats && rel == NULL &&
+		statext_try_join_estimates(root, clauses, varRelid, jointype, sjinfo))
 	{
-		s1 *= statext_clauselist_join_selectivity(root, clauses, varRelid,
+		Assert(varRelid == 0);
+		s1 *= statext_clauselist_join_selectivity(root, clauses,
 												  jointype, sjinfo,
 												  &estimatedclauses);
 	}
diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c
index 80872cc7da..5ed3c5e332 100644
--- a/src/backend/statistics/extended_stats.c
+++ b/src/backend/statistics/extended_stats.c
@@ -2806,8 +2806,7 @@ statext_determine_join_restrictions(PlannerInfo *root, RelOptInfo *rel,
  * on the conditions.
  */
 static bool
-statext_is_supported_join_clause(PlannerInfo *root, Node *clause,
-								 int varRelid, SpecialJoinInfo *sjinfo)
+statext_is_supported_join_clause(PlannerInfo *root, Node *clause, SpecialJoinInfo *sjinfo)
 {
 	Oid	oprsel;
 	RestrictInfo   *rinfo;
@@ -2819,7 +2818,9 @@ statext_is_supported_join_clause(PlannerInfo *root, Node *clause,
 	 *
 	 * XXX See treat_as_join_clause.
 	 */
-	if ((varRelid != 0) || (sjinfo == NULL))
+
+	/* duplicated with statext_try_join_estimates */
+	if (sjinfo == NULL)
 		return false;
 
 	/* XXX Can we rely on always getting RestrictInfo here? */
@@ -2903,8 +2904,7 @@ statext_is_supported_join_clause(PlannerInfo *root, Node *clause,
  */
 bool
 statext_try_join_estimates(PlannerInfo *root, List *clauses, int varRelid,
-						   JoinType jointype, SpecialJoinInfo *sjinfo,
-						   Bitmapset *estimatedclauses)
+						   JoinType jointype, SpecialJoinInfo *sjinfo)
 {
 	int			listidx;
 	int			k;
@@ -2941,15 +2941,11 @@ statext_try_join_estimates(PlannerInfo *root, List *clauses, int varRelid,
 		/* needs to happen before skipping any clauses */
 		listidx++;
 
-		/* Skip clauses that were already estimated. */
-		if (bms_is_member(listidx, estimatedclauses))
-			continue;
-
 		/*
 		 * Skip clauses that are not join clauses or that we don't know
 		 * how to handle estimate using extended statistics.
 		 */
-		if (!statext_is_supported_join_clause(root, clause, varRelid, sjinfo))
+		if (!statext_is_supported_join_clause(root, clause, sjinfo))
 			continue;
 
 		/*
@@ -3019,7 +3015,7 @@ typedef struct JoinPairInfo
  * with F_EQJOINSEL selectivity function at the moment).
  */
 static JoinPairInfo *
-statext_build_join_pairs(PlannerInfo *root, List *clauses, int varRelid,
+statext_build_join_pairs(PlannerInfo *root, List *clauses,
 						 JoinType jointype, SpecialJoinInfo *sjinfo,
 						 Bitmapset *estimatedclauses, int *npairs)
 {
@@ -3055,7 +3051,7 @@ statext_build_join_pairs(PlannerInfo *root, List *clauses, int varRelid,
 		 * the moment we support just (Expr op Expr) clauses with each
 		 * side referencing just a single relation).
 		 */
-		if (!statext_is_supported_join_clause(root, clause, varRelid, sjinfo))
+		if (!statext_is_supported_join_clause(root, clause, sjinfo))
 			continue;
 
 		/* statext_is_supported_join_clause guarantees RestrictInfo */
@@ -3243,7 +3239,7 @@ get_expression_for_rel(PlannerInfo *root, RelOptInfo *rel, Node *clause)
  * statistics in that case yet).
  */
 Selectivity
-statext_clauselist_join_selectivity(PlannerInfo *root, List *clauses, int varRelid,
+statext_clauselist_join_selectivity(PlannerInfo *root, List *clauses,
 									JoinType jointype, SpecialJoinInfo *sjinfo,
 									Bitmapset **estimatedclauses)
 {
@@ -3258,7 +3254,7 @@ statext_clauselist_join_selectivity(PlannerInfo *root, List *clauses, int varRel
 		return 1.0;
 
 	/* extract pairs of joined relations from the list of clauses */
-	info = statext_build_join_pairs(root, clauses, varRelid, jointype, sjinfo,
+	info = statext_build_join_pairs(root, clauses, jointype, sjinfo,
 									*estimatedclauses, &ninfo);
 
 	/* no useful join pairs */
diff --git a/src/include/statistics/statistics.h b/src/include/statistics/statistics.h
index 60b222028d..4f70034983 100644
--- a/src/include/statistics/statistics.h
+++ b/src/include/statistics/statistics.h
@@ -131,11 +131,9 @@ extern StatisticExtInfo *statext_find_matching_mcv(PlannerInfo *root, RelOptInfo
 										   Bitmapset *attnums, List *exprs);
 
 extern bool statext_try_join_estimates(PlannerInfo *root, List *clauses, int varRelid,
-									   JoinType jointype, SpecialJoinInfo *sjinfo,
-									   Bitmapset *estimatedclauses);
+									   JoinType jointype, SpecialJoinInfo *sjinfo);
 
 extern Selectivity statext_clauselist_join_selectivity(PlannerInfo *root, List *clauses,
-													   int varRelid,
 													   JoinType jointype, SpecialJoinInfo *sjinfo,
 													   Bitmapset **estimatedclauses);
 
-- 
2.34.1

v1-0003-Remove-SpecialJoinInfo-sjinfo-argument.patch0000644000175000017500000001065214613326575023026 0ustar  yizhi.fzhyizhi.fzhFrom 51aedd185e9c3e2369a328f285c9c9e5988ea772 Mon Sep 17 00:00:00 2001
From: "yizhi.fzh" <yizhi.fzh@alibaba-inc.com>
Date: Tue, 2 Apr 2024 09:58:18 +0800
Subject: [PATCH v1 03/22] Remove SpecialJoinInfo *sjinfo argument

It was passed down to statext_is_supported_join_clause where it is
used for checking if it is NULL.  However it has been checked before
in statext_try_join_estimates.
---
 src/backend/optimizer/path/clausesel.c  |  3 ++-
 src/backend/statistics/extended_stats.c | 16 ++++++----------
 src/include/statistics/statistics.h     |  3 +--
 3 files changed, 9 insertions(+), 13 deletions(-)

diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c
index e1683febf6..ca550e6c0c 100644
--- a/src/backend/optimizer/path/clausesel.c
+++ b/src/backend/optimizer/path/clausesel.c
@@ -215,8 +215,9 @@ clauselist_selectivity_ext(PlannerInfo *root,
 		statext_try_join_estimates(root, clauses, varRelid, jointype, sjinfo))
 	{
 		Assert(varRelid == 0);
+		Assert(sjinfo != NULL);
 		s1 *= statext_clauselist_join_selectivity(root, clauses,
-												  jointype, sjinfo,
+												  jointype,
 												  &estimatedclauses);
 	}
 
diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c
index 5ed3c5e332..ca604306e7 100644
--- a/src/backend/statistics/extended_stats.c
+++ b/src/backend/statistics/extended_stats.c
@@ -2806,7 +2806,7 @@ statext_determine_join_restrictions(PlannerInfo *root, RelOptInfo *rel,
  * on the conditions.
  */
 static bool
-statext_is_supported_join_clause(PlannerInfo *root, Node *clause, SpecialJoinInfo *sjinfo)
+statext_is_supported_join_clause(PlannerInfo *root, Node *clause)
 {
 	Oid	oprsel;
 	RestrictInfo   *rinfo;
@@ -2819,10 +2819,6 @@ statext_is_supported_join_clause(PlannerInfo *root, Node *clause, SpecialJoinInf
 	 * XXX See treat_as_join_clause.
 	 */
 
-	/* duplicated with statext_try_join_estimates */
-	if (sjinfo == NULL)
-		return false;
-
 	/* XXX Can we rely on always getting RestrictInfo here? */
 	if (!IsA(clause, RestrictInfo))
 		return false;
@@ -2945,7 +2941,7 @@ statext_try_join_estimates(PlannerInfo *root, List *clauses, int varRelid,
 		 * Skip clauses that are not join clauses or that we don't know
 		 * how to handle estimate using extended statistics.
 		 */
-		if (!statext_is_supported_join_clause(root, clause, sjinfo))
+		if (!statext_is_supported_join_clause(root, clause))
 			continue;
 
 		/*
@@ -3016,7 +3012,7 @@ typedef struct JoinPairInfo
  */
 static JoinPairInfo *
 statext_build_join_pairs(PlannerInfo *root, List *clauses,
-						 JoinType jointype, SpecialJoinInfo *sjinfo,
+						 JoinType jointype,
 						 Bitmapset *estimatedclauses, int *npairs)
 {
 	int				cnt;
@@ -3051,7 +3047,7 @@ statext_build_join_pairs(PlannerInfo *root, List *clauses,
 		 * the moment we support just (Expr op Expr) clauses with each
 		 * side referencing just a single relation).
 		 */
-		if (!statext_is_supported_join_clause(root, clause, sjinfo))
+		if (!statext_is_supported_join_clause(root, clause))
 			continue;
 
 		/* statext_is_supported_join_clause guarantees RestrictInfo */
@@ -3240,7 +3236,7 @@ get_expression_for_rel(PlannerInfo *root, RelOptInfo *rel, Node *clause)
  */
 Selectivity
 statext_clauselist_join_selectivity(PlannerInfo *root, List *clauses,
-									JoinType jointype, SpecialJoinInfo *sjinfo,
+									JoinType jointype,
 									Bitmapset **estimatedclauses)
 {
 	int			i;
@@ -3254,7 +3250,7 @@ statext_clauselist_join_selectivity(PlannerInfo *root, List *clauses,
 		return 1.0;
 
 	/* extract pairs of joined relations from the list of clauses */
-	info = statext_build_join_pairs(root, clauses, jointype, sjinfo,
+	info = statext_build_join_pairs(root, clauses, jointype,
 									*estimatedclauses, &ninfo);
 
 	/* no useful join pairs */
diff --git a/src/include/statistics/statistics.h b/src/include/statistics/statistics.h
index 4f70034983..28d9e72e54 100644
--- a/src/include/statistics/statistics.h
+++ b/src/include/statistics/statistics.h
@@ -134,7 +134,6 @@ extern bool statext_try_join_estimates(PlannerInfo *root, List *clauses, int var
 									   JoinType jointype, SpecialJoinInfo *sjinfo);
 
 extern Selectivity statext_clauselist_join_selectivity(PlannerInfo *root, List *clauses,
-													   JoinType jointype, SpecialJoinInfo *sjinfo,
-													   Bitmapset **estimatedclauses);
+													   JoinType jointype, Bitmapset **estimatedclauses);
 
 #endif							/* STATISTICS_H */
-- 
2.34.1

v1-0004-Remove-joinType-argument.patch0000644000175000017500000000525014613326575020324 0ustar  yizhi.fzhyizhi.fzhFrom 66e26391dcedad0e100b984dff5a5cf3a0309380 Mon Sep 17 00:00:00 2001
From: "yizhi.fzh" <yizhi.fzh@alibaba-inc.com>
Date: Tue, 2 Apr 2024 10:07:00 +0800
Subject: [PATCH v1 04/22] Remove joinType argument.

---
 src/backend/optimizer/path/clausesel.c  | 1 -
 src/backend/statistics/extended_stats.c | 4 +---
 src/include/statistics/statistics.h     | 3 +--
 3 files changed, 2 insertions(+), 6 deletions(-)

diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c
index ca550e6c0c..50210ec2ca 100644
--- a/src/backend/optimizer/path/clausesel.c
+++ b/src/backend/optimizer/path/clausesel.c
@@ -217,7 +217,6 @@ clauselist_selectivity_ext(PlannerInfo *root,
 		Assert(varRelid == 0);
 		Assert(sjinfo != NULL);
 		s1 *= statext_clauselist_join_selectivity(root, clauses,
-												  jointype,
 												  &estimatedclauses);
 	}
 
diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c
index ca604306e7..9247aef0b7 100644
--- a/src/backend/statistics/extended_stats.c
+++ b/src/backend/statistics/extended_stats.c
@@ -3012,7 +3012,6 @@ typedef struct JoinPairInfo
  */
 static JoinPairInfo *
 statext_build_join_pairs(PlannerInfo *root, List *clauses,
-						 JoinType jointype,
 						 Bitmapset *estimatedclauses, int *npairs)
 {
 	int				cnt;
@@ -3236,7 +3235,6 @@ get_expression_for_rel(PlannerInfo *root, RelOptInfo *rel, Node *clause)
  */
 Selectivity
 statext_clauselist_join_selectivity(PlannerInfo *root, List *clauses,
-									JoinType jointype,
 									Bitmapset **estimatedclauses)
 {
 	int			i;
@@ -3250,7 +3248,7 @@ statext_clauselist_join_selectivity(PlannerInfo *root, List *clauses,
 		return 1.0;
 
 	/* extract pairs of joined relations from the list of clauses */
-	info = statext_build_join_pairs(root, clauses, jointype,
+	info = statext_build_join_pairs(root, clauses,
 									*estimatedclauses, &ninfo);
 
 	/* no useful join pairs */
diff --git a/src/include/statistics/statistics.h b/src/include/statistics/statistics.h
index 28d9e72e54..97a217af1e 100644
--- a/src/include/statistics/statistics.h
+++ b/src/include/statistics/statistics.h
@@ -133,7 +133,6 @@ extern StatisticExtInfo *statext_find_matching_mcv(PlannerInfo *root, RelOptInfo
 extern bool statext_try_join_estimates(PlannerInfo *root, List *clauses, int varRelid,
 									   JoinType jointype, SpecialJoinInfo *sjinfo);
 
-extern Selectivity statext_clauselist_join_selectivity(PlannerInfo *root, List *clauses,
-													   JoinType jointype, Bitmapset **estimatedclauses);
+extern Selectivity statext_clauselist_join_selectivity(PlannerInfo *root, List *clauses, Bitmapset **estimatedclauses);
 
 #endif							/* STATISTICS_H */
-- 
2.34.1

v1-0005-use-the-pre-calculated-RestrictInfo-left-right_re.patch0000644000175000017500000000555314613326575025056 0ustar  yizhi.fzhyizhi.fzhFrom f76a4f150af93eadc6bf55d2c9484b88fd0d8122 Mon Sep 17 00:00:00 2001
From: "yizhi.fzh" <yizhi.fzh@alibaba-inc.com>
Date: Tue, 2 Apr 2024 10:39:03 +0800
Subject: [PATCH v1 05/22] use the pre-calculated
 RestrictInfo->left|right_relids

It should has better performance than pull_varnos and easier to
understand.
---
 src/backend/statistics/extended_stats.c | 35 +++++--------------------
 1 file changed, 6 insertions(+), 29 deletions(-)

diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c
index 9247aef0b7..42a03a8803 100644
--- a/src/backend/statistics/extended_stats.c
+++ b/src/backend/statistics/extended_stats.c
@@ -2811,7 +2811,7 @@ statext_is_supported_join_clause(PlannerInfo *root, Node *clause)
 	Oid	oprsel;
 	RestrictInfo   *rinfo;
 	OpExpr		   *opclause;
-	ListCell	   *lc;
+	int				left_relid, right_relid;
 
 	/*
 	 * evaluation as a restriction clause, either at scan node or forced
@@ -2827,10 +2827,6 @@ statext_is_supported_join_clause(PlannerInfo *root, Node *clause)
 	rinfo = (RestrictInfo *) clause;
 	clause = (Node *) rinfo->clause;
 
-	/* is it referencing multiple relations? */
-	if (bms_membership(rinfo->clause_relids) != BMS_MULTIPLE)
-		return false;
-
 	/* we only support simple operator clauses for now */
 	if (!is_opclause(clause))
 		return false;
@@ -2853,8 +2849,6 @@ statext_is_supported_join_clause(PlannerInfo *root, Node *clause)
 	 * which is still technically an opclause, but we can't match it to
 	 * extended statistics in a simple way.
 	 *
-	 * XXX This also means we require rinfo->clause_relids to have 2 rels.
-	 *
 	 * XXX Also check it's not expression on system attributes, which we
 	 * don't allow in extended statistics.
 	 *
@@ -2863,30 +2857,13 @@ statext_is_supported_join_clause(PlannerInfo *root, Node *clause)
 	 * or something like that. We could do "cartesian product" of the MCV
 	 * stats and restrict it using this condition.
 	 */
-	foreach (lc, opclause->args)
-	{
-		Bitmapset *varnos = NULL;
-		Node *expr = (Node *) lfirst(lc);
-
-		varnos = pull_varnos(root, expr);
 
-		/*
-		 * No argument should reference more than just one relation.
-		 *
-		 * This effectively means each side references just two relations.
-		 * If there's no relation on one side, it's a Const, and the other
-		 * side has to be either Const or Expr with a single rel. In which
-		 * case it can't be a join clause.
-		 */
-		if (bms_num_members(varnos) > 1)
-			return false;
+	if (!bms_get_singleton_member(rinfo->left_relids, &left_relid) ||
+		!bms_get_singleton_member(rinfo->right_relids, &right_relid))
+		return false;
 
-		/*
-		 * XXX Maybe check that both relations have extended statistics
-		 * (no point in considering the clause as useful without it). But
-		 * we'll do that check later anyway, so keep this cheap.
-		 */
-	}
+	if (left_relid == right_relid)
+		return false;
 
 	return true;
 }
-- 
2.34.1

v1-0006-Fast-path-for-general-clauselist_selectivity.patch0000644000175000017500000000174614613326575024300 0ustar  yizhi.fzhyizhi.fzhFrom b2c50a0254c536ad9612f13031a8d74452bb8dff Mon Sep 17 00:00:00 2001
From: "yizhi.fzh" <yizhi.fzh@alibaba-inc.com>
Date: Tue, 2 Apr 2024 14:47:54 +0800
Subject: [PATCH v1 06/22] Fast path for general clauselist_selectivity

It should be common in the most queries like

SELECT * FROM t1, t2 WHERE t1.a = t2.a AND t1.a > 3;

clauses == NULL at the scan level of t2.
---
 src/backend/optimizer/path/clausesel.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c
index 50210ec2ca..c4f5fae9d7 100644
--- a/src/backend/optimizer/path/clausesel.c
+++ b/src/backend/optimizer/path/clausesel.c
@@ -132,6 +132,9 @@ clauselist_selectivity_ext(PlannerInfo *root,
 	int			listidx;
 	bool		single_clause_optimization = true;
 
+	if (clauses == NULL)
+		return 1.0;
+
 	/*
 	 * The optimization of skipping to clause_selectivity_ext for single
 	 * clauses means we can't improve join estimates with a single join
-- 
2.34.1

v1-0007-bms_is_empty-is-more-effective-than-bms_num_membe.patch0000644000175000017500000000150214613326575025203 0ustar  yizhi.fzhyizhi.fzhFrom bed0777069c09ecb0d49ee607960a3275127ef6f Mon Sep 17 00:00:00 2001
From: "yizhi.fzh" <yizhi.fzh@alibaba-inc.com>
Date: Tue, 2 Apr 2024 14:53:30 +0800
Subject: [PATCH v1 07/22] bms_is_empty is more effective than
 bms_num_members(b) == 0.

---
 src/backend/statistics/extended_stats.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c
index 42a03a8803..b05be9578c 100644
--- a/src/backend/statistics/extended_stats.c
+++ b/src/backend/statistics/extended_stats.c
@@ -2932,7 +2932,7 @@ statext_try_join_estimates(PlannerInfo *root, List *clauses, int varRelid,
 	}
 
 	/* no join clauses found, don't try applying extended stats */
-	if (bms_num_members(relids) == 0)
+	if (bms_is_empty(relids))
 		return false;
 
 	/*
-- 
2.34.1

v1-0008-a-branch-of-updates-around-JoinPairInfo.patch0000644000175000017500000001775114613326575023020 0ustar  yizhi.fzhyizhi.fzhFrom b0baef8918200ccaa8890b6f6c79f72ae3d74c06 Mon Sep 17 00:00:00 2001
From: "yizhi.fzh" <yizhi.fzh@alibaba-inc.com>
Date: Tue, 2 Apr 2024 16:01:05 +0800
Subject: [PATCH v1 08/22] a branch of updates around JoinPairInfo

1. rename rels to relids while the "rels" may reference to list of
RelOptInfo or Relids. but the later one reference to Relids all the
time.

2. Store RestrictInfo to JoinPairInfo.clauses so that we can reuse
the left_relids, right_relids which will save us from calling
pull_varnos.

3. create bms_nth_member function in bitmapset.c and use it
extract_relation_info, the function name is self-documented.

4. pfree the JoinPairInfo array when we are done with that.
---
 src/backend/nodes/bitmapset.c           | 18 ++++++++++++
 src/backend/statistics/extended_stats.c | 37 ++++++++++++-------------
 src/backend/statistics/mcv.c            | 34 ++++++-----------------
 src/include/nodes/bitmapset.h           |  1 +
 4 files changed, 44 insertions(+), 46 deletions(-)

diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index cd05c642b0..7c1291ae64 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -772,6 +772,24 @@ bms_num_members(const Bitmapset *a)
 	return result;
 }
 
+/*
+ * bms_nth_member - return the nth member, index starts with 0.
+ */
+int
+bms_nth_member(const Bitmapset *a, int i)
+{
+	int idx, res = -1;
+
+	for (idx = 0; idx <= i; idx++)
+	{
+		res = bms_next_member(a, res);
+
+		if (res < 0)
+			elog(ERROR, "no enough members for %d", i);
+	}
+	return res;
+}
+
 /*
  * bms_membership - does a set have zero, one, or multiple members?
  *
diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c
index b05be9578c..dc10598cb1 100644
--- a/src/backend/statistics/extended_stats.c
+++ b/src/backend/statistics/extended_stats.c
@@ -2967,11 +2967,11 @@ statext_try_join_estimates(PlannerInfo *root, List *clauses, int varRelid,
 }
 
 /*
- * Information about two joined relations, along with the join clauses between.
+ * Information about two joined relations, group by clauses by relids.
  */
 typedef struct JoinPairInfo
 {
-	Bitmapset  *rels;
+	Bitmapset  *relids;
 	List	   *clauses;
 } JoinPairInfo;
 
@@ -3034,9 +3034,9 @@ statext_build_join_pairs(PlannerInfo *root, List *clauses,
 		found = false;
 		for (i = 0; i < cnt; i++)
 		{
-			if (bms_is_subset(rinfo->clause_relids, info[i].rels))
+			if (bms_is_subset(rinfo->clause_relids, info[i].relids))
 			{
-				info[i].clauses = lappend(info[i].clauses, clause);
+				info[i].clauses = lappend(info[i].clauses, rinfo);
 				found = true;
 				break;
 			}
@@ -3044,14 +3044,17 @@ statext_build_join_pairs(PlannerInfo *root, List *clauses,
 
 		if (!found)
 		{
-			info[cnt].rels = rinfo->clause_relids;
-			info[cnt].clauses = lappend(info[cnt].clauses, clause);
+			info[cnt].relids = rinfo->clause_relids;
+			info[cnt].clauses = lappend(info[cnt].clauses, rinfo);
 			cnt++;
 		}
 	}
 
 	if (cnt == 0)
+	{
+		pfree(info);
 		return NULL;
+	}
 
 	*npairs = cnt;
 	return info;
@@ -3071,7 +3074,6 @@ static RelOptInfo *
 extract_relation_info(PlannerInfo *root, JoinPairInfo *info, int index,
 					  StatisticExtInfo **stat)
 {
-	int			k;
 	int			relid;
 	RelOptInfo *rel;
 	ListCell   *lc;
@@ -3081,16 +3083,7 @@ extract_relation_info(PlannerInfo *root, JoinPairInfo *info, int index,
 
 	Assert((index >= 0) && (index <= 1));
 
-	k = -1;
-	while (index >= 0)
-	{
-		k = bms_next_member(info->rels, k);
-		if (k < 0)
-			elog(ERROR, "failed to extract relid");
-
-		relid = k;
-		index--;
-	}
+	relid = bms_nth_member(info->relids, index);
 
 	rel = find_base_rel(root, relid);
 
@@ -3102,7 +3095,8 @@ extract_relation_info(PlannerInfo *root, JoinPairInfo *info, int index,
 	foreach (lc, info->clauses)
 	{
 		ListCell *lc2;
-		Node *clause = (Node *) lfirst(lc);
+		RestrictInfo *rinfo = (RestrictInfo *) lfirst(lc);
+		Node *clause = (Node *) rinfo->clause;
 		OpExpr *opclause = (OpExpr *) clause;
 
 		/* only opclauses supported for now */
@@ -3142,7 +3136,8 @@ extract_relation_info(PlannerInfo *root, JoinPairInfo *info, int index,
 			 * compatible because we already checked it when building the
 			 * join pairs.
 			 */
-			varnos = pull_varnos(root, arg);
+			varnos = list_cell_number(opclause->args, lc2) == 0 ?
+				rinfo->left_relids : rinfo->right_relids;
 
 			if (relid == bms_singleton_member(varnos))
 				exprs = lappend(exprs, arg);
@@ -3383,7 +3378,8 @@ statext_clauselist_join_selectivity(PlannerInfo *root, List *clauses,
 		 */
 		foreach (lc, info->clauses)
 		{
-			Node *clause = (Node *) lfirst(lc);
+			RestrictInfo *rinfo = (RestrictInfo *) lfirst(lc);
+			Node *clause = (Node *) rinfo->clause;
 			ListCell *lc2;
 
 			listidx = -1;
@@ -3405,5 +3401,6 @@ statext_clauselist_join_selectivity(PlannerInfo *root, List *clauses,
 		}
 	}
 
+	pfree(info);
 	return s;
 }
diff --git a/src/backend/statistics/mcv.c b/src/backend/statistics/mcv.c
index 49299ed907..53b481a291 100644
--- a/src/backend/statistics/mcv.c
+++ b/src/backend/statistics/mcv.c
@@ -2214,8 +2214,7 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 
 	MCVList    *mcv1,
 			   *mcv2;
-	int			idx,
-				i,
+	int			i,
 				j;
 	Selectivity s = 0;
 
@@ -2306,25 +2305,14 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 	indexes2 = (int *) palloc(sizeof(int) * list_length(clauses));
 	reverse = (bool *) palloc(sizeof(bool) * list_length(clauses));
 
-	idx = 0;
 	foreach (lc, clauses)
 	{
-		Node	   *clause = (Node *) lfirst(lc);
+		RestrictInfo	*rinfo = (RestrictInfo *) lfirst(lc);
+		Node	   *clause = (Node *) rinfo->clause;
 		OpExpr	   *opexpr;
 		Node	   *expr1,
 				   *expr2;
-		Bitmapset  *relids1,
-				   *relids2;
-
-		/*
-		 * Strip the RestrictInfo node, get the actual clause.
-		 *
-		 * XXX Not sure if we need to care about removing other node types
-		 * too (e.g. RelabelType etc.). statext_is_supported_join_clause
-		 * matches this, but maybe we need to relax it?
-		 */
-		if (IsA(clause, RestrictInfo))
-			clause = (Node *) ((RestrictInfo *) clause)->clause;
+		int		idx = list_cell_number(clauses, lc);
 
 		opexpr = (OpExpr *) clause;
 
@@ -2338,12 +2326,8 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 		expr1 = linitial(opexpr->args);
 		expr2 = lsecond(opexpr->args);
 
-		/* determine order of clauses (rel1 op rel2) or (rel2 op rel1) */
-		relids1 = pull_varnos(root, expr1);
-		relids2 = pull_varnos(root, expr2);
-
-		if ((bms_singleton_member(relids1) == rel1->relid) &&
-			(bms_singleton_member(relids2) == rel2->relid))
+		if ((bms_singleton_member(rinfo->left_relids) == rel1->relid) &&
+			(bms_singleton_member(rinfo->right_relids) == rel2->relid))
 		{
 			Oid		collid;
 
@@ -2358,8 +2342,8 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 			exprs1 = lappend(exprs1, expr1);
 			exprs2 = lappend(exprs2, expr2);
 		}
-		else if ((bms_singleton_member(relids2) == rel1->relid) &&
-				 (bms_singleton_member(relids1) == rel2->relid))
+		else if ((bms_singleton_member(rinfo->right_relids) == rel1->relid) &&
+				 (bms_singleton_member(rinfo->left_relids) == rel2->relid))
 		{
 			Oid		collid;
 
@@ -2383,8 +2367,6 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 
 		Assert((indexes2[idx] >= 0) &&
 			   (indexes2[idx] < bms_num_members(stat2->keys) + list_length(stat2->exprs)));
-
-		idx++;
 	}
 
 	/*
diff --git a/src/include/nodes/bitmapset.h b/src/include/nodes/bitmapset.h
index 283bea5ea9..8d32e7a244 100644
--- a/src/include/nodes/bitmapset.h
+++ b/src/include/nodes/bitmapset.h
@@ -110,6 +110,7 @@ extern bool bms_nonempty_difference(const Bitmapset *a, const Bitmapset *b);
 extern int	bms_singleton_member(const Bitmapset *a);
 extern bool bms_get_singleton_member(const Bitmapset *a, int *member);
 extern int	bms_num_members(const Bitmapset *a);
+extern int  bms_nth_member(const Bitmapset *a, int i);
 
 /* optimized tests when we don't need to know exact membership count: */
 extern BMS_Membership bms_membership(const Bitmapset *a);
-- 
2.34.1

v1-0009-Cache-the-result-of-statext_determine_join_restri.patch0000644000175000017500000001762714613326575025324 0ustar  yizhi.fzhyizhi.fzhFrom a40edd6ed2cc780122a8fc791e2abc0762d61a77 Mon Sep 17 00:00:00 2001
From: "yizhi.fzh" <yizhi.fzh@alibaba-inc.com>
Date: Wed, 3 Apr 2024 15:01:49 +0800
Subject: [PATCH v1 09/22] Cache the result of
 statext_determine_join_restrictions.

It is firstly needed when choosing statext_find_matching_mcv and then it
is needed when mcv_combine_extended, so caching the result to save some
cycles.
---
 src/backend/statistics/extended_stats.c       | 34 ++++++++++++++-----
 src/backend/statistics/mcv.c                  | 19 ++++-------
 .../statistics/extended_stats_internal.h      |  2 ++
 src/include/statistics/statistics.h           |  3 +-
 4 files changed, 37 insertions(+), 21 deletions(-)

diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c
index dc10598cb1..cb157872fa 100644
--- a/src/backend/statistics/extended_stats.c
+++ b/src/backend/statistics/extended_stats.c
@@ -2641,7 +2641,8 @@ make_build_data(Relation rel, StatExtEntry *stat, int numrows, HeapTuple *rows,
 
 /*
  * statext_find_matching_mcv
- *		Search for a MCV covering all the attributes and expressions.
+ *		Search for a MCV covering all the attributes and expressions and set
+ * the conditions to calculate conditional probability.
  *
  * We pick the statistics to use for join estimation. The statistics object has
  * to have MCV, and we require it to match all the join conditions, because it
@@ -2663,7 +2664,8 @@ make_build_data(Relation rel, StatExtEntry *stat, int numrows, HeapTuple *rows,
  */
 StatisticExtInfo *
 statext_find_matching_mcv(PlannerInfo *root, RelOptInfo *rel,
-						  Bitmapset *attnums, List *exprs)
+						  Bitmapset *attnums, List *exprs,
+						  List **base_conditions)
 {
 	ListCell   *l;
 	StatisticExtInfo *mcv = NULL;
@@ -2693,6 +2695,7 @@ statext_find_matching_mcv(PlannerInfo *root, RelOptInfo *rel,
 		if (!mcv)
 		{
 			mcv = stat;
+			*base_conditions = statext_determine_join_restrictions(root, rel, mcv);
 			continue;
 		}
 
@@ -2731,14 +2734,24 @@ statext_find_matching_mcv(PlannerInfo *root, RelOptInfo *rel,
 		if (list_length(conditions1) > list_length(conditions2))
 		{
 			mcv = stat;
+			*base_conditions = conditions1;
 			continue;
 		}
+		else
+		{
+			*base_conditions = conditions2;
+		}
 
 		/* The statistics seem about equal, so just use the smaller one. */
 		if (bms_num_members(mcv->keys) + list_length(mcv->exprs) >
 			bms_num_members(stat->keys) + list_length(stat->exprs))
 		{
 			mcv = stat;
+			*base_conditions = conditions1;
+		}
+		else
+		{
+			*base_conditions = conditions2;
 		}
 	}
 
@@ -2753,7 +2766,7 @@ statext_find_matching_mcv(PlannerInfo *root, RelOptInfo *rel,
  * and covered by the extended statistics object.
  *
  * When using extended statistics to estimate joins, we can use conditions
- * from base relations to calculate conditional probability
+ * from base relations to calculate conditional probability.
  *
  *    P(join clauses | baserel restrictions)
  *
@@ -3072,7 +3085,7 @@ statext_build_join_pairs(PlannerInfo *root, List *clauses,
  */
 static RelOptInfo *
 extract_relation_info(PlannerInfo *root, JoinPairInfo *info, int index,
-					  StatisticExtInfo **stat)
+					  StatisticExtInfo **stat, List **base_conditions)
 {
 	int			relid;
 	RelOptInfo *rel;
@@ -3144,7 +3157,7 @@ extract_relation_info(PlannerInfo *root, JoinPairInfo *info, int index,
 		}
 	}
 
-	*stat = statext_find_matching_mcv(root, rel, attnums, exprs);
+	*stat = statext_find_matching_mcv(root, rel, attnums, exprs, base_conditions);
 
 	return rel;
 }
@@ -3258,11 +3271,14 @@ statext_clauselist_join_selectivity(PlannerInfo *root, List *clauses,
 		StatisticExtInfo *stat1;
 		StatisticExtInfo *stat2;
 
+		List	*base_condition1 = NULL,
+				*base_condition2 = NULL;
+
 		/* extract info about the first relation */
-		rel1 = extract_relation_info(root, &info[i], 0, &stat1);
+		rel1 = extract_relation_info(root, &info[i], 0, &stat1, &base_condition1);
 
 		/* extract info about the second relation */
-		rel2 = extract_relation_info(root, &info[i], 1, &stat2);
+		rel2 = extract_relation_info(root, &info[i], 1, &stat2, &base_condition2);
 
 		/*
 		 * We can handle three basic cases:
@@ -3285,7 +3301,9 @@ statext_clauselist_join_selectivity(PlannerInfo *root, List *clauses,
 		 */
 		if (stat1 && stat2)
 		{
-			s *= mcv_combine_extended(root, rel1, rel2, stat1, stat2, info[i].clauses);
+			s *= mcv_combine_extended(root, rel1, rel2, stat1, stat2,
+									  base_condition1, base_condition2,
+									  info[i].clauses);
 		}
 		else if (stat1 && (list_length(info[i].clauses) == 1))
 		{
diff --git a/src/backend/statistics/mcv.c b/src/backend/statistics/mcv.c
index 53b481a291..27f31a079f 100644
--- a/src/backend/statistics/mcv.c
+++ b/src/backend/statistics/mcv.c
@@ -2208,6 +2208,7 @@ mcv_clause_selectivity_or(PlannerInfo *root, StatisticExtInfo *stat,
 Selectivity
 mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 					 StatisticExtInfo *stat1, StatisticExtInfo *stat2,
+					 List *base_cond1, List *base_cond2,
 					 List *clauses)
 {
 	ListCell   *lc;
@@ -2221,8 +2222,6 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 	/* match bitmaps and selectivity for baserel conditions (if any) */
 	List   *exprs1 = NIL,
 		   *exprs2 = NIL;
-	List   *conditions1 = NIL,
-		   *conditions2 = NIL;
 	bool   *cmatches1 = NULL,
 		   *cmatches2 = NULL;
 
@@ -2264,28 +2263,24 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 	/* should only get here with MCV on both sides */
 	Assert(mcv1 && mcv2);
 
-	/* Determine which baserel clauses to use for conditional probability. */
-	conditions1 = statext_determine_join_restrictions(root, rel1, stat1);
-	conditions2 = statext_determine_join_restrictions(root, rel2, stat2);
-
 	/*
 	 * Calculate match bitmaps for restrictions on either side of the join
 	 * (there may be none, in which case this will be NULL).
 	 */
-	if (conditions1)
+	if (base_cond1)
 	{
-		cmatches1 = mcv_get_match_bitmap(root, conditions1,
+		cmatches1 = mcv_get_match_bitmap(root, base_cond1,
 										 stat1->keys, stat1->exprs,
 										 mcv1, false);
-		csel1 = clauselist_selectivity(root, conditions1, rel1->relid, 0, NULL);
+		csel1 = clauselist_selectivity(root, base_cond1, rel1->relid, 0, NULL);
 	}
 
-	if (conditions2)
+	if (base_cond2)
 	{
-		cmatches2 = mcv_get_match_bitmap(root, conditions2,
+		cmatches2 = mcv_get_match_bitmap(root, base_cond2,
 										 stat2->keys, stat2->exprs,
 										 mcv2, false);
-		csel2 = clauselist_selectivity(root, conditions2, rel2->relid, 0, NULL);
+		csel2 = clauselist_selectivity(root, base_cond2, rel2->relid, 0, NULL);
 	}
 
 	/*
diff --git a/src/include/statistics/extended_stats_internal.h b/src/include/statistics/extended_stats_internal.h
index a85f896d53..47f1258d81 100644
--- a/src/include/statistics/extended_stats_internal.h
+++ b/src/include/statistics/extended_stats_internal.h
@@ -141,6 +141,8 @@ extern Selectivity mcv_combine_extended(PlannerInfo *root,
 										RelOptInfo *rel2,
 										StatisticExtInfo *stat1,
 										StatisticExtInfo *stat2,
+										List	*base_cond1,
+										List	*base_cond2,
 										List *clauses);
 
 extern List *statext_determine_join_restrictions(PlannerInfo *root,
diff --git a/src/include/statistics/statistics.h b/src/include/statistics/statistics.h
index 97a217af1e..d1368a0583 100644
--- a/src/include/statistics/statistics.h
+++ b/src/include/statistics/statistics.h
@@ -128,7 +128,8 @@ extern StatisticExtInfo *choose_best_statistics(List *stats, char requiredkind,
 extern HeapTuple statext_expressions_load(Oid stxoid, bool inh, int idx);
 
 extern StatisticExtInfo *statext_find_matching_mcv(PlannerInfo *root, RelOptInfo *rel,
-										   Bitmapset *attnums, List *exprs);
+												   Bitmapset *attnums, List *exprs,
+												   List **base_conditions);
 
 extern bool statext_try_join_estimates(PlannerInfo *root, List *clauses, int varRelid,
 									   JoinType jointype, SpecialJoinInfo *sjinfo);
-- 
2.34.1

v1-0010-Simplify-code-by-using-list_cell_number.patch0000644000175000017500000000736314613326575023233 0ustar  yizhi.fzhyizhi.fzhFrom 8d5816e74098486913f8c81affffff4cb3f6d64e Mon Sep 17 00:00:00 2001
From: "yizhi.fzh" <yizhi.fzh@alibaba-inc.com>
Date: Wed, 3 Apr 2024 15:09:36 +0800
Subject: [PATCH v1 10/22] Simplify code by using list_cell_number

instead of maintaining it manually.

and remove the below lines from statext_clauselist_join_selectivity.

	if (!clauses)
		return 1.0;

since it has been handled in clauselist_selectivity_ext.
---
 src/backend/statistics/extended_stats.c | 30 +++++++------------------
 1 file changed, 8 insertions(+), 22 deletions(-)

diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c
index cb157872fa..b07ea248b9 100644
--- a/src/backend/statistics/extended_stats.c
+++ b/src/backend/statistics/extended_stats.c
@@ -2826,13 +2826,6 @@ statext_is_supported_join_clause(PlannerInfo *root, Node *clause)
 	OpExpr		   *opclause;
 	int				left_relid, right_relid;
 
-	/*
-	 * evaluation as a restriction clause, either at scan node or forced
-	 *
-	 * XXX See treat_as_join_clause.
-	 */
-
-	/* XXX Can we rely on always getting RestrictInfo here? */
 	if (!IsA(clause, RestrictInfo))
 		return false;
 
@@ -2875,6 +2868,12 @@ statext_is_supported_join_clause(PlannerInfo *root, Node *clause)
 		!bms_get_singleton_member(rinfo->right_relids, &right_relid))
 		return false;
 
+	/*
+	 * XXX:
+	 * Join two columns in the same relation is uncommon and
+	 * extract_relation_info requires 2 different relids, so no bother to
+	 * handle it.
+	 */
 	if (left_relid == right_relid)
 		return false;
 
@@ -2892,7 +2891,6 @@ bool
 statext_try_join_estimates(PlannerInfo *root, List *clauses, int varRelid,
 						   JoinType jointype, SpecialJoinInfo *sjinfo)
 {
-	int			listidx;
 	int			k;
 	ListCell   *lc;
 	Bitmapset  *relids = NULL;
@@ -2918,15 +2916,11 @@ statext_try_join_estimates(PlannerInfo *root, List *clauses, int varRelid,
 	 * a single bit in the match bitmap). The challenge is what to do about the
 	 * part not represented by MCV, which is now based on ndistinct estimates.
 	 */
-	listidx = -1;
 	foreach (lc, clauses)
 	{
 		Node *clause = (Node *) lfirst(lc);
 		RestrictInfo *rinfo;
 
-		/* needs to happen before skipping any clauses */
-		listidx++;
-
 		/*
 		 * Skip clauses that are not join clauses or that we don't know
 		 * how to handle estimate using extended statistics.
@@ -3005,7 +2999,6 @@ statext_build_join_pairs(PlannerInfo *root, List *clauses,
 						 Bitmapset *estimatedclauses, int *npairs)
 {
 	int				cnt;
-	int				listidx;
 	JoinPairInfo   *info;
 	ListCell	   *lc;
 
@@ -3017,15 +3010,13 @@ statext_build_join_pairs(PlannerInfo *root, List *clauses,
 	info = (JoinPairInfo *) palloc0(sizeof(JoinPairInfo) * list_length(clauses));
 	cnt = 0;
 
-	listidx = -1;
 	foreach(lc, clauses)
 	{
 		int				i;
 		bool			found;
 		Node		   *clause = (Node *) lfirst(lc);
 		RestrictInfo   *rinfo;
-
-		listidx++;
+		int				listidx = list_cell_number(clauses, lc);
 
 		/* skip already estimated clauses */
 		if (bms_is_member(listidx, estimatedclauses))
@@ -3223,15 +3214,11 @@ statext_clauselist_join_selectivity(PlannerInfo *root, List *clauses,
 									Bitmapset **estimatedclauses)
 {
 	int			i;
-	int			listidx;
 	Selectivity	s = 1.0;
 
 	JoinPairInfo *info;
 	int				ninfo;
 
-	if (!clauses)
-		return 1.0;
-
 	/* extract pairs of joined relations from the list of clauses */
 	info = statext_build_join_pairs(root, clauses,
 									*estimatedclauses, &ninfo);
@@ -3400,11 +3387,10 @@ statext_clauselist_join_selectivity(PlannerInfo *root, List *clauses,
 			Node *clause = (Node *) rinfo->clause;
 			ListCell *lc2;
 
-			listidx = -1;
 			foreach (lc2, clauses)
 			{
 				Node *clause2 = (Node *) lfirst(lc2);
-				listidx++;
+				int listidx = list_cell_number(clauses, lc2);
 
 				Assert(IsA(clause2, RestrictInfo));
 
-- 
2.34.1

v1-0011-Handle-the-RelableType.patch0000644000175000017500000000221014613326575017616 0ustar  yizhi.fzhyizhi.fzhFrom 391f6f18b5b43943225507d8eb52e39624d81d3a Mon Sep 17 00:00:00 2001
From: "yizhi.fzh" <yizhi.fzh@alibaba-inc.com>
Date: Sun, 7 Apr 2024 13:24:59 +0800
Subject: [PATCH v1 11/22] Handle the RelableType.

---
 src/backend/statistics/mcv.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/src/backend/statistics/mcv.c b/src/backend/statistics/mcv.c
index 27f31a079f..edd02825c4 100644
--- a/src/backend/statistics/mcv.c
+++ b/src/backend/statistics/mcv.c
@@ -2317,10 +2317,16 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 
 		fmgr_info(get_opcode(opexpr->opno), &opprocs[idx]);
 
-		/* FIXME strip relabel etc. the way examine_opclause_args does */
 		expr1 = linitial(opexpr->args);
 		expr2 = lsecond(opexpr->args);
 
+		/* strip RelabelType from either side of the expression */
+		if (IsA(expr1, RelabelType))
+			expr1 = (Node *) ((RelabelType *) expr1)->arg;
+
+		if (IsA(expr2, RelabelType))
+			expr2 = (Node *) ((RelabelType *) expr2)->arg;
+
 		if ((bms_singleton_member(rinfo->left_relids) == rel1->relid) &&
 			(bms_singleton_member(rinfo->right_relids) == rel2->relid))
 		{
-- 
2.34.1

v1-0012-Use-FunctionCallInvoke-instead-of-FunctionCall2Co.patch0000644000175000017500000000757714613326575024726 0ustar  yizhi.fzhyizhi.fzhFrom be1b30cc669900cfb377c89939550794165b4f63 Mon Sep 17 00:00:00 2001
From: "yizhi.fzh" <yizhi.fzh@alibaba-inc.com>
Date: Sun, 7 Apr 2024 13:51:37 +0800
Subject: [PATCH v1 12/22] Use FunctionCallInvoke instead of FunctionCall2Coll

Some stack variables allocation and setup are saved.

A lesson learnt:

FunctionCallInfo  opprocs;

opprocs = (FunctionCallInfo) palloc(SizeForFunctionCallInfo(2) * list_length(clauses));

opprocs[1] points to a opprocs[0].args, which is caused by flexible
array in FunctionCallInfoBaseData. So the above line is pretty error
prone.
---
 src/backend/statistics/mcv.c | 35 ++++++++++++++++++++++++-----------
 1 file changed, 24 insertions(+), 11 deletions(-)

diff --git a/src/backend/statistics/mcv.c b/src/backend/statistics/mcv.c
index edd02825c4..f578c8b86f 100644
--- a/src/backend/statistics/mcv.c
+++ b/src/backend/statistics/mcv.c
@@ -2247,7 +2247,8 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 			totalsel2;
 
 	/* info about clauses and how they match to MCV stats */
-	FmgrInfo   *opprocs;
+	FmgrInfo   *finfo;
+	FunctionCallInfo  *opprocs;
 	int		   *indexes1,
 			   *indexes2;
 	bool	   *reverse;
@@ -2295,7 +2296,9 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 	 * stats we picked. We do this only once before processing the lists,
 	 * so that we don't have to do that for each MCV item or so.
 	 */
-	opprocs = (FmgrInfo *) palloc(sizeof(FmgrInfo) * list_length(clauses));
+	finfo = (FmgrInfo *) palloc(sizeof(FmgrInfo) * list_length(clauses));
+	// opprocs = (FunctionCallInfo *) palloc(SizeForFunctionCallInfo(2) * list_length(clauses));
+	opprocs = (FunctionCallInfo *) palloc(sizeof(FunctionCallInfo *) * list_length(clauses));
 	indexes1 = (int *) palloc(sizeof(int) * list_length(clauses));
 	indexes2 = (int *) palloc(sizeof(int) * list_length(clauses));
 	reverse = (bool *) palloc(sizeof(bool) * list_length(clauses));
@@ -2308,6 +2311,7 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 		Node	   *expr1,
 				   *expr2;
 		int		idx = list_cell_number(clauses, lc);
+		FunctionCallInfo fcinfo = palloc(SizeForFunctionCallInfo(2));
 
 		opexpr = (OpExpr *) clause;
 
@@ -2315,7 +2319,14 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 		Assert(is_opclause(clause));
 		Assert(list_length(opexpr->args) == 2);
 
-		fmgr_info(get_opcode(opexpr->opno), &opprocs[idx]);
+		fmgr_info(get_opcode(opexpr->opno), &finfo[idx]);
+
+		InitFunctionCallInfoData(*fcinfo, &finfo[idx],
+								 2, opexpr->inputcollid,
+								 NULL, NULL);
+		fcinfo->args[0].isnull = false;
+		fcinfo->args[1].isnull = false;
+		opprocs[idx] = fcinfo;
 
 		expr1 = linitial(opexpr->args);
 		expr2 = lsecond(opexpr->args);
@@ -2439,6 +2450,7 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 				Datum	value1,
 						value2;
 				bool	reverse_args = reverse[idx];
+				FunctionCallInfo	fcinfo = opprocs[idx];
 
 				/* If either value is null, it's a mismatch */
 				if (mcv2->items[j].isnull[index2])
@@ -2451,17 +2463,18 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 					/*
 					 * Careful about order of parameters. For same-type equality
 					 * that should not matter, but easy enough.
-					 *
-					 * FIXME Use appropriate collation.
 					 */
 					if (reverse_args)
-						match = DatumGetBool(FunctionCall2Coll(&opprocs[idx],
-															   InvalidOid,
-															   value2, value1));
+					{
+						fcinfo->args[0].value = value2;
+						fcinfo->args[1].value = value1;
+					}
 					else
-						match = DatumGetBool(FunctionCall2Coll(&opprocs[idx],
-															   InvalidOid,
-															   value1, value2));
+					{
+						fcinfo->args[0].value = value1;
+						fcinfo->args[1].value = value2;
+					}
+					match = DatumGetBool(FunctionCallInvoke(fcinfo));
 				}
 
 				items_match &= match;
-- 
2.34.1

v1-0013-optimize-the-order-of-mcv-equal-function-evaluati.patch0000644000175000017500000001620214613326575025123 0ustar  yizhi.fzhyizhi.fzhFrom 2cc743b33ea378446691e9477cd4c8917608fc7f Mon Sep 17 00:00:00 2001
From: "yizhi.fzh" <yizhi.fzh@alibaba-inc.com>
Date: Mon, 8 Apr 2024 13:21:29 +0800
Subject: [PATCH v1 13/22] optimize the order of mcv equal function evaluation

using n_dinstinct values.  See the test in ext_sort_mcv_proc.sql
which should not be committed since it is just a manual test.
---
 src/backend/statistics/mcv.c               | 78 +++++++++++++++++-----
 src/test/regress/sql/ext_sort_mcv_proc.sql | 30 +++++++++
 2 files changed, 90 insertions(+), 18 deletions(-)
 create mode 100644 src/test/regress/sql/ext_sort_mcv_proc.sql

diff --git a/src/backend/statistics/mcv.c b/src/backend/statistics/mcv.c
index f578c8b86f..df28ed929f 100644
--- a/src/backend/statistics/mcv.c
+++ b/src/backend/statistics/mcv.c
@@ -72,6 +72,33 @@
 	 ((ndims) * sizeof(DimensionInfo)) + \
 	 ((nitems) * ITEM_SIZE(ndims)))
 
+// #define  DEBUG_MCV  1 /* should be removed after review. */
+
+typedef struct
+{
+	FmgrInfo	fmgrinfo;
+	FunctionCallInfo fcinfo;
+	double	n_distinct;
+#ifdef DEBUG_MCV
+	int		idx;
+#endif
+} McvProc;
+
+static int
+cmp_mcv_proc(const void *a, const void *b)
+{
+	/* sort the McvProc reversely based on n_distinct value. */
+	McvProc *m1 = (McvProc *) a;
+	McvProc *m2 = (McvProc *) b;
+
+	if (m1->n_distinct > m2->n_distinct)
+		return -1;
+	else if (m1->n_distinct == m2->n_distinct)
+		return 0;
+	else
+		return 1;
+}
+
 static MultiSortSupport build_mss(StatsBuildData *data);
 
 static SortItem *build_distinct_groups(int numrows, SortItem *items,
@@ -2247,8 +2274,7 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 			totalsel2;
 
 	/* info about clauses and how they match to MCV stats */
-	FmgrInfo   *finfo;
-	FunctionCallInfo  *opprocs;
+	McvProc		*mcvProc;
 	int		   *indexes1,
 			   *indexes2;
 	bool	   *reverse;
@@ -2296,9 +2322,7 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 	 * stats we picked. We do this only once before processing the lists,
 	 * so that we don't have to do that for each MCV item or so.
 	 */
-	finfo = (FmgrInfo *) palloc(sizeof(FmgrInfo) * list_length(clauses));
-	// opprocs = (FunctionCallInfo *) palloc(SizeForFunctionCallInfo(2) * list_length(clauses));
-	opprocs = (FunctionCallInfo *) palloc(sizeof(FunctionCallInfo *) * list_length(clauses));
+	mcvProc = (McvProc *) palloc(sizeof(McvProc) * list_length(clauses));
 	indexes1 = (int *) palloc(sizeof(int) * list_length(clauses));
 	indexes2 = (int *) palloc(sizeof(int) * list_length(clauses));
 	reverse = (bool *) palloc(sizeof(bool) * list_length(clauses));
@@ -2312,6 +2336,9 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 				   *expr2;
 		int		idx = list_cell_number(clauses, lc);
 		FunctionCallInfo fcinfo = palloc(SizeForFunctionCallInfo(2));
+		VariableStatData	vardata;
+		bool	isdefault;
+		Node	*left_expr;
 
 		opexpr = (OpExpr *) clause;
 
@@ -2319,14 +2346,17 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 		Assert(is_opclause(clause));
 		Assert(list_length(opexpr->args) == 2);
 
-		fmgr_info(get_opcode(opexpr->opno), &finfo[idx]);
-
-		InitFunctionCallInfoData(*fcinfo, &finfo[idx],
+		fmgr_info(get_opcode(opexpr->opno), &mcvProc[idx].fmgrinfo);
+		mcvProc[idx].fcinfo = fcinfo;
+#ifdef DEBUG_MCV
+		mcvProc[idx].idx = idx;
+#endif
+		InitFunctionCallInfoData(*mcvProc[idx].fcinfo,
+								 &mcvProc[idx].fmgrinfo,
 								 2, opexpr->inputcollid,
 								 NULL, NULL);
 		fcinfo->args[0].isnull = false;
 		fcinfo->args[1].isnull = false;
-		opprocs[idx] = fcinfo;
 
 		expr1 = linitial(opexpr->args);
 		expr2 = lsecond(opexpr->args);
@@ -2353,6 +2383,8 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 
 			exprs1 = lappend(exprs1, expr1);
 			exprs2 = lappend(exprs2, expr2);
+
+			left_expr = expr1;
 		}
 		else if ((bms_singleton_member(rinfo->right_relids) == rel1->relid) &&
 				 (bms_singleton_member(rinfo->left_relids) == rel2->relid))
@@ -2369,6 +2401,8 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 
 			exprs1 = lappend(exprs1, expr2);
 			exprs2 = lappend(exprs2, expr1);
+
+			left_expr = expr2;
 		}
 		else
 			/* should never happen */
@@ -2379,7 +2413,22 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 
 		Assert((indexes2[idx] >= 0) &&
 			   (indexes2[idx] < bms_num_members(stat2->keys) + list_length(stat2->exprs)));
+
+		examine_variable(root, left_expr, rel1->relid, &vardata);
+		mcvProc[idx].n_distinct = get_variable_numdistinct(&vardata, &isdefault);
+		// elog(INFO, "n_distinct = %f", mcvProc[idx].n_distinct);
+		ReleaseVariableStats(vardata);
+	}
+
+	/* order the McvProc */
+	pg_qsort(mcvProc, list_length(clauses), sizeof(McvProc), cmp_mcv_proc);
+
+#ifdef DEBUG_MCV
+	for (i = 0; i < list_length(clauses); i++)
+	{
+		elog(INFO, "%d", mcvProc[i].idx);
 	}
+#endif
 
 	/*
 	 * Match items between the two MCV lists.
@@ -2436,13 +2485,8 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 
 			/*
 			 * Evaluate if all the join clauses match between the two MCV items.
-			 *
-			 * XXX We might optimize the order of evaluation, using the costs of
-			 * operator functions for individual columns. It does depend on the
-			 * number of distinct values, etc.
 			 */
-			idx = 0;
-			foreach (lc, clauses)
+			for(idx = 0; idx < list_length(clauses); idx++)
 			{
 				bool	match;
 				int		index1 = indexes1[idx],
@@ -2450,7 +2494,7 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 				Datum	value1,
 						value2;
 				bool	reverse_args = reverse[idx];
-				FunctionCallInfo	fcinfo = opprocs[idx];
+				FunctionCallInfo	fcinfo = mcvProc[idx].fcinfo;
 
 				/* If either value is null, it's a mismatch */
 				if (mcv2->items[j].isnull[index2])
@@ -2481,8 +2525,6 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 
 				if (!items_match)
 					break;
-
-				idx++;
 			}
 
 			if (items_match)
diff --git a/src/test/regress/sql/ext_sort_mcv_proc.sql b/src/test/regress/sql/ext_sort_mcv_proc.sql
new file mode 100644
index 0000000000..09360d5b23
--- /dev/null
+++ b/src/test/regress/sql/ext_sort_mcv_proc.sql
@@ -0,0 +1,30 @@
+create table t(level_1 text, level_2 text, level_3 text);
+
+insert into t
+values
+('l11', 'l21', 'l31'),
+('l11', 'l21', 'l32'),
+('l11', 'l21', 'l33'),
+('l11', 'l22', 'l34'),
+('l11', 'l22', 'l35'),
+('l11', 'l22', 'l36');
+
+create statistics on level_1, level_2, level_3 from t;
+
+analyze t;
+
+explain select * from t t1 join t t2 using(level_1, level_2, level_3);
+INFO:  n_distinct = 1.000000
+INFO:  n_distinct = 2.000000
+INFO:  n_distinct = 6.000000
+INFO:  2
+INFO:  1
+INFO:  0
+                                              QUERY PLAN
+------------------------------------------------------------------------------------------------------
+ Hash Join  (cost=1.17..2.32 rows=6 width=12)
+   Hash Cond: ((t1.level_1 = t2.level_1) AND (t1.level_2 = t2.level_2) AND (t1.level_3 = t2.level_3))
+   ->  Seq Scan on t t1  (cost=0.00..1.06 rows=6 width=12)
+   ->  Hash  (cost=1.06..1.06 rows=6 width=12)
+         ->  Seq Scan on t t2  (cost=0.00..1.06 rows=6 width=12)
+(5 rows)
-- 
2.34.1

v1-0014-Merge-3-palloc-into-1-palloc.patch0000644000175000017500000001137214613326575020477 0ustar  yizhi.fzhyizhi.fzhFrom eda501926e4bf494e41a74ebeee4b9cd7a2c74c9 Mon Sep 17 00:00:00 2001
From: "yizhi.fzh" <yizhi.fzh@alibaba-inc.com>
Date: Mon, 8 Apr 2024 14:20:20 +0800
Subject: [PATCH v1 14/22] Merge 3 palloc into 1 palloc

1. Merge 3 palloc into 1 palloc to save 2 palloc calls.

2. A question from me, search: "From Andy".
---
 src/backend/statistics/mcv.c | 45 ++++++++++++++++++++----------------
 1 file changed, 25 insertions(+), 20 deletions(-)

diff --git a/src/backend/statistics/mcv.c b/src/backend/statistics/mcv.c
index df28ed929f..02a74a0ec7 100644
--- a/src/backend/statistics/mcv.c
+++ b/src/backend/statistics/mcv.c
@@ -84,6 +84,13 @@ typedef struct
 #endif
 } McvProc;
 
+typedef struct
+{
+	int 	index1;
+	int 	index2;
+	bool 	reverse;
+} McvClauseInfo;
+
 static int
 cmp_mcv_proc(const void *a, const void *b)
 {
@@ -2275,9 +2282,7 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 
 	/* info about clauses and how they match to MCV stats */
 	McvProc		*mcvProc;
-	int		   *indexes1,
-			   *indexes2;
-	bool	   *reverse;
+	McvClauseInfo	*cinfo;
 	RangeTblEntry *rte1 = root->simple_rte_array[rel1->relid];
 	RangeTblEntry *rte2 = root->simple_rte_array[rel2->relid];
 
@@ -2323,9 +2328,7 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 	 * so that we don't have to do that for each MCV item or so.
 	 */
 	mcvProc = (McvProc *) palloc(sizeof(McvProc) * list_length(clauses));
-	indexes1 = (int *) palloc(sizeof(int) * list_length(clauses));
-	indexes2 = (int *) palloc(sizeof(int) * list_length(clauses));
-	reverse = (bool *) palloc(sizeof(bool) * list_length(clauses));
+	cinfo = (McvClauseInfo *) palloc(sizeof(McvClauseInfo) * list_length(clauses));
 
 	foreach (lc, clauses)
 	{
@@ -2373,13 +2376,13 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 		{
 			Oid		collid;
 
-			indexes1[idx] = mcv_match_expression(expr1,
+			cinfo[idx].index1 = mcv_match_expression(expr1,
 												 stat1->keys, stat1->exprs,
 												 &collid);
-			indexes2[idx] = mcv_match_expression(expr2,
+			cinfo[idx].index2 = mcv_match_expression(expr2,
 												 stat2->keys, stat2->exprs,
 												 &collid);
-			reverse[idx] = false;
+			cinfo[idx].reverse = false;
 
 			exprs1 = lappend(exprs1, expr1);
 			exprs2 = lappend(exprs2, expr2);
@@ -2391,13 +2394,13 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 		{
 			Oid		collid;
 
-			indexes1[idx] = mcv_match_expression(expr2,
+			cinfo[idx].index1 = mcv_match_expression(expr2,
 												 stat2->keys, stat2->exprs,
 												 &collid);
-			indexes2[idx] = mcv_match_expression(expr1,
+			cinfo[idx].index2 = mcv_match_expression(expr1,
 												 stat1->keys, stat1->exprs,
 												 &collid);
-			reverse[idx] = true;
+			cinfo[idx].reverse = true;
 
 			exprs1 = lappend(exprs1, expr2);
 			exprs2 = lappend(exprs2, expr1);
@@ -2408,11 +2411,11 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 			/* should never happen */
 			Assert(false);
 
-		Assert((indexes1[idx] >= 0) &&
-			   (indexes1[idx] < bms_num_members(stat1->keys) + list_length(stat1->exprs)));
+		Assert((cinfo[idx].index1 >= 0) &&
+			   (cinfo[idx].index1 < bms_num_members(stat1->keys) + list_length(stat1->exprs)));
 
-		Assert((indexes2[idx] >= 0) &&
-			   (indexes2[idx] < bms_num_members(stat2->keys) + list_length(stat2->exprs)));
+		Assert((cinfo[idx].index2 >= 0) &&
+			   (cinfo[idx].index2 < bms_num_members(stat2->keys) + list_length(stat2->exprs)));
 
 		examine_variable(root, left_expr, rel1->relid, &vardata);
 		mcvProc[idx].n_distinct = get_variable_numdistinct(&vardata, &isdefault);
@@ -2463,7 +2466,7 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 		 */
 		has_nulls = false;
 		for (j = 0; j < list_length(clauses); j++)
-			has_nulls |= mcv1->items[i].isnull[indexes1[j]];
+			has_nulls |= mcv1->items[i].isnull[cinfo[j].index1];
 
 		if (has_nulls)
 			continue;
@@ -2481,6 +2484,8 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 			/*
 			 * XXX We can't skip based on existing matches2 value, because there
 			 * may be duplicates in the first MCV.
+			 *
+			 * From Andy: what does this mean?
 			 */
 
 			/*
@@ -2489,11 +2494,11 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 			for(idx = 0; idx < list_length(clauses); idx++)
 			{
 				bool	match;
-				int		index1 = indexes1[idx],
-						index2 = indexes2[idx];
+				int		index1 = cinfo[idx].index1,
+						index2 = cinfo[idx].index2;
 				Datum	value1,
 						value2;
-				bool	reverse_args = reverse[idx];
+				bool	reverse_args = cinfo[idx].reverse;
 				FunctionCallInfo	fcinfo = mcvProc[idx].fcinfo;
 
 				/* If either value is null, it's a mismatch */
-- 
2.34.1

v1-0015-Remove-2-pull_varnos-calls-with-rinfo-left-right_.patch0000644000175000017500000000337414613326575024777 0ustar  yizhi.fzhyizhi.fzhFrom 28cc5f771df6290e8c0401d00f0ea2c7aeffc969 Mon Sep 17 00:00:00 2001
From: "yizhi.fzh" <yizhi.fzh@alibaba-inc.com>
Date: Mon, 8 Apr 2024 17:02:29 +0800
Subject: [PATCH v1 15/22] Remove 2 pull_varnos calls with
 rinfo->left|right_relids.

---
 src/backend/statistics/extended_stats.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c
index b07ea248b9..e3251b5aaa 100644
--- a/src/backend/statistics/extended_stats.c
+++ b/src/backend/statistics/extended_stats.c
@@ -3169,6 +3169,9 @@ get_expression_for_rel(PlannerInfo *root, RelOptInfo *rel, Node *clause)
 {
 	OpExpr *opexpr;
 	Node   *expr;
+	RestrictInfo *rinfo = (RestrictInfo *) clause;
+
+	Assert(IsA(clause, RestrictInfo));
 
 	/*
 	 * Strip the RestrictInfo node, get the actual clause.
@@ -3177,8 +3180,7 @@ get_expression_for_rel(PlannerInfo *root, RelOptInfo *rel, Node *clause)
 	 * too (e.g. RelabelType etc.). statext_is_supported_join_clause
 	 * matches this, but maybe we need to relax it?
 	 */
-	if (IsA(clause, RestrictInfo))
-		clause = (Node *) ((RestrictInfo *) clause)->clause;
+	clause = (Node *) rinfo->clause;
 
 	opexpr = (OpExpr *) clause;
 
@@ -3188,11 +3190,11 @@ get_expression_for_rel(PlannerInfo *root, RelOptInfo *rel, Node *clause)
 
 	/* FIXME strip relabel etc. the way examine_opclause_args does */
 	expr = linitial(opexpr->args);
-	if (bms_singleton_member(pull_varnos(root, expr)) == rel->relid)
+	if (bms_singleton_member(rinfo->left_relids) == rel->relid)
 		return expr;
 
 	expr = lsecond(opexpr->args);
-	if (bms_singleton_member(pull_varnos(root, expr)) == rel->relid)
+	if (bms_singleton_member(rinfo->right_relids) == rel->relid)
 		return expr;
 
 	return NULL;
-- 
2.34.1

v1-0016-add-the-statistic_proc_security_check-check.patch0000644000175000017500000000400414613326575024144 0ustar  yizhi.fzhyizhi.fzhFrom f8283ad1a8638bae51ed6e10bb4d1e1bbe1a5491 Mon Sep 17 00:00:00 2001
From: "yizhi.fzh" <yizhi.fzh@alibaba-inc.com>
Date: Mon, 8 Apr 2024 17:12:54 +0800
Subject: [PATCH v1 16/22] add the statistic_proc_security_check check.

---
 src/backend/statistics/extended_stats.c | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c
index e3251b5aaa..a2801d0e94 100644
--- a/src/backend/statistics/extended_stats.c
+++ b/src/backend/statistics/extended_stats.c
@@ -3316,10 +3316,10 @@ statext_clauselist_join_selectivity(PlannerInfo *root, List *clauses,
 			{
 				/* note we allow use of nullfrac regardless of security check */
 				stats = (Form_pg_statistic) GETSTRUCT(vardata.statsTuple);
-				/* FIXME should this call statistic_proc_security_check like eqjoinsel? */
-				have_mcvs = get_attstatsslot(&sslot, vardata.statsTuple,
-											 STATISTIC_KIND_MCV, InvalidOid,
-											 ATTSTATSSLOT_VALUES | ATTSTATSSLOT_NUMBERS);
+				if (statistic_proc_security_check(&vardata, F_EQJOINSEL))
+					have_mcvs = get_attstatsslot(&sslot, vardata.statsTuple,
+												 STATISTIC_KIND_MCV, InvalidOid,
+												 ATTSTATSSLOT_VALUES | ATTSTATSSLOT_NUMBERS);
 			}
 
 			if (have_mcvs)
@@ -3356,10 +3356,10 @@ statext_clauselist_join_selectivity(PlannerInfo *root, List *clauses,
 			{
 				/* note we allow use of nullfrac regardless of security check */
 				stats = (Form_pg_statistic) GETSTRUCT(vardata.statsTuple);
-				/* FIXME should this call statistic_proc_security_check like eqjoinsel? */
-				have_mcvs = get_attstatsslot(&sslot, vardata.statsTuple,
-											 STATISTIC_KIND_MCV, InvalidOid,
-											 ATTSTATSSLOT_VALUES | ATTSTATSSLOT_NUMBERS);
+				if (statistic_proc_security_check(&vardata, F_EQJOINSEL))
+					have_mcvs = get_attstatsslot(&sslot, vardata.statsTuple,
+												 STATISTIC_KIND_MCV, InvalidOid,
+												 ATTSTATSSLOT_VALUES | ATTSTATSSLOT_NUMBERS);
 			}
 
 			if (have_mcvs)
-- 
2.34.1

v1-0017-some-code-refactor-as-before.patch0000644000175000017500000001016414613326575020773 0ustar  yizhi.fzhyizhi.fzhFrom 17679d58c915c0769b92420ee2a3164072d1ee53 Mon Sep 17 00:00:00 2001
From: "yizhi.fzh" <yizhi.fzh@alibaba-inc.com>
Date: Mon, 8 Apr 2024 17:37:52 +0800
Subject: [PATCH v1 17/22] some code refactor as before.

1. use rinfo->left|rigth_relids instead of pull_varnos.
2. use FunctionCallInvoke instead of FunctionCall2Coll.
3. strip RelableType.
---
 src/backend/statistics/mcv.c | 47 +++++++++++++++++++++---------------
 1 file changed, 27 insertions(+), 20 deletions(-)

diff --git a/src/backend/statistics/mcv.c b/src/backend/statistics/mcv.c
index 02a74a0ec7..5d20ea2ae1 100644
--- a/src/backend/statistics/mcv.c
+++ b/src/backend/statistics/mcv.c
@@ -2698,6 +2698,7 @@ mcv_combine_simple(PlannerInfo *root, RelOptInfo *rel, StatisticExtInfo *stat,
 
 	/* info about clauses and how they match to MCV stats */
 	FmgrInfo	opproc;
+	LOCAL_FCINFO(fcinfo, 2);
 	int			index = 0;
 	bool		reverse = false;
 	RangeTblEntry *rte = root->simple_rte_array[rel->relid];
@@ -2720,8 +2721,8 @@ mcv_combine_simple(PlannerInfo *root, RelOptInfo *rel, StatisticExtInfo *stat,
 	if (conditions)
 	{
 		cmatches = mcv_get_match_bitmap(root, conditions,
-										 stat->keys, stat->exprs,
-										 mcv, false);
+										stat->keys, stat->exprs,
+										mcv, false);
 		csel = clauselist_selectivity(root, conditions, rel->relid, 0, NULL);
 	}
 
@@ -2743,8 +2744,9 @@ mcv_combine_simple(PlannerInfo *root, RelOptInfo *rel, StatisticExtInfo *stat,
 		OpExpr	   *opexpr;
 		Node	   *expr1,
 				   *expr2;
-		Bitmapset  *relids1,
-				   *relids2;
+		RestrictInfo *rinfo = (RestrictInfo *) clause;
+
+		Assert(IsA(clause, RestrictInfo));
 
 		/*
 		 * Strip the RestrictInfo node, get the actual clause.
@@ -2753,9 +2755,7 @@ mcv_combine_simple(PlannerInfo *root, RelOptInfo *rel, StatisticExtInfo *stat,
 		 * too (e.g. RelabelType etc.). statext_is_supported_join_clause
 		 * matches this, but maybe we need to relax it?
 		 */
-		if (IsA(clause, RestrictInfo))
-			clause = (Node *) ((RestrictInfo *) clause)->clause;
-
+		clause = (Node *) rinfo->clause;
 		opexpr = (OpExpr *) clause;
 
 		/* Make sure we have the expected node type. */
@@ -2763,16 +2763,21 @@ mcv_combine_simple(PlannerInfo *root, RelOptInfo *rel, StatisticExtInfo *stat,
 		Assert(list_length(opexpr->args) == 2);
 
 		fmgr_info(get_opcode(opexpr->opno), &opproc);
+		InitFunctionCallInfoData(*fcinfo, &opproc, 2, opexpr->inputcollid, NULL, NULL);
+		fcinfo->args[0].isnull = false;
+		fcinfo->args[1].isnull = false;
 
-		/* FIXME strip relabel etc. the way examine_opclause_args does */
 		expr1 = linitial(opexpr->args);
 		expr2 = lsecond(opexpr->args);
 
-		/* determine order of clauses (rel1 op rel2) or (rel2 op rel1) */
-		relids1 = pull_varnos(root, expr1);
-		relids2 = pull_varnos(root, expr2);
+		/* strip RelabelType from either side of the expression */
+		if (IsA(expr1, RelabelType))
+			expr1 = (Node *) ((RelabelType *) expr1)->arg;
 
-		if (bms_singleton_member(relids1) == rel->relid)
+		if (IsA(expr2, RelabelType))
+			expr2 = (Node *) ((RelabelType *) expr2)->arg;
+
+		if (bms_singleton_member(rinfo->left_relids) == rel->relid)
 		{
 			Oid		collid;
 
@@ -2783,7 +2788,7 @@ mcv_combine_simple(PlannerInfo *root, RelOptInfo *rel, StatisticExtInfo *stat,
 			exprs1 = lappend(exprs1, expr1);
 			exprs2 = lappend(exprs2, expr2);
 		}
-		else if (bms_singleton_member(relids2) == rel->relid)
+		else if (bms_singleton_member(rinfo->right_relids) == rel->relid)
 		{
 			Oid		collid;
 
@@ -2849,14 +2854,16 @@ mcv_combine_simple(PlannerInfo *root, RelOptInfo *rel, StatisticExtInfo *stat,
 			 * FIXME Use appropriate collation.
 			 */
 			if (reverse)
-				match = DatumGetBool(FunctionCall2Coll(&opproc,
-													   InvalidOid,
-													   value2, value1));
+			{
+				fcinfo->args[0].value = value2;
+				fcinfo->args[1].value = value1;
+			}
 			else
-				match = DatumGetBool(FunctionCall2Coll(&opproc,
-													   InvalidOid,
-													   value1, value2));
-
+			{
+				fcinfo->args[0].value = value1;
+				fcinfo->args[1].value = value2;
+			}
+			match = DatumGetBool(FunctionCallInvoke(fcinfo));
 			if (match)
 			{
 				/* XXX Do we need to do something about base frequency? */
-- 
2.34.1

v1-0018-Fix-error-unexpected-system-attribute-when-join-w.patch0000644000175000017500000000766714613326575025175 0ustar  yizhi.fzhyizhi.fzhFrom 321a31d142dbd58df3e57c642e11ce17b4eb5033 Mon Sep 17 00:00:00 2001
From: "yizhi.fzh" <yizhi.fzh@alibaba-inc.com>
Date: Tue, 9 Apr 2024 16:48:26 +0800
Subject: [PATCH v1 18/22] Fix error "unexpected system attribute" when join
 with system attr

We can't just change 'elog(ERROR, "unexpected system attribute");' to
'continue' in extract_relation_info since after we extract the
StatisticExtInfo, and stat is not NULL, we grantee the expression in
JoinPairInfo.clause has a matched expression with mcv_match_expression,
however this is not true for system attribute. so fix it at the first
place when populate the clause into JoinPairInfo.clauses which is the
statext_is_supported_join_clause function. Expression contains a system
attribute is OK since due to the implementation of mcv_match_expression
so only Var need to be handled there.
---
 src/backend/statistics/extended_stats.c | 21 +++++++++++++++++++--
 src/test/regress/sql/stats_ext.sql      |  6 ++++++
 2 files changed, 25 insertions(+), 2 deletions(-)

diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c
index a2801d0e94..824cd9d279 100644
--- a/src/backend/statistics/extended_stats.c
+++ b/src/backend/statistics/extended_stats.c
@@ -2825,6 +2825,7 @@ statext_is_supported_join_clause(PlannerInfo *root, Node *clause)
 	RestrictInfo   *rinfo;
 	OpExpr		   *opclause;
 	int				left_relid, right_relid;
+	Var			   *var;
 
 	if (!IsA(clause, RestrictInfo))
 		return false;
@@ -2869,14 +2870,21 @@ statext_is_supported_join_clause(PlannerInfo *root, Node *clause)
 		return false;
 
 	/*
-	 * XXX:
-	 * Join two columns in the same relation is uncommon and
+	 * XXX: Join two columns in the same relation is uncommon and
 	 * extract_relation_info requires 2 different relids, so no bother to
 	 * handle it.
 	 */
 	if (left_relid == right_relid)
 		return false;
 
+	var = (Var *) linitial(opclause->args);
+	if (IsA(var, Var) && var->varattno < 0)
+		return false;
+
+	var = (Var *) lsecond(opclause->args);
+	if (IsA(var, Var) && var->varattno < 0)
+		return false;
+
 	return true;
 }
 
@@ -3148,6 +3156,15 @@ extract_relation_info(PlannerInfo *root, JoinPairInfo *info, int index,
 		}
 	}
 
+	/*
+	 * Find a stat which covers *all* the attnums and exprs for simplification.
+	 *
+	 * To overcome above limitation, statext_find_matching_mcv has to smart enough to
+	 * decide which expression to discard as the first step. and later the other
+	 * side of join has to use a stats which match or superset of expression here.
+	 * at last mcv_combine_extended should be improved to handle the not-exactly-same
+	 * mcv.
+	 */
 	*stat = statext_find_matching_mcv(root, rel, attnums, exprs, base_conditions);
 
 	return rel;
diff --git a/src/test/regress/sql/stats_ext.sql b/src/test/regress/sql/stats_ext.sql
index c7023620a1..85e19ef04c 100644
--- a/src/test/regress/sql/stats_ext.sql
+++ b/src/test/regress/sql/stats_ext.sql
@@ -1595,6 +1595,12 @@ SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_
 SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a) where j1.c < 5');
 SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a + 1) where j1.c < 5');
 
+-- test join with system column var, but the ext statistics can't be built in system attribute AND extended statistics
+-- must covers all the join columns, so the following 2 statements can use extended statistics for join.
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) and j1.cmin = j2.cmin');
+-- Join with system column expression.
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) and j1.cmin::text::int4 = j2.cmin::text::int4');
+
 -- try combining with single-column (and single-expression) statistics
 DROP STATISTICS join_stats_2;
 
-- 
2.34.1

v1-0019-Fix-the-incorrect-comment-on-extended-stats.patch0000644000175000017500000001400214613326575023747 0ustar  yizhi.fzhyizhi.fzhFrom 55251adc429a8b9fcc70409a7b83968584ffca4c Mon Sep 17 00:00:00 2001
From: "yizhi.fzh" <yizhi.fzh@alibaba-inc.com>
Date: Tue, 9 Apr 2024 17:14:01 +0800
Subject: [PATCH v1 19/22] Fix the incorrect comment on extended stats.

Comments (either extended_stats.c or stats_ext.sql) says we must needs
multiple join clauses, but it has been handled in
clauselist_selectivity_ext already with the below code.

single_clause_optimization
	= !treat_as_join_clause(root, clause, rinfo, varRelid, sjinfo);
---
 src/backend/statistics/extended_stats.c | 10 ++--------
 src/test/regress/expected/stats_ext.out | 19 +++++++++++++++----
 src/test/regress/sql/stats_ext.sql      |  4 ----
 3 files changed, 17 insertions(+), 16 deletions(-)

diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c
index 824cd9d279..ca9dcdd556 100644
--- a/src/backend/statistics/extended_stats.c
+++ b/src/backend/statistics/extended_stats.c
@@ -3219,14 +3219,8 @@ get_expression_for_rel(PlannerInfo *root, RelOptInfo *rel, Node *clause)
 
 /*
  * statext_clauselist_join_selectivity
- *		Use extended stats to estimate join clauses.
- *
- * XXX In principle, we should not restrict this to cases with multiple
- * join clauses - we should consider dependencies with conditions at the
- * base relations, i.e. calculate P(join clause | base restrictions).
- * But currently that does not happen, because clauselist_selectivity_ext
- * treats a single clause as a special case (and we don't apply extended
- * statistics in that case yet).
+ *		Use extended stats to estimate join clauses. the limitation is the
+ * extended statistics must covers all the join clauses.
  */
 Selectivity
 statext_clauselist_join_selectivity(PlannerInfo *root, List *clauses,
diff --git a/src/test/regress/expected/stats_ext.out b/src/test/regress/expected/stats_ext.out
index 95246522bb..2ec28263be 100644
--- a/src/test/regress/expected/stats_ext.out
+++ b/src/test/regress/expected/stats_ext.out
@@ -3111,8 +3111,6 @@ SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_
        100 |      0
 (1 row)
 
--- can't be improved due to the optimization in clauselist_selectivity_ext,
--- which skips cases with a single (join) clause
 SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5');
  estimated | actual 
 -----------+--------
@@ -3178,8 +3176,6 @@ SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_
          1 |      0
 (1 row)
 
--- can't be improved due to the optimization in clauselist_selectivity_ext,
--- which skips cases with a single (join) clause
 SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5');
  estimated | actual 
 -----------+--------
@@ -3210,6 +3206,21 @@ SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_
      50000 |  50000
 (1 row)
 
+-- test join with system column var, but the ext statistics can't be built in system attribute AND extended statistics
+-- must covers all the join columns, so the following 2 statements can use extended statistics for join.
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) and j1.cmin = j2.cmin');
+ estimated | actual 
+-----------+--------
+       500 | 100000
+(1 row)
+
+-- Join with system column expression.
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) and j1.cmin::text::int4 = j2.cmin::text::int4');
+ estimated | actual 
+-----------+--------
+        50 | 100000
+(1 row)
+
 -- try combining with single-column (and single-expression) statistics
 DROP STATISTICS join_stats_2;
 SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a) where j1.c < 5');
diff --git a/src/test/regress/sql/stats_ext.sql b/src/test/regress/sql/stats_ext.sql
index 85e19ef04c..ef3484eb92 100644
--- a/src/test/regress/sql/stats_ext.sql
+++ b/src/test/regress/sql/stats_ext.sql
@@ -1564,8 +1564,6 @@ SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_
 SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5 and j2.c < 3');
 SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5 and j2.c > 5');
 
--- can't be improved due to the optimization in clauselist_selectivity_ext,
--- which skips cases with a single (join) clause
 SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5');
 SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5 and j2.c < 5');
 SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5 and j2.c > 5');
@@ -1586,8 +1584,6 @@ SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_
 SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5 and j2.c < 3');
 SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5 and j2.c > 5');
 
--- can't be improved due to the optimization in clauselist_selectivity_ext,
--- which skips cases with a single (join) clause
 SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5');
 SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5 and j2.c < 5');
 SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5 and j2.c > 5');
-- 
2.34.1

v1-0020-Add-fastpath-when-combine-the-2-MCV-like-eqjoinse.patch0000644000175000017500000000267214613326575024421 0ustar  yizhi.fzhyizhi.fzhFrom a24e8a3cff363feffe42a05c1359706bbc41585e Mon Sep 17 00:00:00 2001
From: "yizhi.fzh" <yizhi.fzh@alibaba-inc.com>
Date: Thu, 25 Apr 2024 13:36:22 +0800
Subject: [PATCH v1 20/22] Add fastpath when combine the 2 MCV like
 eqjoinsel_inner.

when MCV2 exactly matches clauses.
---
 src/backend/statistics/mcv.c | 15 ++++++++-------
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/src/backend/statistics/mcv.c b/src/backend/statistics/mcv.c
index 5d20ea2ae1..2eebd32e2c 100644
--- a/src/backend/statistics/mcv.c
+++ b/src/backend/statistics/mcv.c
@@ -2481,13 +2481,6 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 			if (cmatches2 && !cmatches2[j])
 				continue;
 
-			/*
-			 * XXX We can't skip based on existing matches2 value, because there
-			 * may be duplicates in the first MCV.
-			 *
-			 * From Andy: what does this mean?
-			 */
-
 			/*
 			 * Evaluate if all the join clauses match between the two MCV items.
 			 */
@@ -2537,6 +2530,14 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 				/* XXX Do we need to do something about base frequency? */
 				matches1[i] = matches2[j] = true;
 				s += mcv1->items[i].frequency * mcv2->items[j].frequency;
+				nmatches += 1;
+
+				if (mcv2->ndimensions == list_length(clauses))
+					/*
+					 * no more items in mcv2 could match mcv1[i] in this case,
+					 * so break fast.
+					 */
+					break;
 			}
 		}
 	}
-- 
2.34.1

v1-0021-When-mcv-ndimensions-list_length-clauses-handle-i.patch0000644000175000017500000000527414613326575025113 0ustar  yizhi.fzhyizhi.fzhFrom 8e9e80cc7b667daf395ef86e402bce7b062a4080 Mon Sep 17 00:00:00 2001
From: "yizhi.fzh" <yizhi.fzh@alibaba-inc.com>
Date: Sun, 28 Apr 2024 08:50:09 +0800
Subject: [PATCH v1 21/22] When mcv->ndimensions == list_length(clauses),
 handle it same as

eqjoinsel_inner, but more testing doesn't show me any benefits from
it. just this commit is just FYI.
---
 src/backend/statistics/mcv.c | 45 +++++++++++++++++++++++-------------
 1 file changed, 29 insertions(+), 16 deletions(-)

diff --git a/src/backend/statistics/mcv.c b/src/backend/statistics/mcv.c
index 2eebd32e2c..e6ec230045 100644
--- a/src/backend/statistics/mcv.c
+++ b/src/backend/statistics/mcv.c
@@ -2250,7 +2250,8 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 	MCVList    *mcv1,
 			   *mcv2;
 	int			i,
-				j;
+		j,
+		nmatches = 0;
 	Selectivity s = 0;
 
 	/* match bitmaps and selectivity for baserel conditions (if any) */
@@ -2273,7 +2274,7 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 			nd1,
 			totalsel1;
 
-	double 	matchfreq2,
+	double	matchfreq2,
 			unmatchfreq2,
 			otherfreq2,
 			mcvfreq2,
@@ -2626,24 +2627,36 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 	nd2 *= csel2;
 
 	totalsel1 = s;
-	totalsel1 += unmatchfreq1 * otherfreq2 / nd2;
-	totalsel1 += otherfreq1 * (otherfreq2 + unmatchfreq2) / nd2;
 
-//	if (nd2 > mcvb->nitems)
-//		totalsel1 += unmatchfreq1 * otherfreq2 / (nd2 - mcvb->nitems);
-//	if (nd2 > nmatches)
-//		totalsel1 += otherfreq1 * (otherfreq2 + unmatchfreq2) /
-//			(nd2 - nmatches);
+	if (mcv2->ndimensions == list_length(clauses))
+	{
+		if (nd2 > mcv2->nitems)
+			totalsel1 += unmatchfreq1 * otherfreq2 / (nd2 - mcv2->nitems);
+		if (nd2 > nmatches)
+			totalsel1 += otherfreq1 * (otherfreq2 + unmatchfreq2) /
+				(nd2 - nmatches);
+	}
+	else
+	{
+		totalsel1 += unmatchfreq1 * otherfreq2 / nd2;
+		totalsel1 += otherfreq1 * (otherfreq2 + unmatchfreq2) / nd2;
+	}
 
 	totalsel2 = s;
-	totalsel2 += unmatchfreq2 * otherfreq1 / nd1;
-	totalsel2 += otherfreq2 * (otherfreq1 + unmatchfreq1) / nd1;
 
-//	if (nd1 > mcva->nitems)
-//		totalsel2 += unmatchfreq2 * otherfreq1 / (nd1 - mcva->nitems);
-//	if (nd1 > nmatches)
-//		totalsel2 += otherfreq2 * (otherfreq1 + unmatchfreq1) /
-//			(nd1 - nmatches);
+	if (mcv1->ndimensions == list_length(clauses))
+	{
+		if (nd1 > mcv1->nitems)
+			totalsel2 += unmatchfreq2 * otherfreq1 / (nd1 - mcv1->nitems);
+		if (nd1 > nmatches)
+			totalsel2 += otherfreq2 * (otherfreq1 + unmatchfreq1) /
+				(nd1 - nmatches);
+	}
+	else
+	{
+		totalsel2 += unmatchfreq2 * otherfreq1 / nd1;
+		totalsel2 += otherfreq2 * (otherfreq1 + unmatchfreq1) / nd1;
+	}
 
 	s = Min(totalsel1, totalsel2);
 
-- 
2.34.1

v1-0022-Fix-typo-error-s-grantee-guarantee.patch0000644000175000017500000000255714613326575022161 0ustar  yizhi.fzhyizhi.fzhFrom fe3c24f9bf18229bfd77fe9110b6d3aaa746d264 Mon Sep 17 00:00:00 2001
From: "yizhi.fzh" <yizhi.fzh@alibaba-inc.com>
Date: Sun, 28 Apr 2024 09:35:57 +0800
Subject: [PATCH v1 22/22] Fix typo error, s/grantee/guarantee/.

---
 src/backend/optimizer/path/clausesel.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c
index c4f5fae9d7..af6c8abf84 100644
--- a/src/backend/optimizer/path/clausesel.c
+++ b/src/backend/optimizer/path/clausesel.c
@@ -207,11 +207,11 @@ clauselist_selectivity_ext(PlannerInfo *root,
 	 * do to detect when this makes sense, but we can check that there are
 	 * join clauses, and that at least some of the rels have stats.
 	 *
-	 * rel != NULL can't grantee the clause is not a join clause, for example
-	 * t1 left join t2 ON t1.a = 3, but it can grantee we can't use extended
+	 * rel != NULL can't guarantee the clause is not a join clause, for example
+	 * t1 left join t2 ON t1.a = 3, but it can guarantee we can't use extended
 	 * statistics for estimation since it has only 1 relid.
 	 *
-	 * XXX: so we can grantee estimatedclauses == NULL now, so estimatedclauses
+	 * XXX: so we can guarantee estimatedclauses == NULL now, so estimatedclauses
 	 * in statext_try_join_estimates is removed.
 	 */
 	if (use_extended_stats && rel == NULL &&
-- 
2.34.1

#21

Andy Fan

zhihuifan1213@163.com

over 1 year ago

In reply to: Justin Pryzby (#19)

Re: using extended statistics to improve join estimates

Hello Justin!

Justin Pryzby <pryzby@telsasoft.com> writes:

|../src/backend/statistics/extended_stats.c:3151:36: warning: ‘relid’ may be used uninitialized [-Wmaybe-uninitialized]
| 3151 | if (var->varno != relid)
| | ^
|../src/backend/statistics/extended_stats.c:3104:33: note: ‘relid’ was declared here
| 3104 | int relid;
| | ^~~~~
|[1016/1893] Compiling C object src/backend/postgres_lib.a.p/statistics_mcv.c.o
|../src/backend/statistics/mcv.c: In function ‘mcv_combine_extended’:
|../src/backend/statistics/mcv.c:2431:49: warning: declaration of ‘idx’ shadows a previous local [-Wshadow=compatible-local]

Thanks for the feedback, the warnning should be fixed in the lastest
revision and 's/estimiatedcluases/estimatedclauses/' typo error in the
commit message is not fixed since I have to regenerate all the commits
to fix that. We are still in dicussion stage and I think these impact is
pretty limited on dicussion.

FYI, I also ran the patch with a $large number of reports without
observing any errors or crashes.

Good to know that.

I'll try to look harder at the next patch revision.

Thank you!

--
Best Regards
Andy Fan

#22

Justin Pryzby

pryzby@telsasoft.com

over 1 year ago

In reply to: Andy Fan (#21)

Re: using extended statistics to improve join estimates

On Sun, Apr 28, 2024 at 10:07:01AM +0800, Andy Fan wrote:

's/estimiatedcluases/estimatedclauses/' typo error in the
commit message is not fixed since I have to regenerate all the commits

Maybe you know this, but some of these patches need to be squashed.
Regenerating the patches to address feedback is the usual process.
When they're not squished, it makes it hard to review the content of the
patches.

For example:
[PATCH v1 18/22] Fix error "unexpected system attribute" when join with system attr
..adds .sql regression tests, but the expected .out isn't updated until
[PATCH v1 19/22] Fix the incorrect comment on extended stats.

That fixes an elog() in Tomas' original commit, so it should probably be
002 or 003. It might make sense to keep the first commit separate for
now, since it's nice to keep Tomas' original patch "pristine" to make
more apparent the changes you're proposing.

Another:
[PATCH v1 20/22] Add fastpath when combine the 2 MCV like eqjoinsel_inner.
..doesn't compile without
[PATCH v1 21/22] When mcv->ndimensions == list_length(clauses), handle it same as

Your 022 patch fixes a typo in your 002 patch, which means that first
one reads a patch with a typo, and then later, a 10 line long patch
reflowing the comment with a typo fixed.

A good guideline is that each patch should be self-contained, compiling
and passing tests. Which is more difficult with a long stack of
patches.

--
Justin

#23

Andy Fan

zhihuifan1213@163.com

over 1 year ago

In reply to: Justin Pryzby (#22)

Re: using extended statistics to improve join estimates

Hello Justin,

Thanks for showing interest on this!

On Sun, Apr 28, 2024 at 10:07:01AM +0800, Andy Fan wrote:

's/estimiatedcluases/estimatedclauses/' typo error in the
commit message is not fixed since I have to regenerate all the commits

Maybe you know this, but some of these patches need to be squashed.
Regenerating the patches to address feedback is the usual process.
When they're not squished, it makes it hard to review the content of the
patches.

You might overlooked the fact that the each individual commit is just to
make the communication effectively (easy to review) and all of them
will be merged into 1 commit at the last / during the process of review.

Even though if something make it hard to review, I am pretty happy to
regenerate the patches, but does 's/estimiatedcluases/estimatedclauses/'
belongs to this category? I'm pretty sure that is not the only typo
error or inapproprate word, if we need to regenerate the 22 patches
because of that, we have to regenerate that pretty often.

Do you mind to provide more feedback once and I can merge all of them in
one modification or you think the typo error has blocked the review
process?

For example:
[PATCH v1 18/22] Fix error "unexpected system attribute" when join with system attr
..adds .sql regression tests, but the expected .out isn't updated until
[PATCH v1 19/22] Fix the incorrect comment on extended stats.

That fixes an elog() in Tomas' original commit, so it should probably be
002 or 003.

Which elog are you talking about?

It might make sense to keep the first commit separate for
now, since it's nice to keep Tomas' original patch "pristine" to make
more apparent the changes you're proposing.

This is my goal as well, did you find anything I did which break this
rule, that's absoluately not my intention.

Another:
[PATCH v1 20/22] Add fastpath when combine the 2 MCV like eqjoinsel_inner.
..doesn't compile without
[PATCH v1 21/22] When mcv->ndimensions == list_length(clauses), handle it same as

Your 022 patch fixes a typo in your 002 patch, which means that first
one reads a patch with a typo, and then later, a 10 line long patch
reflowing the comment with a typo fixed.

I would like to regenerate the 22 patches if you think the typo error
make the reivew process hard. I can do such things but not willing to
do that often.

A good guideline is that each patch should be self-contained, compiling
and passing tests. Which is more difficult with a long stack of
patches.

I agree.

--
Best Regards
Andy Fan

#24

Andrei Lepikhov

a.lepikhov@postgrespro.ru

over 1 year ago

In reply to: Tomas Vondra (#17)

1 attachment(s)

Re: using extended statistics to improve join estimates

On 4/3/24 01:22, Tomas Vondra wrote:

Cool! There's obviously no chance to get this into v18, and I have stuff
to do in this CF. But I'll take a look after that.

I'm looking at your patch now - an excellent start to an eagerly awaited
feature!
A couple of questions:
1. I didn't find the implementation of strategy 'c' - estimation by the
number of distinct values. Do you forget it?
2. Can we add a clauselist selectivity hook into the core (something
similar the code in attachment)? It can allow the development and
testing of multicolumn join estimations without patching the core.

--
regards,
Andrei Lepikhov
Postgres Professional

Attachments:

clauselist_selectivity_hook.difftext/x-patch; charset=UTF-8; name=clauselist_selectivity_hook.diffDownload

diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c
index 0ab021c1e8..271d36a522 100644
--- a/src/backend/optimizer/path/clausesel.c
+++ b/src/backend/optimizer/path/clausesel.c
@@ -128,6 +128,9 @@ clauselist_selectivity_ext(PlannerInfo *root,
 	ListCell   *l;
 	int			listidx;
 
+	if (clauselist_selectivity_hook)
+		s1 = clauselist_selectivity_hook(root, clauses, varRelid, jointype,
+										 sjinfo, &estimatedclauses);
 	/*
 	 * If there's exactly one clause, just go directly to
 	 * clause_selectivity_ext(). None of what we might do below is relevant.
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 5f5d7959d8..ff98fda08c 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -146,6 +146,7 @@
 /* Hooks for plugins to get control when we ask for stats */
 get_relation_stats_hook_type get_relation_stats_hook = NULL;
 get_index_stats_hook_type get_index_stats_hook = NULL;
+clauselist_selectivity_hook_type clauselist_selectivity_hook = NULL;
 
 static double eqsel_internal(PG_FUNCTION_ARGS, bool negate);
 static double eqjoinsel_inner(Oid opfuncoid, Oid collation,
diff --git a/src/include/utils/selfuncs.h b/src/include/utils/selfuncs.h
index f2563ad1cb..ee28d3ba9b 100644
--- a/src/include/utils/selfuncs.h
+++ b/src/include/utils/selfuncs.h
@@ -148,6 +148,15 @@ typedef bool (*get_index_stats_hook_type) (PlannerInfo *root,
 										   VariableStatData *vardata);
 extern PGDLLIMPORT get_index_stats_hook_type get_index_stats_hook;
 
+/* Hooks for plugins to get control when we ask for selectivity estimation */
+typedef bool (*clauselist_selectivity_hook_type) (PlannerInfo *root,
+												  List *clauses,
+												  int varRelid,
+												  JoinType jointype,
+												  SpecialJoinInfo *sjinfo,
+												  Bitmapset **estimatedclauses);
+extern PGDLLIMPORT clauselist_selectivity_hook_type clauselist_selectivity_hook;
+
 /* Functions in selfuncs.c */
 
 extern void examine_variable(PlannerInfo *root, Node *node, int varRelid,

#25

Andy Fan

zhihuifan1213@163.com

over 1 year ago

In reply to: Andrei Lepikhov (#24)

Re: using extended statistics to improve join estimates

Hi Andrei,

On 4/3/24 01:22, Tomas Vondra wrote:

Cool! There's obviously no chance to get this into v18, and I have stuff
to do in this CF. But I'll take a look after that.

I'm looking at your patch now - an excellent start to an eagerly awaited
feature!
A couple of questions:
1. I didn't find the implementation of strategy 'c' - estimation by the
number of distinct values. Do you forget it?

What do you mean the "strategy 'c'"?

2. Can we add a clauselist selectivity hook into the core (something
similar the code in attachment)? It can allow the development and
testing of multicolumn join estimations without patching the core.

The idea LGTM. But do you want

+	if (clauselist_selectivity_hook)
+		s1 = clauselist_selectivity_hook(root, clauses, varRelid, jointype,
+

rather than

+	if (clauselist_selectivity_hook)
+		*return* clauselist_selectivity_hook(root, clauses, ..)

--
Best Regards
Andy Fan

#26

Andrei Lepikhov

a.lepikhov@postgrespro.ru

over 1 year ago

In reply to: Andy Fan (#25)

Re: using extended statistics to improve join estimates

On 20/5/2024 15:52, Andy Fan wrote:

Hi Andrei,

On 4/3/24 01:22, Tomas Vondra wrote:

Cool! There's obviously no chance to get this into v18, and I have stuff
to do in this CF. But I'll take a look after that.

I'm looking at your patch now - an excellent start to an eagerly awaited
feature!
A couple of questions:
1. I didn't find the implementation of strategy 'c' - estimation by the
number of distinct values. Do you forget it?

What do you mean the "strategy 'c'"?

As described in 0001-* patch:
* c) No extended stats with MCV. If there are multiple join clauses,
* we can try using ndistinct coefficients and do what eqjoinsel does.

2. Can we add a clauselist selectivity hook into the core (something
similar the code in attachment)? It can allow the development and
testing of multicolumn join estimations without patching the core.

The idea LGTM. But do you want
+	if (clauselist_selectivity_hook)
+		s1 = clauselist_selectivity_hook(root, clauses, varRelid, jointype,
+
rather than
+	if (clauselist_selectivity_hook)
+		*return* clauselist_selectivity_hook(root, clauses, ..)

Of course - library may estimate not all the clauses - it is a reason,
why I added input/output parameter 'estimatedclauses' by analogy with
statext_clauselist_selectivity.

--
regards,
Andrei Lepikhov
Postgres Professional

#27

Andrei Lepikhov

a.lepikhov@postgrespro.ru

over 1 year ago

In reply to: Andrei Lepikhov (#26)

1 attachment(s)

Re: using extended statistics to improve join estimates

On 5/20/24 16:40, Andrei Lepikhov wrote:

On 20/5/2024 15:52, Andy Fan wrote:
+    if (clauselist_selectivity_hook)
+        *return* clauselist_selectivity_hook(root, clauses, ..)
Of course - library may estimate not all the clauses - it is a reason,
why I added input/output parameter 'estimatedclauses' by analogy with
statext_clauselist_selectivity.

Here is a polished and a bit modified version of the hook proposed.
Additionally, I propose exporting the statext_mcv_clauselist_selectivity
routine, likewise dependencies_clauselist_selectivity. This could
potentially enhance the functionality of the PostgreSQL estimation code.

To clarify the purpose, I want an optional, loaded as a library, more
conservative estimation based on distinct statistics. Let's provide (a
bit degenerate) example:

CREATE TABLE is_test(x1 integer, x2 integer, x3 integer, x4 integer);
INSERT INTO is_test (x1,x2,x3,x4)
SELECT x%5,x%7,x%11,x%13 FROM generate_series(1,1E3) AS x;
INSERT INTO is_test (x1,x2,x3,x4)
SELECT 14,14,14,14 FROM generate_series(1,100) AS x;
CREATE STATISTICS ist_stat (dependencies,ndistinct)
ON x1,x2,x3,x4 FROM is_test;
ANALYZE is_test;
EXPLAIN (ANALYZE, COSTS ON, SUMMARY OFF, TIMING OFF)
SELECT * FROM is_test WHERE x1=14 AND x2=14 AND x3=14 AND x4=14;
DROP TABLE is_test CASCADE;

I see:
(cost=0.00..15.17 rows=3 width=16) (actual rows=100 loops=1)

Dependency works great if it is the same for all the data in the
columns. But we get underestimations if we have different laws for
subsets of rows. So, if we don't have MCV statistics, sometimes we need
to pass over dependency statistics and use ndistinct instead.

--
regards,
Andrei Lepikhov
Postgres Professional

Attachments:

clauselist_selectivity_hook.difftext/x-patch; charset=UTF-8; name=clauselist_selectivity_hook.diffDownload

diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c
index 0ab021c1e8..1508a1beea 100644
--- a/src/backend/optimizer/path/clausesel.c
+++ b/src/backend/optimizer/path/clausesel.c
@@ -128,6 +128,11 @@ clauselist_selectivity_ext(PlannerInfo *root,
 	ListCell   *l;
 	int			listidx;
 
+	if (clauselist_selectivity_hook)
+		s1 = clauselist_selectivity_hook(root, clauses, varRelid, jointype,
+										 sjinfo, &estimatedclauses,
+										 use_extended_stats);
+
 	/*
 	 * If there's exactly one clause, just go directly to
 	 * clause_selectivity_ext(). None of what we might do below is relevant.
diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c
index 99fdf208db..b1722f5a60 100644
--- a/src/backend/statistics/extended_stats.c
+++ b/src/backend/statistics/extended_stats.c
@@ -1712,7 +1712,7 @@ statext_is_compatible_clause(PlannerInfo *root, Node *clause, Index relid,
  * 0-based 'clauses' indexes we estimate for and also skip clause items that
  * already have a bit set.
  */
-static Selectivity
+Selectivity
 statext_mcv_clauselist_selectivity(PlannerInfo *root, List *clauses, int varRelid,
 								   JoinType jointype, SpecialJoinInfo *sjinfo,
 								   RelOptInfo *rel, Bitmapset **estimatedclauses,
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 5f5d7959d8..ff98fda08c 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -146,6 +146,7 @@
 /* Hooks for plugins to get control when we ask for stats */
 get_relation_stats_hook_type get_relation_stats_hook = NULL;
 get_index_stats_hook_type get_index_stats_hook = NULL;
+clauselist_selectivity_hook_type clauselist_selectivity_hook = NULL;
 
 static double eqsel_internal(PG_FUNCTION_ARGS, bool negate);
 static double eqjoinsel_inner(Oid opfuncoid, Oid collation,
diff --git a/src/include/statistics/statistics.h b/src/include/statistics/statistics.h
index 7f2bf18716..436f30bdde 100644
--- a/src/include/statistics/statistics.h
+++ b/src/include/statistics/statistics.h
@@ -104,6 +104,14 @@ extern void BuildRelationExtStatistics(Relation onerel, bool inh, double totalro
 extern int	ComputeExtStatisticsRows(Relation onerel,
 									 int natts, VacAttrStats **vacattrstats);
 extern bool statext_is_kind_built(HeapTuple htup, char type);
+extern Selectivity statext_mcv_clauselist_selectivity(PlannerInfo *root,
+													  List *clauses,
+													  int varRelid,
+													  JoinType jointype,
+													  SpecialJoinInfo *sjinfo,
+													   RelOptInfo *rel,
+													   Bitmapset **estimatedclauses,
+													   bool is_or);
 extern Selectivity dependencies_clauselist_selectivity(PlannerInfo *root,
 													   List *clauses,
 													   int varRelid,
diff --git a/src/include/utils/selfuncs.h b/src/include/utils/selfuncs.h
index f2563ad1cb..253f584c65 100644
--- a/src/include/utils/selfuncs.h
+++ b/src/include/utils/selfuncs.h
@@ -148,6 +148,17 @@ typedef bool (*get_index_stats_hook_type) (PlannerInfo *root,
 										   VariableStatData *vardata);
 extern PGDLLIMPORT get_index_stats_hook_type get_index_stats_hook;
 
+/* Hooks for plugins to get control when we ask for selectivity estimation */
+typedef Selectivity (*clauselist_selectivity_hook_type) (
+												PlannerInfo *root,
+												List *clauses,
+												int varRelid,
+												JoinType jointype,
+												SpecialJoinInfo *sjinfo,
+												Bitmapset **estimatedclauses,
+												bool use_extended_stats);
+extern PGDLLIMPORT clauselist_selectivity_hook_type clauselist_selectivity_hook;
+
 /* Functions in selfuncs.c */
 
 extern void examine_variable(PlannerInfo *root, Node *node, int varRelid,

#28

Andy Fan

zhihuifan1213@163.com

over 1 year ago

In reply to: Andrei Lepikhov (#26)

Re: using extended statistics to improve join estimates

Andrei Lepikhov <a.lepikhov@postgrespro.ru> writes:

On 20/5/2024 15:52, Andy Fan wrote:

Hi Andrei,

On 4/3/24 01:22, Tomas Vondra wrote:

Cool! There's obviously no chance to get this into v18, and I have stuff
to do in this CF. But I'll take a look after that.

I'm looking at your patch now - an excellent start to an eagerly awaited
feature!
A couple of questions:
1. I didn't find the implementation of strategy 'c' - estimation by the
number of distinct values. Do you forget it?

What do you mean the "strategy 'c'"?

As described in 0001-* patch:
* c) No extended stats with MCV. If there are multiple join clauses,
* we can try using ndistinct coefficients and do what eqjoinsel does.

OK, I didn't pay enough attention to this comment before. and yes, I get
the same conclusion as you - there is no implementation of this.

and if so, I think we should remove the comments and do the
implementation in the next patch.

2. Can we add a clauselist selectivity hook into the core (something
similar the code in attachment)? It can allow the development and
testing of multicolumn join estimations without patching the core.
The idea LGTM. But do you want
+	if (clauselist_selectivity_hook)
+		s1 = clauselist_selectivity_hook(root, clauses, varRelid, jointype,
+
rather than
+	if (clauselist_selectivity_hook)
+		*return* clauselist_selectivity_hook(root, clauses, ..)
Of course - library may estimate not all the clauses - it is a reason,
why I added input/output parameter 'estimatedclauses' by analogy with
statext_clauselist_selectivity.

OK.

Do you think the hook proposal is closely connected with the current
topic? IIUC it's seems not. So a dedicated thread to explain the problem
to slove and the proposal and the follwing discussion should be helpful
for both topics. I'm just worried about mixing the two in one thread
would make things complexer unnecessarily.

--
Best Regards
Andy Fan

#29

Andrei Lepikhov

a.lepikhov@postgrespro.ru

over 1 year ago

In reply to: Andy Fan (#28)

Re: using extended statistics to improve join estimates

On 5/23/24 09:04, Andy Fan wrote:

Andrei Lepikhov <a.lepikhov@postgrespro.ru> writes:

* c) No extended stats with MCV. If there are multiple join clauses,
* we can try using ndistinct coefficients and do what eqjoinsel does.

OK, I didn't pay enough attention to this comment before. and yes, I get
the same conclusion as you - there is no implementation of this.

and if so, I think we should remove the comments and do the
implementation in the next patch.

I have an opposite opinion about it:
1. distinct estimation is more universal thing - you can use it
precisely on any subset of columns.
2. distinct estimation is faster - it just a number, you don't need to
detoast huge array of values and compare them one-by-one.

So, IMO, it is essential part of join estimation and it should be
implemented like in eqjoinsel.

Do you think the hook proposal is closely connected with the current
topic? IIUC it's seems not. So a dedicated thread to explain the problem
to slove and the proposal and the follwing discussion should be helpful
for both topics. I'm just worried about mixing the two in one thread
would make things complexer unnecessarily.

Sure.

--
regards,
Andrei Lepikhov
Postgres Professional

#30

Tomas Vondra

tomas.vondra@enterprisedb.com

over 1 year ago

In reply to: Andrei Lepikhov (#29)

56 attachment(s)

Re: using extended statistics to improve join estimates

Hi,

I finally got to do a review of the reworked patch series. For the most
part I do like the changes, although I'm not 100% sure about some of
them. I do like that the changes have been kept in separate patches,
which makes it much easier to understand what the goal is etc. But it's
probably time to start merging some of the patches back into the main
patch - it's a bit tedious work with 22 patches.

Note: This needs to be added to the next CF, so that we get cfbot
results and can focus on it in 2024-07. Also, I'd attach the patches
directly, not as .tar.

I did go though the patches one by one, and did a review for each of
them separately. I only had a couple hours for this today, so it's not
super-deep review, more a start for a discussion / asking questions.

For each patch I added a "review" and "pgindent" where review is my
comments, pgindent is the changes pgindent would do (which we now expect
to happen before commit). In hindsight I should have skipped the
pgindent, it made it a more tedious with little benefits. But I realized
that half-way through the series, so it was easier to just continue.

Let me quickly go through the original parts - most of this is already
in the "review" patches, but it's better to quote the main points here
to start a discussion. I'll omit some of the smaller suggestions, so
please look at the 'review' patches.

v20240617-0001-Estimate-joins-using-extended-statistics.patch

- rewords a couple comments, particularly for statext_find_matching_mcv

- a couple XXX comments about possibly stale/inaccurate comments

- suggestion to improve statext_determine_join_restrictions, but we one
of the later patches already does the caching

v20240617-0004-Remove-estimiatedcluases-and-varRelid-argu.patch

- I'm not sure we actually should do this (esp. the removal of
estimatedclauses bitmap). It breaks if we add the new hook.

v20240617-0007-Remove-SpecialJoinInfo-sjinfo-argument.patch
v20240617-0009-Remove-joinType-argument.patch

- I'm skeptical about removing these two. Yes, the current code does not
actually use those fields, but selfuncs.c always passes both jointype
and sjinfo, so maybe we should do that too for consistency. What happens
if we end up wanting to call an existing selfuncs function that needs
these parameters in the future? Say because we want to call the regular
join estimator, and then apply some "correction" to the result?

v20240617-0011-use-the-pre-calculated-RestrictInfo-left-r.patch

- why not to keep the BMS_MULTIPLE check on clause_relids, seems cheap
so maybe we could do it before the more expensive stuff?

v20240617-0014-Fast-path-for-general-clauselist_selectivi.patch

- Does this actually make meaningful difference?

v20240617-0017-a-branch-of-updates-around-JoinPairInfo.patch

- Can we actually assume the clause has a RestrictInfo on top? IIRC
there are cases where we can get here without it (e.g. AND clause?).

v20240617-0020-Cache-the-result-of-statext_determine_join.patch

- This addresses some of my suggestions in 0001, but I think we don't
actually need to recalculate both lists in each loop.

v20240617-0030-optimize-the-order-of-mcv-equal-function-e.patch

- There's no explanation to support this optimization. I guess I know
what it tries to do, but doesn't it have the same issues withu
npredictable behavior like the GROUP BY patch, which ended up reverting
and reworking?

- modifies .sql test but not the expected output

- The McvProc name seems a bit misleading. I think it's really "procs",
for example.

v20240617-0033-Merge-3-palloc-into-1-palloc.patch

- Not sure. It's presented as an optimization to save on palloc calls,
but I doubt that's measurable. Maybe it makes it a little bit more
readable, but now I'm not convinced it's worth it.

v20240617-0036-Remove-2-pull_varnos-calls-with-rinfo-left.patch

- Again, can we rely on this always getting a RestrictInfo? Maybe we do,
but it's not obvious to me, so a comment explaining that would be nice.
And maybe an assert to check this.

v20240617-0040-some-code-refactor-as-before.patch

- Essentially applies earlier refactorings/tweaks to another place.

- Seems OK (depending on whether we agree on those changes), but it
seems mostly independent of this patch series. So I'd at least keep it
in a separate patch.

v20240617-0043-Fix-error-unexpected-system-attribute-when.patch

- seems to only tweak the .sql, not expected output

- One of the comments refers to "above limitation" but I'm unsure what
that's about.

v20240617-0048-Add-fastpath-when-combine-the-2-MCV-like-e.patch
v20240617-0050-When-mcv-ndimensions-list_length-clauses-h.patch

- I'm not sure about one of the opimizations, relying on having a clause
for each dimensions of the MCV.

v20240617-0054-clauselist_selectivity_hook.patch

- I believe this does not work with the earlier patch that removed
estimatedclaused bitmap from the "try" function.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

v20240617-0001-Estimate-joins-using-extended-statistics.patchtext/x-patch; charset=UTF-8; name=v20240617-0001-Estimate-joins-using-extended-statistics.patchDownload

From 5290136dfe7c8041a879ac41541a5b4c60077f9d Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas.vondra@postgresql.org>
Date: Mon, 13 Dec 2021 14:05:17 +0100
Subject: [PATCH v20240617 01/56] Estimate joins using extended statistics

Use extended statistics (MCV) to improve join estimates. In general this
is similar to how we use regular statistics - we search for extended
statistics (with MCV) covering all join clauses, and if we find such MCV
on both sides of the join, we combine those two MCVs.

Extended statistics allow a couple additional improvements - e.g. if
there are baserel conditions, we can use them to restrict the part of
the MCVs combined. This means we're building conditional probability
distribution and calculating conditional probability

    P(join clauses | baserel conditions)

instead of just P(join clauses).

The patch also allows combining regular and extended MCV - we don't need
extended MCVs on both sides. This helps when one of the tables does not
have extended statistics (e.g. because there are no correlations).
---
 src/backend/optimizer/path/clausesel.c        |  63 +-
 src/backend/statistics/extended_stats.c       | 805 ++++++++++++++++++
 src/backend/statistics/mcv.c                  | 758 +++++++++++++++++
 .../statistics/extended_stats_internal.h      |  20 +
 src/include/statistics/statistics.h           |  12 +
 src/test/regress/expected/stats_ext.out       | 167 ++++
 src/test/regress/sql/stats_ext.sql            |  66 ++
 7 files changed, 1890 insertions(+), 1 deletion(-)

diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c
index 0ab021c1e89..bedf76edaec 100644
--- a/src/backend/optimizer/path/clausesel.c
+++ b/src/backend/optimizer/path/clausesel.c
@@ -48,6 +48,9 @@ static Selectivity clauselist_selectivity_or(PlannerInfo *root,
 											 JoinType jointype,
 											 SpecialJoinInfo *sjinfo,
 											 bool use_extended_stats);
+static inline bool treat_as_join_clause(PlannerInfo *root,
+										Node *clause, RestrictInfo *rinfo,
+										int varRelid, SpecialJoinInfo *sjinfo);
 
 /****************************************************************************
  *		ROUTINES TO COMPUTE SELECTIVITIES
@@ -127,12 +130,53 @@ clauselist_selectivity_ext(PlannerInfo *root,
 	RangeQueryClause *rqlist = NULL;
 	ListCell   *l;
 	int			listidx;
+	bool		single_clause_optimization = true;
+
+	/*
+	 * The optimization of skipping to clause_selectivity_ext for single
+	 * clauses means we can't improve join estimates with a single join
+	 * clause but additional baserel restrictions. So we disable it when
+	 * estimating joins.
+	 *
+	 * XXX Not sure if this is the right way to do it, but more elaborate
+	 * checks would mostly negate the whole point of the optimization.
+	 * The (Var op Var) patch has the same issue.
+	 *
+	 * XXX An alternative might be making clause_selectivity_ext smarter
+	 * and make it use the join extended stats there. But that seems kinda
+	 * against the whole point of the optimization (skipping expensive
+	 * stuff) and it's making other parts more complex.
+	 *
+	 * XXX Maybe this should check if there are at least some restrictions
+	 * on some base relations, which seems important. But then again, that
+	 * seems to go against the idea of this check to be cheap. Moreover, it
+	 * won't work for OR clauses, which may have multiple parts but we still
+	 * see them as a single BoolExpr clause (it doesn't work later, though).
+	 */
+	if (list_length(clauses) == 1)
+	{
+		Node *clause = linitial(clauses);
+		RestrictInfo *rinfo = NULL;
+
+		if (IsA(clause, RestrictInfo))
+		{
+			rinfo = (RestrictInfo *) clause;
+			clause = (Node *) rinfo->clause;
+		}
+
+		single_clause_optimization
+			= !treat_as_join_clause(root, clause, rinfo, varRelid, sjinfo);
+	}
 
 	/*
 	 * If there's exactly one clause, just go directly to
 	 * clause_selectivity_ext(). None of what we might do below is relevant.
+	 *
+	 * XXX This means we won't try using extended stats on OR-clauses (which
+	 * are a single BoolExpr clause at this point), although we'll do that
+	 * later (once we look at the arguments).
 	 */
-	if (list_length(clauses) == 1)
+	if ((list_length(clauses) == 1) && single_clause_optimization)
 		return clause_selectivity_ext(root, (Node *) linitial(clauses),
 									  varRelid, jointype, sjinfo,
 									  use_extended_stats);
@@ -155,6 +199,23 @@ clauselist_selectivity_ext(PlannerInfo *root,
 											&estimatedclauses, false);
 	}
 
+	/*
+	 * Try applying extended statistics to joins. There's not much we can
+	 * do to detect when this makes sense, but we can check that there are
+	 * join clauses, and that at least some of the rels have stats.
+	 *
+	 * XXX Isn't this mutually exclusive with the preceding block which
+	 * calculates estimates for a single relation?
+	 */
+	if (use_extended_stats &&
+		statext_try_join_estimates(root, clauses, varRelid, jointype, sjinfo,
+						 estimatedclauses))
+	{
+		s1 *= statext_clauselist_join_selectivity(root, clauses, varRelid,
+												  jointype, sjinfo,
+												  &estimatedclauses);
+	}
+
 	/*
 	 * Apply normal selectivity estimates for remaining clauses. We'll be
 	 * careful to skip any clauses which were already estimated above.
diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c
index 99fdf208dba..80872cc7daa 100644
--- a/src/backend/statistics/extended_stats.c
+++ b/src/backend/statistics/extended_stats.c
@@ -29,6 +29,7 @@
 #include "miscadmin.h"
 #include "nodes/nodeFuncs.h"
 #include "optimizer/optimizer.h"
+#include "optimizer/pathnode.h"
 #include "parser/parsetree.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -100,6 +101,8 @@ static StatsBuildData *make_build_data(Relation rel, StatExtEntry *stat,
 									   int numrows, HeapTuple *rows,
 									   VacAttrStats **stats, int stattarget);
 
+static bool stat_covers_expressions(StatisticExtInfo *stat, List *exprs,
+									Bitmapset **expr_idxs);
 
 /*
  * Compute requested extended stats, using the rows sampled for the plain
@@ -2635,3 +2638,805 @@ make_build_data(Relation rel, StatExtEntry *stat, int numrows, HeapTuple *rows,
 
 	return result;
 }
+
+/*
+ * statext_find_matching_mcv
+ *		Search for a MCV covering all the attributes and expressions.
+ *
+ * We pick the statistics to use for join estimation. The statistics object has
+ * to have MCV, and we require it to match all the join conditions, because it
+ * makes the estimation simpler.
+ *
+ * If there are multiple candidate statistics objects (matching all join clauses),
+ * we pick the smallest one, and we also consider additional conditions on
+ * the base relations to restrict the MCV items used for estimation (using
+ * conditional probability).
+ *
+ * XXX The requirement that all the attributes need to be covered might be
+ * too strong. We could relax this and and require fewer matches (at least two,
+ * if counting the additional conditions), and we might even apply multiple
+ * statistics etc. But that would require matching statistics on both sides of
+ * the join, while now we simply know the statistics match. We don't really
+ * expect many candidate MCVs, so this simple approach seems sufficient. And
+ * the joins usually use only one or two columns, so there's not much room
+ * for applying multiple statistics anyway.
+ */
+StatisticExtInfo *
+statext_find_matching_mcv(PlannerInfo *root, RelOptInfo *rel,
+						  Bitmapset *attnums, List *exprs)
+{
+	ListCell   *l;
+	StatisticExtInfo *mcv = NULL;
+	List *stats = rel->statlist;
+
+	foreach(l, stats)
+	{
+		StatisticExtInfo *stat = (StatisticExtInfo *) lfirst(l);
+		List *conditions1 = NIL,
+			 *conditions2 = NIL;
+
+		/* We only care about MCV statistics here. */
+		if (stat->kind != STATS_EXT_MCV)
+			continue;
+
+		/*
+		 * Ignore MCVs not covering all the attributes/expressions.
+		 *
+		 * XXX Maybe we shouldn't be so strict and consider only partial
+		 * matches for join clauses too?
+		 */
+		if (!bms_is_subset(attnums, stat->keys) ||
+			!stat_covers_expressions(stat, exprs, NULL))
+			continue;
+
+		/* If there's no matching MCV yet, keep this one. */
+		if (!mcv)
+		{
+			mcv = stat;
+			continue;
+		}
+
+		/*
+		 * OK, we have two candidate statistics objects and we need to decide
+		 * which one to keep. We'll use two simple heuristics:
+		 *
+		 * (a) We prefer smaller statistics (fewer columns), on the assumption
+		 * that it represents a larger fraction of the data (due to having fewer
+		 * combinations with higher counts).
+		 *
+		 * (b) If the statistics object covers some additional conditions for the rels,
+		 * that may help with considering additional dependencies between the
+		 * tables.
+		 *
+		 * Of course, those two heuristict are somewhat contradictory - smaller
+		 * stats are less likely to cover as many conditions as a larger one. We
+		 * consider the additional conditions first - if someone created such
+		 * statistics, there probably is a dependency worth considering.
+		 *
+		 * When inspecting the restrictions, we need to be careful - we don't
+		 * know which of them are compatible with extended stats, so we have to
+		 * run them through statext_is_compatible_clause first and then match
+		 * them to the statistics.
+		 *
+		 * XXX Maybe we shouldn't pick statistics that covers just a single join
+		 * clause, without any additional conditions. In such case we could just
+		 * as well pick regular statistics for the column/expression, but it's
+		 * not clear if that actually exists (so we might reject the stats here
+		 * and then fail to find something simpler/better).
+		 */
+		conditions1 = statext_determine_join_restrictions(root, rel, stat);
+		conditions2 = statext_determine_join_restrictions(root, rel, mcv);
+
+		/* if the new statistics object covers more conditions, use it */
+		if (list_length(conditions1) > list_length(conditions2))
+		{
+			mcv = stat;
+			continue;
+		}
+
+		/* The statistics seem about equal, so just use the smaller one. */
+		if (bms_num_members(mcv->keys) + list_length(mcv->exprs) >
+			bms_num_members(stat->keys) + list_length(stat->exprs))
+		{
+			mcv = stat;
+		}
+	}
+
+	return mcv;
+}
+
+/*
+ * statext_determine_join_restrictions
+ *		Get restrictions on base relation, covered by the statistics object.
+ *
+ * Returns a list of baserel restrictinfos, compatible with extended statistics
+ * and covered by the extended statistics object.
+ *
+ * When using extended statistics to estimate joins, we can use conditions
+ * from base relations to calculate conditional probability
+ *
+ *    P(join clauses | baserel restrictions)
+ *
+ * which should be a better estimate than just P(join clauses). We want to pick
+ * the statistics object covering the most such conditions.
+ */
+List *
+statext_determine_join_restrictions(PlannerInfo *root, RelOptInfo *rel,
+									StatisticExtInfo *info)
+{
+	ListCell   *lc;
+	List	   *conditions = NIL;
+
+	/* extract conditions that may be applied to the MCV list */
+	foreach (lc, rel->baserestrictinfo)
+	{
+		RestrictInfo *rinfo = (RestrictInfo *) lfirst(lc);
+		Bitmapset *indexes = NULL;
+		Bitmapset *attnums = NULL;
+		List *exprs = NIL;
+
+		/* clause has to be supported by MCV in general */
+		if (!statext_is_compatible_clause(root, (Node *) rinfo, rel->relid,
+										  &attnums, &exprs))
+			continue;
+
+		/*
+		 * clause is compatible in general, but is it actually covered
+		 * by this particular statistics object?
+		 */
+		if (!bms_is_subset(attnums, info->keys) ||
+			!stat_covers_expressions(info, exprs, &indexes))
+			continue;
+
+		conditions = lappend(conditions, rinfo->clause);
+	}
+
+	return conditions;
+}
+
+/*
+ * statext_is_supported_join_clause
+ *		Check if a join clause may be estimated using extended stats.
+ *
+ * Determines if this is a join clause of the form (Expr op Expr) which may be
+ * estimated using extended statistics. Each side must reference just a single
+ * relation for now.
+ *
+ * Similar to treat_as_join_clause, but we place additional restrictions
+ * on the conditions.
+ */
+static bool
+statext_is_supported_join_clause(PlannerInfo *root, Node *clause,
+								 int varRelid, SpecialJoinInfo *sjinfo)
+{
+	Oid	oprsel;
+	RestrictInfo   *rinfo;
+	OpExpr		   *opclause;
+	ListCell	   *lc;
+
+	/*
+	 * evaluation as a restriction clause, either at scan node or forced
+	 *
+	 * XXX See treat_as_join_clause.
+	 */
+	if ((varRelid != 0) || (sjinfo == NULL))
+		return false;
+
+	/* XXX Can we rely on always getting RestrictInfo here? */
+	if (!IsA(clause, RestrictInfo))
+		return false;
+
+	/* strip the RestrictInfo */
+	rinfo = (RestrictInfo *) clause;
+	clause = (Node *) rinfo->clause;
+
+	/* is it referencing multiple relations? */
+	if (bms_membership(rinfo->clause_relids) != BMS_MULTIPLE)
+		return false;
+
+	/* we only support simple operator clauses for now */
+	if (!is_opclause(clause))
+		return false;
+
+	opclause = (OpExpr *) clause;
+
+	/* for now we only support estimating equijoins */
+	oprsel = get_oprjoin(opclause->opno);
+
+	/* has to be an equality condition */
+	if (oprsel != F_EQJOINSEL)
+		return false;
+
+	/*
+	 * Make sure we're not mixing vars from multiple relations on the same
+	 * side, like
+	 *
+	 *   (t1.a + t2.a) = (t1.b + t2.b)
+	 *
+	 * which is still technically an opclause, but we can't match it to
+	 * extended statistics in a simple way.
+	 *
+	 * XXX This also means we require rinfo->clause_relids to have 2 rels.
+	 *
+	 * XXX Also check it's not expression on system attributes, which we
+	 * don't allow in extended statistics.
+	 *
+	 * XXX Although maybe we could allow cases that combine expressions
+	 * from both relations on either side? Like (t1.a + t2.b = t1.c - t2.d)
+	 * or something like that. We could do "cartesian product" of the MCV
+	 * stats and restrict it using this condition.
+	 */
+	foreach (lc, opclause->args)
+	{
+		Bitmapset *varnos = NULL;
+		Node *expr = (Node *) lfirst(lc);
+
+		varnos = pull_varnos(root, expr);
+
+		/*
+		 * No argument should reference more than just one relation.
+		 *
+		 * This effectively means each side references just two relations.
+		 * If there's no relation on one side, it's a Const, and the other
+		 * side has to be either Const or Expr with a single rel. In which
+		 * case it can't be a join clause.
+		 */
+		if (bms_num_members(varnos) > 1)
+			return false;
+
+		/*
+		 * XXX Maybe check that both relations have extended statistics
+		 * (no point in considering the clause as useful without it). But
+		 * we'll do that check later anyway, so keep this cheap.
+		 */
+	}
+
+	return true;
+}
+
+/*
+ * statext_try_join_estimates
+ *		Checks if it's worth considering extended stats on join estimates.
+ *
+ * This is supposed to be a quick/cheap check to decide whether to expend
+ * more effort on applying extended statistics to join clauses.
+ */
+bool
+statext_try_join_estimates(PlannerInfo *root, List *clauses, int varRelid,
+						   JoinType jointype, SpecialJoinInfo *sjinfo,
+						   Bitmapset *estimatedclauses)
+{
+	int			listidx;
+	int			k;
+	ListCell   *lc;
+	Bitmapset  *relids = NULL;
+
+	/*
+	 * XXX Not having these values means treat_as_join_clause returns false,
+	 * so we're not supposed to handle join clauses here. So just bail out.
+	 */
+	if ((varRelid != 0) || (sjinfo == NULL))
+		return false;
+
+	/*
+	 * Check if there are any unestimated join clauses, collect relids.
+	 *
+	 * XXX Currently this only allows simple OpExpr equality clauses with each
+	 * argument referring to single relation, AND-ed together. Maybe we could
+	 * relax this in the future, e.g. to allow more complex (deeper) expressions
+	 * and to allow OR-ed join clauses too. And maybe supporting inequalities.
+	 *
+	 * Handling more complex expressions seems simple - we already do that for
+	 * baserel estimates by building the match bitmap recursively, and we could
+	 * do something similar for combinations of MCV items (a bit like building
+	 * a single bit in the match bitmap). The challenge is what to do about the
+	 * part not represented by MCV, which is now based on ndistinct estimates.
+	 */
+	listidx = -1;
+	foreach (lc, clauses)
+	{
+		Node *clause = (Node *) lfirst(lc);
+		RestrictInfo *rinfo;
+
+		/* needs to happen before skipping any clauses */
+		listidx++;
+
+		/* Skip clauses that were already estimated. */
+		if (bms_is_member(listidx, estimatedclauses))
+			continue;
+
+		/*
+		 * Skip clauses that are not join clauses or that we don't know
+		 * how to handle estimate using extended statistics.
+		 */
+		if (!statext_is_supported_join_clause(root, clause, varRelid, sjinfo))
+			continue;
+
+		/*
+		 * XXX We're guaranteed to have RestrictInfo thanks to the checks
+		 * in statext_is_supported_join_clause.
+		 */
+		rinfo = (RestrictInfo *) clause;
+
+		/* Collect relids from all usable clauses. */
+		relids = bms_union(relids, rinfo->clause_relids);
+	}
+
+	/* no join clauses found, don't try applying extended stats */
+	if (bms_num_members(relids) == 0)
+		return false;
+
+	/*
+	 * We expect either 0 or >= 2 relids, a case with 1 relid in join clauses
+	 * should be impossible. And we just ruled out 0, so there are at least 2.
+	 */
+	Assert(bms_num_members(relids) >= 2);
+
+	/*
+	 * Check that at least some of the rels referenced by the clauses have
+	 * extended stats.
+	 *
+	 * XXX Maybe we should check how many rels have stats, and cross-check how
+	 * compatible they are (e.g. that both have MCVs, etc.). We might also
+	 * cross-check the exact joined pairs of rels, but it's supposed to be a
+	 * cheap check, so maybe better leave that for later.
+	 *
+	 * XXX We could also check the number of parameters in each rel to consider
+	 * extended stats. If there's just a single attribute, it's pointless to use
+	 * extended statistics. OTOH we can also consider restriction clauses from
+	 * baserestrictinfo and use them to calculate conditional probabilities.
+	 */
+	k = -1;
+	while ((k = bms_next_member(relids, k)) >= 0)
+	{
+		RelOptInfo *rel = find_base_rel(root, k);
+		if (rel->statlist)
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * Information about two joined relations, along with the join clauses between.
+ */
+typedef struct JoinPairInfo
+{
+	Bitmapset  *rels;
+	List	   *clauses;
+} JoinPairInfo;
+
+/*
+ * statext_build_join_pairs
+ *		Extract pairs of joined rels with join clauses for each pair.
+ *
+ * Walks the remaining (not yet estimated) clauses, and splits them into
+ * lists for each pair of joined relations. Returns NULL if there are no
+ * suitable join pairs that might be estimated using extended stats.
+ *
+ * XXX It's possible there are join clauses, but the clauses are not
+ * supported by the extended stats machinery (we only support opclauses
+ * with F_EQJOINSEL selectivity function at the moment).
+ */
+static JoinPairInfo *
+statext_build_join_pairs(PlannerInfo *root, List *clauses, int varRelid,
+						 JoinType jointype, SpecialJoinInfo *sjinfo,
+						 Bitmapset *estimatedclauses, int *npairs)
+{
+	int				cnt;
+	int				listidx;
+	JoinPairInfo   *info;
+	ListCell	   *lc;
+
+	/*
+	 * Assume each clause is for a different pair of relations (some of them
+	 * might be already estimated, but meh - there shouldn't be too many of
+	 * them and it's cheaper than repalloc).
+	 */
+	info = (JoinPairInfo *) palloc0(sizeof(JoinPairInfo) * list_length(clauses));
+	cnt = 0;
+
+	listidx = -1;
+	foreach(lc, clauses)
+	{
+		int				i;
+		bool			found;
+		Node		   *clause = (Node *) lfirst(lc);
+		RestrictInfo   *rinfo;
+
+		listidx++;
+
+		/* skip already estimated clauses */
+		if (bms_is_member(listidx, estimatedclauses))
+			continue;
+
+		/*
+		 * Make sure the clause is a join clause of a supported shape (at
+		 * the moment we support just (Expr op Expr) clauses with each
+		 * side referencing just a single relation).
+		 */
+		if (!statext_is_supported_join_clause(root, clause, varRelid, sjinfo))
+			continue;
+
+		/* statext_is_supported_join_clause guarantees RestrictInfo */
+		rinfo = (RestrictInfo *) clause;
+		clause = (Node *) rinfo->clause;
+
+		/* search for a matching join pair */
+		found = false;
+		for (i = 0; i < cnt; i++)
+		{
+			if (bms_is_subset(rinfo->clause_relids, info[i].rels))
+			{
+				info[i].clauses = lappend(info[i].clauses, clause);
+				found = true;
+				break;
+			}
+		}
+
+		if (!found)
+		{
+			info[cnt].rels = rinfo->clause_relids;
+			info[cnt].clauses = lappend(info[cnt].clauses, clause);
+			cnt++;
+		}
+	}
+
+	if (cnt == 0)
+		return NULL;
+
+	*npairs = cnt;
+	return info;
+}
+
+/*
+ * extract_relation_info
+ *		Extract information about a relation in a join pair.
+ *
+ * The relation is identified by index (generally 0 or 1), and picks extended
+ * statistics object covering the join clauses and baserel restrictions.
+ *
+ * XXX Can we have cases with indexes above 1? Probably for clauses mixing
+ * vars from 3 relations, but statext_is_supported_join_clause rejects those.
+ */
+static RelOptInfo *
+extract_relation_info(PlannerInfo *root, JoinPairInfo *info, int index,
+					  StatisticExtInfo **stat)
+{
+	int			k;
+	int			relid;
+	RelOptInfo *rel;
+	ListCell   *lc;
+	List	   *exprs = NIL;
+
+	Bitmapset  *attnums = NULL;
+
+	Assert((index >= 0) && (index <= 1));
+
+	k = -1;
+	while (index >= 0)
+	{
+		k = bms_next_member(info->rels, k);
+		if (k < 0)
+			elog(ERROR, "failed to extract relid");
+
+		relid = k;
+		index--;
+	}
+
+	rel = find_base_rel(root, relid);
+
+	/*
+	 * Walk the clauses for this join pair, and extract expressions about
+	 * the relation identified by index / relid. For simple Vars we extract
+	 * the attnum. Otherwise we keep the whole expression.
+	 */
+	foreach (lc, info->clauses)
+	{
+		ListCell *lc2;
+		Node *clause = (Node *) lfirst(lc);
+		OpExpr *opclause = (OpExpr *) clause;
+
+		/* only opclauses supported for now */
+		Assert(is_opclause(clause));
+
+		foreach (lc2, opclause->args)
+		{
+			Node *arg = (Node *) lfirst(lc2);
+			Bitmapset *varnos = NULL;
+
+			/* plain Var references (boolean Vars or recursive checks) */
+			if (IsA(arg, Var))
+			{
+				Var		   *var = (Var *) arg;
+
+				/* Ignore vars from other relations. */
+				if (var->varno != relid)
+					continue;
+
+				/* we also better ensure the Var is from the current level */
+				if (var->varlevelsup > 0)
+					continue;
+
+				/* Also skip system attributes (we don't allow stats on those). */
+				if (!AttrNumberIsForUserDefinedAttr(var->varattno))
+					elog(ERROR, "unexpected system attribute");
+
+				attnums = bms_add_member(attnums, var->varattno);
+
+				/* Done, process the next argument. */
+				continue;
+			}
+
+			/*
+			 * OK, it's a more complex expression, so check if it matches
+			 * the relid and maybe keep it as a whole. It should be
+			 * compatible because we already checked it when building the
+			 * join pairs.
+			 */
+			varnos = pull_varnos(root, arg);
+
+			if (relid == bms_singleton_member(varnos))
+				exprs = lappend(exprs, arg);
+		}
+	}
+
+	*stat = statext_find_matching_mcv(root, rel, attnums, exprs);
+
+	return rel;
+}
+
+/*
+ * get_expression_for_rel
+ *		Extract expression for a given relation from the join clause.
+ *
+ * Given a join clause supported by the extended statistics object (currently
+ * that means just OpExpr clauses with each argument referencing single rel),
+ * return either the left or right argument expression for the rel.
+ *
+ * XXX This should probably return a flag identifying whether it's the
+ * left or right argument.
+ */
+static Node *
+get_expression_for_rel(PlannerInfo *root, RelOptInfo *rel, Node *clause)
+{
+	OpExpr *opexpr;
+	Node   *expr;
+
+	/*
+	 * Strip the RestrictInfo node, get the actual clause.
+	 *
+	 * XXX Not sure if we need to care about removing other node types
+	 * too (e.g. RelabelType etc.). statext_is_supported_join_clause
+	 * matches this, but maybe we need to relax it?
+	 */
+	if (IsA(clause, RestrictInfo))
+		clause = (Node *) ((RestrictInfo *) clause)->clause;
+
+	opexpr = (OpExpr *) clause;
+
+	/* Make sure we have the expected node type. */
+	Assert(is_opclause(clause));
+	Assert(list_length(opexpr->args) == 2);
+
+	/* FIXME strip relabel etc. the way examine_opclause_args does */
+	expr = linitial(opexpr->args);
+	if (bms_singleton_member(pull_varnos(root, expr)) == rel->relid)
+		return expr;
+
+	expr = lsecond(opexpr->args);
+	if (bms_singleton_member(pull_varnos(root, expr)) == rel->relid)
+		return expr;
+
+	return NULL;
+}
+
+/*
+ * statext_clauselist_join_selectivity
+ *		Use extended stats to estimate join clauses.
+ *
+ * XXX In principle, we should not restrict this to cases with multiple
+ * join clauses - we should consider dependencies with conditions at the
+ * base relations, i.e. calculate P(join clause | base restrictions).
+ * But currently that does not happen, because clauselist_selectivity_ext
+ * treats a single clause as a special case (and we don't apply extended
+ * statistics in that case yet).
+ */
+Selectivity
+statext_clauselist_join_selectivity(PlannerInfo *root, List *clauses, int varRelid,
+									JoinType jointype, SpecialJoinInfo *sjinfo,
+									Bitmapset **estimatedclauses)
+{
+	int			i;
+	int			listidx;
+	Selectivity	s = 1.0;
+
+	JoinPairInfo *info;
+	int				ninfo;
+
+	if (!clauses)
+		return 1.0;
+
+	/* extract pairs of joined relations from the list of clauses */
+	info = statext_build_join_pairs(root, clauses, varRelid, jointype, sjinfo,
+									*estimatedclauses, &ninfo);
+
+	/* no useful join pairs */
+	if (!info)
+		return 1.0;
+
+	/*
+	 * Process the join pairs, try to find a matching MCV on each side.
+	 *
+	 * XXX The basic principle is quite similar to eqjoinsel_inner, i.e.
+	 * we try to find a MCV on both sides of the join, and use it to get
+	 * a better join estimate. It's a bit more complicated, because there
+	 * might be multiple MCV lists, we also need ndistinct estimate, and
+	 * there may be interesting baserestrictions too.
+	 *
+	 * XXX At the moment we only handle the case with matching MCVs on
+	 * both sides, but it'd be good to also handle case with just ndistinct
+	 * statistics improving ndistinct estimates.
+	 *
+	 * XXX We might also handle cases with a regular MCV on one side and
+	 * an extended MCV on the other side.
+	 *
+	 * XXX Perhaps it'd be good to also handle case with one side only
+	 * having "regular" statistics (e.g. MCV), especially in cases with
+	 * no conditions on that side of the join (where we can't use the
+	 * extended MCV to calculate conditional probability).
+	 */
+	for (i = 0; i < ninfo; i++)
+	{
+		ListCell *lc;
+
+		RelOptInfo *rel1;
+		RelOptInfo *rel2;
+
+		StatisticExtInfo *stat1;
+		StatisticExtInfo *stat2;
+
+		/* extract info about the first relation */
+		rel1 = extract_relation_info(root, &info[i], 0, &stat1);
+
+		/* extract info about the second relation */
+		rel2 = extract_relation_info(root, &info[i], 1, &stat2);
+
+		/*
+		 * We can handle three basic cases:
+		 *
+		 * a) Extended stats (with MCV) on both sides is an ideal case, and we
+		 * can simply combine the two MCVs, possibly with additional conditions
+		 * from the relations.
+		 *
+		 * b) Extended stats on one side, regular MCV on the other side (this
+		 * means there's just one join clause / expression). It also means the
+		 * extended stats likely covers at least one extra condition, otherwise
+		 * we could just use regular statistics. We can combine the stats just
+		 * similarly to (a).
+		 *
+		 * c) No extended stats with MCV. If there are multiple join clauses,
+		 * we can try using ndistinct coefficients and do what eqjoinsel does.
+		 *
+		 * If none of these applies, we fallback to the regular selectivity
+		 * estimation in eqjoinsel.
+		 */
+		if (stat1 && stat2)
+		{
+			s *= mcv_combine_extended(root, rel1, rel2, stat1, stat2, info[i].clauses);
+		}
+		else if (stat1 && (list_length(info[i].clauses) == 1))
+		{
+			/* try finding MCV on the other relation */
+			VariableStatData	vardata;
+			AttStatsSlot		sslot;
+			Form_pg_statistic	stats = NULL;
+			bool				have_mcvs = false;
+			Node			   *clause = linitial(info[i].clauses);
+			Node			   *expr = get_expression_for_rel(root, rel2, clause);
+			double				nd;
+			bool				isdefault;
+
+			examine_variable(root, expr, 0, &vardata);
+
+			nd = get_variable_numdistinct(&vardata, &isdefault);
+
+			memset(&sslot, 0, sizeof(sslot));
+
+			if (HeapTupleIsValid(vardata.statsTuple))
+			{
+				/* note we allow use of nullfrac regardless of security check */
+				stats = (Form_pg_statistic) GETSTRUCT(vardata.statsTuple);
+				/* FIXME should this call statistic_proc_security_check like eqjoinsel? */
+				have_mcvs = get_attstatsslot(&sslot, vardata.statsTuple,
+											 STATISTIC_KIND_MCV, InvalidOid,
+											 ATTSTATSSLOT_VALUES | ATTSTATSSLOT_NUMBERS);
+			}
+
+			if (have_mcvs)
+				s *= mcv_combine_simple(root, rel1, stat1, &sslot,
+										stats->stanullfrac, nd, isdefault, clause);
+
+			free_attstatsslot(&sslot);
+
+			ReleaseVariableStats(vardata);
+
+			/* no stats, don't mark the clauses as estimated */
+			if (!have_mcvs)
+				continue;
+		}
+		else if (stat2 && (list_length(info[i].clauses) == 1))
+		{
+			/* try finding MCV on the other relation */
+			VariableStatData	vardata;
+			AttStatsSlot		sslot;
+			Form_pg_statistic	stats = NULL;
+			bool				have_mcvs = false;
+			Node			   *clause = (Node *) linitial(info[i].clauses);
+			Node			   *expr = get_expression_for_rel(root, rel1, clause);
+			double				nd;
+			bool				isdefault;
+
+			examine_variable(root, expr, 0, &vardata);
+
+			nd = get_variable_numdistinct(&vardata, &isdefault);
+
+			memset(&sslot, 0, sizeof(sslot));
+
+			if (HeapTupleIsValid(vardata.statsTuple))
+			{
+				/* note we allow use of nullfrac regardless of security check */
+				stats = (Form_pg_statistic) GETSTRUCT(vardata.statsTuple);
+				/* FIXME should this call statistic_proc_security_check like eqjoinsel? */
+				have_mcvs = get_attstatsslot(&sslot, vardata.statsTuple,
+											 STATISTIC_KIND_MCV, InvalidOid,
+											 ATTSTATSSLOT_VALUES | ATTSTATSSLOT_NUMBERS);
+			}
+
+			if (have_mcvs)
+				s *= mcv_combine_simple(root, rel2, stat2, &sslot,
+										stats->stanullfrac, nd, isdefault, clause);
+
+			free_attstatsslot(&sslot);
+
+			ReleaseVariableStats(vardata);
+
+			/* no stats, don't mark the clauses as estimated */
+			if (!have_mcvs)
+				continue;
+		}
+		else
+			continue;
+
+		/*
+		 * Now mark all the clauses for this join pair as estimated.
+		 *
+		 * XXX Maybe track the indexes in JoinPairInfo, so that we can
+		 * simply union the two bitmaps, without the extra matching.
+		 */
+		foreach (lc, info->clauses)
+		{
+			Node *clause = (Node *) lfirst(lc);
+			ListCell *lc2;
+
+			listidx = -1;
+			foreach (lc2, clauses)
+			{
+				Node *clause2 = (Node *) lfirst(lc2);
+				listidx++;
+
+				Assert(IsA(clause2, RestrictInfo));
+
+				clause2 = (Node *) ((RestrictInfo *) clause2)->clause;
+
+				if (equal(clause, clause2))
+				{
+					*estimatedclauses = bms_add_member(*estimatedclauses, listidx);
+					break;
+				}
+			}
+		}
+	}
+
+	return s;
+}
diff --git a/src/backend/statistics/mcv.c b/src/backend/statistics/mcv.c
index b0e9aead84e..49299ed9074 100644
--- a/src/backend/statistics/mcv.c
+++ b/src/backend/statistics/mcv.c
@@ -22,6 +22,8 @@
 #include "fmgr.h"
 #include "funcapi.h"
 #include "nodes/nodeFuncs.h"
+#include "optimizer/clauses.h"
+#include "optimizer/optimizer.h"
 #include "statistics/extended_stats_internal.h"
 #include "statistics/statistics.h"
 #include "utils/array.h"
@@ -2173,3 +2175,759 @@ mcv_clause_selectivity_or(PlannerInfo *root, StatisticExtInfo *stat,
 
 	return s;
 }
+
+/*
+ * statext_compare_mcvs
+ *		Calculate join selectivity using extended statistics, similar to
+ *		eqjoinsel_inner.
+ *
+ * Considers restrictions on base relations too, essentially computing a
+ * conditional probability
+ *
+ *	P(join clauses | baserestrictinfos on either side)
+ *
+ * Compared to eqjoinsel_inner there's a couple problems. With per-column MCV
+ * lists it's obvious that the number of distinct values not covered by the MCV
+ * is (ndistinct - size(MCV)). With multi-column MCVs it's not that simple,
+ * particularly when the conditions are on a subset of the MCV attributes and/or
+ * NULLs are involved. E.g. with MCV (a,b,c) and conditions on (a,b), it's not
+ * clear if the number of (a,b) combinations not covered by the MCV is
+ *
+ * (ndistinct(a,b) - ndistinct_mcv(a,b))
+ *
+ * where ndistinct_mcv(a,b) is the number of distinct (a,b) combinations
+ * included in the MCV list. These combinations may be present in the rest
+ * of the data (outside MCV), just with some extra values in "c". So in
+ * principle there may be between
+ *
+ * (ndistinct(a,b) - ndistinct_mcv(a,b)) and ndistinct(a,b)
+ *
+ * distinct values in the part of the data not covered by the MCV. So we need
+ * to pick something in between, there's no way to calculate this accurately.
+ */
+Selectivity
+mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
+					 StatisticExtInfo *stat1, StatisticExtInfo *stat2,
+					 List *clauses)
+{
+	ListCell   *lc;
+
+	MCVList    *mcv1,
+			   *mcv2;
+	int			idx,
+				i,
+				j;
+	Selectivity s = 0;
+
+	/* match bitmaps and selectivity for baserel conditions (if any) */
+	List   *exprs1 = NIL,
+		   *exprs2 = NIL;
+	List   *conditions1 = NIL,
+		   *conditions2 = NIL;
+	bool   *cmatches1 = NULL,
+		   *cmatches2 = NULL;
+
+	double	csel1 = 1.0,
+			csel2 = 1.0;
+
+	bool   *matches1 = NULL,
+		   *matches2 = NULL;
+
+	/* estimates for the two relations */
+	double	matchfreq1,
+			unmatchfreq1,
+			otherfreq1,
+			mcvfreq1,
+			nd1,
+			totalsel1;
+
+	double 	matchfreq2,
+			unmatchfreq2,
+			otherfreq2,
+			mcvfreq2,
+			nd2,
+			totalsel2;
+
+	/* info about clauses and how they match to MCV stats */
+	FmgrInfo   *opprocs;
+	int		   *indexes1,
+			   *indexes2;
+	bool	   *reverse;
+	RangeTblEntry *rte1 = root->simple_rte_array[rel1->relid];
+	RangeTblEntry *rte2 = root->simple_rte_array[rel2->relid];
+
+	/* we picked the stats so that they have MCV enabled */
+	Assert((stat1->kind = STATS_EXT_MCV) && (stat2->kind = STATS_EXT_MCV));
+
+	mcv1 = statext_mcv_load(stat1->statOid, rte1->inh);
+	mcv2 = statext_mcv_load(stat2->statOid, rte2->inh);
+
+	/* should only get here with MCV on both sides */
+	Assert(mcv1 && mcv2);
+
+	/* Determine which baserel clauses to use for conditional probability. */
+	conditions1 = statext_determine_join_restrictions(root, rel1, stat1);
+	conditions2 = statext_determine_join_restrictions(root, rel2, stat2);
+
+	/*
+	 * Calculate match bitmaps for restrictions on either side of the join
+	 * (there may be none, in which case this will be NULL).
+	 */
+	if (conditions1)
+	{
+		cmatches1 = mcv_get_match_bitmap(root, conditions1,
+										 stat1->keys, stat1->exprs,
+										 mcv1, false);
+		csel1 = clauselist_selectivity(root, conditions1, rel1->relid, 0, NULL);
+	}
+
+	if (conditions2)
+	{
+		cmatches2 = mcv_get_match_bitmap(root, conditions2,
+										 stat2->keys, stat2->exprs,
+										 mcv2, false);
+		csel2 = clauselist_selectivity(root, conditions2, rel2->relid, 0, NULL);
+	}
+
+	/*
+	 * Match bitmaps for matches between MCV elements. By default there
+	 * are no matches, so we set all items to 0.
+	 */
+	matches1 = (bool *) palloc0(sizeof(bool) * mcv1->nitems);
+	matches2 = (bool *) palloc0(sizeof(bool) * mcv2->nitems);
+
+	/*
+	 * Initialize information about clauses and how they match to the MCV
+	 * stats we picked. We do this only once before processing the lists,
+	 * so that we don't have to do that for each MCV item or so.
+	 */
+	opprocs = (FmgrInfo *) palloc(sizeof(FmgrInfo) * list_length(clauses));
+	indexes1 = (int *) palloc(sizeof(int) * list_length(clauses));
+	indexes2 = (int *) palloc(sizeof(int) * list_length(clauses));
+	reverse = (bool *) palloc(sizeof(bool) * list_length(clauses));
+
+	idx = 0;
+	foreach (lc, clauses)
+	{
+		Node	   *clause = (Node *) lfirst(lc);
+		OpExpr	   *opexpr;
+		Node	   *expr1,
+				   *expr2;
+		Bitmapset  *relids1,
+				   *relids2;
+
+		/*
+		 * Strip the RestrictInfo node, get the actual clause.
+		 *
+		 * XXX Not sure if we need to care about removing other node types
+		 * too (e.g. RelabelType etc.). statext_is_supported_join_clause
+		 * matches this, but maybe we need to relax it?
+		 */
+		if (IsA(clause, RestrictInfo))
+			clause = (Node *) ((RestrictInfo *) clause)->clause;
+
+		opexpr = (OpExpr *) clause;
+
+		/* Make sure we have the expected node type. */
+		Assert(is_opclause(clause));
+		Assert(list_length(opexpr->args) == 2);
+
+		fmgr_info(get_opcode(opexpr->opno), &opprocs[idx]);
+
+		/* FIXME strip relabel etc. the way examine_opclause_args does */
+		expr1 = linitial(opexpr->args);
+		expr2 = lsecond(opexpr->args);
+
+		/* determine order of clauses (rel1 op rel2) or (rel2 op rel1) */
+		relids1 = pull_varnos(root, expr1);
+		relids2 = pull_varnos(root, expr2);
+
+		if ((bms_singleton_member(relids1) == rel1->relid) &&
+			(bms_singleton_member(relids2) == rel2->relid))
+		{
+			Oid		collid;
+
+			indexes1[idx] = mcv_match_expression(expr1,
+												 stat1->keys, stat1->exprs,
+												 &collid);
+			indexes2[idx] = mcv_match_expression(expr2,
+												 stat2->keys, stat2->exprs,
+												 &collid);
+			reverse[idx] = false;
+
+			exprs1 = lappend(exprs1, expr1);
+			exprs2 = lappend(exprs2, expr2);
+		}
+		else if ((bms_singleton_member(relids2) == rel1->relid) &&
+				 (bms_singleton_member(relids1) == rel2->relid))
+		{
+			Oid		collid;
+
+			indexes1[idx] = mcv_match_expression(expr2,
+												 stat2->keys, stat2->exprs,
+												 &collid);
+			indexes2[idx] = mcv_match_expression(expr1,
+												 stat1->keys, stat1->exprs,
+												 &collid);
+			reverse[idx] = true;
+
+			exprs1 = lappend(exprs1, expr2);
+			exprs2 = lappend(exprs2, expr1);
+		}
+		else
+			/* should never happen */
+			Assert(false);
+
+		Assert((indexes1[idx] >= 0) &&
+			   (indexes1[idx] < bms_num_members(stat1->keys) + list_length(stat1->exprs)));
+
+		Assert((indexes2[idx] >= 0) &&
+			   (indexes2[idx] < bms_num_members(stat2->keys) + list_length(stat2->exprs)));
+
+		idx++;
+	}
+
+	/*
+	 * Match items between the two MCV lists.
+	 *
+	 * We don't know if the join conditions match all attributes in the MCV, the
+	 * overlap may be just on a subset of attributes, e.g. (a,b,c) vs. (b,c,d).
+	 * So there may be multiple matches on either side. So we can't optimize by
+	 * aborting the inner loop after the first match, etc.
+	 *
+	 * XXX We can skip the items eliminated by the base restrictions, of course.
+	 *
+	 * XXX We might optimize this in two ways. We might sort the MCV items on
+	 * both sides using the "join" attributes, and then perform something like
+	 * merge join. Or we might calculate a hash from the join columns, and then
+	 * compare this (to eliminate the most expensive equality functions).
+	 */
+	for (i = 0; i < mcv1->nitems; i++)
+	{
+		bool	has_nulls;
+
+		/* skip items eliminated by restrictions on rel1 */
+		if (cmatches1 && !cmatches1[i])
+			continue;
+
+		/*
+		 * Check if any value in the first MCV item is NULL, because it'll be
+		 * mismatch anyway.
+		 *
+		 * XXX This might not work for some join clauses, e.g. IS NOT DISTINCT
+		 * FROM, but those are currently not considered compatible (we only
+		 * allow OpExpr at the moment).
+		 */
+		has_nulls = false;
+		for (j = 0; j < list_length(clauses); j++)
+			has_nulls |= mcv1->items[i].isnull[indexes1[j]];
+
+		if (has_nulls)
+			continue;
+
+		/* find matches in the second MCV list */
+		for (j = 0; j < mcv2->nitems; j++)
+		{
+			int			idx;
+			bool		items_match = true;
+
+			/* skip items eliminated by restrictions on rel2 */
+			if (cmatches2 && !cmatches2[j])
+				continue;
+
+			/*
+			 * XXX We can't skip based on existing matches2 value, because there
+			 * may be duplicates in the first MCV.
+			 */
+
+			/*
+			 * Evaluate if all the join clauses match between the two MCV items.
+			 *
+			 * XXX We might optimize the order of evaluation, using the costs of
+			 * operator functions for individual columns. It does depend on the
+			 * number of distinct values, etc.
+			 */
+			idx = 0;
+			foreach (lc, clauses)
+			{
+				bool	match;
+				int		index1 = indexes1[idx],
+						index2 = indexes2[idx];
+				Datum	value1,
+						value2;
+				bool	reverse_args = reverse[idx];
+
+				/* If either value is null, it's a mismatch */
+				if (mcv2->items[j].isnull[index2])
+					match = false;
+				else
+				{
+					value1 = mcv1->items[i].values[index1];
+					value2 = mcv2->items[j].values[index2];
+
+					/*
+					 * Careful about order of parameters. For same-type equality
+					 * that should not matter, but easy enough.
+					 *
+					 * FIXME Use appropriate collation.
+					 */
+					if (reverse_args)
+						match = DatumGetBool(FunctionCall2Coll(&opprocs[idx],
+															   InvalidOid,
+															   value2, value1));
+					else
+						match = DatumGetBool(FunctionCall2Coll(&opprocs[idx],
+															   InvalidOid,
+															   value1, value2));
+				}
+
+				items_match &= match;
+
+				if (!items_match)
+					break;
+
+				idx++;
+			}
+
+			if (items_match)
+			{
+				/* XXX Do we need to do something about base frequency? */
+				matches1[i] = matches2[j] = true;
+				s += mcv1->items[i].frequency * mcv2->items[j].frequency;
+			}
+		}
+	}
+
+	matchfreq1 = unmatchfreq1 = mcvfreq1 = 0.0;
+	for (i = 0; i < mcv1->nitems; i++)
+	{
+		mcvfreq1 += mcv1->items[i].frequency;
+
+		/* ignore MCV items eliminated by baserel conditions */
+		if (cmatches1 && !cmatches1[i])
+			continue;
+
+		if (matches1[i])
+			matchfreq1 += mcv1->items[i].frequency;
+		else
+			unmatchfreq1 += mcv1->items[i].frequency;
+	}
+
+	/* not represented by the MCV */
+	otherfreq1 = 1.0 - mcvfreq1;
+
+	matchfreq2 = unmatchfreq2 = mcvfreq2 = 0.0;
+	for (i = 0; i < mcv2->nitems; i++)
+	{
+		mcvfreq2 += mcv2->items[i].frequency;
+
+		/* ignore MCV items eliminated by baserel conditions */
+		if (cmatches2 && !cmatches2[i])
+			continue;
+
+		if (matches2[i])
+			matchfreq2 += mcv2->items[i].frequency;
+		else
+			unmatchfreq2 += mcv2->items[i].frequency;
+	}
+
+	/* not represented by the MCV */
+	otherfreq2 = 1.0 - mcvfreq2;
+
+	/*
+	 * Correction for MCV parts eliminated by the conditions.
+	 *
+	 * We need to be careful about cases where conditions eliminated all
+	 * the MCV items. We must not divide by 0.0, because that would either
+	 * produce bogus value or trigger division by zero. Instead we simply
+	 * set the selectivity to 0.0, because there can't be any matches.
+	 */
+	if ((matchfreq1 + unmatchfreq1) > 0)
+		s = s * mcvfreq1 / (matchfreq1 + unmatchfreq1);
+	else
+		s = 0.0;
+
+	if ((matchfreq2 + unmatchfreq2) > 0)
+		s = s * mcvfreq2 / (matchfreq2 + unmatchfreq2);
+	else
+		s = 0.0;
+
+	/* calculate ndistinct for the expression in join clauses for each rel */
+	nd1 = estimate_num_groups(root, exprs1, rel1->rows, NULL, NULL);
+	nd2 = estimate_num_groups(root, exprs2, rel2->rows, NULL, NULL);
+
+	/*
+	 * Consider the part of the data not represented by the MCV lists.
+	 *
+	 * XXX this is a bit bogus, because we don't know what fraction of
+	 * distinct combinations is covered by the MCV list (we're only
+	 * dealing with some of the columns), so we can't use the same
+	 * formular as eqjoinsel_inner exactly. We just use the estimates
+	 * for the whole table - this is likely an overestimate, because
+	 * (a) items may repeat in the MCV list, if it has more columns,
+	 * and (b) some of the combinations may be present in non-MCV data.
+	 *
+	 * Moreover, we need to look at the conditions. For now we simply
+	 * assume the conditions affect the distinct groups, and use that.
+	 *
+	 * XXX We might calculate the number of distinct groups in the MCV,
+	 * and then use something between (nd1 - distinct(MCV)) and (nd1),
+	 * which are the possible extreme values, assuming the estimates
+	 * are accurate. Maybe mean or geometric mean would work?
+	 *
+	 * XXX Not sure multiplying ndistinct with probabilities is good.
+	 * Maybe we should do something more like estimate_num_groups?
+	 */
+	nd1 *= csel1;
+	nd2 *= csel2;
+
+	totalsel1 = s;
+	totalsel1 += unmatchfreq1 * otherfreq2 / nd2;
+	totalsel1 += otherfreq1 * (otherfreq2 + unmatchfreq2) / nd2;
+
+//	if (nd2 > mcvb->nitems)
+//		totalsel1 += unmatchfreq1 * otherfreq2 / (nd2 - mcvb->nitems);
+//	if (nd2 > nmatches)
+//		totalsel1 += otherfreq1 * (otherfreq2 + unmatchfreq2) /
+//			(nd2 - nmatches);
+
+	totalsel2 = s;
+	totalsel2 += unmatchfreq2 * otherfreq1 / nd1;
+	totalsel2 += otherfreq2 * (otherfreq1 + unmatchfreq1) / nd1;
+
+//	if (nd1 > mcva->nitems)
+//		totalsel2 += unmatchfreq2 * otherfreq1 / (nd1 - mcva->nitems);
+//	if (nd1 > nmatches)
+//		totalsel2 += otherfreq2 * (otherfreq1 + unmatchfreq1) /
+//			(nd1 - nmatches);
+
+	s = Min(totalsel1, totalsel2);
+
+	return s;
+}
+
+
+/*
+ * statext_compare_simple
+ *		Calculate join selectivity using a combination of extended
+ * statistics MCV on one side, and simple per-column MCV on the other.
+ *
+ * Most of the mcv_combine_extended comment applies here too, but we can make
+ * some simplifications because we know the second (per-column) MCV is simpler,
+ * contains no NULL or duplicate values, etc.
+ */
+Selectivity
+mcv_combine_simple(PlannerInfo *root, RelOptInfo *rel, StatisticExtInfo *stat,
+				   AttStatsSlot *sslot, double stanullfrac, double nd,
+				   bool isdefault, Node *clause)
+{
+	MCVList    *mcv;
+	int			i,
+				j;
+	Selectivity s = 0;
+
+	/* match bitmaps and selectivity for baserel conditions (if any) */
+	List   *conditions = NIL;
+	bool   *cmatches = NULL;
+
+	double	csel = 1.0;
+
+	bool   *matches1 = NULL,
+		   *matches2 = NULL;
+
+	/* estimates for the two sides */
+	double	matchfreq1,
+			unmatchfreq1,
+			otherfreq1,
+			mcvfreq1,
+			nd1,
+			totalsel1;
+
+	double 	matchfreq2,
+			unmatchfreq2,
+			otherfreq2,
+			mcvfreq2,
+			nd2,
+			totalsel2;
+
+	List   *exprs1 = NIL,
+		   *exprs2 = NIL;
+
+	/* info about clauses and how they match to MCV stats */
+	FmgrInfo	opproc;
+	int			index = 0;
+	bool		reverse = false;
+	RangeTblEntry *rte = root->simple_rte_array[rel->relid];
+
+	/* we picked the stats so that they have MCV enabled */
+	Assert(stat->kind = STATS_EXT_MCV);
+
+	mcv = statext_mcv_load(stat->statOid, rte->inh);
+
+	/* should only get here with MCV on both sides */
+	Assert(mcv);
+
+	/* Determine which baserel clauses to use for conditional probability. */
+	conditions = statext_determine_join_restrictions(root, rel, stat);
+
+	/*
+	 * Calculate match bitmaps for restrictions on either side of the join
+	 * (there may be none, in which case this will be NULL).
+	 */
+	if (conditions)
+	{
+		cmatches = mcv_get_match_bitmap(root, conditions,
+										 stat->keys, stat->exprs,
+										 mcv, false);
+		csel = clauselist_selectivity(root, conditions, rel->relid, 0, NULL);
+	}
+
+	/*
+	 * Match bitmaps for matches between MCV elements. By default there
+	 * are no matches, so we set all items to 0.
+	 */
+	matches1 = (bool *) palloc0(sizeof(bool) * mcv->nitems);
+
+	/* Matches for the side with just regular single-column MCV. */
+	matches2 = (bool *) palloc0(sizeof(bool) * sslot->nvalues);
+
+	/*
+	 * Initialize information about the clause and how it matches to the
+	 * extended stats we picked. We do this only once before processing
+	 * the lists, so that we don't have to do that for each item or so.
+	 */
+	{
+		OpExpr	   *opexpr;
+		Node	   *expr1,
+				   *expr2;
+		Bitmapset  *relids1,
+				   *relids2;
+
+		/*
+		 * Strip the RestrictInfo node, get the actual clause.
+		 *
+		 * XXX Not sure if we need to care about removing other node types
+		 * too (e.g. RelabelType etc.). statext_is_supported_join_clause
+		 * matches this, but maybe we need to relax it?
+		 */
+		if (IsA(clause, RestrictInfo))
+			clause = (Node *) ((RestrictInfo *) clause)->clause;
+
+		opexpr = (OpExpr *) clause;
+
+		/* Make sure we have the expected node type. */
+		Assert(is_opclause(clause));
+		Assert(list_length(opexpr->args) == 2);
+
+		fmgr_info(get_opcode(opexpr->opno), &opproc);
+
+		/* FIXME strip relabel etc. the way examine_opclause_args does */
+		expr1 = linitial(opexpr->args);
+		expr2 = lsecond(opexpr->args);
+
+		/* determine order of clauses (rel1 op rel2) or (rel2 op rel1) */
+		relids1 = pull_varnos(root, expr1);
+		relids2 = pull_varnos(root, expr2);
+
+		if (bms_singleton_member(relids1) == rel->relid)
+		{
+			Oid		collid;
+
+			index = mcv_match_expression(expr1, stat->keys, stat->exprs,
+										 &collid);
+			reverse = false;
+
+			exprs1 = lappend(exprs1, expr1);
+			exprs2 = lappend(exprs2, expr2);
+		}
+		else if (bms_singleton_member(relids2) == rel->relid)
+		{
+			Oid		collid;
+
+			index = mcv_match_expression(expr2, stat->keys, stat->exprs,
+										 &collid);
+			reverse = true;
+
+			exprs1 = lappend(exprs1, expr2);
+			exprs2 = lappend(exprs2, expr1);
+		}
+		else
+			/* should never happen */
+			Assert(false);
+
+		Assert((index >= 0) &&
+			   (index < bms_num_members(stat->keys) + list_length(stat->exprs)));
+	}
+
+	/*
+	 * Match items between the two MCV lists.
+	 *
+	 * We don't know if the join conditions match all attributes in the MCV, the
+	 * overlap may be just on a subset of attributes, e.g. (a,b,c) vs. (b,c,d).
+	 * So there may be multiple matches on either side. So we can't optimize by
+	 * aborting the inner loop after the first match, etc.
+	 *
+	 * XXX We can skip the items eliminated by the base restrictions, of course.
+	 *
+	 * XXX We might optimize this in two ways. We might sort the MCV items on
+	 * both sides using the "join" attributes, and then perform something like
+	 * merge join. Or we might calculate a hash from the join columns, and then
+	 * compare this (to eliminate the most expensive equality functions).
+	 */
+	for (i = 0; i < mcv->nitems; i++)
+	{
+		/* skip items eliminated by restrictions on rel1 */
+		if (cmatches && !cmatches[i])
+			continue;
+
+		/*
+		 * We can check mcv1->items[i].isnull[index1] here, because it'll be a
+		 * mismatch anyway (the simple MCV does not contain NULLs).
+		 */
+		if (mcv->items[i].isnull[index])
+			continue;
+
+		/* find matches in the second MCV list */
+		for (j = 0; j < sslot->nvalues; j++)
+		{
+			bool	match;
+			Datum	value1 = mcv->items[i].values[index];
+			Datum	value2 = sslot->values[j];
+
+			/*
+			 * Evaluate the join clause between the two MCV lists. We don't
+			 * need to deal with NULL values here - we've already checked for
+			 * NULL in the extended statistics earlier, and the simple MCV
+			 * does not contain NULL values.
+			 *
+			 * Careful about order of parameters. For same-type equality
+			 * that should not matter, but easy enough.
+			 *
+			 * FIXME Use appropriate collation.
+			 */
+			if (reverse)
+				match = DatumGetBool(FunctionCall2Coll(&opproc,
+													   InvalidOid,
+													   value2, value1));
+			else
+				match = DatumGetBool(FunctionCall2Coll(&opproc,
+													   InvalidOid,
+													   value1, value2));
+
+			if (match)
+			{
+				/* XXX Do we need to do something about base frequency? */
+				matches1[i] = matches2[j] = true;
+				s += mcv->items[i].frequency * sslot->numbers[j];
+
+				/*
+				 * We know there can be just a single match in the regular
+				 * MCV list, so we can abort the inner loop.
+				 */
+				break;
+			}
+		}
+	}
+
+	matchfreq1 = unmatchfreq1 = mcvfreq1 = 0.0;
+	for (i = 0; i < mcv->nitems; i++)
+	{
+		mcvfreq1 += mcv->items[i].frequency;
+
+		/* ignore MCV items eliminated by baserel conditions */
+		if (cmatches && !cmatches[i])
+			continue;
+
+		if (matches1[i])
+			matchfreq1 += mcv->items[i].frequency;
+		else
+			unmatchfreq1 += mcv->items[i].frequency;
+	}
+
+	/* not represented by the MCV */
+	otherfreq1 = 1.0 - mcvfreq1;
+
+	matchfreq2 = unmatchfreq2 = mcvfreq2 = 0.0;
+	for (i = 0; i < sslot->nvalues; i++)
+	{
+		mcvfreq2 += sslot->numbers[i];
+
+		if (matches2[i])
+			matchfreq2 += sslot->numbers[i];
+		else
+			unmatchfreq2 += sslot->numbers[i];
+	}
+
+	/* not represented by the MCV */
+	otherfreq2 = 1.0 - mcvfreq2;
+
+	/*
+	 * Correction for MCV parts eliminated by the conditions.
+	 *
+	 * We need to be careful about cases where conditions eliminated all
+	 * the MCV items. We must not divide by 0.0, because that would either
+	 * produce bogus value or trigger division by zero. Instead we simply
+	 * set the selectivity to 0.0, because there can't be any matches.
+	 */
+	if ((matchfreq1 + unmatchfreq1) > 0)
+		s = s * mcvfreq1 / (matchfreq1 + unmatchfreq1);
+	else
+		s = 0.0;
+
+	if ((matchfreq2 + unmatchfreq2) > 0)
+		s = s * mcvfreq2 / (matchfreq2 + unmatchfreq2);
+	else
+		s = 0.0;
+
+	/* calculate ndistinct for the expression in join clauses for each rel */
+	nd1 = estimate_num_groups(root, exprs1, rel->rows, NULL, NULL);
+	nd2 = nd;
+
+	/*
+	 * Consider the part of the data not represented by the MCV lists.
+	 *
+	 * XXX this is a bit bogus, because we don't know what fraction of
+	 * distinct combinations is covered by the MCV list (we're only
+	 * dealing with some of the columns), so we can't use the same
+	 * formular as eqjoinsel_inner exactly. We just use the estimates
+	 * for the whole table - this is likely an overestimate, because
+	 * (a) items may repeat in the MCV list, if it has more columns,
+	 * and (b) some of the combinations may be present in non-MCV data.
+	 *
+	 * Moreover, we need to look at the conditions. For now we simply
+	 * assume the conditions affect the distinct groups, and use that.
+	 *
+	 * XXX We might calculate the number of distinct groups in the MCV,
+	 * and then use something between (nd1 - distinct(MCV)) and (nd1),
+	 * which are the possible extreme values, assuming the estimates
+	 * are accurate. Maybe mean or geometric mean would work?
+	 *
+	 * XXX Not sure multiplying ndistinct with probabilities is good.
+	 * Maybe we should do something more like estimate_num_groups?
+	 */
+	nd1 *= csel;
+
+	totalsel1 = s;
+	totalsel1 += unmatchfreq1 * otherfreq2 / nd2;
+	totalsel1 += otherfreq1 * (otherfreq2 + unmatchfreq2) / nd2;
+
+//	if (nd2 > mcvb->nitems)
+//		totalsel1 += unmatchfreq1 * otherfreq2 / (nd2 - mcvb->nitems);
+//	if (nd2 > nmatches)
+//		totalsel1 += otherfreq1 * (otherfreq2 + unmatchfreq2) /
+//			(nd2 - nmatches);
+
+	totalsel2 = s;
+	totalsel2 += unmatchfreq2 * otherfreq1 / nd1;
+	totalsel2 += otherfreq2 * (otherfreq1 + unmatchfreq1) / nd1;
+
+//	if (nd1 > mcva->nitems)
+//		totalsel2 += unmatchfreq2 * otherfreq1 / (nd1 - mcva->nitems);
+//	if (nd1 > nmatches)
+//		totalsel2 += otherfreq2 * (otherfreq1 + unmatchfreq1) /
+//			(nd1 - nmatches);
+
+	s = Min(totalsel1, totalsel2);
+
+	return s;
+}
diff --git a/src/include/statistics/extended_stats_internal.h b/src/include/statistics/extended_stats_internal.h
index 8eed9b338d4..a85f896d53a 100644
--- a/src/include/statistics/extended_stats_internal.h
+++ b/src/include/statistics/extended_stats_internal.h
@@ -15,6 +15,7 @@
 #define EXTENDED_STATS_INTERNAL_H
 
 #include "statistics/statistics.h"
+#include "utils/lsyscache.h"
 #include "utils/sortsupport.h"
 
 typedef struct
@@ -127,4 +128,23 @@ extern Selectivity mcv_clause_selectivity_or(PlannerInfo *root,
 											 Selectivity *overlap_basesel,
 											 Selectivity *totalsel);
 
+extern Selectivity mcv_combine_simple(PlannerInfo *root,
+									  RelOptInfo *rel,
+									  StatisticExtInfo *stat,
+									  AttStatsSlot *sslot,
+									  double stanullfrac,
+									  double nd, bool isdefault,
+									  Node *clause);
+
+extern Selectivity mcv_combine_extended(PlannerInfo *root,
+										RelOptInfo *rel1,
+										RelOptInfo *rel2,
+										StatisticExtInfo *stat1,
+										StatisticExtInfo *stat2,
+										List *clauses);
+
+extern List *statext_determine_join_restrictions(PlannerInfo *root,
+												 RelOptInfo *rel,
+												 StatisticExtInfo *info);
+
 #endif							/* EXTENDED_STATS_INTERNAL_H */
diff --git a/src/include/statistics/statistics.h b/src/include/statistics/statistics.h
index 7f2bf18716d..60b222028d8 100644
--- a/src/include/statistics/statistics.h
+++ b/src/include/statistics/statistics.h
@@ -127,4 +127,16 @@ extern StatisticExtInfo *choose_best_statistics(List *stats, char requiredkind,
 												int nclauses);
 extern HeapTuple statext_expressions_load(Oid stxoid, bool inh, int idx);
 
+extern StatisticExtInfo *statext_find_matching_mcv(PlannerInfo *root, RelOptInfo *rel,
+										   Bitmapset *attnums, List *exprs);
+
+extern bool statext_try_join_estimates(PlannerInfo *root, List *clauses, int varRelid,
+									   JoinType jointype, SpecialJoinInfo *sjinfo,
+									   Bitmapset *estimatedclauses);
+
+extern Selectivity statext_clauselist_join_selectivity(PlannerInfo *root, List *clauses,
+													   int varRelid,
+													   JoinType jointype, SpecialJoinInfo *sjinfo,
+													   Bitmapset **estimatedclauses);
+
 #endif							/* STATISTICS_H */
diff --git a/src/test/regress/expected/stats_ext.out b/src/test/regress/expected/stats_ext.out
index 8c4da955084..b08bf951e4d 100644
--- a/src/test/regress/expected/stats_ext.out
+++ b/src/test/regress/expected/stats_ext.out
@@ -3074,6 +3074,173 @@ SELECT c0 FROM ONLY expr_stats_incompatible_test WHERE
 (0 rows)
 
 DROP TABLE expr_stats_incompatible_test;
+-- Test join estimates.
+CREATE TABLE join_test_1 (a int, b int, c int);
+CREATE TABLE join_test_2 (a int, b int, c int);
+INSERT INTO join_test_1 SELECT mod(i,10), mod(i,10), mod(i,10) FROM generate_series(1,1000) s(i);
+INSERT INTO join_test_2 SELECT mod(i,10), mod(i,10), mod(i,10) FROM generate_series(1,1000) s(i);
+ANALYZE join_test_1;
+ANALYZE join_test_2;
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b))');
+ estimated | actual 
+-----------+--------
+       500 | 100000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5');
+ estimated | actual 
+-----------+--------
+       250 |  50000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 0');
+ estimated | actual 
+-----------+--------
+         1 |      0
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5 and j2.c < 3');
+ estimated | actual 
+-----------+--------
+        75 |  30000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5 and j2.c > 5');
+ estimated | actual 
+-----------+--------
+       100 |      0
+(1 row)
+
+-- can't be improved due to the optimization in clauselist_selectivity_ext,
+-- which skips cases with a single (join) clause
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5');
+ estimated | actual 
+-----------+--------
+      2500 |  50000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5 and j2.c < 5');
+ estimated | actual 
+-----------+--------
+      1250 |  50000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5 and j2.c > 5');
+ estimated | actual 
+-----------+--------
+      1000 |      0
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a) where j1.c < 5');
+ estimated | actual 
+-----------+--------
+      2500 |  50000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a + 1) where j1.c < 5');
+ estimated | actual 
+-----------+--------
+      2500 |  50000
+(1 row)
+
+-- create extended statistics on the join/where columns
+CREATE STATISTICS join_stats_1 ON a, b, c, (a+1), (b+1) FROM join_test_1;
+CREATE STATISTICS join_stats_2 ON a, b, c, (a+1), (b+1) FROM join_test_2;
+ANALYZE join_test_1;
+ANALYZE join_test_2;
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b))');
+ estimated | actual 
+-----------+--------
+    100000 | 100000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5');
+ estimated | actual 
+-----------+--------
+     50000 |  50000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 0');
+ estimated | actual 
+-----------+--------
+         1 |      0
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5 and j2.c < 3');
+ estimated | actual 
+-----------+--------
+     30000 |  30000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5 and j2.c > 5');
+ estimated | actual 
+-----------+--------
+         1 |      0
+(1 row)
+
+-- can't be improved due to the optimization in clauselist_selectivity_ext,
+-- which skips cases with a single (join) clause
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5');
+ estimated | actual 
+-----------+--------
+     50000 |  50000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5 and j2.c < 5');
+ estimated | actual 
+-----------+--------
+     50000 |  50000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5 and j2.c > 5');
+ estimated | actual 
+-----------+--------
+         1 |      0
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a) where j1.c < 5');
+ estimated | actual 
+-----------+--------
+     50000 |  50000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a + 1) where j1.c < 5');
+ estimated | actual 
+-----------+--------
+     50000 |  50000
+(1 row)
+
+-- try combining with single-column (and single-expression) statistics
+DROP STATISTICS join_stats_2;
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a) where j1.c < 5');
+ estimated | actual 
+-----------+--------
+     50000 |  50000
+(1 row)
+
+-- no MCV on join_test_2 (on the (a+1) expression)
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a + 1) where j1.c < 5');
+ estimated | actual 
+-----------+--------
+      2500 |  50000
+(1 row)
+
+CREATE STATISTICS join_stats_2 ON (a+1) FROM join_test_2;
+ANALYZE join_test_2;
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a) where j1.c < 5');
+ estimated | actual 
+-----------+--------
+     50000 |  50000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a + 1) where j1.c < 5');
+ estimated | actual 
+-----------+--------
+     50000 |  50000
+(1 row)
+
+DROP TABLE join_test_1;
+DROP TABLE join_test_2;
 -- Permission tests. Users should not be able to see specific data values in
 -- the extended statistics, if they lack permission to see those values in
 -- the underlying table.
diff --git a/src/test/regress/sql/stats_ext.sql b/src/test/regress/sql/stats_ext.sql
index 0c08a6cc42e..e372fffebfb 100644
--- a/src/test/regress/sql/stats_ext.sql
+++ b/src/test/regress/sql/stats_ext.sql
@@ -1547,6 +1547,72 @@ SELECT c0 FROM ONLY expr_stats_incompatible_test WHERE
 
 DROP TABLE expr_stats_incompatible_test;
 
+
+-- Test join estimates.
+CREATE TABLE join_test_1 (a int, b int, c int);
+CREATE TABLE join_test_2 (a int, b int, c int);
+
+INSERT INTO join_test_1 SELECT mod(i,10), mod(i,10), mod(i,10) FROM generate_series(1,1000) s(i);
+INSERT INTO join_test_2 SELECT mod(i,10), mod(i,10), mod(i,10) FROM generate_series(1,1000) s(i);
+
+ANALYZE join_test_1;
+ANALYZE join_test_2;
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b))');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 0');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5 and j2.c < 3');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5 and j2.c > 5');
+
+-- can't be improved due to the optimization in clauselist_selectivity_ext,
+-- which skips cases with a single (join) clause
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5 and j2.c < 5');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5 and j2.c > 5');
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a) where j1.c < 5');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a + 1) where j1.c < 5');
+
+-- create extended statistics on the join/where columns
+CREATE STATISTICS join_stats_1 ON a, b, c, (a+1), (b+1) FROM join_test_1;
+CREATE STATISTICS join_stats_2 ON a, b, c, (a+1), (b+1) FROM join_test_2;
+
+ANALYZE join_test_1;
+ANALYZE join_test_2;
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b))');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 0');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5 and j2.c < 3');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5 and j2.c > 5');
+
+-- can't be improved due to the optimization in clauselist_selectivity_ext,
+-- which skips cases with a single (join) clause
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5 and j2.c < 5');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5 and j2.c > 5');
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a) where j1.c < 5');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a + 1) where j1.c < 5');
+
+-- try combining with single-column (and single-expression) statistics
+DROP STATISTICS join_stats_2;
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a) where j1.c < 5');
+
+-- no MCV on join_test_2 (on the (a+1) expression)
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a + 1) where j1.c < 5');
+
+CREATE STATISTICS join_stats_2 ON (a+1) FROM join_test_2;
+ANALYZE join_test_2;
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a) where j1.c < 5');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a + 1) where j1.c < 5');
+
+
+DROP TABLE join_test_1;
+DROP TABLE join_test_2;
+
 -- Permission tests. Users should not be able to see specific data values in
 -- the extended statistics, if they lack permission to see those values in
 -- the underlying table.
-- 
2.45.2

v20240617-0002-review.patchtext/x-patch; charset=UTF-8; name=v20240617-0002-review.patchDownload

From 22e956fd80c076c0e836674c49447afb3cd84bfc Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Mon, 17 Jun 2024 14:10:49 +0200
Subject: [PATCH v20240617 02/56] review

---
 src/backend/optimizer/path/clausesel.c        | 29 ++++----
 src/backend/statistics/extended_stats.c       | 72 +++++++++++++------
 .../statistics/extended_stats_internal.h      |  2 +-
 3 files changed, 69 insertions(+), 34 deletions(-)

diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c
index bedf76edaec..24e3c9729a3 100644
--- a/src/backend/optimizer/path/clausesel.c
+++ b/src/backend/optimizer/path/clausesel.c
@@ -130,22 +130,26 @@ clauselist_selectivity_ext(PlannerInfo *root,
 	RangeQueryClause *rqlist = NULL;
 	ListCell   *l;
 	int			listidx;
+
+	/* skip expensive processing when estimating a single clause */
 	bool		single_clause_optimization = true;
 
 	/*
-	 * The optimization of skipping to clause_selectivity_ext for single
-	 * clauses means we can't improve join estimates with a single join
-	 * clause but additional baserel restrictions. So we disable it when
-	 * estimating joins.
+	 * Disable the single-clause optimization when estimating a join clause.
+	 *
+	 * The optimization skips clause_selectivity_ext when estimating a single
+	 * clause, but for join clauses it would mean we can't consider both the
+	 * join clause and the baserel restrictions. So we disable the optimization
+	 * when estimating a join clause.
 	 *
-	 * XXX Not sure if this is the right way to do it, but more elaborate
-	 * checks would mostly negate the whole point of the optimization.
-	 * The (Var op Var) patch has the same issue.
+	 * XXX Not sure if this is the best way to deal with the optimization. We
+	 * could make it more elaborate in various ways, but increasing the cost
+	 * of the checks might negate the whole point of the optimization.
 	 *
-	 * XXX An alternative might be making clause_selectivity_ext smarter
-	 * and make it use the join extended stats there. But that seems kinda
-	 * against the whole point of the optimization (skipping expensive
-	 * stuff) and it's making other parts more complex.
+	 * XXX Alternatively we could make clause_selectivity_ext smarter and
+	 * combine the join clauses and baserel restrictions there. But that seems
+	 * somewhat against the whole point of the optimization (skipping expensive
+	 * stuff) and it'd making other parts more complex.
 	 *
 	 * XXX Maybe this should check if there are at least some restrictions
 	 * on some base relations, which seems important. But then again, that
@@ -164,6 +168,7 @@ clauselist_selectivity_ext(PlannerInfo *root,
 			clause = (Node *) rinfo->clause;
 		}
 
+		/* disable optimization for join clauses */
 		single_clause_optimization
 			= !treat_as_join_clause(root, clause, rinfo, varRelid, sjinfo);
 	}
@@ -209,7 +214,7 @@ clauselist_selectivity_ext(PlannerInfo *root,
 	 */
 	if (use_extended_stats &&
 		statext_try_join_estimates(root, clauses, varRelid, jointype, sjinfo,
-						 estimatedclauses))
+								   estimatedclauses))
 	{
 		s1 *= statext_clauselist_join_selectivity(root, clauses, varRelid,
 												  jointype, sjinfo,
diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c
index 80872cc7daa..d3e1dde73d1 100644
--- a/src/backend/statistics/extended_stats.c
+++ b/src/backend/statistics/extended_stats.c
@@ -2643,23 +2643,28 @@ make_build_data(Relation rel, StatExtEntry *stat, int numrows, HeapTuple *rows,
  * statext_find_matching_mcv
  *		Search for a MCV covering all the attributes and expressions.
  *
- * We pick the statistics to use for join estimation. The statistics object has
- * to have MCV, and we require it to match all the join conditions, because it
- * makes the estimation simpler.
+ * Picks the extended statistics object to estimate join clause. The statistics
+ * object has to have a MCV, and we require it to match all the join conditions
+ * (be it plain attribute or an expression), as it makes the estimation simpler.
  *
- * If there are multiple candidate statistics objects (matching all join clauses),
- * we pick the smallest one, and we also consider additional conditions on
- * the base relations to restrict the MCV items used for estimation (using
- * conditional probability).
+ * If there are multiple applicable candidate statistics objects (matching all
+ * join clauses), picks the narrowest one. But we also consider additional
+ * restrictions on base relations that we can use to filter the MCV items
+ * (to calculate conditional probability).
+ *
+ * XXX How exactly this balances the "narrowest" and "additional conditions"?
+ * Which criteria we prefer?
  *
  * XXX The requirement that all the attributes need to be covered might be
  * too strong. We could relax this and and require fewer matches (at least two,
  * if counting the additional conditions), and we might even apply multiple
  * statistics etc. But that would require matching statistics on both sides of
- * the join, while now we simply know the statistics match. We don't really
- * expect many candidate MCVs, so this simple approach seems sufficient. And
- * the joins usually use only one or two columns, so there's not much room
- * for applying multiple statistics anyway.
+ * the join (using a statistics on a subset of conditions one one side means
+ * we need a matching statistics on the other side too). While now we simply
+ * know the statistics will match. We don't really expect many candidate MCVs,
+ * so this simple approach seems sufficient. And the joins usually use only one
+ * or two columns, so there's not much room for applying multiple statistics
+ * anyway.
  */
 StatisticExtInfo *
 statext_find_matching_mcv(PlannerInfo *root, RelOptInfo *rel,
@@ -2679,12 +2684,7 @@ statext_find_matching_mcv(PlannerInfo *root, RelOptInfo *rel,
 		if (stat->kind != STATS_EXT_MCV)
 			continue;
 
-		/*
-		 * Ignore MCVs not covering all the attributes/expressions.
-		 *
-		 * XXX Maybe we shouldn't be so strict and consider only partial
-		 * matches for join clauses too?
-		 */
+		/* Ignore MCVs not covering all the attributes/expressions. */
 		if (!bms_is_subset(attnums, stat->keys) ||
 			!stat_covers_expressions(stat, exprs, NULL))
 			continue;
@@ -2697,8 +2697,8 @@ statext_find_matching_mcv(PlannerInfo *root, RelOptInfo *rel,
 		}
 
 		/*
-		 * OK, we have two candidate statistics objects and we need to decide
-		 * which one to keep. We'll use two simple heuristics:
+		 * We have two candidate statistics objects and we need to decide which
+		 * one to keep. We'll use two simple heuristics:
 		 *
 		 * (a) We prefer smaller statistics (fewer columns), on the assumption
 		 * that it represents a larger fraction of the data (due to having fewer
@@ -2723,6 +2723,23 @@ statext_find_matching_mcv(PlannerInfo *root, RelOptInfo *rel,
 		 * as well pick regular statistics for the column/expression, but it's
 		 * not clear if that actually exists (so we might reject the stats here
 		 * and then fail to find something simpler/better).
+		 *
+		 * XXX I'm not sure about the preceding comment. Why would we find a MCV
+		 * list for a single condition here, but not for the single attribute?
+		 * Would the "partial" extended MCV even be useful?
+		 */
+
+		/*
+		 * Match additional baserel conditions for the two statistics.
+		 *
+		 * XXX Shouldn't we keep this too? If there are more than 2 candidates,
+		 * we'll end up recalculating the conditions for the statistics we kept
+		 * from the preceding loop. Perhaps we could/should even pass the
+		 * conditions to the caller?
+		 *
+		 * XXX Or maybe we should simply "count" the restrictions here, instead
+		 * of constructing a list? Probably not a meaningful difference in CPU
+		 * costs or a memory leak.
 		 */
 		conditions1 = statext_determine_join_restrictions(root, rel, stat);
 		conditions2 = statext_determine_join_restrictions(root, rel, mcv);
@@ -2734,7 +2751,10 @@ statext_find_matching_mcv(PlannerInfo *root, RelOptInfo *rel,
 			continue;
 		}
 
-		/* The statistics seem about equal, so just use the smaller one. */
+		/* The statistics seem about equal, so just use the narrower one.
+		 *
+		 * XXX Maybe we should have a function/macro to count the keys/exprs?
+		 */
 		if (bms_num_members(mcv->keys) + list_length(mcv->exprs) >
 			bms_num_members(stat->keys) + list_length(stat->exprs))
 		{
@@ -2803,7 +2823,7 @@ statext_determine_join_restrictions(PlannerInfo *root, RelOptInfo *rel,
  * relation for now.
  *
  * Similar to treat_as_join_clause, but we place additional restrictions
- * on the conditions.
+ * on the conditions, to make sure it can be estimated using extended stats.
  */
 static bool
 statext_is_supported_join_clause(PlannerInfo *root, Node *clause,
@@ -2900,6 +2920,8 @@ statext_is_supported_join_clause(PlannerInfo *root, Node *clause,
  *
  * This is supposed to be a quick/cheap check to decide whether to expend
  * more effort on applying extended statistics to join clauses.
+ *
+ * XXX Probably should document arguments, see statext_mcv_clauselist_selectivity.
  */
 bool
 statext_try_join_estimates(PlannerInfo *root, List *clauses, int varRelid,
@@ -3098,6 +3120,10 @@ statext_build_join_pairs(PlannerInfo *root, List *clauses, int varRelid,
  *
  * XXX Can we have cases with indexes above 1? Probably for clauses mixing
  * vars from 3 relations, but statext_is_supported_join_clause rejects those.
+ *
+ * XXX Name should probably start with statext_ too.
+ *
+ * XXX The 0/1 index seems a bit weird. Is there a better way to do this?
  */
 static RelOptInfo *
 extract_relation_info(PlannerInfo *root, JoinPairInfo *info, int index,
@@ -3196,6 +3222,8 @@ extract_relation_info(PlannerInfo *root, JoinPairInfo *info, int index,
  *
  * XXX This should probably return a flag identifying whether it's the
  * left or right argument.
+ *
+ * XXX Name should probably start with statext_ too.
  */
 static Node *
 get_expression_for_rel(PlannerInfo *root, RelOptInfo *rel, Node *clause)
@@ -3241,6 +3269,8 @@ get_expression_for_rel(PlannerInfo *root, RelOptInfo *rel, Node *clause)
  * But currently that does not happen, because clauselist_selectivity_ext
  * treats a single clause as a special case (and we don't apply extended
  * statistics in that case yet).
+ *
+ * XXX Isn't the preceding comment stale? We skip the optimization, no?
  */
 Selectivity
 statext_clauselist_join_selectivity(PlannerInfo *root, List *clauses, int varRelid,
diff --git a/src/include/statistics/extended_stats_internal.h b/src/include/statistics/extended_stats_internal.h
index a85f896d53a..f156fda555e 100644
--- a/src/include/statistics/extended_stats_internal.h
+++ b/src/include/statistics/extended_stats_internal.h
@@ -15,7 +15,7 @@
 #define EXTENDED_STATS_INTERNAL_H
 
 #include "statistics/statistics.h"
-#include "utils/lsyscache.h"
+#include "utils/lsyscache.h"		/* XXX is this needed? */
 #include "utils/sortsupport.h"
 
 typedef struct
-- 
2.45.2

v20240617-0003-pgindent.patchtext/x-patch; charset=UTF-8; name=v20240617-0003-pgindent.patchDownload

From 16b7db4484b2d6a75beaf62c35d8237127c1dcb0 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Mon, 17 Jun 2024 14:11:14 +0200
Subject: [PATCH v20240617 03/56] pgindent

---
 src/backend/optimizer/path/clausesel.c        |  26 +-
 src/backend/statistics/extended_stats.c       | 324 +++++++++---------
 src/backend/statistics/mcv.c                  | 319 ++++++++---------
 .../statistics/extended_stats_internal.h      |   2 +-
 src/include/statistics/statistics.h           |   2 +-
 src/tools/pgindent/typedefs.list              |   1 +
 6 files changed, 348 insertions(+), 326 deletions(-)

diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c
index 24e3c9729a3..871d73e3b4f 100644
--- a/src/backend/optimizer/path/clausesel.c
+++ b/src/backend/optimizer/path/clausesel.c
@@ -139,8 +139,8 @@ clauselist_selectivity_ext(PlannerInfo *root,
 	 *
 	 * The optimization skips clause_selectivity_ext when estimating a single
 	 * clause, but for join clauses it would mean we can't consider both the
-	 * join clause and the baserel restrictions. So we disable the optimization
-	 * when estimating a join clause.
+	 * join clause and the baserel restrictions. So we disable the
+	 * optimization when estimating a join clause.
 	 *
 	 * XXX Not sure if this is the best way to deal with the optimization. We
 	 * could make it more elaborate in various ways, but increasing the cost
@@ -148,18 +148,18 @@ clauselist_selectivity_ext(PlannerInfo *root,
 	 *
 	 * XXX Alternatively we could make clause_selectivity_ext smarter and
 	 * combine the join clauses and baserel restrictions there. But that seems
-	 * somewhat against the whole point of the optimization (skipping expensive
-	 * stuff) and it'd making other parts more complex.
+	 * somewhat against the whole point of the optimization (skipping
+	 * expensive stuff) and it'd making other parts more complex.
 	 *
-	 * XXX Maybe this should check if there are at least some restrictions
-	 * on some base relations, which seems important. But then again, that
-	 * seems to go against the idea of this check to be cheap. Moreover, it
-	 * won't work for OR clauses, which may have multiple parts but we still
-	 * see them as a single BoolExpr clause (it doesn't work later, though).
+	 * XXX Maybe this should check if there are at least some restrictions on
+	 * some base relations, which seems important. But then again, that seems
+	 * to go against the idea of this check to be cheap. Moreover, it won't
+	 * work for OR clauses, which may have multiple parts but we still see
+	 * them as a single BoolExpr clause (it doesn't work later, though).
 	 */
 	if (list_length(clauses) == 1)
 	{
-		Node *clause = linitial(clauses);
+		Node	   *clause = linitial(clauses);
 		RestrictInfo *rinfo = NULL;
 
 		if (IsA(clause, RestrictInfo))
@@ -205,9 +205,9 @@ clauselist_selectivity_ext(PlannerInfo *root,
 	}
 
 	/*
-	 * Try applying extended statistics to joins. There's not much we can
-	 * do to detect when this makes sense, but we can check that there are
-	 * join clauses, and that at least some of the rels have stats.
+	 * Try applying extended statistics to joins. There's not much we can do
+	 * to detect when this makes sense, but we can check that there are join
+	 * clauses, and that at least some of the rels have stats.
 	 *
 	 * XXX Isn't this mutually exclusive with the preceding block which
 	 * calculates estimates for a single relation?
diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c
index d3e1dde73d1..25b4d486a09 100644
--- a/src/backend/statistics/extended_stats.c
+++ b/src/backend/statistics/extended_stats.c
@@ -2672,13 +2672,13 @@ statext_find_matching_mcv(PlannerInfo *root, RelOptInfo *rel,
 {
 	ListCell   *l;
 	StatisticExtInfo *mcv = NULL;
-	List *stats = rel->statlist;
+	List	   *stats = rel->statlist;
 
 	foreach(l, stats)
 	{
 		StatisticExtInfo *stat = (StatisticExtInfo *) lfirst(l);
-		List *conditions1 = NIL,
-			 *conditions2 = NIL;
+		List	   *conditions1 = NIL,
+				   *conditions2 = NIL;
 
 		/* We only care about MCV statistics here. */
 		if (stat->kind != STATS_EXT_MCV)
@@ -2697,49 +2697,51 @@ statext_find_matching_mcv(PlannerInfo *root, RelOptInfo *rel,
 		}
 
 		/*
-		 * We have two candidate statistics objects and we need to decide which
-		 * one to keep. We'll use two simple heuristics:
+		 * We have two candidate statistics objects and we need to decide
+		 * which one to keep. We'll use two simple heuristics:
 		 *
 		 * (a) We prefer smaller statistics (fewer columns), on the assumption
-		 * that it represents a larger fraction of the data (due to having fewer
-		 * combinations with higher counts).
+		 * that it represents a larger fraction of the data (due to having
+		 * fewer combinations with higher counts).
 		 *
-		 * (b) If the statistics object covers some additional conditions for the rels,
-		 * that may help with considering additional dependencies between the
-		 * tables.
+		 * (b) If the statistics object covers some additional conditions for
+		 * the rels, that may help with considering additional dependencies
+		 * between the tables.
 		 *
-		 * Of course, those two heuristict are somewhat contradictory - smaller
-		 * stats are less likely to cover as many conditions as a larger one. We
-		 * consider the additional conditions first - if someone created such
-		 * statistics, there probably is a dependency worth considering.
+		 * Of course, those two heuristict are somewhat contradictory -
+		 * smaller stats are less likely to cover as many conditions as a
+		 * larger one. We consider the additional conditions first - if
+		 * someone created such statistics, there probably is a dependency
+		 * worth considering.
 		 *
 		 * When inspecting the restrictions, we need to be careful - we don't
-		 * know which of them are compatible with extended stats, so we have to
-		 * run them through statext_is_compatible_clause first and then match
-		 * them to the statistics.
+		 * know which of them are compatible with extended stats, so we have
+		 * to run them through statext_is_compatible_clause first and then
+		 * match them to the statistics.
 		 *
-		 * XXX Maybe we shouldn't pick statistics that covers just a single join
-		 * clause, without any additional conditions. In such case we could just
-		 * as well pick regular statistics for the column/expression, but it's
-		 * not clear if that actually exists (so we might reject the stats here
-		 * and then fail to find something simpler/better).
+		 * XXX Maybe we shouldn't pick statistics that covers just a single
+		 * join clause, without any additional conditions. In such case we
+		 * could just as well pick regular statistics for the
+		 * column/expression, but it's not clear if that actually exists (so
+		 * we might reject the stats here and then fail to find something
+		 * simpler/better).
 		 *
-		 * XXX I'm not sure about the preceding comment. Why would we find a MCV
-		 * list for a single condition here, but not for the single attribute?
-		 * Would the "partial" extended MCV even be useful?
+		 * XXX I'm not sure about the preceding comment. Why would we find a
+		 * MCV list for a single condition here, but not for the single
+		 * attribute? Would the "partial" extended MCV even be useful?
 		 */
 
 		/*
 		 * Match additional baserel conditions for the two statistics.
 		 *
-		 * XXX Shouldn't we keep this too? If there are more than 2 candidates,
-		 * we'll end up recalculating the conditions for the statistics we kept
-		 * from the preceding loop. Perhaps we could/should even pass the
-		 * conditions to the caller?
+		 * XXX Shouldn't we keep this too? If there are more than 2
+		 * candidates, we'll end up recalculating the conditions for the
+		 * statistics we kept from the preceding loop. Perhaps we could/should
+		 * even pass the conditions to the caller?
 		 *
-		 * XXX Or maybe we should simply "count" the restrictions here, instead
-		 * of constructing a list? Probably not a meaningful difference in CPU
-		 * costs or a memory leak.
+		 * XXX Or maybe we should simply "count" the restrictions here,
+		 * instead of constructing a list? Probably not a meaningful
+		 * difference in CPU costs or a memory leak.
 		 */
 		conditions1 = statext_determine_join_restrictions(root, rel, stat);
 		conditions2 = statext_determine_join_restrictions(root, rel, mcv);
@@ -2751,7 +2753,8 @@ statext_find_matching_mcv(PlannerInfo *root, RelOptInfo *rel,
 			continue;
 		}
 
-		/* The statistics seem about equal, so just use the narrower one.
+		/*
+		 * The statistics seem about equal, so just use the narrower one.
 		 *
 		 * XXX Maybe we should have a function/macro to count the keys/exprs?
 		 */
@@ -2788,12 +2791,12 @@ statext_determine_join_restrictions(PlannerInfo *root, RelOptInfo *rel,
 	List	   *conditions = NIL;
 
 	/* extract conditions that may be applied to the MCV list */
-	foreach (lc, rel->baserestrictinfo)
+	foreach(lc, rel->baserestrictinfo)
 	{
 		RestrictInfo *rinfo = (RestrictInfo *) lfirst(lc);
-		Bitmapset *indexes = NULL;
-		Bitmapset *attnums = NULL;
-		List *exprs = NIL;
+		Bitmapset  *indexes = NULL;
+		Bitmapset  *attnums = NULL;
+		List	   *exprs = NIL;
 
 		/* clause has to be supported by MCV in general */
 		if (!statext_is_compatible_clause(root, (Node *) rinfo, rel->relid,
@@ -2801,8 +2804,8 @@ statext_determine_join_restrictions(PlannerInfo *root, RelOptInfo *rel,
 			continue;
 
 		/*
-		 * clause is compatible in general, but is it actually covered
-		 * by this particular statistics object?
+		 * clause is compatible in general, but is it actually covered by this
+		 * particular statistics object?
 		 */
 		if (!bms_is_subset(attnums, info->keys) ||
 			!stat_covers_expressions(info, exprs, &indexes))
@@ -2829,10 +2832,10 @@ static bool
 statext_is_supported_join_clause(PlannerInfo *root, Node *clause,
 								 int varRelid, SpecialJoinInfo *sjinfo)
 {
-	Oid	oprsel;
-	RestrictInfo   *rinfo;
-	OpExpr		   *opclause;
-	ListCell	   *lc;
+	Oid			oprsel;
+	RestrictInfo *rinfo;
+	OpExpr	   *opclause;
+	ListCell   *lc;
 
 	/*
 	 * evaluation as a restriction clause, either at scan node or forced
@@ -2871,43 +2874,43 @@ statext_is_supported_join_clause(PlannerInfo *root, Node *clause,
 	 * Make sure we're not mixing vars from multiple relations on the same
 	 * side, like
 	 *
-	 *   (t1.a + t2.a) = (t1.b + t2.b)
+	 * (t1.a + t2.a) = (t1.b + t2.b)
 	 *
 	 * which is still technically an opclause, but we can't match it to
 	 * extended statistics in a simple way.
 	 *
 	 * XXX This also means we require rinfo->clause_relids to have 2 rels.
 	 *
-	 * XXX Also check it's not expression on system attributes, which we
-	 * don't allow in extended statistics.
+	 * XXX Also check it's not expression on system attributes, which we don't
+	 * allow in extended statistics.
 	 *
-	 * XXX Although maybe we could allow cases that combine expressions
-	 * from both relations on either side? Like (t1.a + t2.b = t1.c - t2.d)
-	 * or something like that. We could do "cartesian product" of the MCV
-	 * stats and restrict it using this condition.
+	 * XXX Although maybe we could allow cases that combine expressions from
+	 * both relations on either side? Like (t1.a + t2.b = t1.c - t2.d) or
+	 * something like that. We could do "cartesian product" of the MCV stats
+	 * and restrict it using this condition.
 	 */
-	foreach (lc, opclause->args)
+	foreach(lc, opclause->args)
 	{
-		Bitmapset *varnos = NULL;
-		Node *expr = (Node *) lfirst(lc);
+		Bitmapset  *varnos = NULL;
+		Node	   *expr = (Node *) lfirst(lc);
 
 		varnos = pull_varnos(root, expr);
 
 		/*
 		 * No argument should reference more than just one relation.
 		 *
-		 * This effectively means each side references just two relations.
-		 * If there's no relation on one side, it's a Const, and the other
-		 * side has to be either Const or Expr with a single rel. In which
-		 * case it can't be a join clause.
+		 * This effectively means each side references just two relations. If
+		 * there's no relation on one side, it's a Const, and the other side
+		 * has to be either Const or Expr with a single rel. In which case it
+		 * can't be a join clause.
 		 */
 		if (bms_num_members(varnos) > 1)
 			return false;
 
 		/*
-		 * XXX Maybe check that both relations have extended statistics
-		 * (no point in considering the clause as useful without it). But
-		 * we'll do that check later anyway, so keep this cheap.
+		 * XXX Maybe check that both relations have extended statistics (no
+		 * point in considering the clause as useful without it). But we'll do
+		 * that check later anyway, so keep this cheap.
 		 */
 	}
 
@@ -2945,19 +2948,21 @@ statext_try_join_estimates(PlannerInfo *root, List *clauses, int varRelid,
 	 *
 	 * XXX Currently this only allows simple OpExpr equality clauses with each
 	 * argument referring to single relation, AND-ed together. Maybe we could
-	 * relax this in the future, e.g. to allow more complex (deeper) expressions
-	 * and to allow OR-ed join clauses too. And maybe supporting inequalities.
+	 * relax this in the future, e.g. to allow more complex (deeper)
+	 * expressions and to allow OR-ed join clauses too. And maybe supporting
+	 * inequalities.
 	 *
 	 * Handling more complex expressions seems simple - we already do that for
-	 * baserel estimates by building the match bitmap recursively, and we could
-	 * do something similar for combinations of MCV items (a bit like building
-	 * a single bit in the match bitmap). The challenge is what to do about the
-	 * part not represented by MCV, which is now based on ndistinct estimates.
+	 * baserel estimates by building the match bitmap recursively, and we
+	 * could do something similar for combinations of MCV items (a bit like
+	 * building a single bit in the match bitmap). The challenge is what to do
+	 * about the part not represented by MCV, which is now based on ndistinct
+	 * estimates.
 	 */
 	listidx = -1;
-	foreach (lc, clauses)
+	foreach(lc, clauses)
 	{
-		Node *clause = (Node *) lfirst(lc);
+		Node	   *clause = (Node *) lfirst(lc);
 		RestrictInfo *rinfo;
 
 		/* needs to happen before skipping any clauses */
@@ -2968,15 +2973,15 @@ statext_try_join_estimates(PlannerInfo *root, List *clauses, int varRelid,
 			continue;
 
 		/*
-		 * Skip clauses that are not join clauses or that we don't know
-		 * how to handle estimate using extended statistics.
+		 * Skip clauses that are not join clauses or that we don't know how to
+		 * handle estimate using extended statistics.
 		 */
 		if (!statext_is_supported_join_clause(root, clause, varRelid, sjinfo))
 			continue;
 
 		/*
-		 * XXX We're guaranteed to have RestrictInfo thanks to the checks
-		 * in statext_is_supported_join_clause.
+		 * XXX We're guaranteed to have RestrictInfo thanks to the checks in
+		 * statext_is_supported_join_clause.
 		 */
 		rinfo = (RestrictInfo *) clause;
 
@@ -3003,15 +3008,17 @@ statext_try_join_estimates(PlannerInfo *root, List *clauses, int varRelid,
 	 * cross-check the exact joined pairs of rels, but it's supposed to be a
 	 * cheap check, so maybe better leave that for later.
 	 *
-	 * XXX We could also check the number of parameters in each rel to consider
-	 * extended stats. If there's just a single attribute, it's pointless to use
-	 * extended statistics. OTOH we can also consider restriction clauses from
-	 * baserestrictinfo and use them to calculate conditional probabilities.
+	 * XXX We could also check the number of parameters in each rel to
+	 * consider extended stats. If there's just a single attribute, it's
+	 * pointless to use extended statistics. OTOH we can also consider
+	 * restriction clauses from baserestrictinfo and use them to calculate
+	 * conditional probabilities.
 	 */
 	k = -1;
 	while ((k = bms_next_member(relids, k)) >= 0)
 	{
 		RelOptInfo *rel = find_base_rel(root, k);
+
 		if (rel->statlist)
 			return true;
 	}
@@ -3045,10 +3052,10 @@ statext_build_join_pairs(PlannerInfo *root, List *clauses, int varRelid,
 						 JoinType jointype, SpecialJoinInfo *sjinfo,
 						 Bitmapset *estimatedclauses, int *npairs)
 {
-	int				cnt;
-	int				listidx;
-	JoinPairInfo   *info;
-	ListCell	   *lc;
+	int			cnt;
+	int			listidx;
+	JoinPairInfo *info;
+	ListCell   *lc;
 
 	/*
 	 * Assume each clause is for a different pair of relations (some of them
@@ -3061,10 +3068,10 @@ statext_build_join_pairs(PlannerInfo *root, List *clauses, int varRelid,
 	listidx = -1;
 	foreach(lc, clauses)
 	{
-		int				i;
-		bool			found;
-		Node		   *clause = (Node *) lfirst(lc);
-		RestrictInfo   *rinfo;
+		int			i;
+		bool		found;
+		Node	   *clause = (Node *) lfirst(lc);
+		RestrictInfo *rinfo;
 
 		listidx++;
 
@@ -3073,9 +3080,9 @@ statext_build_join_pairs(PlannerInfo *root, List *clauses, int varRelid,
 			continue;
 
 		/*
-		 * Make sure the clause is a join clause of a supported shape (at
-		 * the moment we support just (Expr op Expr) clauses with each
-		 * side referencing just a single relation).
+		 * Make sure the clause is a join clause of a supported shape (at the
+		 * moment we support just (Expr op Expr) clauses with each side
+		 * referencing just a single relation).
 		 */
 		if (!statext_is_supported_join_clause(root, clause, varRelid, sjinfo))
 			continue;
@@ -3153,23 +3160,23 @@ extract_relation_info(PlannerInfo *root, JoinPairInfo *info, int index,
 	rel = find_base_rel(root, relid);
 
 	/*
-	 * Walk the clauses for this join pair, and extract expressions about
-	 * the relation identified by index / relid. For simple Vars we extract
-	 * the attnum. Otherwise we keep the whole expression.
+	 * Walk the clauses for this join pair, and extract expressions about the
+	 * relation identified by index / relid. For simple Vars we extract the
+	 * attnum. Otherwise we keep the whole expression.
 	 */
-	foreach (lc, info->clauses)
+	foreach(lc, info->clauses)
 	{
-		ListCell *lc2;
-		Node *clause = (Node *) lfirst(lc);
-		OpExpr *opclause = (OpExpr *) clause;
+		ListCell   *lc2;
+		Node	   *clause = (Node *) lfirst(lc);
+		OpExpr	   *opclause = (OpExpr *) clause;
 
 		/* only opclauses supported for now */
 		Assert(is_opclause(clause));
 
-		foreach (lc2, opclause->args)
+		foreach(lc2, opclause->args)
 		{
-			Node *arg = (Node *) lfirst(lc2);
-			Bitmapset *varnos = NULL;
+			Node	   *arg = (Node *) lfirst(lc2);
+			Bitmapset  *varnos = NULL;
 
 			/* plain Var references (boolean Vars or recursive checks) */
 			if (IsA(arg, Var))
@@ -3184,7 +3191,10 @@ extract_relation_info(PlannerInfo *root, JoinPairInfo *info, int index,
 				if (var->varlevelsup > 0)
 					continue;
 
-				/* Also skip system attributes (we don't allow stats on those). */
+				/*
+				 * Also skip system attributes (we don't allow stats on
+				 * those).
+				 */
 				if (!AttrNumberIsForUserDefinedAttr(var->varattno))
 					elog(ERROR, "unexpected system attribute");
 
@@ -3195,10 +3205,9 @@ extract_relation_info(PlannerInfo *root, JoinPairInfo *info, int index,
 			}
 
 			/*
-			 * OK, it's a more complex expression, so check if it matches
-			 * the relid and maybe keep it as a whole. It should be
-			 * compatible because we already checked it when building the
-			 * join pairs.
+			 * OK, it's a more complex expression, so check if it matches the
+			 * relid and maybe keep it as a whole. It should be compatible
+			 * because we already checked it when building the join pairs.
 			 */
 			varnos = pull_varnos(root, arg);
 
@@ -3228,15 +3237,15 @@ extract_relation_info(PlannerInfo *root, JoinPairInfo *info, int index,
 static Node *
 get_expression_for_rel(PlannerInfo *root, RelOptInfo *rel, Node *clause)
 {
-	OpExpr *opexpr;
-	Node   *expr;
+	OpExpr	   *opexpr;
+	Node	   *expr;
 
 	/*
 	 * Strip the RestrictInfo node, get the actual clause.
 	 *
-	 * XXX Not sure if we need to care about removing other node types
-	 * too (e.g. RelabelType etc.). statext_is_supported_join_clause
-	 * matches this, but maybe we need to relax it?
+	 * XXX Not sure if we need to care about removing other node types too
+	 * (e.g. RelabelType etc.). statext_is_supported_join_clause matches this,
+	 * but maybe we need to relax it?
 	 */
 	if (IsA(clause, RestrictInfo))
 		clause = (Node *) ((RestrictInfo *) clause)->clause;
@@ -3279,10 +3288,10 @@ statext_clauselist_join_selectivity(PlannerInfo *root, List *clauses, int varRel
 {
 	int			i;
 	int			listidx;
-	Selectivity	s = 1.0;
+	Selectivity s = 1.0;
 
 	JoinPairInfo *info;
-	int				ninfo;
+	int			ninfo;
 
 	if (!clauses)
 		return 1.0;
@@ -3298,27 +3307,27 @@ statext_clauselist_join_selectivity(PlannerInfo *root, List *clauses, int varRel
 	/*
 	 * Process the join pairs, try to find a matching MCV on each side.
 	 *
-	 * XXX The basic principle is quite similar to eqjoinsel_inner, i.e.
-	 * we try to find a MCV on both sides of the join, and use it to get
-	 * a better join estimate. It's a bit more complicated, because there
-	 * might be multiple MCV lists, we also need ndistinct estimate, and
-	 * there may be interesting baserestrictions too.
+	 * XXX The basic principle is quite similar to eqjoinsel_inner, i.e. we
+	 * try to find a MCV on both sides of the join, and use it to get a better
+	 * join estimate. It's a bit more complicated, because there might be
+	 * multiple MCV lists, we also need ndistinct estimate, and there may be
+	 * interesting baserestrictions too.
 	 *
-	 * XXX At the moment we only handle the case with matching MCVs on
-	 * both sides, but it'd be good to also handle case with just ndistinct
+	 * XXX At the moment we only handle the case with matching MCVs on both
+	 * sides, but it'd be good to also handle case with just ndistinct
 	 * statistics improving ndistinct estimates.
 	 *
-	 * XXX We might also handle cases with a regular MCV on one side and
-	 * an extended MCV on the other side.
+	 * XXX We might also handle cases with a regular MCV on one side and an
+	 * extended MCV on the other side.
 	 *
-	 * XXX Perhaps it'd be good to also handle case with one side only
-	 * having "regular" statistics (e.g. MCV), especially in cases with
-	 * no conditions on that side of the join (where we can't use the
-	 * extended MCV to calculate conditional probability).
+	 * XXX Perhaps it'd be good to also handle case with one side only having
+	 * "regular" statistics (e.g. MCV), especially in cases with no conditions
+	 * on that side of the join (where we can't use the extended MCV to
+	 * calculate conditional probability).
 	 */
 	for (i = 0; i < ninfo; i++)
 	{
-		ListCell *lc;
+		ListCell   *lc;
 
 		RelOptInfo *rel1;
 		RelOptInfo *rel2;
@@ -3336,14 +3345,14 @@ statext_clauselist_join_selectivity(PlannerInfo *root, List *clauses, int varRel
 		 * We can handle three basic cases:
 		 *
 		 * a) Extended stats (with MCV) on both sides is an ideal case, and we
-		 * can simply combine the two MCVs, possibly with additional conditions
-		 * from the relations.
+		 * can simply combine the two MCVs, possibly with additional
+		 * conditions from the relations.
 		 *
 		 * b) Extended stats on one side, regular MCV on the other side (this
 		 * means there's just one join clause / expression). It also means the
-		 * extended stats likely covers at least one extra condition, otherwise
-		 * we could just use regular statistics. We can combine the stats just
-		 * similarly to (a).
+		 * extended stats likely covers at least one extra condition,
+		 * otherwise we could just use regular statistics. We can combine the
+		 * stats just similarly to (a).
 		 *
 		 * c) No extended stats with MCV. If there are multiple join clauses,
 		 * we can try using ndistinct coefficients and do what eqjoinsel does.
@@ -3358,14 +3367,14 @@ statext_clauselist_join_selectivity(PlannerInfo *root, List *clauses, int varRel
 		else if (stat1 && (list_length(info[i].clauses) == 1))
 		{
 			/* try finding MCV on the other relation */
-			VariableStatData	vardata;
-			AttStatsSlot		sslot;
-			Form_pg_statistic	stats = NULL;
-			bool				have_mcvs = false;
-			Node			   *clause = linitial(info[i].clauses);
-			Node			   *expr = get_expression_for_rel(root, rel2, clause);
-			double				nd;
-			bool				isdefault;
+			VariableStatData vardata;
+			AttStatsSlot sslot;
+			Form_pg_statistic stats = NULL;
+			bool		have_mcvs = false;
+			Node	   *clause = linitial(info[i].clauses);
+			Node	   *expr = get_expression_for_rel(root, rel2, clause);
+			double		nd;
+			bool		isdefault;
 
 			examine_variable(root, expr, 0, &vardata);
 
@@ -3377,7 +3386,11 @@ statext_clauselist_join_selectivity(PlannerInfo *root, List *clauses, int varRel
 			{
 				/* note we allow use of nullfrac regardless of security check */
 				stats = (Form_pg_statistic) GETSTRUCT(vardata.statsTuple);
-				/* FIXME should this call statistic_proc_security_check like eqjoinsel? */
+
+				/*
+				 * FIXME should this call statistic_proc_security_check like
+				 * eqjoinsel?
+				 */
 				have_mcvs = get_attstatsslot(&sslot, vardata.statsTuple,
 											 STATISTIC_KIND_MCV, InvalidOid,
 											 ATTSTATSSLOT_VALUES | ATTSTATSSLOT_NUMBERS);
@@ -3398,14 +3411,14 @@ statext_clauselist_join_selectivity(PlannerInfo *root, List *clauses, int varRel
 		else if (stat2 && (list_length(info[i].clauses) == 1))
 		{
 			/* try finding MCV on the other relation */
-			VariableStatData	vardata;
-			AttStatsSlot		sslot;
-			Form_pg_statistic	stats = NULL;
-			bool				have_mcvs = false;
-			Node			   *clause = (Node *) linitial(info[i].clauses);
-			Node			   *expr = get_expression_for_rel(root, rel1, clause);
-			double				nd;
-			bool				isdefault;
+			VariableStatData vardata;
+			AttStatsSlot sslot;
+			Form_pg_statistic stats = NULL;
+			bool		have_mcvs = false;
+			Node	   *clause = (Node *) linitial(info[i].clauses);
+			Node	   *expr = get_expression_for_rel(root, rel1, clause);
+			double		nd;
+			bool		isdefault;
 
 			examine_variable(root, expr, 0, &vardata);
 
@@ -3417,7 +3430,11 @@ statext_clauselist_join_selectivity(PlannerInfo *root, List *clauses, int varRel
 			{
 				/* note we allow use of nullfrac regardless of security check */
 				stats = (Form_pg_statistic) GETSTRUCT(vardata.statsTuple);
-				/* FIXME should this call statistic_proc_security_check like eqjoinsel? */
+
+				/*
+				 * FIXME should this call statistic_proc_security_check like
+				 * eqjoinsel?
+				 */
 				have_mcvs = get_attstatsslot(&sslot, vardata.statsTuple,
 											 STATISTIC_KIND_MCV, InvalidOid,
 											 ATTSTATSSLOT_VALUES | ATTSTATSSLOT_NUMBERS);
@@ -3441,18 +3458,19 @@ statext_clauselist_join_selectivity(PlannerInfo *root, List *clauses, int varRel
 		/*
 		 * Now mark all the clauses for this join pair as estimated.
 		 *
-		 * XXX Maybe track the indexes in JoinPairInfo, so that we can
-		 * simply union the two bitmaps, without the extra matching.
+		 * XXX Maybe track the indexes in JoinPairInfo, so that we can simply
+		 * union the two bitmaps, without the extra matching.
 		 */
-		foreach (lc, info->clauses)
+		foreach(lc, info->clauses)
 		{
-			Node *clause = (Node *) lfirst(lc);
-			ListCell *lc2;
+			Node	   *clause = (Node *) lfirst(lc);
+			ListCell   *lc2;
 
 			listidx = -1;
-			foreach (lc2, clauses)
+			foreach(lc2, clauses)
 			{
-				Node *clause2 = (Node *) lfirst(lc2);
+				Node	   *clause2 = (Node *) lfirst(lc2);
+
 				listidx++;
 
 				Assert(IsA(clause2, RestrictInfo));
diff --git a/src/backend/statistics/mcv.c b/src/backend/statistics/mcv.c
index 49299ed9074..3169022cd6d 100644
--- a/src/backend/statistics/mcv.c
+++ b/src/backend/statistics/mcv.c
@@ -2220,33 +2220,33 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 	Selectivity s = 0;
 
 	/* match bitmaps and selectivity for baserel conditions (if any) */
-	List   *exprs1 = NIL,
-		   *exprs2 = NIL;
-	List   *conditions1 = NIL,
-		   *conditions2 = NIL;
-	bool   *cmatches1 = NULL,
-		   *cmatches2 = NULL;
+	List	   *exprs1 = NIL,
+			   *exprs2 = NIL;
+	List	   *conditions1 = NIL,
+			   *conditions2 = NIL;
+	bool	   *cmatches1 = NULL,
+			   *cmatches2 = NULL;
 
-	double	csel1 = 1.0,
-			csel2 = 1.0;
+	double		csel1 = 1.0,
+				csel2 = 1.0;
 
-	bool   *matches1 = NULL,
-		   *matches2 = NULL;
+	bool	   *matches1 = NULL,
+			   *matches2 = NULL;
 
 	/* estimates for the two relations */
-	double	matchfreq1,
-			unmatchfreq1,
-			otherfreq1,
-			mcvfreq1,
-			nd1,
-			totalsel1;
-
-	double 	matchfreq2,
-			unmatchfreq2,
-			otherfreq2,
-			mcvfreq2,
-			nd2,
-			totalsel2;
+	double		matchfreq1,
+				unmatchfreq1,
+				otherfreq1,
+				mcvfreq1,
+				nd1,
+				totalsel1;
+
+	double		matchfreq2,
+				unmatchfreq2,
+				otherfreq2,
+				mcvfreq2,
+				nd2,
+				totalsel2;
 
 	/* info about clauses and how they match to MCV stats */
 	FmgrInfo   *opprocs;
@@ -2290,16 +2290,16 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 	}
 
 	/*
-	 * Match bitmaps for matches between MCV elements. By default there
-	 * are no matches, so we set all items to 0.
+	 * Match bitmaps for matches between MCV elements. By default there are no
+	 * matches, so we set all items to 0.
 	 */
 	matches1 = (bool *) palloc0(sizeof(bool) * mcv1->nitems);
 	matches2 = (bool *) palloc0(sizeof(bool) * mcv2->nitems);
 
 	/*
 	 * Initialize information about clauses and how they match to the MCV
-	 * stats we picked. We do this only once before processing the lists,
-	 * so that we don't have to do that for each MCV item or so.
+	 * stats we picked. We do this only once before processing the lists, so
+	 * that we don't have to do that for each MCV item or so.
 	 */
 	opprocs = (FmgrInfo *) palloc(sizeof(FmgrInfo) * list_length(clauses));
 	indexes1 = (int *) palloc(sizeof(int) * list_length(clauses));
@@ -2307,7 +2307,7 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 	reverse = (bool *) palloc(sizeof(bool) * list_length(clauses));
 
 	idx = 0;
-	foreach (lc, clauses)
+	foreach(lc, clauses)
 	{
 		Node	   *clause = (Node *) lfirst(lc);
 		OpExpr	   *opexpr;
@@ -2319,9 +2319,9 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 		/*
 		 * Strip the RestrictInfo node, get the actual clause.
 		 *
-		 * XXX Not sure if we need to care about removing other node types
-		 * too (e.g. RelabelType etc.). statext_is_supported_join_clause
-		 * matches this, but maybe we need to relax it?
+		 * XXX Not sure if we need to care about removing other node types too
+		 * (e.g. RelabelType etc.). statext_is_supported_join_clause matches
+		 * this, but maybe we need to relax it?
 		 */
 		if (IsA(clause, RestrictInfo))
 			clause = (Node *) ((RestrictInfo *) clause)->clause;
@@ -2345,7 +2345,7 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 		if ((bms_singleton_member(relids1) == rel1->relid) &&
 			(bms_singleton_member(relids2) == rel2->relid))
 		{
-			Oid		collid;
+			Oid			collid;
 
 			indexes1[idx] = mcv_match_expression(expr1,
 												 stat1->keys, stat1->exprs,
@@ -2361,7 +2361,7 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 		else if ((bms_singleton_member(relids2) == rel1->relid) &&
 				 (bms_singleton_member(relids1) == rel2->relid))
 		{
-			Oid		collid;
+			Oid			collid;
 
 			indexes1[idx] = mcv_match_expression(expr2,
 												 stat2->keys, stat2->exprs,
@@ -2390,21 +2390,22 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 	/*
 	 * Match items between the two MCV lists.
 	 *
-	 * We don't know if the join conditions match all attributes in the MCV, the
-	 * overlap may be just on a subset of attributes, e.g. (a,b,c) vs. (b,c,d).
-	 * So there may be multiple matches on either side. So we can't optimize by
-	 * aborting the inner loop after the first match, etc.
+	 * We don't know if the join conditions match all attributes in the MCV,
+	 * the overlap may be just on a subset of attributes, e.g. (a,b,c) vs.
+	 * (b,c,d). So there may be multiple matches on either side. So we can't
+	 * optimize by aborting the inner loop after the first match, etc.
 	 *
-	 * XXX We can skip the items eliminated by the base restrictions, of course.
+	 * XXX We can skip the items eliminated by the base restrictions, of
+	 * course.
 	 *
 	 * XXX We might optimize this in two ways. We might sort the MCV items on
 	 * both sides using the "join" attributes, and then perform something like
-	 * merge join. Or we might calculate a hash from the join columns, and then
-	 * compare this (to eliminate the most expensive equality functions).
+	 * merge join. Or we might calculate a hash from the join columns, and
+	 * then compare this (to eliminate the most expensive equality functions).
 	 */
 	for (i = 0; i < mcv1->nitems; i++)
 	{
-		bool	has_nulls;
+		bool		has_nulls;
 
 		/* skip items eliminated by restrictions on rel1 */
 		if (cmatches1 && !cmatches1[i])
@@ -2436,26 +2437,27 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 				continue;
 
 			/*
-			 * XXX We can't skip based on existing matches2 value, because there
-			 * may be duplicates in the first MCV.
+			 * XXX We can't skip based on existing matches2 value, because
+			 * there may be duplicates in the first MCV.
 			 */
 
 			/*
-			 * Evaluate if all the join clauses match between the two MCV items.
+			 * Evaluate if all the join clauses match between the two MCV
+			 * items.
 			 *
-			 * XXX We might optimize the order of evaluation, using the costs of
-			 * operator functions for individual columns. It does depend on the
-			 * number of distinct values, etc.
+			 * XXX We might optimize the order of evaluation, using the costs
+			 * of operator functions for individual columns. It does depend on
+			 * the number of distinct values, etc.
 			 */
 			idx = 0;
-			foreach (lc, clauses)
+			foreach(lc, clauses)
 			{
-				bool	match;
-				int		index1 = indexes1[idx],
-						index2 = indexes2[idx];
-				Datum	value1,
-						value2;
-				bool	reverse_args = reverse[idx];
+				bool		match;
+				int			index1 = indexes1[idx],
+							index2 = indexes2[idx];
+				Datum		value1,
+							value2;
+				bool		reverse_args = reverse[idx];
 
 				/* If either value is null, it's a mismatch */
 				if (mcv2->items[j].isnull[index2])
@@ -2466,8 +2468,8 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 					value2 = mcv2->items[j].values[index2];
 
 					/*
-					 * Careful about order of parameters. For same-type equality
-					 * that should not matter, but easy enough.
+					 * Careful about order of parameters. For same-type
+					 * equality that should not matter, but easy enough.
 					 *
 					 * FIXME Use appropriate collation.
 					 */
@@ -2537,10 +2539,10 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 	/*
 	 * Correction for MCV parts eliminated by the conditions.
 	 *
-	 * We need to be careful about cases where conditions eliminated all
-	 * the MCV items. We must not divide by 0.0, because that would either
-	 * produce bogus value or trigger division by zero. Instead we simply
-	 * set the selectivity to 0.0, because there can't be any matches.
+	 * We need to be careful about cases where conditions eliminated all the
+	 * MCV items. We must not divide by 0.0, because that would either produce
+	 * bogus value or trigger division by zero. Instead we simply set the
+	 * selectivity to 0.0, because there can't be any matches.
 	 */
 	if ((matchfreq1 + unmatchfreq1) > 0)
 		s = s * mcvfreq1 / (matchfreq1 + unmatchfreq1);
@@ -2560,23 +2562,23 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 	 * Consider the part of the data not represented by the MCV lists.
 	 *
 	 * XXX this is a bit bogus, because we don't know what fraction of
-	 * distinct combinations is covered by the MCV list (we're only
-	 * dealing with some of the columns), so we can't use the same
-	 * formular as eqjoinsel_inner exactly. We just use the estimates
-	 * for the whole table - this is likely an overestimate, because
-	 * (a) items may repeat in the MCV list, if it has more columns,
-	 * and (b) some of the combinations may be present in non-MCV data.
+	 * distinct combinations is covered by the MCV list (we're only dealing
+	 * with some of the columns), so we can't use the same formular as
+	 * eqjoinsel_inner exactly. We just use the estimates for the whole table
+	 * - this is likely an overestimate, because (a) items may repeat in the
+	 * MCV list, if it has more columns, and (b) some of the combinations may
+	 * be present in non-MCV data.
 	 *
-	 * Moreover, we need to look at the conditions. For now we simply
-	 * assume the conditions affect the distinct groups, and use that.
+	 * Moreover, we need to look at the conditions. For now we simply assume
+	 * the conditions affect the distinct groups, and use that.
 	 *
-	 * XXX We might calculate the number of distinct groups in the MCV,
-	 * and then use something between (nd1 - distinct(MCV)) and (nd1),
-	 * which are the possible extreme values, assuming the estimates
-	 * are accurate. Maybe mean or geometric mean would work?
+	 * XXX We might calculate the number of distinct groups in the MCV, and
+	 * then use something between (nd1 - distinct(MCV)) and (nd1), which are
+	 * the possible extreme values, assuming the estimates are accurate. Maybe
+	 * mean or geometric mean would work?
 	 *
-	 * XXX Not sure multiplying ndistinct with probabilities is good.
-	 * Maybe we should do something more like estimate_num_groups?
+	 * XXX Not sure multiplying ndistinct with probabilities is good. Maybe we
+	 * should do something more like estimate_num_groups?
 	 */
 	nd1 *= csel1;
 	nd2 *= csel2;
@@ -2585,21 +2587,21 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 	totalsel1 += unmatchfreq1 * otherfreq2 / nd2;
 	totalsel1 += otherfreq1 * (otherfreq2 + unmatchfreq2) / nd2;
 
-//	if (nd2 > mcvb->nitems)
-//		totalsel1 += unmatchfreq1 * otherfreq2 / (nd2 - mcvb->nitems);
-//	if (nd2 > nmatches)
-//		totalsel1 += otherfreq1 * (otherfreq2 + unmatchfreq2) /
-//			(nd2 - nmatches);
+/* 	if (nd2 > mcvb->nitems) */
+/* 		totalsel1 += unmatchfreq1 * otherfreq2 / (nd2 - mcvb->nitems); */
+/* 	if (nd2 > nmatches) */
+/* 		totalsel1 += otherfreq1 * (otherfreq2 + unmatchfreq2) / */
+/* 			(nd2 - nmatches); */
 
 	totalsel2 = s;
 	totalsel2 += unmatchfreq2 * otherfreq1 / nd1;
 	totalsel2 += otherfreq2 * (otherfreq1 + unmatchfreq1) / nd1;
 
-//	if (nd1 > mcva->nitems)
-//		totalsel2 += unmatchfreq2 * otherfreq1 / (nd1 - mcva->nitems);
-//	if (nd1 > nmatches)
-//		totalsel2 += otherfreq2 * (otherfreq1 + unmatchfreq1) /
-//			(nd1 - nmatches);
+/* 	if (nd1 > mcva->nitems) */
+/* 		totalsel2 += unmatchfreq2 * otherfreq1 / (nd1 - mcva->nitems); */
+/* 	if (nd1 > nmatches) */
+/* 		totalsel2 += otherfreq2 * (otherfreq1 + unmatchfreq1) / */
+/* 			(nd1 - nmatches); */
 
 	s = Min(totalsel1, totalsel2);
 
@@ -2627,31 +2629,31 @@ mcv_combine_simple(PlannerInfo *root, RelOptInfo *rel, StatisticExtInfo *stat,
 	Selectivity s = 0;
 
 	/* match bitmaps and selectivity for baserel conditions (if any) */
-	List   *conditions = NIL;
-	bool   *cmatches = NULL;
+	List	   *conditions = NIL;
+	bool	   *cmatches = NULL;
 
-	double	csel = 1.0;
+	double		csel = 1.0;
 
-	bool   *matches1 = NULL,
-		   *matches2 = NULL;
+	bool	   *matches1 = NULL,
+			   *matches2 = NULL;
 
 	/* estimates for the two sides */
-	double	matchfreq1,
-			unmatchfreq1,
-			otherfreq1,
-			mcvfreq1,
-			nd1,
-			totalsel1;
-
-	double 	matchfreq2,
-			unmatchfreq2,
-			otherfreq2,
-			mcvfreq2,
-			nd2,
-			totalsel2;
-
-	List   *exprs1 = NIL,
-		   *exprs2 = NIL;
+	double		matchfreq1,
+				unmatchfreq1,
+				otherfreq1,
+				mcvfreq1,
+				nd1,
+				totalsel1;
+
+	double		matchfreq2,
+				unmatchfreq2,
+				otherfreq2,
+				mcvfreq2,
+				nd2,
+				totalsel2;
+
+	List	   *exprs1 = NIL,
+			   *exprs2 = NIL;
 
 	/* info about clauses and how they match to MCV stats */
 	FmgrInfo	opproc;
@@ -2677,14 +2679,14 @@ mcv_combine_simple(PlannerInfo *root, RelOptInfo *rel, StatisticExtInfo *stat,
 	if (conditions)
 	{
 		cmatches = mcv_get_match_bitmap(root, conditions,
-										 stat->keys, stat->exprs,
-										 mcv, false);
+										stat->keys, stat->exprs,
+										mcv, false);
 		csel = clauselist_selectivity(root, conditions, rel->relid, 0, NULL);
 	}
 
 	/*
-	 * Match bitmaps for matches between MCV elements. By default there
-	 * are no matches, so we set all items to 0.
+	 * Match bitmaps for matches between MCV elements. By default there are no
+	 * matches, so we set all items to 0.
 	 */
 	matches1 = (bool *) palloc0(sizeof(bool) * mcv->nitems);
 
@@ -2693,8 +2695,8 @@ mcv_combine_simple(PlannerInfo *root, RelOptInfo *rel, StatisticExtInfo *stat,
 
 	/*
 	 * Initialize information about the clause and how it matches to the
-	 * extended stats we picked. We do this only once before processing
-	 * the lists, so that we don't have to do that for each item or so.
+	 * extended stats we picked. We do this only once before processing the
+	 * lists, so that we don't have to do that for each item or so.
 	 */
 	{
 		OpExpr	   *opexpr;
@@ -2706,9 +2708,9 @@ mcv_combine_simple(PlannerInfo *root, RelOptInfo *rel, StatisticExtInfo *stat,
 		/*
 		 * Strip the RestrictInfo node, get the actual clause.
 		 *
-		 * XXX Not sure if we need to care about removing other node types
-		 * too (e.g. RelabelType etc.). statext_is_supported_join_clause
-		 * matches this, but maybe we need to relax it?
+		 * XXX Not sure if we need to care about removing other node types too
+		 * (e.g. RelabelType etc.). statext_is_supported_join_clause matches
+		 * this, but maybe we need to relax it?
 		 */
 		if (IsA(clause, RestrictInfo))
 			clause = (Node *) ((RestrictInfo *) clause)->clause;
@@ -2731,7 +2733,7 @@ mcv_combine_simple(PlannerInfo *root, RelOptInfo *rel, StatisticExtInfo *stat,
 
 		if (bms_singleton_member(relids1) == rel->relid)
 		{
-			Oid		collid;
+			Oid			collid;
 
 			index = mcv_match_expression(expr1, stat->keys, stat->exprs,
 										 &collid);
@@ -2742,7 +2744,7 @@ mcv_combine_simple(PlannerInfo *root, RelOptInfo *rel, StatisticExtInfo *stat,
 		}
 		else if (bms_singleton_member(relids2) == rel->relid)
 		{
-			Oid		collid;
+			Oid			collid;
 
 			index = mcv_match_expression(expr2, stat->keys, stat->exprs,
 										 &collid);
@@ -2762,17 +2764,18 @@ mcv_combine_simple(PlannerInfo *root, RelOptInfo *rel, StatisticExtInfo *stat,
 	/*
 	 * Match items between the two MCV lists.
 	 *
-	 * We don't know if the join conditions match all attributes in the MCV, the
-	 * overlap may be just on a subset of attributes, e.g. (a,b,c) vs. (b,c,d).
-	 * So there may be multiple matches on either side. So we can't optimize by
-	 * aborting the inner loop after the first match, etc.
+	 * We don't know if the join conditions match all attributes in the MCV,
+	 * the overlap may be just on a subset of attributes, e.g. (a,b,c) vs.
+	 * (b,c,d). So there may be multiple matches on either side. So we can't
+	 * optimize by aborting the inner loop after the first match, etc.
 	 *
-	 * XXX We can skip the items eliminated by the base restrictions, of course.
+	 * XXX We can skip the items eliminated by the base restrictions, of
+	 * course.
 	 *
 	 * XXX We might optimize this in two ways. We might sort the MCV items on
 	 * both sides using the "join" attributes, and then perform something like
-	 * merge join. Or we might calculate a hash from the join columns, and then
-	 * compare this (to eliminate the most expensive equality functions).
+	 * merge join. Or we might calculate a hash from the join columns, and
+	 * then compare this (to eliminate the most expensive equality functions).
 	 */
 	for (i = 0; i < mcv->nitems; i++)
 	{
@@ -2790,9 +2793,9 @@ mcv_combine_simple(PlannerInfo *root, RelOptInfo *rel, StatisticExtInfo *stat,
 		/* find matches in the second MCV list */
 		for (j = 0; j < sslot->nvalues; j++)
 		{
-			bool	match;
-			Datum	value1 = mcv->items[i].values[index];
-			Datum	value2 = sslot->values[j];
+			bool		match;
+			Datum		value1 = mcv->items[i].values[index];
+			Datum		value2 = sslot->values[j];
 
 			/*
 			 * Evaluate the join clause between the two MCV lists. We don't
@@ -2800,8 +2803,8 @@ mcv_combine_simple(PlannerInfo *root, RelOptInfo *rel, StatisticExtInfo *stat,
 			 * NULL in the extended statistics earlier, and the simple MCV
 			 * does not contain NULL values.
 			 *
-			 * Careful about order of parameters. For same-type equality
-			 * that should not matter, but easy enough.
+			 * Careful about order of parameters. For same-type equality that
+			 * should not matter, but easy enough.
 			 *
 			 * FIXME Use appropriate collation.
 			 */
@@ -2821,8 +2824,8 @@ mcv_combine_simple(PlannerInfo *root, RelOptInfo *rel, StatisticExtInfo *stat,
 				s += mcv->items[i].frequency * sslot->numbers[j];
 
 				/*
-				 * We know there can be just a single match in the regular
-				 * MCV list, so we can abort the inner loop.
+				 * We know there can be just a single match in the regular MCV
+				 * list, so we can abort the inner loop.
 				 */
 				break;
 			}
@@ -2864,10 +2867,10 @@ mcv_combine_simple(PlannerInfo *root, RelOptInfo *rel, StatisticExtInfo *stat,
 	/*
 	 * Correction for MCV parts eliminated by the conditions.
 	 *
-	 * We need to be careful about cases where conditions eliminated all
-	 * the MCV items. We must not divide by 0.0, because that would either
-	 * produce bogus value or trigger division by zero. Instead we simply
-	 * set the selectivity to 0.0, because there can't be any matches.
+	 * We need to be careful about cases where conditions eliminated all the
+	 * MCV items. We must not divide by 0.0, because that would either produce
+	 * bogus value or trigger division by zero. Instead we simply set the
+	 * selectivity to 0.0, because there can't be any matches.
 	 */
 	if ((matchfreq1 + unmatchfreq1) > 0)
 		s = s * mcvfreq1 / (matchfreq1 + unmatchfreq1);
@@ -2887,23 +2890,23 @@ mcv_combine_simple(PlannerInfo *root, RelOptInfo *rel, StatisticExtInfo *stat,
 	 * Consider the part of the data not represented by the MCV lists.
 	 *
 	 * XXX this is a bit bogus, because we don't know what fraction of
-	 * distinct combinations is covered by the MCV list (we're only
-	 * dealing with some of the columns), so we can't use the same
-	 * formular as eqjoinsel_inner exactly. We just use the estimates
-	 * for the whole table - this is likely an overestimate, because
-	 * (a) items may repeat in the MCV list, if it has more columns,
-	 * and (b) some of the combinations may be present in non-MCV data.
+	 * distinct combinations is covered by the MCV list (we're only dealing
+	 * with some of the columns), so we can't use the same formular as
+	 * eqjoinsel_inner exactly. We just use the estimates for the whole table
+	 * - this is likely an overestimate, because (a) items may repeat in the
+	 * MCV list, if it has more columns, and (b) some of the combinations may
+	 * be present in non-MCV data.
 	 *
-	 * Moreover, we need to look at the conditions. For now we simply
-	 * assume the conditions affect the distinct groups, and use that.
+	 * Moreover, we need to look at the conditions. For now we simply assume
+	 * the conditions affect the distinct groups, and use that.
 	 *
-	 * XXX We might calculate the number of distinct groups in the MCV,
-	 * and then use something between (nd1 - distinct(MCV)) and (nd1),
-	 * which are the possible extreme values, assuming the estimates
-	 * are accurate. Maybe mean or geometric mean would work?
+	 * XXX We might calculate the number of distinct groups in the MCV, and
+	 * then use something between (nd1 - distinct(MCV)) and (nd1), which are
+	 * the possible extreme values, assuming the estimates are accurate. Maybe
+	 * mean or geometric mean would work?
 	 *
-	 * XXX Not sure multiplying ndistinct with probabilities is good.
-	 * Maybe we should do something more like estimate_num_groups?
+	 * XXX Not sure multiplying ndistinct with probabilities is good. Maybe we
+	 * should do something more like estimate_num_groups?
 	 */
 	nd1 *= csel;
 
@@ -2911,21 +2914,21 @@ mcv_combine_simple(PlannerInfo *root, RelOptInfo *rel, StatisticExtInfo *stat,
 	totalsel1 += unmatchfreq1 * otherfreq2 / nd2;
 	totalsel1 += otherfreq1 * (otherfreq2 + unmatchfreq2) / nd2;
 
-//	if (nd2 > mcvb->nitems)
-//		totalsel1 += unmatchfreq1 * otherfreq2 / (nd2 - mcvb->nitems);
-//	if (nd2 > nmatches)
-//		totalsel1 += otherfreq1 * (otherfreq2 + unmatchfreq2) /
-//			(nd2 - nmatches);
+/* 	if (nd2 > mcvb->nitems) */
+/* 		totalsel1 += unmatchfreq1 * otherfreq2 / (nd2 - mcvb->nitems); */
+/* 	if (nd2 > nmatches) */
+/* 		totalsel1 += otherfreq1 * (otherfreq2 + unmatchfreq2) / */
+/* 			(nd2 - nmatches); */
 
 	totalsel2 = s;
 	totalsel2 += unmatchfreq2 * otherfreq1 / nd1;
 	totalsel2 += otherfreq2 * (otherfreq1 + unmatchfreq1) / nd1;
 
-//	if (nd1 > mcva->nitems)
-//		totalsel2 += unmatchfreq2 * otherfreq1 / (nd1 - mcva->nitems);
-//	if (nd1 > nmatches)
-//		totalsel2 += otherfreq2 * (otherfreq1 + unmatchfreq1) /
-//			(nd1 - nmatches);
+/* 	if (nd1 > mcva->nitems) */
+/* 		totalsel2 += unmatchfreq2 * otherfreq1 / (nd1 - mcva->nitems); */
+/* 	if (nd1 > nmatches) */
+/* 		totalsel2 += otherfreq2 * (otherfreq1 + unmatchfreq1) / */
+/* 			(nd1 - nmatches); */
 
 	s = Min(totalsel1, totalsel2);
 
diff --git a/src/include/statistics/extended_stats_internal.h b/src/include/statistics/extended_stats_internal.h
index f156fda555e..b1f30dfe2ee 100644
--- a/src/include/statistics/extended_stats_internal.h
+++ b/src/include/statistics/extended_stats_internal.h
@@ -15,7 +15,7 @@
 #define EXTENDED_STATS_INTERNAL_H
 
 #include "statistics/statistics.h"
-#include "utils/lsyscache.h"		/* XXX is this needed? */
+#include "utils/lsyscache.h"	/* XXX is this needed? */
 #include "utils/sortsupport.h"
 
 typedef struct
diff --git a/src/include/statistics/statistics.h b/src/include/statistics/statistics.h
index 60b222028d8..48ab718304e 100644
--- a/src/include/statistics/statistics.h
+++ b/src/include/statistics/statistics.h
@@ -128,7 +128,7 @@ extern StatisticExtInfo *choose_best_statistics(List *stats, char requiredkind,
 extern HeapTuple statext_expressions_load(Oid stxoid, bool inh, int idx);
 
 extern StatisticExtInfo *statext_find_matching_mcv(PlannerInfo *root, RelOptInfo *rel,
-										   Bitmapset *attnums, List *exprs);
+												   Bitmapset *attnums, List *exprs);
 
 extern bool statext_try_join_estimates(PlannerInfo *root, List *clauses, int varRelid,
 									   JoinType jointype, SpecialJoinInfo *sjinfo,
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 61ad417cde6..f7ba7901a27 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1289,6 +1289,7 @@ JoinCostWorkspace
 JoinDomain
 JoinExpr
 JoinHashEntry
+JoinPairInfo
 JoinPath
 JoinPathExtraData
 JoinState
-- 
2.45.2

v20240617-0004-Remove-estimiatedcluases-and-varRelid-argu.patchtext/x-patch; charset=UTF-8; name=v20240617-0004-Remove-estimiatedcluases-and-varRelid-argu.patchDownload

From fdb8c9afda45a45b6a5922e14c711f6c1dd201d0 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Mon, 17 Jun 2024 14:11:50 +0200
Subject: [PATCH v20240617 04/56] Remove estimiatedcluases and varRelid
 arguments

comments and Assert around the changes provides more information.
---
 src/backend/optimizer/path/clausesel.c  | 16 ++++++++++------
 src/backend/statistics/extended_stats.c | 24 ++++++++++--------------
 src/include/statistics/statistics.h     |  4 +---
 3 files changed, 21 insertions(+), 23 deletions(-)

diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c
index 871d73e3b4f..a0ab95553bc 100644
--- a/src/backend/optimizer/path/clausesel.c
+++ b/src/backend/optimizer/path/clausesel.c
@@ -209,14 +209,18 @@ clauselist_selectivity_ext(PlannerInfo *root,
 	 * to detect when this makes sense, but we can check that there are join
 	 * clauses, and that at least some of the rels have stats.
 	 *
-	 * XXX Isn't this mutually exclusive with the preceding block which
-	 * calculates estimates for a single relation?
+	 * rel != NULL can't grantee the clause is not a join clause, for example
+	 * t1 left join t2 ON t1.a = 3, but it can grantee we can't use extended
+	 * statistics for estimation since it has only 1 relid.
+	 *
+	 * XXX: so we can grantee estimatedclauses == NULL now, so estimatedclauses
+	 * in statext_try_join_estimates is removed.
 	 */
-	if (use_extended_stats &&
-		statext_try_join_estimates(root, clauses, varRelid, jointype, sjinfo,
-								   estimatedclauses))
+	if (use_extended_stats && rel == NULL &&
+		statext_try_join_estimates(root, clauses, varRelid, jointype, sjinfo))
 	{
-		s1 *= statext_clauselist_join_selectivity(root, clauses, varRelid,
+		Assert(varRelid == 0);
+		s1 *= statext_clauselist_join_selectivity(root, clauses,
 												  jointype, sjinfo,
 												  &estimatedclauses);
 	}
diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c
index 25b4d486a09..71e47748d23 100644
--- a/src/backend/statistics/extended_stats.c
+++ b/src/backend/statistics/extended_stats.c
@@ -2829,8 +2829,7 @@ statext_determine_join_restrictions(PlannerInfo *root, RelOptInfo *rel,
  * on the conditions, to make sure it can be estimated using extended stats.
  */
 static bool
-statext_is_supported_join_clause(PlannerInfo *root, Node *clause,
-								 int varRelid, SpecialJoinInfo *sjinfo)
+statext_is_supported_join_clause(PlannerInfo *root, Node *clause, SpecialJoinInfo *sjinfo)
 {
 	Oid			oprsel;
 	RestrictInfo *rinfo;
@@ -2842,7 +2841,9 @@ statext_is_supported_join_clause(PlannerInfo *root, Node *clause,
 	 *
 	 * XXX See treat_as_join_clause.
 	 */
-	if ((varRelid != 0) || (sjinfo == NULL))
+
+	/* duplicated with statext_try_join_estimates */
+	if (sjinfo == NULL)
 		return false;
 
 	/* XXX Can we rely on always getting RestrictInfo here? */
@@ -2928,8 +2929,7 @@ statext_is_supported_join_clause(PlannerInfo *root, Node *clause,
  */
 bool
 statext_try_join_estimates(PlannerInfo *root, List *clauses, int varRelid,
-						   JoinType jointype, SpecialJoinInfo *sjinfo,
-						   Bitmapset *estimatedclauses)
+						   JoinType jointype, SpecialJoinInfo *sjinfo)
 {
 	int			listidx;
 	int			k;
@@ -2968,15 +2968,11 @@ statext_try_join_estimates(PlannerInfo *root, List *clauses, int varRelid,
 		/* needs to happen before skipping any clauses */
 		listidx++;
 
-		/* Skip clauses that were already estimated. */
-		if (bms_is_member(listidx, estimatedclauses))
-			continue;
-
 		/*
 		 * Skip clauses that are not join clauses or that we don't know how to
 		 * handle estimate using extended statistics.
 		 */
-		if (!statext_is_supported_join_clause(root, clause, varRelid, sjinfo))
+		if (!statext_is_supported_join_clause(root, clause, sjinfo))
 			continue;
 
 		/*
@@ -3048,7 +3044,7 @@ typedef struct JoinPairInfo
  * with F_EQJOINSEL selectivity function at the moment).
  */
 static JoinPairInfo *
-statext_build_join_pairs(PlannerInfo *root, List *clauses, int varRelid,
+statext_build_join_pairs(PlannerInfo *root, List *clauses,
 						 JoinType jointype, SpecialJoinInfo *sjinfo,
 						 Bitmapset *estimatedclauses, int *npairs)
 {
@@ -3084,7 +3080,7 @@ statext_build_join_pairs(PlannerInfo *root, List *clauses, int varRelid,
 		 * moment we support just (Expr op Expr) clauses with each side
 		 * referencing just a single relation).
 		 */
-		if (!statext_is_supported_join_clause(root, clause, varRelid, sjinfo))
+		if (!statext_is_supported_join_clause(root, clause, sjinfo))
 			continue;
 
 		/* statext_is_supported_join_clause guarantees RestrictInfo */
@@ -3282,7 +3278,7 @@ get_expression_for_rel(PlannerInfo *root, RelOptInfo *rel, Node *clause)
  * XXX Isn't the preceding comment stale? We skip the optimization, no?
  */
 Selectivity
-statext_clauselist_join_selectivity(PlannerInfo *root, List *clauses, int varRelid,
+statext_clauselist_join_selectivity(PlannerInfo *root, List *clauses,
 									JoinType jointype, SpecialJoinInfo *sjinfo,
 									Bitmapset **estimatedclauses)
 {
@@ -3297,7 +3293,7 @@ statext_clauselist_join_selectivity(PlannerInfo *root, List *clauses, int varRel
 		return 1.0;
 
 	/* extract pairs of joined relations from the list of clauses */
-	info = statext_build_join_pairs(root, clauses, varRelid, jointype, sjinfo,
+	info = statext_build_join_pairs(root, clauses, jointype, sjinfo,
 									*estimatedclauses, &ninfo);
 
 	/* no useful join pairs */
diff --git a/src/include/statistics/statistics.h b/src/include/statistics/statistics.h
index 48ab718304e..4bd3104a2b7 100644
--- a/src/include/statistics/statistics.h
+++ b/src/include/statistics/statistics.h
@@ -131,11 +131,9 @@ extern StatisticExtInfo *statext_find_matching_mcv(PlannerInfo *root, RelOptInfo
 												   Bitmapset *attnums, List *exprs);
 
 extern bool statext_try_join_estimates(PlannerInfo *root, List *clauses, int varRelid,
-									   JoinType jointype, SpecialJoinInfo *sjinfo,
-									   Bitmapset *estimatedclauses);
+									   JoinType jointype, SpecialJoinInfo *sjinfo);
 
 extern Selectivity statext_clauselist_join_selectivity(PlannerInfo *root, List *clauses,
-													   int varRelid,
 													   JoinType jointype, SpecialJoinInfo *sjinfo,
 													   Bitmapset **estimatedclauses);
 
-- 
2.45.2

v20240617-0005-review.patchtext/x-patch; charset=UTF-8; name=v20240617-0005-review.patchDownload

From 4369a6afa3437dec03a9237cc836c4205d5c56a1 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Mon, 17 Jun 2024 14:29:11 +0200
Subject: [PATCH v20240617 05/56] review

---
 src/backend/optimizer/path/clausesel.c  | 5 +++++
 src/backend/statistics/extended_stats.c | 2 ++
 2 files changed, 7 insertions(+)

diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c
index a0ab95553bc..00d74e21bdd 100644
--- a/src/backend/optimizer/path/clausesel.c
+++ b/src/backend/optimizer/path/clausesel.c
@@ -213,8 +213,13 @@ clauselist_selectivity_ext(PlannerInfo *root,
 	 * t1 left join t2 ON t1.a = 3, but it can grantee we can't use extended
 	 * statistics for estimation since it has only 1 relid.
 	 *
+	 * XXX Is that actually behaving like that? Won't the (t1.a=3) be turned
+	 * into a regular clause? I haven't tried, though.
+	 *
 	 * XXX: so we can grantee estimatedclauses == NULL now, so estimatedclauses
 	 * in statext_try_join_estimates is removed.
+	 *
+	 * XXX Maybe remove the comment and add an assert estimatedclauses==NULL.
 	 */
 	if (use_extended_stats && rel == NULL &&
 		statext_try_join_estimates(root, clauses, varRelid, jointype, sjinfo))
diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c
index 71e47748d23..69a638a18b7 100644
--- a/src/backend/statistics/extended_stats.c
+++ b/src/backend/statistics/extended_stats.c
@@ -2837,6 +2837,8 @@ statext_is_supported_join_clause(PlannerInfo *root, Node *clause, SpecialJoinInf
 	ListCell   *lc;
 
 	/*
+	 * XXX isn't this comment stale after removal of varRelid?
+	 *
 	 * evaluation as a restriction clause, either at scan node or forced
 	 *
 	 * XXX See treat_as_join_clause.
-- 
2.45.2

v20240617-0006-pgindent.patchtext/x-patch; charset=UTF-8; name=v20240617-0006-pgindent.patchDownload

From a3fd6b97fb0bd31d873e7a843d99f593327c0b84 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Mon, 17 Jun 2024 14:30:05 +0200
Subject: [PATCH v20240617 06/56] pgindent

---
 src/backend/optimizer/path/clausesel.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c
index 00d74e21bdd..ec7121be3d1 100644
--- a/src/backend/optimizer/path/clausesel.c
+++ b/src/backend/optimizer/path/clausesel.c
@@ -216,8 +216,8 @@ clauselist_selectivity_ext(PlannerInfo *root,
 	 * XXX Is that actually behaving like that? Won't the (t1.a=3) be turned
 	 * into a regular clause? I haven't tried, though.
 	 *
-	 * XXX: so we can grantee estimatedclauses == NULL now, so estimatedclauses
-	 * in statext_try_join_estimates is removed.
+	 * XXX: so we can grantee estimatedclauses == NULL now, so
+	 * estimatedclauses in statext_try_join_estimates is removed.
 	 *
 	 * XXX Maybe remove the comment and add an assert estimatedclauses==NULL.
 	 */
-- 
2.45.2

v20240617-0007-Remove-SpecialJoinInfo-sjinfo-argument.patchtext/x-patch; charset=UTF-8; name=v20240617-0007-Remove-SpecialJoinInfo-sjinfo-argument.patchDownload

From 2a277a9fe8b2031f66fdb4fd8c2e37a2ed525d32 Mon Sep 17 00:00:00 2001
From: "yizhi.fzh" <yizhi.fzh@alibaba-inc.com>
Date: Tue, 2 Apr 2024 09:58:18 +0800
Subject: [PATCH v20240617 07/56] Remove SpecialJoinInfo *sjinfo argument

It was passed down to statext_is_supported_join_clause where it is
used for checking if it is NULL.  However it has been checked before
in statext_try_join_estimates.
---
 src/backend/optimizer/path/clausesel.c  |  3 ++-
 src/backend/statistics/extended_stats.c | 16 ++++++----------
 src/include/statistics/statistics.h     |  3 +--
 3 files changed, 9 insertions(+), 13 deletions(-)

diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c
index ec7121be3d1..500a8858162 100644
--- a/src/backend/optimizer/path/clausesel.c
+++ b/src/backend/optimizer/path/clausesel.c
@@ -225,8 +225,9 @@ clauselist_selectivity_ext(PlannerInfo *root,
 		statext_try_join_estimates(root, clauses, varRelid, jointype, sjinfo))
 	{
 		Assert(varRelid == 0);
+		Assert(sjinfo != NULL);
 		s1 *= statext_clauselist_join_selectivity(root, clauses,
-												  jointype, sjinfo,
+												  jointype,
 												  &estimatedclauses);
 	}
 
diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c
index 69a638a18b7..c38ad6d17c5 100644
--- a/src/backend/statistics/extended_stats.c
+++ b/src/backend/statistics/extended_stats.c
@@ -2829,7 +2829,7 @@ statext_determine_join_restrictions(PlannerInfo *root, RelOptInfo *rel,
  * on the conditions, to make sure it can be estimated using extended stats.
  */
 static bool
-statext_is_supported_join_clause(PlannerInfo *root, Node *clause, SpecialJoinInfo *sjinfo)
+statext_is_supported_join_clause(PlannerInfo *root, Node *clause)
 {
 	Oid			oprsel;
 	RestrictInfo *rinfo;
@@ -2844,10 +2844,6 @@ statext_is_supported_join_clause(PlannerInfo *root, Node *clause, SpecialJoinInf
 	 * XXX See treat_as_join_clause.
 	 */
 
-	/* duplicated with statext_try_join_estimates */
-	if (sjinfo == NULL)
-		return false;
-
 	/* XXX Can we rely on always getting RestrictInfo here? */
 	if (!IsA(clause, RestrictInfo))
 		return false;
@@ -2974,7 +2970,7 @@ statext_try_join_estimates(PlannerInfo *root, List *clauses, int varRelid,
 		 * Skip clauses that are not join clauses or that we don't know how to
 		 * handle estimate using extended statistics.
 		 */
-		if (!statext_is_supported_join_clause(root, clause, sjinfo))
+		if (!statext_is_supported_join_clause(root, clause))
 			continue;
 
 		/*
@@ -3047,7 +3043,7 @@ typedef struct JoinPairInfo
  */
 static JoinPairInfo *
 statext_build_join_pairs(PlannerInfo *root, List *clauses,
-						 JoinType jointype, SpecialJoinInfo *sjinfo,
+						 JoinType jointype,
 						 Bitmapset *estimatedclauses, int *npairs)
 {
 	int			cnt;
@@ -3082,7 +3078,7 @@ statext_build_join_pairs(PlannerInfo *root, List *clauses,
 		 * moment we support just (Expr op Expr) clauses with each side
 		 * referencing just a single relation).
 		 */
-		if (!statext_is_supported_join_clause(root, clause, sjinfo))
+		if (!statext_is_supported_join_clause(root, clause))
 			continue;
 
 		/* statext_is_supported_join_clause guarantees RestrictInfo */
@@ -3281,7 +3277,7 @@ get_expression_for_rel(PlannerInfo *root, RelOptInfo *rel, Node *clause)
  */
 Selectivity
 statext_clauselist_join_selectivity(PlannerInfo *root, List *clauses,
-									JoinType jointype, SpecialJoinInfo *sjinfo,
+									JoinType jointype,
 									Bitmapset **estimatedclauses)
 {
 	int			i;
@@ -3295,7 +3291,7 @@ statext_clauselist_join_selectivity(PlannerInfo *root, List *clauses,
 		return 1.0;
 
 	/* extract pairs of joined relations from the list of clauses */
-	info = statext_build_join_pairs(root, clauses, jointype, sjinfo,
+	info = statext_build_join_pairs(root, clauses, jointype,
 									*estimatedclauses, &ninfo);
 
 	/* no useful join pairs */
diff --git a/src/include/statistics/statistics.h b/src/include/statistics/statistics.h
index 4bd3104a2b7..c682a6fb0e8 100644
--- a/src/include/statistics/statistics.h
+++ b/src/include/statistics/statistics.h
@@ -134,7 +134,6 @@ extern bool statext_try_join_estimates(PlannerInfo *root, List *clauses, int var
 									   JoinType jointype, SpecialJoinInfo *sjinfo);
 
 extern Selectivity statext_clauselist_join_selectivity(PlannerInfo *root, List *clauses,
-													   JoinType jointype, SpecialJoinInfo *sjinfo,
-													   Bitmapset **estimatedclauses);
+													   JoinType jointype, Bitmapset **estimatedclauses);
 
 #endif							/* STATISTICS_H */
-- 
2.45.2

v20240617-0008-review.patchtext/x-patch; charset=UTF-8; name=v20240617-0008-review.patchDownload

From 499082afcf5a2054a13ed331adf8d572787d0eaa Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Mon, 17 Jun 2024 15:07:09 +0200
Subject: [PATCH v20240617 08/56] review

---
 src/backend/optimizer/path/clausesel.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c
index 500a8858162..824a042c54a 100644
--- a/src/backend/optimizer/path/clausesel.c
+++ b/src/backend/optimizer/path/clausesel.c
@@ -220,6 +220,13 @@ clauselist_selectivity_ext(PlannerInfo *root,
 	 * estimatedclauses in statext_try_join_estimates is removed.
 	 *
 	 * XXX Maybe remove the comment and add an assert estimatedclauses==NULL.
+	 *
+	 * XXX I'm not sure removing the sjinfo is a good idea. Yes, the current
+	 * code does not actually use it (AFAICS), but selfuncs.c always passes
+	 * both jointype+sjinfo, so maybe we should do that too ... What happens
+	 * if we end up wanting to call an existing selfuncs function that needs
+	 * sjinfo in the future? Say because we want to call the regular join
+	 * estimator, and then apply some "correction" to the result?
 	 */
 	if (use_extended_stats && rel == NULL &&
 		statext_try_join_estimates(root, clauses, varRelid, jointype, sjinfo))
-- 
2.45.2

v20240617-0009-Remove-joinType-argument.patchtext/x-patch; charset=UTF-8; name=v20240617-0009-Remove-joinType-argument.patchDownload

From b26bc2c6418faad9017558eadbd23f3f879cea06 Mon Sep 17 00:00:00 2001
From: "yizhi.fzh" <yizhi.fzh@alibaba-inc.com>
Date: Tue, 2 Apr 2024 10:07:00 +0800
Subject: [PATCH v20240617 09/56] Remove joinType argument.

---
 src/backend/optimizer/path/clausesel.c  | 1 -
 src/backend/statistics/extended_stats.c | 4 +---
 src/include/statistics/statistics.h     | 3 +--
 3 files changed, 2 insertions(+), 6 deletions(-)

diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c
index 824a042c54a..b130d88c5e8 100644
--- a/src/backend/optimizer/path/clausesel.c
+++ b/src/backend/optimizer/path/clausesel.c
@@ -234,7 +234,6 @@ clauselist_selectivity_ext(PlannerInfo *root,
 		Assert(varRelid == 0);
 		Assert(sjinfo != NULL);
 		s1 *= statext_clauselist_join_selectivity(root, clauses,
-												  jointype,
 												  &estimatedclauses);
 	}
 
diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c
index c38ad6d17c5..f6f416ac213 100644
--- a/src/backend/statistics/extended_stats.c
+++ b/src/backend/statistics/extended_stats.c
@@ -3043,7 +3043,6 @@ typedef struct JoinPairInfo
  */
 static JoinPairInfo *
 statext_build_join_pairs(PlannerInfo *root, List *clauses,
-						 JoinType jointype,
 						 Bitmapset *estimatedclauses, int *npairs)
 {
 	int			cnt;
@@ -3277,7 +3276,6 @@ get_expression_for_rel(PlannerInfo *root, RelOptInfo *rel, Node *clause)
  */
 Selectivity
 statext_clauselist_join_selectivity(PlannerInfo *root, List *clauses,
-									JoinType jointype,
 									Bitmapset **estimatedclauses)
 {
 	int			i;
@@ -3291,7 +3289,7 @@ statext_clauselist_join_selectivity(PlannerInfo *root, List *clauses,
 		return 1.0;
 
 	/* extract pairs of joined relations from the list of clauses */
-	info = statext_build_join_pairs(root, clauses, jointype,
+	info = statext_build_join_pairs(root, clauses,
 									*estimatedclauses, &ninfo);
 
 	/* no useful join pairs */
diff --git a/src/include/statistics/statistics.h b/src/include/statistics/statistics.h
index c682a6fb0e8..531feef85a4 100644
--- a/src/include/statistics/statistics.h
+++ b/src/include/statistics/statistics.h
@@ -133,7 +133,6 @@ extern StatisticExtInfo *statext_find_matching_mcv(PlannerInfo *root, RelOptInfo
 extern bool statext_try_join_estimates(PlannerInfo *root, List *clauses, int varRelid,
 									   JoinType jointype, SpecialJoinInfo *sjinfo);
 
-extern Selectivity statext_clauselist_join_selectivity(PlannerInfo *root, List *clauses,
-													   JoinType jointype, Bitmapset **estimatedclauses);
+extern Selectivity statext_clauselist_join_selectivity(PlannerInfo *root, List *clauses, Bitmapset **estimatedclauses);
 
 #endif							/* STATISTICS_H */
-- 
2.45.2

v20240617-0010-review.patchtext/x-patch; charset=UTF-8; name=v20240617-0010-review.patchDownload

From b34667cc0cc344f8af762da45785d6c834d26179 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Mon, 17 Jun 2024 15:09:42 +0200
Subject: [PATCH v20240617 10/56] review

---
 src/backend/optimizer/path/clausesel.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c
index b130d88c5e8..c2a56341095 100644
--- a/src/backend/optimizer/path/clausesel.c
+++ b/src/backend/optimizer/path/clausesel.c
@@ -227,6 +227,8 @@ clauselist_selectivity_ext(PlannerInfo *root,
 	 * if we end up wanting to call an existing selfuncs function that needs
 	 * sjinfo in the future? Say because we want to call the regular join
 	 * estimator, and then apply some "correction" to the result?
+	 *
+	 * XXX Same thing for the joinType removal, I guess.
 	 */
 	if (use_extended_stats && rel == NULL &&
 		statext_try_join_estimates(root, clauses, varRelid, jointype, sjinfo))
-- 
2.45.2

v20240617-0011-use-the-pre-calculated-RestrictInfo-left-r.patchtext/x-patch; charset=UTF-8; name=v20240617-0011-use-the-pre-calculated-RestrictInfo-left-r.patchDownload

From a296c9876e510ed27056a1acee24afb1ed90e146 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Mon, 17 Jun 2024 15:13:21 +0200
Subject: [PATCH v20240617 11/56] use the pre-calculated
 RestrictInfo->left|right_relids

It should has better performance than pull_varnos and easier to
understand.
---
 src/backend/statistics/extended_stats.c | 41 ++++++-------------------
 1 file changed, 9 insertions(+), 32 deletions(-)

diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c
index f6f416ac213..98d579578c0 100644
--- a/src/backend/statistics/extended_stats.c
+++ b/src/backend/statistics/extended_stats.c
@@ -2831,10 +2831,10 @@ statext_determine_join_restrictions(PlannerInfo *root, RelOptInfo *rel,
 static bool
 statext_is_supported_join_clause(PlannerInfo *root, Node *clause)
 {
-	Oid			oprsel;
-	RestrictInfo *rinfo;
-	OpExpr	   *opclause;
-	ListCell   *lc;
+	Oid	oprsel;
+	RestrictInfo   *rinfo;
+	OpExpr		   *opclause;
+	int				left_relid, right_relid;
 
 	/*
 	 * XXX isn't this comment stale after removal of varRelid?
@@ -2852,10 +2852,6 @@ statext_is_supported_join_clause(PlannerInfo *root, Node *clause)
 	rinfo = (RestrictInfo *) clause;
 	clause = (Node *) rinfo->clause;
 
-	/* is it referencing multiple relations? */
-	if (bms_membership(rinfo->clause_relids) != BMS_MULTIPLE)
-		return false;
-
 	/* we only support simple operator clauses for now */
 	if (!is_opclause(clause))
 		return false;
@@ -2878,8 +2874,6 @@ statext_is_supported_join_clause(PlannerInfo *root, Node *clause)
 	 * which is still technically an opclause, but we can't match it to
 	 * extended statistics in a simple way.
 	 *
-	 * XXX This also means we require rinfo->clause_relids to have 2 rels.
-	 *
 	 * XXX Also check it's not expression on system attributes, which we don't
 	 * allow in extended statistics.
 	 *
@@ -2888,30 +2882,13 @@ statext_is_supported_join_clause(PlannerInfo *root, Node *clause)
 	 * something like that. We could do "cartesian product" of the MCV stats
 	 * and restrict it using this condition.
 	 */
-	foreach(lc, opclause->args)
-	{
-		Bitmapset  *varnos = NULL;
-		Node	   *expr = (Node *) lfirst(lc);
 
-		varnos = pull_varnos(root, expr);
-
-		/*
-		 * No argument should reference more than just one relation.
-		 *
-		 * This effectively means each side references just two relations. If
-		 * there's no relation on one side, it's a Const, and the other side
-		 * has to be either Const or Expr with a single rel. In which case it
-		 * can't be a join clause.
-		 */
-		if (bms_num_members(varnos) > 1)
-			return false;
+	if (!bms_get_singleton_member(rinfo->left_relids, &left_relid) ||
+		!bms_get_singleton_member(rinfo->right_relids, &right_relid))
+		return false;
 
-		/*
-		 * XXX Maybe check that both relations have extended statistics (no
-		 * point in considering the clause as useful without it). But we'll do
-		 * that check later anyway, so keep this cheap.
-		 */
-	}
+	if (left_relid == right_relid)
+		return false;
 
 	return true;
 }
-- 
2.45.2

v20240617-0012-review.patchtext/x-patch; charset=UTF-8; name=v20240617-0012-review.patchDownload

From cb513c8b07938c966c8460bacae689497081df21 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Mon, 17 Jun 2024 15:15:30 +0200
Subject: [PATCH v20240617 12/56] review

---
 src/backend/statistics/extended_stats.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c
index 98d579578c0..df33d25ebfd 100644
--- a/src/backend/statistics/extended_stats.c
+++ b/src/backend/statistics/extended_stats.c
@@ -2852,6 +2852,11 @@ statext_is_supported_join_clause(PlannerInfo *root, Node *clause)
 	rinfo = (RestrictInfo *) clause;
 	clause = (Node *) rinfo->clause;
 
+	/*
+	 * XXX why not to retain the BMS_MULTIPLE check on clause_relids, seems
+	 * cheap so maybe we could do it before the more expensive stuff?
+	 */
+
 	/* we only support simple operator clauses for now */
 	if (!is_opclause(clause))
 		return false;
-- 
2.45.2

v20240617-0013-pgindent.patchtext/x-patch; charset=UTF-8; name=v20240617-0013-pgindent.patchDownload

From a50284f586342907db3dc6963eb1a5bb694c289a Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Mon, 17 Jun 2024 15:16:17 +0200
Subject: [PATCH v20240617 13/56] pgindent

---
 src/backend/statistics/extended_stats.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c
index df33d25ebfd..85839e1104c 100644
--- a/src/backend/statistics/extended_stats.c
+++ b/src/backend/statistics/extended_stats.c
@@ -2831,10 +2831,11 @@ statext_determine_join_restrictions(PlannerInfo *root, RelOptInfo *rel,
 static bool
 statext_is_supported_join_clause(PlannerInfo *root, Node *clause)
 {
-	Oid	oprsel;
-	RestrictInfo   *rinfo;
-	OpExpr		   *opclause;
-	int				left_relid, right_relid;
+	Oid			oprsel;
+	RestrictInfo *rinfo;
+	OpExpr	   *opclause;
+	int			left_relid,
+				right_relid;
 
 	/*
 	 * XXX isn't this comment stale after removal of varRelid?
-- 
2.45.2

v20240617-0038-pgindent.patchtext/x-patch; charset=UTF-8; name=v20240617-0038-pgindent.patchDownload

From 140f7e99050521c37beb15fdbf183730f6f94346 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Mon, 17 Jun 2024 16:40:20 +0200
Subject: [PATCH v20240617 38/56] pgindent

---
 src/backend/statistics/extended_stats.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c
index 0b0d0ce33b9..e7caa112ee7 100644
--- a/src/backend/statistics/extended_stats.c
+++ b/src/backend/statistics/extended_stats.c
@@ -3216,8 +3216,8 @@ extract_relation_info(PlannerInfo *root, JoinPairInfo *info, int index,
 static Node *
 get_expression_for_rel(PlannerInfo *root, RelOptInfo *rel, Node *clause)
 {
-	OpExpr *opexpr;
-	Node   *expr;
+	OpExpr	   *opexpr;
+	Node	   *expr;
 	RestrictInfo *rinfo = (RestrictInfo *) clause;
 
 	Assert(IsA(clause, RestrictInfo));
-- 
2.45.2

v20240617-0014-Fast-path-for-general-clauselist_selectivi.patchtext/x-patch; charset=UTF-8; name=v20240617-0014-Fast-path-for-general-clauselist_selectivi.patchDownload

From 009352da6fa23e78d2120b1bf50cf1f887592667 Mon Sep 17 00:00:00 2001
From: "yizhi.fzh" <yizhi.fzh@alibaba-inc.com>
Date: Tue, 2 Apr 2024 14:47:54 +0800
Subject: [PATCH v20240617 14/56] Fast path for general clauselist_selectivity

It should be common in the most queries like

SELECT * FROM t1, t2 WHERE t1.a = t2.a AND t1.a > 3;

clauses == NULL at the scan level of t2.
---
 src/backend/optimizer/path/clausesel.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c
index c2a56341095..390668ca4c4 100644
--- a/src/backend/optimizer/path/clausesel.c
+++ b/src/backend/optimizer/path/clausesel.c
@@ -134,6 +134,9 @@ clauselist_selectivity_ext(PlannerInfo *root,
 	/* skip expensive processing when estimating a single clause */
 	bool		single_clause_optimization = true;
 
+	if (clauses == NULL)
+		return 1.0;
+
 	/*
 	 * Disable the single-clause optimization when estimating a join clause.
 	 *
-- 
2.45.2

v20240617-0015-review.patchtext/x-patch; charset=UTF-8; name=v20240617-0015-review.patchDownload

From b47b3c135f5e2c3972013aeb019507eece655713 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Mon, 17 Jun 2024 15:35:08 +0200
Subject: [PATCH v20240617 15/56] review

---
 src/backend/optimizer/path/clausesel.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c
index 390668ca4c4..206fe627e58 100644
--- a/src/backend/optimizer/path/clausesel.c
+++ b/src/backend/optimizer/path/clausesel.c
@@ -134,6 +134,7 @@ clauselist_selectivity_ext(PlannerInfo *root,
 	/* skip expensive processing when estimating a single clause */
 	bool		single_clause_optimization = true;
 
+	/* XXX Does this actually make meaningful difference? */
 	if (clauses == NULL)
 		return 1.0;
 
-- 
2.45.2

v20240617-0016-bms_is_empty-is-more-effective-than-bms_nu.patchtext/x-patch; charset=UTF-8; name=v20240617-0016-bms_is_empty-is-more-effective-than-bms_nu.patchDownload

From 1205bf4bf8cc843aabaccb9ca4923b2d60772fa6 Mon Sep 17 00:00:00 2001
From: "yizhi.fzh" <yizhi.fzh@alibaba-inc.com>
Date: Tue, 2 Apr 2024 14:53:30 +0800
Subject: [PATCH v20240617 16/56] bms_is_empty is more effective than
 bms_num_members(b) == 0.

---
 src/backend/statistics/extended_stats.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c
index 85839e1104c..3e7a133c047 100644
--- a/src/backend/statistics/extended_stats.c
+++ b/src/backend/statistics/extended_stats.c
@@ -2967,7 +2967,7 @@ statext_try_join_estimates(PlannerInfo *root, List *clauses, int varRelid,
 	}
 
 	/* no join clauses found, don't try applying extended stats */
-	if (bms_num_members(relids) == 0)
+	if (bms_is_empty(relids))
 		return false;
 
 	/*
-- 
2.45.2

v20240617-0017-a-branch-of-updates-around-JoinPairInfo.patchtext/x-patch; charset=UTF-8; name=v20240617-0017-a-branch-of-updates-around-JoinPairInfo.patchDownload

From 26da69fca825d041485dd8ae85ccbd5e55ad2c33 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Mon, 17 Jun 2024 15:41:28 +0200
Subject: [PATCH v20240617 17/56] a branch of updates around JoinPairInfo

1. rename rels to relids while the "rels" may reference to list of
RelOptInfo or Relids. but the later one reference to Relids all the
time.

2. Store RestrictInfo to JoinPairInfo.clauses so that we can reuse
the left_relids, right_relids which will save us from calling
pull_varnos.

3. create bms_nth_member function in bitmapset.c and use it
extract_relation_info, the function name is self-documented.

4. pfree the JoinPairInfo array when we are done with that.
---
 src/backend/nodes/bitmapset.c           | 18 +++++++++++
 src/backend/statistics/extended_stats.c | 43 ++++++++++++-------------
 src/backend/statistics/mcv.c            | 35 ++++++--------------
 src/include/nodes/bitmapset.h           |  1 +
 4 files changed, 48 insertions(+), 49 deletions(-)

diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index cd05c642b04..7c1291ae641 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -772,6 +772,24 @@ bms_num_members(const Bitmapset *a)
 	return result;
 }
 
+/*
+ * bms_nth_member - return the nth member, index starts with 0.
+ */
+int
+bms_nth_member(const Bitmapset *a, int i)
+{
+	int idx, res = -1;
+
+	for (idx = 0; idx <= i; idx++)
+	{
+		res = bms_next_member(a, res);
+
+		if (res < 0)
+			elog(ERROR, "no enough members for %d", i);
+	}
+	return res;
+}
+
 /*
  * bms_membership - does a set have zero, one, or multiple members?
  *
diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c
index 3e7a133c047..13042dd63c0 100644
--- a/src/backend/statistics/extended_stats.c
+++ b/src/backend/statistics/extended_stats.c
@@ -3004,11 +3004,11 @@ statext_try_join_estimates(PlannerInfo *root, List *clauses, int varRelid,
 }
 
 /*
- * Information about two joined relations, along with the join clauses between.
+ * Information about two joined relations, group by clauses by relids.
  */
 typedef struct JoinPairInfo
 {
-	Bitmapset  *rels;
+	Bitmapset  *relids;
 	List	   *clauses;
 } JoinPairInfo;
 
@@ -3071,9 +3071,9 @@ statext_build_join_pairs(PlannerInfo *root, List *clauses,
 		found = false;
 		for (i = 0; i < cnt; i++)
 		{
-			if (bms_is_subset(rinfo->clause_relids, info[i].rels))
+			if (bms_is_subset(rinfo->clause_relids, info[i].relids))
 			{
-				info[i].clauses = lappend(info[i].clauses, clause);
+				info[i].clauses = lappend(info[i].clauses, rinfo);
 				found = true;
 				break;
 			}
@@ -3081,14 +3081,17 @@ statext_build_join_pairs(PlannerInfo *root, List *clauses,
 
 		if (!found)
 		{
-			info[cnt].rels = rinfo->clause_relids;
-			info[cnt].clauses = lappend(info[cnt].clauses, clause);
+			info[cnt].relids = rinfo->clause_relids;
+			info[cnt].clauses = lappend(info[cnt].clauses, rinfo);
 			cnt++;
 		}
 	}
 
 	if (cnt == 0)
+	{
+		pfree(info);
 		return NULL;
+	}
 
 	*npairs = cnt;
 	return info;
@@ -3112,7 +3115,6 @@ static RelOptInfo *
 extract_relation_info(PlannerInfo *root, JoinPairInfo *info, int index,
 					  StatisticExtInfo **stat)
 {
-	int			k;
 	int			relid;
 	RelOptInfo *rel;
 	ListCell   *lc;
@@ -3122,16 +3124,7 @@ extract_relation_info(PlannerInfo *root, JoinPairInfo *info, int index,
 
 	Assert((index >= 0) && (index <= 1));
 
-	k = -1;
-	while (index >= 0)
-	{
-		k = bms_next_member(info->rels, k);
-		if (k < 0)
-			elog(ERROR, "failed to extract relid");
-
-		relid = k;
-		index--;
-	}
+	relid = bms_nth_member(info->relids, index);
 
 	rel = find_base_rel(root, relid);
 
@@ -3142,9 +3135,10 @@ extract_relation_info(PlannerInfo *root, JoinPairInfo *info, int index,
 	 */
 	foreach(lc, info->clauses)
 	{
-		ListCell   *lc2;
-		Node	   *clause = (Node *) lfirst(lc);
-		OpExpr	   *opclause = (OpExpr *) clause;
+		ListCell *lc2;
+		RestrictInfo *rinfo = (RestrictInfo *) lfirst(lc);
+		Node *clause = (Node *) rinfo->clause;
+		OpExpr *opclause = (OpExpr *) clause;
 
 		/* only opclauses supported for now */
 		Assert(is_opclause(clause));
@@ -3185,7 +3179,8 @@ extract_relation_info(PlannerInfo *root, JoinPairInfo *info, int index,
 			 * relid and maybe keep it as a whole. It should be compatible
 			 * because we already checked it when building the join pairs.
 			 */
-			varnos = pull_varnos(root, arg);
+			varnos = list_cell_number(opclause->args, lc2) == 0 ?
+				rinfo->left_relids : rinfo->right_relids;
 
 			if (relid == bms_singleton_member(varnos))
 				exprs = lappend(exprs, arg);
@@ -3438,8 +3433,9 @@ statext_clauselist_join_selectivity(PlannerInfo *root, List *clauses,
 		 */
 		foreach(lc, info->clauses)
 		{
-			Node	   *clause = (Node *) lfirst(lc);
-			ListCell   *lc2;
+			RestrictInfo *rinfo = (RestrictInfo *) lfirst(lc);
+			Node *clause = (Node *) rinfo->clause;
+			ListCell *lc2;
 
 			listidx = -1;
 			foreach(lc2, clauses)
@@ -3461,5 +3457,6 @@ statext_clauselist_join_selectivity(PlannerInfo *root, List *clauses,
 		}
 	}
 
+	pfree(info);
 	return s;
 }
diff --git a/src/backend/statistics/mcv.c b/src/backend/statistics/mcv.c
index 3169022cd6d..8b09bf16662 100644
--- a/src/backend/statistics/mcv.c
+++ b/src/backend/statistics/mcv.c
@@ -2214,8 +2214,7 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 
 	MCVList    *mcv1,
 			   *mcv2;
-	int			idx,
-				i,
+	int			i,
 				j;
 	Selectivity s = 0;
 
@@ -2306,25 +2305,15 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 	indexes2 = (int *) palloc(sizeof(int) * list_length(clauses));
 	reverse = (bool *) palloc(sizeof(bool) * list_length(clauses));
 
-	idx = 0;
-	foreach(lc, clauses)
+	foreach (lc, clauses)
 	{
-		Node	   *clause = (Node *) lfirst(lc);
+		RestrictInfo	*rinfo = (RestrictInfo *) lfirst(lc);
+		Node	   *clause = (Node *) rinfo->clause;
 		OpExpr	   *opexpr;
 		Node	   *expr1,
 				   *expr2;
-		Bitmapset  *relids1,
-				   *relids2;
 
-		/*
-		 * Strip the RestrictInfo node, get the actual clause.
-		 *
-		 * XXX Not sure if we need to care about removing other node types too
-		 * (e.g. RelabelType etc.). statext_is_supported_join_clause matches
-		 * this, but maybe we need to relax it?
-		 */
-		if (IsA(clause, RestrictInfo))
-			clause = (Node *) ((RestrictInfo *) clause)->clause;
+		int		idx = list_cell_number(clauses, lc);
 
 		opexpr = (OpExpr *) clause;
 
@@ -2338,12 +2327,8 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 		expr1 = linitial(opexpr->args);
 		expr2 = lsecond(opexpr->args);
 
-		/* determine order of clauses (rel1 op rel2) or (rel2 op rel1) */
-		relids1 = pull_varnos(root, expr1);
-		relids2 = pull_varnos(root, expr2);
-
-		if ((bms_singleton_member(relids1) == rel1->relid) &&
-			(bms_singleton_member(relids2) == rel2->relid))
+		if ((bms_singleton_member(rinfo->left_relids) == rel1->relid) &&
+			(bms_singleton_member(rinfo->right_relids) == rel2->relid))
 		{
 			Oid			collid;
 
@@ -2358,8 +2343,8 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 			exprs1 = lappend(exprs1, expr1);
 			exprs2 = lappend(exprs2, expr2);
 		}
-		else if ((bms_singleton_member(relids2) == rel1->relid) &&
-				 (bms_singleton_member(relids1) == rel2->relid))
+		else if ((bms_singleton_member(rinfo->right_relids) == rel1->relid) &&
+				 (bms_singleton_member(rinfo->left_relids) == rel2->relid))
 		{
 			Oid			collid;
 
@@ -2383,8 +2368,6 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 
 		Assert((indexes2[idx] >= 0) &&
 			   (indexes2[idx] < bms_num_members(stat2->keys) + list_length(stat2->exprs)));
-
-		idx++;
 	}
 
 	/*
diff --git a/src/include/nodes/bitmapset.h b/src/include/nodes/bitmapset.h
index 283bea5ea96..8d32e7a2447 100644
--- a/src/include/nodes/bitmapset.h
+++ b/src/include/nodes/bitmapset.h
@@ -110,6 +110,7 @@ extern bool bms_nonempty_difference(const Bitmapset *a, const Bitmapset *b);
 extern int	bms_singleton_member(const Bitmapset *a);
 extern bool bms_get_singleton_member(const Bitmapset *a, int *member);
 extern int	bms_num_members(const Bitmapset *a);
+extern int  bms_nth_member(const Bitmapset *a, int i);
 
 /* optimized tests when we don't need to know exact membership count: */
 extern BMS_Membership bms_membership(const Bitmapset *a);
-- 
2.45.2

v20240617-0018-review.patchtext/x-patch; charset=UTF-8; name=v20240617-0018-review.patchDownload

From b6d8a8ac9216c5e8a23b45920ecb95473dec13b1 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Mon, 17 Jun 2024 15:43:17 +0200
Subject: [PATCH v20240617 18/56] review

---
 src/backend/statistics/mcv.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/src/backend/statistics/mcv.c b/src/backend/statistics/mcv.c
index 8b09bf16662..f3b607d583b 100644
--- a/src/backend/statistics/mcv.c
+++ b/src/backend/statistics/mcv.c
@@ -2307,6 +2307,15 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 
 	foreach (lc, clauses)
 	{
+		/*
+		 * XXX Can we just assume the clause has a RestrictInfo on top? IIRC
+		 * there are cases where we can get here without it (e.g. AND
+		 * clause?).
+		 *
+		 * XXX Not sure if we need to care about removing other node types too
+		 * (e.g. RelabelType etc.). statext_is_supported_join_clause matches
+		 * this, but maybe we need to relax it?
+		 */
 		RestrictInfo	*rinfo = (RestrictInfo *) lfirst(lc);
 		Node	   *clause = (Node *) rinfo->clause;
 		OpExpr	   *opexpr;
-- 
2.45.2

v20240617-0019-pgindent.patchtext/x-patch; charset=UTF-8; name=v20240617-0019-pgindent.patchDownload

From e346b4b0d0fc082186e113b6f110e16ffc021cee Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Mon, 17 Jun 2024 15:44:25 +0200
Subject: [PATCH v20240617 19/56] pgindent

---
 src/backend/nodes/bitmapset.c           |  3 ++-
 src/backend/statistics/extended_stats.c | 10 +++++-----
 src/backend/statistics/mcv.c            |  6 +++---
 src/include/nodes/bitmapset.h           |  2 +-
 4 files changed, 11 insertions(+), 10 deletions(-)

diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index 7c1291ae641..110c363b859 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -778,7 +778,8 @@ bms_num_members(const Bitmapset *a)
 int
 bms_nth_member(const Bitmapset *a, int i)
 {
-	int idx, res = -1;
+	int			idx,
+				res = -1;
 
 	for (idx = 0; idx <= i; idx++)
 	{
diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c
index 13042dd63c0..0e7dd7c9308 100644
--- a/src/backend/statistics/extended_stats.c
+++ b/src/backend/statistics/extended_stats.c
@@ -3135,10 +3135,10 @@ extract_relation_info(PlannerInfo *root, JoinPairInfo *info, int index,
 	 */
 	foreach(lc, info->clauses)
 	{
-		ListCell *lc2;
+		ListCell   *lc2;
 		RestrictInfo *rinfo = (RestrictInfo *) lfirst(lc);
-		Node *clause = (Node *) rinfo->clause;
-		OpExpr *opclause = (OpExpr *) clause;
+		Node	   *clause = (Node *) rinfo->clause;
+		OpExpr	   *opclause = (OpExpr *) clause;
 
 		/* only opclauses supported for now */
 		Assert(is_opclause(clause));
@@ -3434,8 +3434,8 @@ statext_clauselist_join_selectivity(PlannerInfo *root, List *clauses,
 		foreach(lc, info->clauses)
 		{
 			RestrictInfo *rinfo = (RestrictInfo *) lfirst(lc);
-			Node *clause = (Node *) rinfo->clause;
-			ListCell *lc2;
+			Node	   *clause = (Node *) rinfo->clause;
+			ListCell   *lc2;
 
 			listidx = -1;
 			foreach(lc2, clauses)
diff --git a/src/backend/statistics/mcv.c b/src/backend/statistics/mcv.c
index f3b607d583b..0cbe7821fcc 100644
--- a/src/backend/statistics/mcv.c
+++ b/src/backend/statistics/mcv.c
@@ -2305,7 +2305,7 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 	indexes2 = (int *) palloc(sizeof(int) * list_length(clauses));
 	reverse = (bool *) palloc(sizeof(bool) * list_length(clauses));
 
-	foreach (lc, clauses)
+	foreach(lc, clauses)
 	{
 		/*
 		 * XXX Can we just assume the clause has a RestrictInfo on top? IIRC
@@ -2316,13 +2316,13 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 		 * (e.g. RelabelType etc.). statext_is_supported_join_clause matches
 		 * this, but maybe we need to relax it?
 		 */
-		RestrictInfo	*rinfo = (RestrictInfo *) lfirst(lc);
+		RestrictInfo *rinfo = (RestrictInfo *) lfirst(lc);
 		Node	   *clause = (Node *) rinfo->clause;
 		OpExpr	   *opexpr;
 		Node	   *expr1,
 				   *expr2;
 
-		int		idx = list_cell_number(clauses, lc);
+		int			idx = list_cell_number(clauses, lc);
 
 		opexpr = (OpExpr *) clause;
 
diff --git a/src/include/nodes/bitmapset.h b/src/include/nodes/bitmapset.h
index 8d32e7a2447..101e3740a4a 100644
--- a/src/include/nodes/bitmapset.h
+++ b/src/include/nodes/bitmapset.h
@@ -110,7 +110,7 @@ extern bool bms_nonempty_difference(const Bitmapset *a, const Bitmapset *b);
 extern int	bms_singleton_member(const Bitmapset *a);
 extern bool bms_get_singleton_member(const Bitmapset *a, int *member);
 extern int	bms_num_members(const Bitmapset *a);
-extern int  bms_nth_member(const Bitmapset *a, int i);
+extern int	bms_nth_member(const Bitmapset *a, int i);
 
 /* optimized tests when we don't need to know exact membership count: */
 extern BMS_Membership bms_membership(const Bitmapset *a);
-- 
2.45.2

v20240617-0020-Cache-the-result-of-statext_determine_join.patchtext/x-patch; charset=UTF-8; name=v20240617-0020-Cache-the-result-of-statext_determine_join.patchDownload

From fbf1111d61564a69719f8f9034802c222abcd87a Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Mon, 17 Jun 2024 15:49:24 +0200
Subject: [PATCH v20240617 20/56] Cache the result of
 statext_determine_join_restrictions.

It is firstly needed when choosing statext_find_matching_mcv and then it
is needed when mcv_combine_extended, so caching the result to save some
cycles.
---
 src/backend/statistics/extended_stats.c       | 34 ++++++++++++++-----
 src/backend/statistics/mcv.c                  | 27 ++++++---------
 .../statistics/extended_stats_internal.h      |  2 ++
 src/include/statistics/statistics.h           |  3 +-
 4 files changed, 41 insertions(+), 25 deletions(-)

diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c
index 0e7dd7c9308..241c7d4ec35 100644
--- a/src/backend/statistics/extended_stats.c
+++ b/src/backend/statistics/extended_stats.c
@@ -2641,7 +2641,8 @@ make_build_data(Relation rel, StatExtEntry *stat, int numrows, HeapTuple *rows,
 
 /*
  * statext_find_matching_mcv
- *		Search for a MCV covering all the attributes and expressions.
+ *		Search for a MCV covering all the attributes and expressions and set
+ * the conditions to calculate conditional probability.
  *
  * Picks the extended statistics object to estimate join clause. The statistics
  * object has to have a MCV, and we require it to match all the join conditions
@@ -2668,7 +2669,8 @@ make_build_data(Relation rel, StatExtEntry *stat, int numrows, HeapTuple *rows,
  */
 StatisticExtInfo *
 statext_find_matching_mcv(PlannerInfo *root, RelOptInfo *rel,
-						  Bitmapset *attnums, List *exprs)
+						  Bitmapset *attnums, List *exprs,
+						  List **base_conditions)
 {
 	ListCell   *l;
 	StatisticExtInfo *mcv = NULL;
@@ -2693,6 +2695,7 @@ statext_find_matching_mcv(PlannerInfo *root, RelOptInfo *rel,
 		if (!mcv)
 		{
 			mcv = stat;
+			*base_conditions = statext_determine_join_restrictions(root, rel, mcv);
 			continue;
 		}
 
@@ -2750,8 +2753,13 @@ statext_find_matching_mcv(PlannerInfo *root, RelOptInfo *rel,
 		if (list_length(conditions1) > list_length(conditions2))
 		{
 			mcv = stat;
+			*base_conditions = conditions1;
 			continue;
 		}
+		else
+		{
+			*base_conditions = conditions2;
+		}
 
 		/*
 		 * The statistics seem about equal, so just use the narrower one.
@@ -2762,6 +2770,11 @@ statext_find_matching_mcv(PlannerInfo *root, RelOptInfo *rel,
 			bms_num_members(stat->keys) + list_length(stat->exprs))
 		{
 			mcv = stat;
+			*base_conditions = conditions1;
+		}
+		else
+		{
+			*base_conditions = conditions2;
 		}
 	}
 
@@ -2776,7 +2789,7 @@ statext_find_matching_mcv(PlannerInfo *root, RelOptInfo *rel,
  * and covered by the extended statistics object.
  *
  * When using extended statistics to estimate joins, we can use conditions
- * from base relations to calculate conditional probability
+ * from base relations to calculate conditional probability.
  *
  *    P(join clauses | baserel restrictions)
  *
@@ -3113,7 +3126,7 @@ statext_build_join_pairs(PlannerInfo *root, List *clauses,
  */
 static RelOptInfo *
 extract_relation_info(PlannerInfo *root, JoinPairInfo *info, int index,
-					  StatisticExtInfo **stat)
+					  StatisticExtInfo **stat, List **base_conditions)
 {
 	int			relid;
 	RelOptInfo *rel;
@@ -3187,7 +3200,7 @@ extract_relation_info(PlannerInfo *root, JoinPairInfo *info, int index,
 		}
 	}
 
-	*stat = statext_find_matching_mcv(root, rel, attnums, exprs);
+	*stat = statext_find_matching_mcv(root, rel, attnums, exprs, base_conditions);
 
 	return rel;
 }
@@ -3305,11 +3318,14 @@ statext_clauselist_join_selectivity(PlannerInfo *root, List *clauses,
 		StatisticExtInfo *stat1;
 		StatisticExtInfo *stat2;
 
+		List	*base_condition1 = NULL,
+				*base_condition2 = NULL;
+
 		/* extract info about the first relation */
-		rel1 = extract_relation_info(root, &info[i], 0, &stat1);
+		rel1 = extract_relation_info(root, &info[i], 0, &stat1, &base_condition1);
 
 		/* extract info about the second relation */
-		rel2 = extract_relation_info(root, &info[i], 1, &stat2);
+		rel2 = extract_relation_info(root, &info[i], 1, &stat2, &base_condition2);
 
 		/*
 		 * We can handle three basic cases:
@@ -3332,7 +3348,9 @@ statext_clauselist_join_selectivity(PlannerInfo *root, List *clauses,
 		 */
 		if (stat1 && stat2)
 		{
-			s *= mcv_combine_extended(root, rel1, rel2, stat1, stat2, info[i].clauses);
+			s *= mcv_combine_extended(root, rel1, rel2, stat1, stat2,
+									  base_condition1, base_condition2,
+									  info[i].clauses);
 		}
 		else if (stat1 && (list_length(info[i].clauses) == 1))
 		{
diff --git a/src/backend/statistics/mcv.c b/src/backend/statistics/mcv.c
index 0cbe7821fcc..68a2cff1611 100644
--- a/src/backend/statistics/mcv.c
+++ b/src/backend/statistics/mcv.c
@@ -2208,6 +2208,7 @@ mcv_clause_selectivity_or(PlannerInfo *root, StatisticExtInfo *stat,
 Selectivity
 mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 					 StatisticExtInfo *stat1, StatisticExtInfo *stat2,
+					 List *base_cond1, List *base_cond2,
 					 List *clauses)
 {
 	ListCell   *lc;
@@ -2219,12 +2220,10 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 	Selectivity s = 0;
 
 	/* match bitmaps and selectivity for baserel conditions (if any) */
-	List	   *exprs1 = NIL,
-			   *exprs2 = NIL;
-	List	   *conditions1 = NIL,
-			   *conditions2 = NIL;
-	bool	   *cmatches1 = NULL,
-			   *cmatches2 = NULL;
+	List   *exprs1 = NIL,
+		   *exprs2 = NIL;
+	bool   *cmatches1 = NULL,
+		   *cmatches2 = NULL;
 
 	double		csel1 = 1.0,
 				csel2 = 1.0;
@@ -2264,28 +2263,24 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 	/* should only get here with MCV on both sides */
 	Assert(mcv1 && mcv2);
 
-	/* Determine which baserel clauses to use for conditional probability. */
-	conditions1 = statext_determine_join_restrictions(root, rel1, stat1);
-	conditions2 = statext_determine_join_restrictions(root, rel2, stat2);
-
 	/*
 	 * Calculate match bitmaps for restrictions on either side of the join
 	 * (there may be none, in which case this will be NULL).
 	 */
-	if (conditions1)
+	if (base_cond1)
 	{
-		cmatches1 = mcv_get_match_bitmap(root, conditions1,
+		cmatches1 = mcv_get_match_bitmap(root, base_cond1,
 										 stat1->keys, stat1->exprs,
 										 mcv1, false);
-		csel1 = clauselist_selectivity(root, conditions1, rel1->relid, 0, NULL);
+		csel1 = clauselist_selectivity(root, base_cond1, rel1->relid, 0, NULL);
 	}
 
-	if (conditions2)
+	if (base_cond2)
 	{
-		cmatches2 = mcv_get_match_bitmap(root, conditions2,
+		cmatches2 = mcv_get_match_bitmap(root, base_cond2,
 										 stat2->keys, stat2->exprs,
 										 mcv2, false);
-		csel2 = clauselist_selectivity(root, conditions2, rel2->relid, 0, NULL);
+		csel2 = clauselist_selectivity(root, base_cond2, rel2->relid, 0, NULL);
 	}
 
 	/*
diff --git a/src/include/statistics/extended_stats_internal.h b/src/include/statistics/extended_stats_internal.h
index b1f30dfe2ee..1d16366f041 100644
--- a/src/include/statistics/extended_stats_internal.h
+++ b/src/include/statistics/extended_stats_internal.h
@@ -141,6 +141,8 @@ extern Selectivity mcv_combine_extended(PlannerInfo *root,
 										RelOptInfo *rel2,
 										StatisticExtInfo *stat1,
 										StatisticExtInfo *stat2,
+										List	*base_cond1,
+										List	*base_cond2,
 										List *clauses);
 
 extern List *statext_determine_join_restrictions(PlannerInfo *root,
diff --git a/src/include/statistics/statistics.h b/src/include/statistics/statistics.h
index 531feef85a4..d1368a05833 100644
--- a/src/include/statistics/statistics.h
+++ b/src/include/statistics/statistics.h
@@ -128,7 +128,8 @@ extern StatisticExtInfo *choose_best_statistics(List *stats, char requiredkind,
 extern HeapTuple statext_expressions_load(Oid stxoid, bool inh, int idx);
 
 extern StatisticExtInfo *statext_find_matching_mcv(PlannerInfo *root, RelOptInfo *rel,
-												   Bitmapset *attnums, List *exprs);
+												   Bitmapset *attnums, List *exprs,
+												   List **base_conditions);
 
 extern bool statext_try_join_estimates(PlannerInfo *root, List *clauses, int varRelid,
 									   JoinType jointype, SpecialJoinInfo *sjinfo);
-- 
2.45.2

v20240617-0021-review.patchtext/x-patch; charset=UTF-8; name=v20240617-0021-review.patchDownload

From 8c8273ed818a533cdb866a5a5aa49dfdcc73794d Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Mon, 17 Jun 2024 15:53:09 +0200
Subject: [PATCH v20240617 21/56] review

---
 src/backend/statistics/extended_stats.c | 6 ++++++
 src/backend/statistics/mcv.c            | 3 +--
 2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c
index 241c7d4ec35..82c65e38fba 100644
--- a/src/backend/statistics/extended_stats.c
+++ b/src/backend/statistics/extended_stats.c
@@ -2745,6 +2745,8 @@ statext_find_matching_mcv(PlannerInfo *root, RelOptInfo *rel,
 		 * XXX Or maybe we should simply "count" the restrictions here,
 		 * instead of constructing a list? Probably not a meaningful
 		 * difference in CPU costs or a memory leak.
+		 *
+		 * XXX Why are we recalculating conditions1 here?
 		 */
 		conditions1 = statext_determine_join_restrictions(root, rel, stat);
 		conditions2 = statext_determine_join_restrictions(root, rel, mcv);
@@ -3123,6 +3125,10 @@ statext_build_join_pairs(PlannerInfo *root, List *clauses,
  * XXX Name should probably start with statext_ too.
  *
  * XXX The 0/1 index seems a bit weird. Is there a better way to do this?
+ *
+ * XXX I somehow dislike the finctions returning a lot of stuff using output
+ * arguments / pointers. Maybe it's time to invent a new struct returned by
+ * this function?
  */
 static RelOptInfo *
 extract_relation_info(PlannerInfo *root, JoinPairInfo *info, int index,
diff --git a/src/backend/statistics/mcv.c b/src/backend/statistics/mcv.c
index 68a2cff1611..88d2f7ee233 100644
--- a/src/backend/statistics/mcv.c
+++ b/src/backend/statistics/mcv.c
@@ -2208,8 +2208,7 @@ mcv_clause_selectivity_or(PlannerInfo *root, StatisticExtInfo *stat,
 Selectivity
 mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 					 StatisticExtInfo *stat1, StatisticExtInfo *stat2,
-					 List *base_cond1, List *base_cond2,
-					 List *clauses)
+					 List *base_cond1, List *base_cond2, List *clauses)
 {
 	ListCell   *lc;
 
-- 
2.45.2

v20240617-0022-pgindent.patchtext/x-patch; charset=UTF-8; name=v20240617-0022-pgindent.patchDownload

From 7dc14e96632a5bdf4a56355fb865b429b6c4ade1 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Mon, 17 Jun 2024 15:54:00 +0200
Subject: [PATCH v20240617 22/56] pgindent

---
 src/backend/statistics/extended_stats.c          | 4 ++--
 src/backend/statistics/mcv.c                     | 8 ++++----
 src/include/statistics/extended_stats_internal.h | 4 ++--
 3 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c
index 82c65e38fba..070a362b44e 100644
--- a/src/backend/statistics/extended_stats.c
+++ b/src/backend/statistics/extended_stats.c
@@ -3324,8 +3324,8 @@ statext_clauselist_join_selectivity(PlannerInfo *root, List *clauses,
 		StatisticExtInfo *stat1;
 		StatisticExtInfo *stat2;
 
-		List	*base_condition1 = NULL,
-				*base_condition2 = NULL;
+		List	   *base_condition1 = NULL,
+				   *base_condition2 = NULL;
 
 		/* extract info about the first relation */
 		rel1 = extract_relation_info(root, &info[i], 0, &stat1, &base_condition1);
diff --git a/src/backend/statistics/mcv.c b/src/backend/statistics/mcv.c
index 88d2f7ee233..c91d07d7f10 100644
--- a/src/backend/statistics/mcv.c
+++ b/src/backend/statistics/mcv.c
@@ -2219,10 +2219,10 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 	Selectivity s = 0;
 
 	/* match bitmaps and selectivity for baserel conditions (if any) */
-	List   *exprs1 = NIL,
-		   *exprs2 = NIL;
-	bool   *cmatches1 = NULL,
-		   *cmatches2 = NULL;
+	List	   *exprs1 = NIL,
+			   *exprs2 = NIL;
+	bool	   *cmatches1 = NULL,
+			   *cmatches2 = NULL;
 
 	double		csel1 = 1.0,
 				csel2 = 1.0;
diff --git a/src/include/statistics/extended_stats_internal.h b/src/include/statistics/extended_stats_internal.h
index 1d16366f041..5b2be87f886 100644
--- a/src/include/statistics/extended_stats_internal.h
+++ b/src/include/statistics/extended_stats_internal.h
@@ -141,8 +141,8 @@ extern Selectivity mcv_combine_extended(PlannerInfo *root,
 										RelOptInfo *rel2,
 										StatisticExtInfo *stat1,
 										StatisticExtInfo *stat2,
-										List	*base_cond1,
-										List	*base_cond2,
+										List *base_cond1,
+										List *base_cond2,
 										List *clauses);
 
 extern List *statext_determine_join_restrictions(PlannerInfo *root,
-- 
2.45.2

v20240617-0023-Simplify-code-by-using-list_cell_number.patchtext/x-patch; charset=UTF-8; name=v20240617-0023-Simplify-code-by-using-list_cell_number.patchDownload

From de3c83283487b6d219d960ffe5d2f49e068051d5 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Mon, 17 Jun 2024 15:56:25 +0200
Subject: [PATCH v20240617 23/56] Simplify code by using list_cell_number

instead of maintaining it manually.

and remove the below lines from statext_clauselist_join_selectivity.

	if (!clauses)
		return 1.0;

since it has been handled in clauselist_selectivity_ext.
---
 src/backend/statistics/extended_stats.c | 55 +++++++++----------------
 1 file changed, 19 insertions(+), 36 deletions(-)

diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c
index 070a362b44e..ab5b8d9a0d4 100644
--- a/src/backend/statistics/extended_stats.c
+++ b/src/backend/statistics/extended_stats.c
@@ -2852,15 +2852,6 @@ statext_is_supported_join_clause(PlannerInfo *root, Node *clause)
 	int			left_relid,
 				right_relid;
 
-	/*
-	 * XXX isn't this comment stale after removal of varRelid?
-	 *
-	 * evaluation as a restriction clause, either at scan node or forced
-	 *
-	 * XXX See treat_as_join_clause.
-	 */
-
-	/* XXX Can we rely on always getting RestrictInfo here? */
 	if (!IsA(clause, RestrictInfo))
 		return false;
 
@@ -2908,6 +2899,12 @@ statext_is_supported_join_clause(PlannerInfo *root, Node *clause)
 		!bms_get_singleton_member(rinfo->right_relids, &right_relid))
 		return false;
 
+	/*
+	 * XXX:
+	 * Join two columns in the same relation is uncommon and
+	 * extract_relation_info requires 2 different relids, so no bother to
+	 * handle it.
+	 */
 	if (left_relid == right_relid)
 		return false;
 
@@ -2927,7 +2924,6 @@ bool
 statext_try_join_estimates(PlannerInfo *root, List *clauses, int varRelid,
 						   JoinType jointype, SpecialJoinInfo *sjinfo)
 {
-	int			listidx;
 	int			k;
 	ListCell   *lc;
 	Bitmapset  *relids = NULL;
@@ -2955,15 +2951,11 @@ statext_try_join_estimates(PlannerInfo *root, List *clauses, int varRelid,
 	 * about the part not represented by MCV, which is now based on ndistinct
 	 * estimates.
 	 */
-	listidx = -1;
-	foreach(lc, clauses)
+	foreach (lc, clauses)
 	{
 		Node	   *clause = (Node *) lfirst(lc);
 		RestrictInfo *rinfo;
 
-		/* needs to happen before skipping any clauses */
-		listidx++;
-
 		/*
 		 * Skip clauses that are not join clauses or that we don't know how to
 		 * handle estimate using extended statistics.
@@ -3043,10 +3035,9 @@ static JoinPairInfo *
 statext_build_join_pairs(PlannerInfo *root, List *clauses,
 						 Bitmapset *estimatedclauses, int *npairs)
 {
-	int			cnt;
-	int			listidx;
-	JoinPairInfo *info;
-	ListCell   *lc;
+	int				cnt;
+	JoinPairInfo   *info;
+	ListCell	   *lc;
 
 	/*
 	 * Assume each clause is for a different pair of relations (some of them
@@ -3056,15 +3047,13 @@ statext_build_join_pairs(PlannerInfo *root, List *clauses,
 	info = (JoinPairInfo *) palloc0(sizeof(JoinPairInfo) * list_length(clauses));
 	cnt = 0;
 
-	listidx = -1;
 	foreach(lc, clauses)
 	{
-		int			i;
-		bool		found;
-		Node	   *clause = (Node *) lfirst(lc);
-		RestrictInfo *rinfo;
-
-		listidx++;
+		int				i;
+		bool			found;
+		Node		   *clause = (Node *) lfirst(lc);
+		RestrictInfo   *rinfo;
+		int				listidx = list_cell_number(clauses, lc);
 
 		/* skip already estimated clauses */
 		if (bms_is_member(listidx, estimatedclauses))
@@ -3276,15 +3265,11 @@ statext_clauselist_join_selectivity(PlannerInfo *root, List *clauses,
 									Bitmapset **estimatedclauses)
 {
 	int			i;
-	int			listidx;
-	Selectivity s = 1.0;
+	Selectivity	s = 1.0;
 
 	JoinPairInfo *info;
 	int			ninfo;
 
-	if (!clauses)
-		return 1.0;
-
 	/* extract pairs of joined relations from the list of clauses */
 	info = statext_build_join_pairs(root, clauses,
 									*estimatedclauses, &ninfo);
@@ -3461,12 +3446,10 @@ statext_clauselist_join_selectivity(PlannerInfo *root, List *clauses,
 			Node	   *clause = (Node *) rinfo->clause;
 			ListCell   *lc2;
 
-			listidx = -1;
-			foreach(lc2, clauses)
+			foreach (lc2, clauses)
 			{
-				Node	   *clause2 = (Node *) lfirst(lc2);
-
-				listidx++;
+				Node *clause2 = (Node *) lfirst(lc2);
+				int listidx = list_cell_number(clauses, lc2);
 
 				Assert(IsA(clause2, RestrictInfo));
 
-- 
2.45.2

v20240617-0024-review.patchtext/x-patch; charset=UTF-8; name=v20240617-0024-review.patchDownload

From 8c0f0fe03317dd510572ef7f8baed0bcf6411449 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Mon, 17 Jun 2024 15:59:32 +0200
Subject: [PATCH v20240617 24/56] review

---
 src/backend/statistics/extended_stats.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c
index ab5b8d9a0d4..c1a60edcaab 100644
--- a/src/backend/statistics/extended_stats.c
+++ b/src/backend/statistics/extended_stats.c
@@ -2852,6 +2852,7 @@ statext_is_supported_join_clause(PlannerInfo *root, Node *clause)
 	int			left_relid,
 				right_relid;
 
+	/* XXX Can we rely on always getting RestrictInfo here? */
 	if (!IsA(clause, RestrictInfo))
 		return false;
 
@@ -3270,6 +3271,8 @@ statext_clauselist_join_selectivity(PlannerInfo *root, List *clauses,
 	JoinPairInfo *info;
 	int			ninfo;
 
+	/* XXX Shouldn't we have at least an assert that (clauses != NULL)? */
+
 	/* extract pairs of joined relations from the list of clauses */
 	info = statext_build_join_pairs(root, clauses,
 									*estimatedclauses, &ninfo);
@@ -3457,6 +3460,7 @@ statext_clauselist_join_selectivity(PlannerInfo *root, List *clauses,
 
 				if (equal(clause, clause2))
 				{
+					/* XXX why not to just call list_cell_number here? */
 					*estimatedclauses = bms_add_member(*estimatedclauses, listidx);
 					break;
 				}
-- 
2.45.2

v20240617-0025-pgindent.patchtext/x-patch; charset=UTF-8; name=v20240617-0025-pgindent.patchDownload

From 28faa1b0f78049d546cd829b7941954452a1258a Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Mon, 17 Jun 2024 16:00:07 +0200
Subject: [PATCH v20240617 25/56] pgindent

---
 src/backend/statistics/extended_stats.c | 29 ++++++++++++-------------
 1 file changed, 14 insertions(+), 15 deletions(-)

diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c
index c1a60edcaab..868fc12e77e 100644
--- a/src/backend/statistics/extended_stats.c
+++ b/src/backend/statistics/extended_stats.c
@@ -2901,8 +2901,7 @@ statext_is_supported_join_clause(PlannerInfo *root, Node *clause)
 		return false;
 
 	/*
-	 * XXX:
-	 * Join two columns in the same relation is uncommon and
+	 * XXX: Join two columns in the same relation is uncommon and
 	 * extract_relation_info requires 2 different relids, so no bother to
 	 * handle it.
 	 */
@@ -2952,7 +2951,7 @@ statext_try_join_estimates(PlannerInfo *root, List *clauses, int varRelid,
 	 * about the part not represented by MCV, which is now based on ndistinct
 	 * estimates.
 	 */
-	foreach (lc, clauses)
+	foreach(lc, clauses)
 	{
 		Node	   *clause = (Node *) lfirst(lc);
 		RestrictInfo *rinfo;
@@ -3036,9 +3035,9 @@ static JoinPairInfo *
 statext_build_join_pairs(PlannerInfo *root, List *clauses,
 						 Bitmapset *estimatedclauses, int *npairs)
 {
-	int				cnt;
-	JoinPairInfo   *info;
-	ListCell	   *lc;
+	int			cnt;
+	JoinPairInfo *info;
+	ListCell   *lc;
 
 	/*
 	 * Assume each clause is for a different pair of relations (some of them
@@ -3050,11 +3049,11 @@ statext_build_join_pairs(PlannerInfo *root, List *clauses,
 
 	foreach(lc, clauses)
 	{
-		int				i;
-		bool			found;
-		Node		   *clause = (Node *) lfirst(lc);
-		RestrictInfo   *rinfo;
-		int				listidx = list_cell_number(clauses, lc);
+		int			i;
+		bool		found;
+		Node	   *clause = (Node *) lfirst(lc);
+		RestrictInfo *rinfo;
+		int			listidx = list_cell_number(clauses, lc);
 
 		/* skip already estimated clauses */
 		if (bms_is_member(listidx, estimatedclauses))
@@ -3266,7 +3265,7 @@ statext_clauselist_join_selectivity(PlannerInfo *root, List *clauses,
 									Bitmapset **estimatedclauses)
 {
 	int			i;
-	Selectivity	s = 1.0;
+	Selectivity s = 1.0;
 
 	JoinPairInfo *info;
 	int			ninfo;
@@ -3449,10 +3448,10 @@ statext_clauselist_join_selectivity(PlannerInfo *root, List *clauses,
 			Node	   *clause = (Node *) rinfo->clause;
 			ListCell   *lc2;
 
-			foreach (lc2, clauses)
+			foreach(lc2, clauses)
 			{
-				Node *clause2 = (Node *) lfirst(lc2);
-				int listidx = list_cell_number(clauses, lc2);
+				Node	   *clause2 = (Node *) lfirst(lc2);
+				int			listidx = list_cell_number(clauses, lc2);
 
 				Assert(IsA(clause2, RestrictInfo));
 
-- 
2.45.2

v20240617-0039-add-the-statistic_proc_security_check-chec.patchtext/x-patch; charset=UTF-8; name=v20240617-0039-add-the-statistic_proc_security_check-chec.patchDownload

From fd37d6b50678d6638de4ad8a0a8cad232cadbac7 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Mon, 17 Jun 2024 16:41:33 +0200
Subject: [PATCH v20240617 39/56] add the statistic_proc_security_check check.

---
 src/backend/statistics/extended_stats.c | 24 ++++++++----------------
 1 file changed, 8 insertions(+), 16 deletions(-)

diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c
index e7caa112ee7..08998f219e1 100644
--- a/src/backend/statistics/extended_stats.c
+++ b/src/backend/statistics/extended_stats.c
@@ -3371,14 +3371,10 @@ statext_clauselist_join_selectivity(PlannerInfo *root, List *clauses,
 			{
 				/* note we allow use of nullfrac regardless of security check */
 				stats = (Form_pg_statistic) GETSTRUCT(vardata.statsTuple);
-
-				/*
-				 * FIXME should this call statistic_proc_security_check like
-				 * eqjoinsel?
-				 */
-				have_mcvs = get_attstatsslot(&sslot, vardata.statsTuple,
-											 STATISTIC_KIND_MCV, InvalidOid,
-											 ATTSTATSSLOT_VALUES | ATTSTATSSLOT_NUMBERS);
+				if (statistic_proc_security_check(&vardata, F_EQJOINSEL))
+					have_mcvs = get_attstatsslot(&sslot, vardata.statsTuple,
+												 STATISTIC_KIND_MCV, InvalidOid,
+												 ATTSTATSSLOT_VALUES | ATTSTATSSLOT_NUMBERS);
 			}
 
 			if (have_mcvs)
@@ -3415,14 +3411,10 @@ statext_clauselist_join_selectivity(PlannerInfo *root, List *clauses,
 			{
 				/* note we allow use of nullfrac regardless of security check */
 				stats = (Form_pg_statistic) GETSTRUCT(vardata.statsTuple);
-
-				/*
-				 * FIXME should this call statistic_proc_security_check like
-				 * eqjoinsel?
-				 */
-				have_mcvs = get_attstatsslot(&sslot, vardata.statsTuple,
-											 STATISTIC_KIND_MCV, InvalidOid,
-											 ATTSTATSSLOT_VALUES | ATTSTATSSLOT_NUMBERS);
+				if (statistic_proc_security_check(&vardata, F_EQJOINSEL))
+					have_mcvs = get_attstatsslot(&sslot, vardata.statsTuple,
+												 STATISTIC_KIND_MCV, InvalidOid,
+												 ATTSTATSSLOT_VALUES | ATTSTATSSLOT_NUMBERS);
 			}
 
 			if (have_mcvs)
-- 
2.45.2

v20240617-0040-some-code-refactor-as-before.patchtext/x-patch; charset=UTF-8; name=v20240617-0040-some-code-refactor-as-before.patchDownload

From a7b71acb87eaefc558baee084c70f25859650070 Mon Sep 17 00:00:00 2001
From: "yizhi.fzh" <yizhi.fzh@alibaba-inc.com>
Date: Mon, 8 Apr 2024 17:37:52 +0800
Subject: [PATCH v20240617 40/56] some code refactor as before.

1. use rinfo->left|rigth_relids instead of pull_varnos.
2. use FunctionCallInvoke instead of FunctionCall2Coll.
3. strip RelableType.
---
 src/backend/statistics/mcv.c | 43 +++++++++++++++++++++---------------
 1 file changed, 25 insertions(+), 18 deletions(-)

diff --git a/src/backend/statistics/mcv.c b/src/backend/statistics/mcv.c
index 3af717affbc..92cb74df33f 100644
--- a/src/backend/statistics/mcv.c
+++ b/src/backend/statistics/mcv.c
@@ -2725,6 +2725,7 @@ mcv_combine_simple(PlannerInfo *root, RelOptInfo *rel, StatisticExtInfo *stat,
 
 	/* info about clauses and how they match to MCV stats */
 	FmgrInfo	opproc;
+	LOCAL_FCINFO(fcinfo, 2);
 	int			index = 0;
 	bool		reverse = false;
 	RangeTblEntry *rte = root->simple_rte_array[rel->relid];
@@ -2770,8 +2771,9 @@ mcv_combine_simple(PlannerInfo *root, RelOptInfo *rel, StatisticExtInfo *stat,
 		OpExpr	   *opexpr;
 		Node	   *expr1,
 				   *expr2;
-		Bitmapset  *relids1,
-				   *relids2;
+		RestrictInfo *rinfo = (RestrictInfo *) clause;
+
+		Assert(IsA(clause, RestrictInfo));
 
 		/*
 		 * Strip the RestrictInfo node, get the actual clause.
@@ -2780,9 +2782,7 @@ mcv_combine_simple(PlannerInfo *root, RelOptInfo *rel, StatisticExtInfo *stat,
 		 * (e.g. RelabelType etc.). statext_is_supported_join_clause matches
 		 * this, but maybe we need to relax it?
 		 */
-		if (IsA(clause, RestrictInfo))
-			clause = (Node *) ((RestrictInfo *) clause)->clause;
-
+		clause = (Node *) rinfo->clause;
 		opexpr = (OpExpr *) clause;
 
 		/* Make sure we have the expected node type. */
@@ -2790,16 +2790,21 @@ mcv_combine_simple(PlannerInfo *root, RelOptInfo *rel, StatisticExtInfo *stat,
 		Assert(list_length(opexpr->args) == 2);
 
 		fmgr_info(get_opcode(opexpr->opno), &opproc);
+		InitFunctionCallInfoData(*fcinfo, &opproc, 2, opexpr->inputcollid, NULL, NULL);
+		fcinfo->args[0].isnull = false;
+		fcinfo->args[1].isnull = false;
 
-		/* FIXME strip relabel etc. the way examine_opclause_args does */
 		expr1 = linitial(opexpr->args);
 		expr2 = lsecond(opexpr->args);
 
-		/* determine order of clauses (rel1 op rel2) or (rel2 op rel1) */
-		relids1 = pull_varnos(root, expr1);
-		relids2 = pull_varnos(root, expr2);
+		/* strip RelabelType from either side of the expression */
+		if (IsA(expr1, RelabelType))
+			expr1 = (Node *) ((RelabelType *) expr1)->arg;
 
-		if (bms_singleton_member(relids1) == rel->relid)
+		if (IsA(expr2, RelabelType))
+			expr2 = (Node *) ((RelabelType *) expr2)->arg;
+
+		if (bms_singleton_member(rinfo->left_relids) == rel->relid)
 		{
 			Oid			collid;
 
@@ -2810,7 +2815,7 @@ mcv_combine_simple(PlannerInfo *root, RelOptInfo *rel, StatisticExtInfo *stat,
 			exprs1 = lappend(exprs1, expr1);
 			exprs2 = lappend(exprs2, expr2);
 		}
-		else if (bms_singleton_member(relids2) == rel->relid)
+		else if (bms_singleton_member(rinfo->right_relids) == rel->relid)
 		{
 			Oid			collid;
 
@@ -2877,14 +2882,16 @@ mcv_combine_simple(PlannerInfo *root, RelOptInfo *rel, StatisticExtInfo *stat,
 			 * FIXME Use appropriate collation.
 			 */
 			if (reverse)
-				match = DatumGetBool(FunctionCall2Coll(&opproc,
-													   InvalidOid,
-													   value2, value1));
+			{
+				fcinfo->args[0].value = value2;
+				fcinfo->args[1].value = value1;
+			}
 			else
-				match = DatumGetBool(FunctionCall2Coll(&opproc,
-													   InvalidOid,
-													   value1, value2));
-
+			{
+				fcinfo->args[0].value = value1;
+				fcinfo->args[1].value = value2;
+			}
+			match = DatumGetBool(FunctionCallInvoke(fcinfo));
 			if (match)
 			{
 				/* XXX Do we need to do something about base frequency? */
-- 
2.45.2

v20240617-0026-Handle-the-RelableType.patchtext/x-patch; charset=UTF-8; name=v20240617-0026-Handle-the-RelableType.patchDownload

From 215be2ec9fba573d0950a1d3bead3f62cf7a070e Mon Sep 17 00:00:00 2001
From: "yizhi.fzh" <yizhi.fzh@alibaba-inc.com>
Date: Sun, 7 Apr 2024 13:24:59 +0800
Subject: [PATCH v20240617 26/56] Handle the RelableType.

---
 src/backend/statistics/mcv.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/src/backend/statistics/mcv.c b/src/backend/statistics/mcv.c
index c91d07d7f10..a7ab964ae10 100644
--- a/src/backend/statistics/mcv.c
+++ b/src/backend/statistics/mcv.c
@@ -2326,10 +2326,16 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 
 		fmgr_info(get_opcode(opexpr->opno), &opprocs[idx]);
 
-		/* FIXME strip relabel etc. the way examine_opclause_args does */
 		expr1 = linitial(opexpr->args);
 		expr2 = lsecond(opexpr->args);
 
+		/* strip RelabelType from either side of the expression */
+		if (IsA(expr1, RelabelType))
+			expr1 = (Node *) ((RelabelType *) expr1)->arg;
+
+		if (IsA(expr2, RelabelType))
+			expr2 = (Node *) ((RelabelType *) expr2)->arg;
+
 		if ((bms_singleton_member(rinfo->left_relids) == rel1->relid) &&
 			(bms_singleton_member(rinfo->right_relids) == rel2->relid))
 		{
-- 
2.45.2

v20240617-0027-Use-FunctionCallInvoke-instead-of-Function.patchtext/x-patch; charset=UTF-8; name=v20240617-0027-Use-FunctionCallInvoke-instead-of-Function.patchDownload

From 54c60ce245a908ee4348d6538aa81246fde207c8 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Mon, 17 Jun 2024 16:10:55 +0200
Subject: [PATCH v20240617 27/56] Use FunctionCallInvoke instead of
 FunctionCall2Coll

Some stack variables allocation and setup are saved.

A lesson learnt:

FunctionCallInfo  opprocs;

opprocs = (FunctionCallInfo) palloc(SizeForFunctionCallInfo(2) * list_length(clauses));

opprocs[1] points to a opprocs[0].args, which is caused by flexible
array in FunctionCallInfoBaseData. So the above line is pretty error
prone.
---
 src/backend/statistics/mcv.c | 50 ++++++++++++++++++++++--------------
 1 file changed, 31 insertions(+), 19 deletions(-)

diff --git a/src/backend/statistics/mcv.c b/src/backend/statistics/mcv.c
index a7ab964ae10..c1a5a7f6a4c 100644
--- a/src/backend/statistics/mcv.c
+++ b/src/backend/statistics/mcv.c
@@ -2246,7 +2246,8 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 				totalsel2;
 
 	/* info about clauses and how they match to MCV stats */
-	FmgrInfo   *opprocs;
+	FmgrInfo   *finfo;
+	FunctionCallInfo  *opprocs;
 	int		   *indexes1,
 			   *indexes2;
 	bool	   *reverse;
@@ -2294,7 +2295,9 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 	 * stats we picked. We do this only once before processing the lists, so
 	 * that we don't have to do that for each MCV item or so.
 	 */
-	opprocs = (FmgrInfo *) palloc(sizeof(FmgrInfo) * list_length(clauses));
+	finfo = (FmgrInfo *) palloc(sizeof(FmgrInfo) * list_length(clauses));
+	// opprocs = (FunctionCallInfo *) palloc(SizeForFunctionCallInfo(2) * list_length(clauses));
+	opprocs = (FunctionCallInfo *) palloc(sizeof(FunctionCallInfo *) * list_length(clauses));
 	indexes1 = (int *) palloc(sizeof(int) * list_length(clauses));
 	indexes2 = (int *) palloc(sizeof(int) * list_length(clauses));
 	reverse = (bool *) palloc(sizeof(bool) * list_length(clauses));
@@ -2315,8 +2318,8 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 		OpExpr	   *opexpr;
 		Node	   *expr1,
 				   *expr2;
-
-		int			idx = list_cell_number(clauses, lc);
+		int		idx = list_cell_number(clauses, lc);
+		FunctionCallInfo fcinfo = palloc(SizeForFunctionCallInfo(2));
 
 		opexpr = (OpExpr *) clause;
 
@@ -2324,7 +2327,14 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 		Assert(is_opclause(clause));
 		Assert(list_length(opexpr->args) == 2);
 
-		fmgr_info(get_opcode(opexpr->opno), &opprocs[idx]);
+		fmgr_info(get_opcode(opexpr->opno), &finfo[idx]);
+
+		InitFunctionCallInfoData(*fcinfo, &finfo[idx],
+								 2, opexpr->inputcollid,
+								 NULL, NULL);
+		fcinfo->args[0].isnull = false;
+		fcinfo->args[1].isnull = false;
+		opprocs[idx] = fcinfo;
 
 		expr1 = linitial(opexpr->args);
 		expr2 = lsecond(opexpr->args);
@@ -2444,12 +2454,13 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 			idx = 0;
 			foreach(lc, clauses)
 			{
-				bool		match;
-				int			index1 = indexes1[idx],
-							index2 = indexes2[idx];
-				Datum		value1,
-							value2;
-				bool		reverse_args = reverse[idx];
+				bool	match;
+				int		index1 = indexes1[idx],
+						index2 = indexes2[idx];
+				Datum	value1,
+						value2;
+				bool	reverse_args = reverse[idx];
+				FunctionCallInfo	fcinfo = opprocs[idx];
 
 				/* If either value is null, it's a mismatch */
 				if (mcv2->items[j].isnull[index2])
@@ -2462,17 +2473,18 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 					/*
 					 * Careful about order of parameters. For same-type
 					 * equality that should not matter, but easy enough.
-					 *
-					 * FIXME Use appropriate collation.
 					 */
 					if (reverse_args)
-						match = DatumGetBool(FunctionCall2Coll(&opprocs[idx],
-															   InvalidOid,
-															   value2, value1));
+					{
+						fcinfo->args[0].value = value2;
+						fcinfo->args[1].value = value1;
+					}
 					else
-						match = DatumGetBool(FunctionCall2Coll(&opprocs[idx],
-															   InvalidOid,
-															   value1, value2));
+					{
+						fcinfo->args[0].value = value1;
+						fcinfo->args[1].value = value2;
+					}
+					match = DatumGetBool(FunctionCallInvoke(fcinfo));
 				}
 
 				items_match &= match;
-- 
2.45.2

v20240617-0028-review.patchtext/x-patch; charset=UTF-8; name=v20240617-0028-review.patchDownload

From 57547022e0b17085d972bd9106fe68893f9206af Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Mon, 17 Jun 2024 16:14:22 +0200
Subject: [PATCH v20240617 28/56] review

---
 src/backend/statistics/mcv.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/src/backend/statistics/mcv.c b/src/backend/statistics/mcv.c
index c1a5a7f6a4c..617f33504a8 100644
--- a/src/backend/statistics/mcv.c
+++ b/src/backend/statistics/mcv.c
@@ -2296,6 +2296,7 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 	 * that we don't have to do that for each MCV item or so.
 	 */
 	finfo = (FmgrInfo *) palloc(sizeof(FmgrInfo) * list_length(clauses));
+	// FIXME remove?
 	// opprocs = (FunctionCallInfo *) palloc(SizeForFunctionCallInfo(2) * list_length(clauses));
 	opprocs = (FunctionCallInfo *) palloc(sizeof(FunctionCallInfo *) * list_length(clauses));
 	indexes1 = (int *) palloc(sizeof(int) * list_length(clauses));
@@ -2334,6 +2335,8 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 								 NULL, NULL);
 		fcinfo->args[0].isnull = false;
 		fcinfo->args[1].isnull = false;
+
+		/* XXX Do we even need this, if we have finfo? */
 		opprocs[idx] = fcinfo;
 
 		expr1 = linitial(opexpr->args);
-- 
2.45.2

v20240617-0029-pgindent.patchtext/x-patch; charset=UTF-8; name=v20240617-0029-pgindent.patchDownload

From 42dc8242803afdb7f595bdd5661fb88c919cad19 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Mon, 17 Jun 2024 16:14:59 +0200
Subject: [PATCH v20240617 29/56] pgindent

---
 src/backend/statistics/mcv.c | 26 +++++++++++++++-----------
 1 file changed, 15 insertions(+), 11 deletions(-)

diff --git a/src/backend/statistics/mcv.c b/src/backend/statistics/mcv.c
index 617f33504a8..0db285a8958 100644
--- a/src/backend/statistics/mcv.c
+++ b/src/backend/statistics/mcv.c
@@ -2247,7 +2247,7 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 
 	/* info about clauses and how they match to MCV stats */
 	FmgrInfo   *finfo;
-	FunctionCallInfo  *opprocs;
+	FunctionCallInfo *opprocs;
 	int		   *indexes1,
 			   *indexes2;
 	bool	   *reverse;
@@ -2296,8 +2296,12 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 	 * that we don't have to do that for each MCV item or so.
 	 */
 	finfo = (FmgrInfo *) palloc(sizeof(FmgrInfo) * list_length(clauses));
-	// FIXME remove?
-	// opprocs = (FunctionCallInfo *) palloc(SizeForFunctionCallInfo(2) * list_length(clauses));
+	/* FIXME remove? */
+
+	/*
+	 * opprocs = (FunctionCallInfo *) palloc(SizeForFunctionCallInfo(2) *
+	 * list_length(clauses));
+	 */
 	opprocs = (FunctionCallInfo *) palloc(sizeof(FunctionCallInfo *) * list_length(clauses));
 	indexes1 = (int *) palloc(sizeof(int) * list_length(clauses));
 	indexes2 = (int *) palloc(sizeof(int) * list_length(clauses));
@@ -2319,7 +2323,7 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 		OpExpr	   *opexpr;
 		Node	   *expr1,
 				   *expr2;
-		int		idx = list_cell_number(clauses, lc);
+		int			idx = list_cell_number(clauses, lc);
 		FunctionCallInfo fcinfo = palloc(SizeForFunctionCallInfo(2));
 
 		opexpr = (OpExpr *) clause;
@@ -2457,13 +2461,13 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 			idx = 0;
 			foreach(lc, clauses)
 			{
-				bool	match;
-				int		index1 = indexes1[idx],
-						index2 = indexes2[idx];
-				Datum	value1,
-						value2;
-				bool	reverse_args = reverse[idx];
-				FunctionCallInfo	fcinfo = opprocs[idx];
+				bool		match;
+				int			index1 = indexes1[idx],
+							index2 = indexes2[idx];
+				Datum		value1,
+							value2;
+				bool		reverse_args = reverse[idx];
+				FunctionCallInfo fcinfo = opprocs[idx];
 
 				/* If either value is null, it's a mismatch */
 				if (mcv2->items[j].isnull[index2])
-- 
2.45.2

v20240617-0030-optimize-the-order-of-mcv-equal-function-e.patchtext/x-patch; charset=UTF-8; name=v20240617-0030-optimize-the-order-of-mcv-equal-function-e.patchDownload

From 0041de71c24540a1b65b2ddd652bbb905b5937aa Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Mon, 17 Jun 2024 16:19:16 +0200
Subject: [PATCH v20240617 30/56] optimize the order of mcv equal function
 evaluation

using n_dinstinct values.  See the test in ext_sort_mcv_proc.sql
which should not be committed since it is just a manual test.
---
 src/backend/statistics/mcv.c               | 100 ++++++++++++++-------
 src/test/regress/sql/ext_sort_mcv_proc.sql |  30 +++++++
 2 files changed, 97 insertions(+), 33 deletions(-)
 create mode 100644 src/test/regress/sql/ext_sort_mcv_proc.sql

diff --git a/src/backend/statistics/mcv.c b/src/backend/statistics/mcv.c
index 0db285a8958..96f4c1b94f6 100644
--- a/src/backend/statistics/mcv.c
+++ b/src/backend/statistics/mcv.c
@@ -72,6 +72,33 @@
 	 ((ndims) * sizeof(DimensionInfo)) + \
 	 ((nitems) * ITEM_SIZE(ndims)))
 
+// #define  DEBUG_MCV  1 /* should be removed after review. */
+
+typedef struct
+{
+	FmgrInfo	fmgrinfo;
+	FunctionCallInfo fcinfo;
+	double	n_distinct;
+#ifdef DEBUG_MCV
+	int		idx;
+#endif
+} McvProc;
+
+static int
+cmp_mcv_proc(const void *a, const void *b)
+{
+	/* sort the McvProc reversely based on n_distinct value. */
+	McvProc *m1 = (McvProc *) a;
+	McvProc *m2 = (McvProc *) b;
+
+	if (m1->n_distinct > m2->n_distinct)
+		return -1;
+	else if (m1->n_distinct == m2->n_distinct)
+		return 0;
+	else
+		return 1;
+}
+
 static MultiSortSupport build_mss(StatsBuildData *data);
 
 static SortItem *build_distinct_groups(int numrows, SortItem *items,
@@ -2246,8 +2273,7 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 				totalsel2;
 
 	/* info about clauses and how they match to MCV stats */
-	FmgrInfo   *finfo;
-	FunctionCallInfo *opprocs;
+	McvProc		*mcvProc;
 	int		   *indexes1,
 			   *indexes2;
 	bool	   *reverse;
@@ -2295,14 +2321,7 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 	 * stats we picked. We do this only once before processing the lists, so
 	 * that we don't have to do that for each MCV item or so.
 	 */
-	finfo = (FmgrInfo *) palloc(sizeof(FmgrInfo) * list_length(clauses));
-	/* FIXME remove? */
-
-	/*
-	 * opprocs = (FunctionCallInfo *) palloc(SizeForFunctionCallInfo(2) *
-	 * list_length(clauses));
-	 */
-	opprocs = (FunctionCallInfo *) palloc(sizeof(FunctionCallInfo *) * list_length(clauses));
+	mcvProc = (McvProc *) palloc(sizeof(McvProc) * list_length(clauses));
 	indexes1 = (int *) palloc(sizeof(int) * list_length(clauses));
 	indexes2 = (int *) palloc(sizeof(int) * list_length(clauses));
 	reverse = (bool *) palloc(sizeof(bool) * list_length(clauses));
@@ -2325,6 +2344,9 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 				   *expr2;
 		int			idx = list_cell_number(clauses, lc);
 		FunctionCallInfo fcinfo = palloc(SizeForFunctionCallInfo(2));
+		VariableStatData	vardata;
+		bool	isdefault;
+		Node	*left_expr;
 
 		opexpr = (OpExpr *) clause;
 
@@ -2332,17 +2354,18 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 		Assert(is_opclause(clause));
 		Assert(list_length(opexpr->args) == 2);
 
-		fmgr_info(get_opcode(opexpr->opno), &finfo[idx]);
-
-		InitFunctionCallInfoData(*fcinfo, &finfo[idx],
+		fmgr_info(get_opcode(opexpr->opno), &mcvProc[idx].fmgrinfo);
+		mcvProc[idx].fcinfo = fcinfo;
+#ifdef DEBUG_MCV
+		mcvProc[idx].idx = idx;
+#endif
+		InitFunctionCallInfoData(*mcvProc[idx].fcinfo,
+								 &mcvProc[idx].fmgrinfo,
 								 2, opexpr->inputcollid,
 								 NULL, NULL);
 		fcinfo->args[0].isnull = false;
 		fcinfo->args[1].isnull = false;
 
-		/* XXX Do we even need this, if we have finfo? */
-		opprocs[idx] = fcinfo;
-
 		expr1 = linitial(opexpr->args);
 		expr2 = lsecond(opexpr->args);
 
@@ -2368,6 +2391,8 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 
 			exprs1 = lappend(exprs1, expr1);
 			exprs2 = lappend(exprs2, expr2);
+
+			left_expr = expr1;
 		}
 		else if ((bms_singleton_member(rinfo->right_relids) == rel1->relid) &&
 				 (bms_singleton_member(rinfo->left_relids) == rel2->relid))
@@ -2384,6 +2409,8 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 
 			exprs1 = lappend(exprs1, expr2);
 			exprs2 = lappend(exprs2, expr1);
+
+			left_expr = expr2;
 		}
 		else
 			/* should never happen */
@@ -2394,8 +2421,23 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 
 		Assert((indexes2[idx] >= 0) &&
 			   (indexes2[idx] < bms_num_members(stat2->keys) + list_length(stat2->exprs)));
+
+		examine_variable(root, left_expr, rel1->relid, &vardata);
+		mcvProc[idx].n_distinct = get_variable_numdistinct(&vardata, &isdefault);
+		// elog(INFO, "n_distinct = %f", mcvProc[idx].n_distinct);
+		ReleaseVariableStats(vardata);
 	}
 
+	/* order the McvProc */
+	pg_qsort(mcvProc, list_length(clauses), sizeof(McvProc), cmp_mcv_proc);
+
+#ifdef DEBUG_MCV
+	for (i = 0; i < list_length(clauses); i++)
+	{
+		elog(INFO, "%d", mcvProc[i].idx);
+	}
+#endif
+
 	/*
 	 * Match items between the two MCV lists.
 	 *
@@ -2451,23 +2493,17 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 			 */
 
 			/*
-			 * Evaluate if all the join clauses match between the two MCV
-			 * items.
-			 *
-			 * XXX We might optimize the order of evaluation, using the costs
-			 * of operator functions for individual columns. It does depend on
-			 * the number of distinct values, etc.
+			 * Evaluate if all the join clauses match between the two MCV items.
 			 */
-			idx = 0;
-			foreach(lc, clauses)
+			for(idx = 0; idx < list_length(clauses); idx++)
 			{
-				bool		match;
-				int			index1 = indexes1[idx],
-							index2 = indexes2[idx];
-				Datum		value1,
-							value2;
-				bool		reverse_args = reverse[idx];
-				FunctionCallInfo fcinfo = opprocs[idx];
+				bool	match;
+				int		index1 = indexes1[idx],
+						index2 = indexes2[idx];
+				Datum	value1,
+						value2;
+				bool	reverse_args = reverse[idx];
+				FunctionCallInfo	fcinfo = mcvProc[idx].fcinfo;
 
 				/* If either value is null, it's a mismatch */
 				if (mcv2->items[j].isnull[index2])
@@ -2498,8 +2534,6 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 
 				if (!items_match)
 					break;
-
-				idx++;
 			}
 
 			if (items_match)
diff --git a/src/test/regress/sql/ext_sort_mcv_proc.sql b/src/test/regress/sql/ext_sort_mcv_proc.sql
new file mode 100644
index 00000000000..09360d5b23e
--- /dev/null
+++ b/src/test/regress/sql/ext_sort_mcv_proc.sql
@@ -0,0 +1,30 @@
+create table t(level_1 text, level_2 text, level_3 text);
+
+insert into t
+values
+('l11', 'l21', 'l31'),
+('l11', 'l21', 'l32'),
+('l11', 'l21', 'l33'),
+('l11', 'l22', 'l34'),
+('l11', 'l22', 'l35'),
+('l11', 'l22', 'l36');
+
+create statistics on level_1, level_2, level_3 from t;
+
+analyze t;
+
+explain select * from t t1 join t t2 using(level_1, level_2, level_3);
+INFO:  n_distinct = 1.000000
+INFO:  n_distinct = 2.000000
+INFO:  n_distinct = 6.000000
+INFO:  2
+INFO:  1
+INFO:  0
+                                              QUERY PLAN
+------------------------------------------------------------------------------------------------------
+ Hash Join  (cost=1.17..2.32 rows=6 width=12)
+   Hash Cond: ((t1.level_1 = t2.level_1) AND (t1.level_2 = t2.level_2) AND (t1.level_3 = t2.level_3))
+   ->  Seq Scan on t t1  (cost=0.00..1.06 rows=6 width=12)
+   ->  Hash  (cost=1.06..1.06 rows=6 width=12)
+         ->  Seq Scan on t t2  (cost=0.00..1.06 rows=6 width=12)
+(5 rows)
-- 
2.45.2

v20240617-0031-review.patchtext/x-patch; charset=UTF-8; name=v20240617-0031-review.patchDownload

From 90874e0fd9a6e46dc6717b2ab3fe27aa9fe918d2 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Mon, 17 Jun 2024 16:25:27 +0200
Subject: [PATCH v20240617 31/56] review

---
 src/backend/statistics/mcv.c               | 12 ++++++++++++
 src/test/regress/sql/ext_sort_mcv_proc.sql |  3 +++
 2 files changed, 15 insertions(+)

diff --git a/src/backend/statistics/mcv.c b/src/backend/statistics/mcv.c
index 96f4c1b94f6..367efdbb391 100644
--- a/src/backend/statistics/mcv.c
+++ b/src/backend/statistics/mcv.c
@@ -74,6 +74,13 @@
 
 // #define  DEBUG_MCV  1 /* should be removed after review. */
 
+/*
+ * XXX The patch says "optimize the order of mcv equal function evaluation"
+ * but how is this new struct related to that?
+ *
+ * XXX Anyway, if we want to do this, surely this is not the only opportunity
+ * to replace a couple separate variables with a struct wrapping them logically?
+ */
 typedef struct
 {
 	FmgrInfo	fmgrinfo;
@@ -84,6 +91,11 @@ typedef struct
 #endif
 } McvProc;
 
+/*
+ * XXX What's the reasoning behind reordering the functions like this? Doesn't it
+ * have the same issues with unpredictable behavior like the GROUP BY patch, which
+ * got eventually reverted and reworked?
+ */
 static int
 cmp_mcv_proc(const void *a, const void *b)
 {
diff --git a/src/test/regress/sql/ext_sort_mcv_proc.sql b/src/test/regress/sql/ext_sort_mcv_proc.sql
index 09360d5b23e..14677dd02f7 100644
--- a/src/test/regress/sql/ext_sort_mcv_proc.sql
+++ b/src/test/regress/sql/ext_sort_mcv_proc.sql
@@ -1,3 +1,6 @@
+-- FIXME would be nice to explain what's the point of the test, what it tries to verify
+-- FIXME maybe use "costs off"?
+-- FIXME where's the expected output for the test? also, not added to the schedule
 create table t(level_1 text, level_2 text, level_3 text);
 
 insert into t
-- 
2.45.2

v20240617-0032-pgindent.patchtext/x-patch; charset=UTF-8; name=v20240617-0032-pgindent.patchDownload

From 97161a66899c9ed3ea09ea6473303bd4245b62e4 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Mon, 17 Jun 2024 16:33:05 +0200
Subject: [PATCH v20240617 32/56] pgindent

---
 src/backend/statistics/mcv.c | 41 ++++++++++++++++++------------------
 1 file changed, 21 insertions(+), 20 deletions(-)

diff --git a/src/backend/statistics/mcv.c b/src/backend/statistics/mcv.c
index 367efdbb391..db2768eef46 100644
--- a/src/backend/statistics/mcv.c
+++ b/src/backend/statistics/mcv.c
@@ -72,7 +72,7 @@
 	 ((ndims) * sizeof(DimensionInfo)) + \
 	 ((nitems) * ITEM_SIZE(ndims)))
 
-// #define  DEBUG_MCV  1 /* should be removed after review. */
+ /*  #define  DEBUG_MCV  1 /* should be removed after review. */ * /
 
 /*
  * XXX The patch says "optimize the order of mcv equal function evaluation"
@@ -85,11 +85,11 @@ typedef struct
 {
 	FmgrInfo	fmgrinfo;
 	FunctionCallInfo fcinfo;
-	double	n_distinct;
+	double		n_distinct;
 #ifdef DEBUG_MCV
-	int		idx;
+	int			idx;
 #endif
-} McvProc;
+}			McvProc;
 
 /*
  * XXX What's the reasoning behind reordering the functions like this? Doesn't it
@@ -100,8 +100,8 @@ static int
 cmp_mcv_proc(const void *a, const void *b)
 {
 	/* sort the McvProc reversely based on n_distinct value. */
-	McvProc *m1 = (McvProc *) a;
-	McvProc *m2 = (McvProc *) b;
+	McvProc    *m1 = (McvProc *) a;
+	McvProc    *m2 = (McvProc *) b;
 
 	if (m1->n_distinct > m2->n_distinct)
 		return -1;
@@ -2285,7 +2285,7 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 				totalsel2;
 
 	/* info about clauses and how they match to MCV stats */
-	McvProc		*mcvProc;
+	McvProc    *mcvProc;
 	int		   *indexes1,
 			   *indexes2;
 	bool	   *reverse;
@@ -2356,9 +2356,9 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 				   *expr2;
 		int			idx = list_cell_number(clauses, lc);
 		FunctionCallInfo fcinfo = palloc(SizeForFunctionCallInfo(2));
-		VariableStatData	vardata;
-		bool	isdefault;
-		Node	*left_expr;
+		VariableStatData vardata;
+		bool		isdefault;
+		Node	   *left_expr;
 
 		opexpr = (OpExpr *) clause;
 
@@ -2436,7 +2436,7 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 
 		examine_variable(root, left_expr, rel1->relid, &vardata);
 		mcvProc[idx].n_distinct = get_variable_numdistinct(&vardata, &isdefault);
-		// elog(INFO, "n_distinct = %f", mcvProc[idx].n_distinct);
+		/* elog(INFO, "n_distinct = %f", mcvProc[idx].n_distinct); */
 		ReleaseVariableStats(vardata);
 	}
 
@@ -2505,17 +2505,18 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 			 */
 
 			/*
-			 * Evaluate if all the join clauses match between the two MCV items.
+			 * Evaluate if all the join clauses match between the two MCV
+			 * items.
 			 */
-			for(idx = 0; idx < list_length(clauses); idx++)
+			for (idx = 0; idx < list_length(clauses); idx++)
 			{
-				bool	match;
-				int		index1 = indexes1[idx],
-						index2 = indexes2[idx];
-				Datum	value1,
-						value2;
-				bool	reverse_args = reverse[idx];
-				FunctionCallInfo	fcinfo = mcvProc[idx].fcinfo;
+				bool		match;
+				int			index1 = indexes1[idx],
+							index2 = indexes2[idx];
+				Datum		value1,
+							value2;
+				bool		reverse_args = reverse[idx];
+				FunctionCallInfo fcinfo = mcvProc[idx].fcinfo;
 
 				/* If either value is null, it's a mismatch */
 				if (mcv2->items[j].isnull[index2])
-- 
2.45.2

v20240617-0033-Merge-3-palloc-into-1-palloc.patchtext/x-patch; charset=UTF-8; name=v20240617-0033-Merge-3-palloc-into-1-palloc.patchDownload

From 4a91dc41b225da8c5df12a92f1ab2e530df55877 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Mon, 17 Jun 2024 16:34:38 +0200
Subject: [PATCH v20240617 33/56] Merge 3 palloc into 1 palloc

1. Merge 3 palloc into 1 palloc to save 2 palloc calls.

2. A question from me, search: "From Andy".
---
 src/backend/statistics/mcv.c | 55 ++++++++++++++++++++----------------
 1 file changed, 30 insertions(+), 25 deletions(-)

diff --git a/src/backend/statistics/mcv.c b/src/backend/statistics/mcv.c
index db2768eef46..11f5fb9f5f5 100644
--- a/src/backend/statistics/mcv.c
+++ b/src/backend/statistics/mcv.c
@@ -91,6 +91,13 @@ typedef struct
 #endif
 }			McvProc;
 
+typedef struct
+{
+	int 	index1;
+	int 	index2;
+	bool 	reverse;
+} McvClauseInfo;
+
 /*
  * XXX What's the reasoning behind reordering the functions like this? Doesn't it
  * have the same issues with unpredictable behavior like the GROUP BY patch, which
@@ -2285,10 +2292,8 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 				totalsel2;
 
 	/* info about clauses and how they match to MCV stats */
-	McvProc    *mcvProc;
-	int		   *indexes1,
-			   *indexes2;
-	bool	   *reverse;
+	McvProc		*mcvProc;
+	McvClauseInfo	*cinfo;
 	RangeTblEntry *rte1 = root->simple_rte_array[rel1->relid];
 	RangeTblEntry *rte2 = root->simple_rte_array[rel2->relid];
 
@@ -2334,9 +2339,7 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 	 * that we don't have to do that for each MCV item or so.
 	 */
 	mcvProc = (McvProc *) palloc(sizeof(McvProc) * list_length(clauses));
-	indexes1 = (int *) palloc(sizeof(int) * list_length(clauses));
-	indexes2 = (int *) palloc(sizeof(int) * list_length(clauses));
-	reverse = (bool *) palloc(sizeof(bool) * list_length(clauses));
+	cinfo = (McvClauseInfo *) palloc(sizeof(McvClauseInfo) * list_length(clauses));
 
 	foreach(lc, clauses)
 	{
@@ -2393,13 +2396,13 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 		{
 			Oid			collid;
 
-			indexes1[idx] = mcv_match_expression(expr1,
+			cinfo[idx].index1 = mcv_match_expression(expr1,
 												 stat1->keys, stat1->exprs,
 												 &collid);
-			indexes2[idx] = mcv_match_expression(expr2,
+			cinfo[idx].index2 = mcv_match_expression(expr2,
 												 stat2->keys, stat2->exprs,
 												 &collid);
-			reverse[idx] = false;
+			cinfo[idx].reverse = false;
 
 			exprs1 = lappend(exprs1, expr1);
 			exprs2 = lappend(exprs2, expr2);
@@ -2411,13 +2414,13 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 		{
 			Oid			collid;
 
-			indexes1[idx] = mcv_match_expression(expr2,
+			cinfo[idx].index1 = mcv_match_expression(expr2,
 												 stat2->keys, stat2->exprs,
 												 &collid);
-			indexes2[idx] = mcv_match_expression(expr1,
+			cinfo[idx].index2 = mcv_match_expression(expr1,
 												 stat1->keys, stat1->exprs,
 												 &collid);
-			reverse[idx] = true;
+			cinfo[idx].reverse = true;
 
 			exprs1 = lappend(exprs1, expr2);
 			exprs2 = lappend(exprs2, expr1);
@@ -2428,11 +2431,11 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 			/* should never happen */
 			Assert(false);
 
-		Assert((indexes1[idx] >= 0) &&
-			   (indexes1[idx] < bms_num_members(stat1->keys) + list_length(stat1->exprs)));
+		Assert((cinfo[idx].index1 >= 0) &&
+			   (cinfo[idx].index1 < bms_num_members(stat1->keys) + list_length(stat1->exprs)));
 
-		Assert((indexes2[idx] >= 0) &&
-			   (indexes2[idx] < bms_num_members(stat2->keys) + list_length(stat2->exprs)));
+		Assert((cinfo[idx].index2 >= 0) &&
+			   (cinfo[idx].index2 < bms_num_members(stat2->keys) + list_length(stat2->exprs)));
 
 		examine_variable(root, left_expr, rel1->relid, &vardata);
 		mcvProc[idx].n_distinct = get_variable_numdistinct(&vardata, &isdefault);
@@ -2484,7 +2487,7 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 		 */
 		has_nulls = false;
 		for (j = 0; j < list_length(clauses); j++)
-			has_nulls |= mcv1->items[i].isnull[indexes1[j]];
+			has_nulls |= mcv1->items[i].isnull[cinfo[j].index1];
 
 		if (has_nulls)
 			continue;
@@ -2502,6 +2505,8 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 			/*
 			 * XXX We can't skip based on existing matches2 value, because
 			 * there may be duplicates in the first MCV.
+			 *
+			 * From Andy: what does this mean?
 			 */
 
 			/*
@@ -2510,13 +2515,13 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 			 */
 			for (idx = 0; idx < list_length(clauses); idx++)
 			{
-				bool		match;
-				int			index1 = indexes1[idx],
-							index2 = indexes2[idx];
-				Datum		value1,
-							value2;
-				bool		reverse_args = reverse[idx];
-				FunctionCallInfo fcinfo = mcvProc[idx].fcinfo;
+				bool	match;
+				int		index1 = cinfo[idx].index1,
+						index2 = cinfo[idx].index2;
+				Datum	value1,
+						value2;
+				bool	reverse_args = cinfo[idx].reverse;
+				FunctionCallInfo	fcinfo = mcvProc[idx].fcinfo;
 
 				/* If either value is null, it's a mismatch */
 				if (mcv2->items[j].isnull[index2])
-- 
2.45.2

v20240617-0034-review.patchtext/x-patch; charset=UTF-8; name=v20240617-0034-review.patchDownload

From c9aab87694300ddf6f5152b1b4945a766ed5c0ea Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Mon, 17 Jun 2024 16:36:55 +0200
Subject: [PATCH v20240617 34/56] review

---
 src/backend/statistics/mcv.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/src/backend/statistics/mcv.c b/src/backend/statistics/mcv.c
index 11f5fb9f5f5..503cdc1bc4a 100644
--- a/src/backend/statistics/mcv.c
+++ b/src/backend/statistics/mcv.c
@@ -91,6 +91,11 @@ typedef struct
 #endif
 }			McvProc;
 
+/*
+ * XXX I kinda doubt this makes performance difference (thanks to caching in
+ * the memory contexts), but it does seem to make the code easier to read,
+ * which is nice.
+ */
 typedef struct
 {
 	int 	index1;
-- 
2.45.2

v20240617-0035-pgindent.patchtext/x-patch; charset=UTF-8; name=v20240617-0035-pgindent.patchDownload

From 33397f0a62af9d56e7c92494a0138d4c97c8d877 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Mon, 17 Jun 2024 16:37:28 +0200
Subject: [PATCH v20240617 35/56] pgindent

---
 src/backend/statistics/mcv.c | 44 ++++++++++++++++++------------------
 1 file changed, 22 insertions(+), 22 deletions(-)

diff --git a/src/backend/statistics/mcv.c b/src/backend/statistics/mcv.c
index 503cdc1bc4a..3af717affbc 100644
--- a/src/backend/statistics/mcv.c
+++ b/src/backend/statistics/mcv.c
@@ -72,7 +72,7 @@
 	 ((ndims) * sizeof(DimensionInfo)) + \
 	 ((nitems) * ITEM_SIZE(ndims)))
 
- /*  #define  DEBUG_MCV  1 /* should be removed after review. */ * /
+ /* #define  DEBUG_MCV  1 /* should be removed after review. */ * /
 
 /*
  * XXX The patch says "optimize the order of mcv equal function evaluation"
@@ -98,10 +98,10 @@ typedef struct
  */
 typedef struct
 {
-	int 	index1;
-	int 	index2;
-	bool 	reverse;
-} McvClauseInfo;
+	int			index1;
+	int			index2;
+	bool		reverse;
+}			McvClauseInfo;
 
 /*
  * XXX What's the reasoning behind reordering the functions like this? Doesn't it
@@ -2297,8 +2297,8 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 				totalsel2;
 
 	/* info about clauses and how they match to MCV stats */
-	McvProc		*mcvProc;
-	McvClauseInfo	*cinfo;
+	McvProc    *mcvProc;
+	McvClauseInfo *cinfo;
 	RangeTblEntry *rte1 = root->simple_rte_array[rel1->relid];
 	RangeTblEntry *rte2 = root->simple_rte_array[rel2->relid];
 
@@ -2402,11 +2402,11 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 			Oid			collid;
 
 			cinfo[idx].index1 = mcv_match_expression(expr1,
-												 stat1->keys, stat1->exprs,
-												 &collid);
+													 stat1->keys, stat1->exprs,
+													 &collid);
 			cinfo[idx].index2 = mcv_match_expression(expr2,
-												 stat2->keys, stat2->exprs,
-												 &collid);
+													 stat2->keys, stat2->exprs,
+													 &collid);
 			cinfo[idx].reverse = false;
 
 			exprs1 = lappend(exprs1, expr1);
@@ -2420,11 +2420,11 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 			Oid			collid;
 
 			cinfo[idx].index1 = mcv_match_expression(expr2,
-												 stat2->keys, stat2->exprs,
-												 &collid);
+													 stat2->keys, stat2->exprs,
+													 &collid);
 			cinfo[idx].index2 = mcv_match_expression(expr1,
-												 stat1->keys, stat1->exprs,
-												 &collid);
+													 stat1->keys, stat1->exprs,
+													 &collid);
 			cinfo[idx].reverse = true;
 
 			exprs1 = lappend(exprs1, expr2);
@@ -2520,13 +2520,13 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 			 */
 			for (idx = 0; idx < list_length(clauses); idx++)
 			{
-				bool	match;
-				int		index1 = cinfo[idx].index1,
-						index2 = cinfo[idx].index2;
-				Datum	value1,
-						value2;
-				bool	reverse_args = cinfo[idx].reverse;
-				FunctionCallInfo	fcinfo = mcvProc[idx].fcinfo;
+				bool		match;
+				int			index1 = cinfo[idx].index1,
+							index2 = cinfo[idx].index2;
+				Datum		value1,
+							value2;
+				bool		reverse_args = cinfo[idx].reverse;
+				FunctionCallInfo fcinfo = mcvProc[idx].fcinfo;
 
 				/* If either value is null, it's a mismatch */
 				if (mcv2->items[j].isnull[index2])
-- 
2.45.2

v20240617-0036-Remove-2-pull_varnos-calls-with-rinfo-left.patchtext/x-patch; charset=UTF-8; name=v20240617-0036-Remove-2-pull_varnos-calls-with-rinfo-left.patchDownload

From b34e58469a92a1d4c5d4266cf6632c78802c0c16 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Mon, 17 Jun 2024 16:38:00 +0200
Subject: [PATCH v20240617 36/56] Remove 2 pull_varnos calls with
 rinfo->left|right_relids.

---
 src/backend/statistics/extended_stats.c | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c
index 868fc12e77e..c06829eba31 100644
--- a/src/backend/statistics/extended_stats.c
+++ b/src/backend/statistics/extended_stats.c
@@ -3216,8 +3216,11 @@ extract_relation_info(PlannerInfo *root, JoinPairInfo *info, int index,
 static Node *
 get_expression_for_rel(PlannerInfo *root, RelOptInfo *rel, Node *clause)
 {
-	OpExpr	   *opexpr;
-	Node	   *expr;
+	OpExpr *opexpr;
+	Node   *expr;
+	RestrictInfo *rinfo = (RestrictInfo *) clause;
+
+	Assert(IsA(clause, RestrictInfo));
 
 	/*
 	 * Strip the RestrictInfo node, get the actual clause.
@@ -3226,8 +3229,7 @@ get_expression_for_rel(PlannerInfo *root, RelOptInfo *rel, Node *clause)
 	 * (e.g. RelabelType etc.). statext_is_supported_join_clause matches this,
 	 * but maybe we need to relax it?
 	 */
-	if (IsA(clause, RestrictInfo))
-		clause = (Node *) ((RestrictInfo *) clause)->clause;
+	clause = (Node *) rinfo->clause;
 
 	opexpr = (OpExpr *) clause;
 
@@ -3237,11 +3239,11 @@ get_expression_for_rel(PlannerInfo *root, RelOptInfo *rel, Node *clause)
 
 	/* FIXME strip relabel etc. the way examine_opclause_args does */
 	expr = linitial(opexpr->args);
-	if (bms_singleton_member(pull_varnos(root, expr)) == rel->relid)
+	if (bms_singleton_member(rinfo->left_relids) == rel->relid)
 		return expr;
 
 	expr = lsecond(opexpr->args);
-	if (bms_singleton_member(pull_varnos(root, expr)) == rel->relid)
+	if (bms_singleton_member(rinfo->right_relids) == rel->relid)
 		return expr;
 
 	return NULL;
-- 
2.45.2

v20240617-0037-review.patchtext/x-patch; charset=UTF-8; name=v20240617-0037-review.patchDownload

From 2889d4dab0ab4a86d24054edc95982855a19a4a9 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Mon, 17 Jun 2024 16:39:01 +0200
Subject: [PATCH v20240617 37/56] review

---
 src/backend/statistics/extended_stats.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c
index c06829eba31..0b0d0ce33b9 100644
--- a/src/backend/statistics/extended_stats.c
+++ b/src/backend/statistics/extended_stats.c
@@ -3228,6 +3228,8 @@ get_expression_for_rel(PlannerInfo *root, RelOptInfo *rel, Node *clause)
 	 * XXX Not sure if we need to care about removing other node types too
 	 * (e.g. RelabelType etc.). statext_is_supported_join_clause matches this,
 	 * but maybe we need to relax it?
+	 *
+	 * XXX Can we be sure there always is RestrictInfo?
 	 */
 	clause = (Node *) rinfo->clause;
 
-- 
2.45.2

v20240617-0041-review.patchtext/x-patch; charset=UTF-8; name=v20240617-0041-review.patchDownload

From 5c4c67e58229897688a083cae0320a7aa89de0d4 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Mon, 17 Jun 2024 16:47:21 +0200
Subject: [PATCH v20240617 41/56] review

---
 src/backend/statistics/mcv.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/src/backend/statistics/mcv.c b/src/backend/statistics/mcv.c
index 92cb74df33f..55fee0ceebd 100644
--- a/src/backend/statistics/mcv.c
+++ b/src/backend/statistics/mcv.c
@@ -2685,6 +2685,8 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
  * Most of the mcv_combine_extended comment applies here too, but we can make
  * some simplifications because we know the second (per-column) MCV is simpler,
  * contains no NULL or duplicate values, etc.
+ *
+ * XXX May make sense, but seems rather independent of this patch series.
  */
 Selectivity
 mcv_combine_simple(PlannerInfo *root, RelOptInfo *rel, StatisticExtInfo *stat,
-- 
2.45.2

v20240617-0042-pgindent.patchtext/x-patch; charset=UTF-8; name=v20240617-0042-pgindent.patchDownload

From a95e3801ec9cd0642ca92d8179bdd9e332275f25 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Mon, 17 Jun 2024 16:48:01 +0200
Subject: [PATCH v20240617 42/56] pgindent

---
 src/backend/statistics/mcv.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/src/backend/statistics/mcv.c b/src/backend/statistics/mcv.c
index 55fee0ceebd..e1c5abf3148 100644
--- a/src/backend/statistics/mcv.c
+++ b/src/backend/statistics/mcv.c
@@ -2727,6 +2727,7 @@ mcv_combine_simple(PlannerInfo *root, RelOptInfo *rel, StatisticExtInfo *stat,
 
 	/* info about clauses and how they match to MCV stats */
 	FmgrInfo	opproc;
+
 	LOCAL_FCINFO(fcinfo, 2);
 	int			index = 0;
 	bool		reverse = false;
-- 
2.45.2

v20240617-0043-Fix-error-unexpected-system-attribute-when.patchtext/x-patch; charset=UTF-8; name=v20240617-0043-Fix-error-unexpected-system-attribute-when.patchDownload

From 7654eb8e59e8fe5e918e517aa64ebeb5767da87c Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Mon, 17 Jun 2024 16:48:30 +0200
Subject: [PATCH v20240617 43/56] Fix error "unexpected system attribute" when
 join with system attr

We can't just change 'elog(ERROR, "unexpected system attribute");' to
'continue' in extract_relation_info since after we extract the
StatisticExtInfo, and stat is not NULL, we grantee the expression in
JoinPairInfo.clause has a matched expression with mcv_match_expression,
however this is not true for system attribute. so fix it at the first
place when populate the clause into JoinPairInfo.clauses which is the
statext_is_supported_join_clause function. Expression contains a system
attribute is OK since due to the implementation of mcv_match_expression
so only Var need to be handled there.
---
 src/backend/statistics/extended_stats.c | 27 ++++++++++++++++++++-----
 src/test/regress/sql/stats_ext.sql      |  6 ++++++
 2 files changed, 28 insertions(+), 5 deletions(-)

diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c
index 08998f219e1..d6f1c70ae64 100644
--- a/src/backend/statistics/extended_stats.c
+++ b/src/backend/statistics/extended_stats.c
@@ -2846,11 +2846,11 @@ statext_determine_join_restrictions(PlannerInfo *root, RelOptInfo *rel,
 static bool
 statext_is_supported_join_clause(PlannerInfo *root, Node *clause)
 {
-	Oid			oprsel;
-	RestrictInfo *rinfo;
-	OpExpr	   *opclause;
-	int			left_relid,
-				right_relid;
+	Oid	oprsel;
+	RestrictInfo   *rinfo;
+	OpExpr		   *opclause;
+	int				left_relid, right_relid;
+	Var			   *var;
 
 	/* XXX Can we rely on always getting RestrictInfo here? */
 	if (!IsA(clause, RestrictInfo))
@@ -2908,6 +2908,14 @@ statext_is_supported_join_clause(PlannerInfo *root, Node *clause)
 	if (left_relid == right_relid)
 		return false;
 
+	var = (Var *) linitial(opclause->args);
+	if (IsA(var, Var) && var->varattno < 0)
+		return false;
+
+	var = (Var *) lsecond(opclause->args);
+	if (IsA(var, Var) && var->varattno < 0)
+		return false;
+
 	return true;
 }
 
@@ -3195,6 +3203,15 @@ extract_relation_info(PlannerInfo *root, JoinPairInfo *info, int index,
 		}
 	}
 
+	/*
+	 * Find a stat which covers *all* the attnums and exprs for simplification.
+	 *
+	 * To overcome above limitation, statext_find_matching_mcv has to smart enough to
+	 * decide which expression to discard as the first step. and later the other
+	 * side of join has to use a stats which match or superset of expression here.
+	 * at last mcv_combine_extended should be improved to handle the not-exactly-same
+	 * mcv.
+	 */
 	*stat = statext_find_matching_mcv(root, rel, attnums, exprs, base_conditions);
 
 	return rel;
diff --git a/src/test/regress/sql/stats_ext.sql b/src/test/regress/sql/stats_ext.sql
index e372fffebfb..798ee78265e 100644
--- a/src/test/regress/sql/stats_ext.sql
+++ b/src/test/regress/sql/stats_ext.sql
@@ -1595,6 +1595,12 @@ SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_
 SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a) where j1.c < 5');
 SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a + 1) where j1.c < 5');
 
+-- test join with system column var, but the ext statistics can't be built in system attribute AND extended statistics
+-- must covers all the join columns, so the following 2 statements can use extended statistics for join.
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) and j1.cmin = j2.cmin');
+-- Join with system column expression.
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) and j1.cmin::text::int4 = j2.cmin::text::int4');
+
 -- try combining with single-column (and single-expression) statistics
 DROP STATISTICS join_stats_2;
 
-- 
2.45.2

v20240617-0044-review.patchtext/x-patch; charset=UTF-8; name=v20240617-0044-review.patchDownload

From 471850741895d009556447c56e1c5c88f0af468f Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Mon, 17 Jun 2024 16:53:27 +0200
Subject: [PATCH v20240617 44/56] review

---
 src/backend/statistics/extended_stats.c | 5 +++++
 src/test/regress/sql/stats_ext.sql      | 1 +
 2 files changed, 6 insertions(+)

diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c
index d6f1c70ae64..1289b0a2c53 100644
--- a/src/backend/statistics/extended_stats.c
+++ b/src/backend/statistics/extended_stats.c
@@ -2908,6 +2908,7 @@ statext_is_supported_join_clause(PlannerInfo *root, Node *clause)
 	if (left_relid == right_relid)
 		return false;
 
+	/* FIXME add some comments explainig why we need to do this */
 	var = (Var *) linitial(opclause->args);
 	if (IsA(var, Var) && var->varattno < 0)
 		return false;
@@ -3211,6 +3212,10 @@ extract_relation_info(PlannerInfo *root, JoinPairInfo *info, int index,
 	 * side of join has to use a stats which match or superset of expression here.
 	 * at last mcv_combine_extended should be improved to handle the not-exactly-same
 	 * mcv.
+	 *
+	 * XXX I don't understand what "above limitation" this refers to. The comment
+	 * would benefit from some clarification, but I'm not sure what it's trying
+	 * to say exactly :-(
 	 */
 	*stat = statext_find_matching_mcv(root, rel, attnums, exprs, base_conditions);
 
diff --git a/src/test/regress/sql/stats_ext.sql b/src/test/regress/sql/stats_ext.sql
index 798ee78265e..e030560a4c5 100644
--- a/src/test/regress/sql/stats_ext.sql
+++ b/src/test/regress/sql/stats_ext.sql
@@ -1597,6 +1597,7 @@ SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_
 
 -- test join with system column var, but the ext statistics can't be built in system attribute AND extended statistics
 -- must covers all the join columns, so the following 2 statements can use extended statistics for join.
+-- FIXME needs to be added to the expected output
 SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) and j1.cmin = j2.cmin');
 -- Join with system column expression.
 SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) and j1.cmin::text::int4 = j2.cmin::text::int4');
-- 
2.45.2

v20240617-0045-pgindent.patchtext/x-patch; charset=UTF-8; name=v20240617-0045-pgindent.patchDownload

From ec8ab6f792341a9517091003f3b4285f0f1920de Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Mon, 17 Jun 2024 16:54:33 +0200
Subject: [PATCH v20240617 45/56] pgindent

---
 src/backend/statistics/extended_stats.c | 30 +++++++++++++------------
 1 file changed, 16 insertions(+), 14 deletions(-)

diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c
index 1289b0a2c53..3ce55ae4a5f 100644
--- a/src/backend/statistics/extended_stats.c
+++ b/src/backend/statistics/extended_stats.c
@@ -2846,11 +2846,12 @@ statext_determine_join_restrictions(PlannerInfo *root, RelOptInfo *rel,
 static bool
 statext_is_supported_join_clause(PlannerInfo *root, Node *clause)
 {
-	Oid	oprsel;
-	RestrictInfo   *rinfo;
-	OpExpr		   *opclause;
-	int				left_relid, right_relid;
-	Var			   *var;
+	Oid			oprsel;
+	RestrictInfo *rinfo;
+	OpExpr	   *opclause;
+	int			left_relid,
+				right_relid;
+	Var		   *var;
 
 	/* XXX Can we rely on always getting RestrictInfo here? */
 	if (!IsA(clause, RestrictInfo))
@@ -3205,17 +3206,18 @@ extract_relation_info(PlannerInfo *root, JoinPairInfo *info, int index,
 	}
 
 	/*
-	 * Find a stat which covers *all* the attnums and exprs for simplification.
+	 * Find a stat which covers *all* the attnums and exprs for
+	 * simplification.
 	 *
-	 * To overcome above limitation, statext_find_matching_mcv has to smart enough to
-	 * decide which expression to discard as the first step. and later the other
-	 * side of join has to use a stats which match or superset of expression here.
-	 * at last mcv_combine_extended should be improved to handle the not-exactly-same
-	 * mcv.
+	 * To overcome above limitation, statext_find_matching_mcv has to smart
+	 * enough to decide which expression to discard as the first step. and
+	 * later the other side of join has to use a stats which match or superset
+	 * of expression here. at last mcv_combine_extended should be improved to
+	 * handle the not-exactly-same mcv.
 	 *
-	 * XXX I don't understand what "above limitation" this refers to. The comment
-	 * would benefit from some clarification, but I'm not sure what it's trying
-	 * to say exactly :-(
+	 * XXX I don't understand what "above limitation" this refers to. The
+	 * comment would benefit from some clarification, but I'm not sure what
+	 * it's trying to say exactly :-(
 	 */
 	*stat = statext_find_matching_mcv(root, rel, attnums, exprs, base_conditions);
 
-- 
2.45.2

v20240617-0046-Fix-the-incorrect-comment-on-extended-stat.patchtext/x-patch; charset=UTF-8; name=v20240617-0046-Fix-the-incorrect-comment-on-extended-stat.patchDownload

From 59fc8eb1e384a268a4de1df5ddd453711835f5b9 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Mon, 17 Jun 2024 16:55:30 +0200
Subject: [PATCH v20240617 46/56] Fix the incorrect comment on extended stats.

Comments (either extended_stats.c or stats_ext.sql) says we must needs
multiple join clauses, but it has been handled in
clauselist_selectivity_ext already with the below code.

single_clause_optimization
	= !treat_as_join_clause(root, clause, rinfo, varRelid, sjinfo);
---
 src/backend/statistics/extended_stats.c | 12 ++----------
 src/test/regress/expected/stats_ext.out | 19 +++++++++++++++----
 src/test/regress/sql/stats_ext.sql      |  4 ----
 3 files changed, 17 insertions(+), 18 deletions(-)

diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c
index 3ce55ae4a5f..d22d511b09a 100644
--- a/src/backend/statistics/extended_stats.c
+++ b/src/backend/statistics/extended_stats.c
@@ -3277,16 +3277,8 @@ get_expression_for_rel(PlannerInfo *root, RelOptInfo *rel, Node *clause)
 
 /*
  * statext_clauselist_join_selectivity
- *		Use extended stats to estimate join clauses.
- *
- * XXX In principle, we should not restrict this to cases with multiple
- * join clauses - we should consider dependencies with conditions at the
- * base relations, i.e. calculate P(join clause | base restrictions).
- * But currently that does not happen, because clauselist_selectivity_ext
- * treats a single clause as a special case (and we don't apply extended
- * statistics in that case yet).
- *
- * XXX Isn't the preceding comment stale? We skip the optimization, no?
+ *		Use extended stats to estimate join clauses. the limitation is the
+ * extended statistics must covers all the join clauses.
  */
 Selectivity
 statext_clauselist_join_selectivity(PlannerInfo *root, List *clauses,
diff --git a/src/test/regress/expected/stats_ext.out b/src/test/regress/expected/stats_ext.out
index b08bf951e4d..fd8df01f309 100644
--- a/src/test/regress/expected/stats_ext.out
+++ b/src/test/regress/expected/stats_ext.out
@@ -3111,8 +3111,6 @@ SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_
        100 |      0
 (1 row)
 
--- can't be improved due to the optimization in clauselist_selectivity_ext,
--- which skips cases with a single (join) clause
 SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5');
  estimated | actual 
 -----------+--------
@@ -3178,8 +3176,6 @@ SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_
          1 |      0
 (1 row)
 
--- can't be improved due to the optimization in clauselist_selectivity_ext,
--- which skips cases with a single (join) clause
 SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5');
  estimated | actual 
 -----------+--------
@@ -3210,6 +3206,21 @@ SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_
      50000 |  50000
 (1 row)
 
+-- test join with system column var, but the ext statistics can't be built in system attribute AND extended statistics
+-- must covers all the join columns, so the following 2 statements can use extended statistics for join.
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) and j1.cmin = j2.cmin');
+ estimated | actual 
+-----------+--------
+       500 | 100000
+(1 row)
+
+-- Join with system column expression.
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) and j1.cmin::text::int4 = j2.cmin::text::int4');
+ estimated | actual 
+-----------+--------
+        50 | 100000
+(1 row)
+
 -- try combining with single-column (and single-expression) statistics
 DROP STATISTICS join_stats_2;
 SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a) where j1.c < 5');
diff --git a/src/test/regress/sql/stats_ext.sql b/src/test/regress/sql/stats_ext.sql
index e030560a4c5..ca0001605cb 100644
--- a/src/test/regress/sql/stats_ext.sql
+++ b/src/test/regress/sql/stats_ext.sql
@@ -1564,8 +1564,6 @@ SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_
 SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5 and j2.c < 3');
 SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5 and j2.c > 5');
 
--- can't be improved due to the optimization in clauselist_selectivity_ext,
--- which skips cases with a single (join) clause
 SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5');
 SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5 and j2.c < 5');
 SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5 and j2.c > 5');
@@ -1586,8 +1584,6 @@ SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_
 SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5 and j2.c < 3');
 SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5 and j2.c > 5');
 
--- can't be improved due to the optimization in clauselist_selectivity_ext,
--- which skips cases with a single (join) clause
 SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5');
 SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5 and j2.c < 5');
 SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5 and j2.c > 5');
-- 
2.45.2

v20240617-0047-review.patchtext/x-patch; charset=UTF-8; name=v20240617-0047-review.patchDownload

From b687a719c7b5a308525aa61b1bd1eb16cc1fb11b Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Mon, 17 Jun 2024 16:58:05 +0200
Subject: [PATCH v20240617 47/56] review

---
 src/test/regress/expected/stats_ext.out | 1 +
 1 file changed, 1 insertion(+)

diff --git a/src/test/regress/expected/stats_ext.out b/src/test/regress/expected/stats_ext.out
index fd8df01f309..682d4ea176a 100644
--- a/src/test/regress/expected/stats_ext.out
+++ b/src/test/regress/expected/stats_ext.out
@@ -3206,6 +3206,7 @@ SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_
      50000 |  50000
 (1 row)
 
+-- FIXME seems should have been in the preceding patch?
 -- test join with system column var, but the ext statistics can't be built in system attribute AND extended statistics
 -- must covers all the join columns, so the following 2 statements can use extended statistics for join.
 SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) and j1.cmin = j2.cmin');
-- 
2.45.2

v20240617-0048-Add-fastpath-when-combine-the-2-MCV-like-e.patchtext/x-patch; charset=UTF-8; name=v20240617-0048-Add-fastpath-when-combine-the-2-MCV-like-e.patchDownload

From 6f0d69e031da100b422ae3367780f02d21427451 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Mon, 17 Jun 2024 16:59:04 +0200
Subject: [PATCH v20240617 48/56] Add fastpath when combine the 2 MCV like
 eqjoinsel_inner.

when MCV2 exactly matches clauses.
---
 src/backend/statistics/mcv.c | 15 ++++++++-------
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/src/backend/statistics/mcv.c b/src/backend/statistics/mcv.c
index e1c5abf3148..4eec6592562 100644
--- a/src/backend/statistics/mcv.c
+++ b/src/backend/statistics/mcv.c
@@ -2507,13 +2507,6 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 			if (cmatches2 && !cmatches2[j])
 				continue;
 
-			/*
-			 * XXX We can't skip based on existing matches2 value, because
-			 * there may be duplicates in the first MCV.
-			 *
-			 * From Andy: what does this mean?
-			 */
-
 			/*
 			 * Evaluate if all the join clauses match between the two MCV
 			 * items.
@@ -2564,6 +2557,14 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 				/* XXX Do we need to do something about base frequency? */
 				matches1[i] = matches2[j] = true;
 				s += mcv1->items[i].frequency * mcv2->items[j].frequency;
+				nmatches += 1;
+
+				if (mcv2->ndimensions == list_length(clauses))
+					/*
+					 * no more items in mcv2 could match mcv1[i] in this case,
+					 * so break fast.
+					 */
+					break;
 			}
 		}
 	}
-- 
2.45.2

v20240617-0049-review.patchtext/x-patch; charset=UTF-8; name=v20240617-0049-review.patchDownload

From 0f57cfeecb2e0032f23c5098e5e80267dd0c3c4b Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Mon, 17 Jun 2024 17:02:32 +0200
Subject: [PATCH v20240617 49/56] review

---
 src/backend/statistics/mcv.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/src/backend/statistics/mcv.c b/src/backend/statistics/mcv.c
index 4eec6592562..2f055fd1085 100644
--- a/src/backend/statistics/mcv.c
+++ b/src/backend/statistics/mcv.c
@@ -2559,7 +2559,15 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 				s += mcv1->items[i].frequency * mcv2->items[j].frequency;
 				nmatches += 1;
 
+				/*
+				 * XXX Comment should be before the condition, and should
+				 * explain why there could be no more matches.
+				 *
+				 * XXX I'm not sure about this optimization, could there be a
+				 * case with two clauses matching the same MCV dimension?
+				 */
 				if (mcv2->ndimensions == list_length(clauses))
+
 					/*
 					 * no more items in mcv2 could match mcv1[i] in this case,
 					 * so break fast.
-- 
2.45.2

v20240617-0050-When-mcv-ndimensions-list_length-clauses-h.patchtext/x-patch; charset=UTF-8; name=v20240617-0050-When-mcv-ndimensions-list_length-clauses-h.patchDownload

From 20fd210d0b0ce690e351ee676f2855ad500305db Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Mon, 17 Jun 2024 17:04:52 +0200
Subject: [PATCH v20240617 50/56] When mcv->ndimensions ==
 list_length(clauses), handle it same as

eqjoinsel_inner, but more testing doesn't show me any benefits from
it. just this commit is just FYI.
---
 src/backend/statistics/mcv.c | 55 ++++++++++++++++++++++--------------
 1 file changed, 34 insertions(+), 21 deletions(-)

diff --git a/src/backend/statistics/mcv.c b/src/backend/statistics/mcv.c
index 2f055fd1085..8910911d233 100644
--- a/src/backend/statistics/mcv.c
+++ b/src/backend/statistics/mcv.c
@@ -2266,7 +2266,8 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 	MCVList    *mcv1,
 			   *mcv2;
 	int			i,
-				j;
+		j,
+		nmatches = 0;
 	Selectivity s = 0;
 
 	/* match bitmaps and selectivity for baserel conditions (if any) */
@@ -2289,12 +2290,12 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 				nd1,
 				totalsel1;
 
-	double		matchfreq2,
-				unmatchfreq2,
-				otherfreq2,
-				mcvfreq2,
-				nd2,
-				totalsel2;
+	double	matchfreq2,
+			unmatchfreq2,
+			otherfreq2,
+			mcvfreq2,
+			nd2,
+			totalsel2;
 
 	/* info about clauses and how they match to MCV stats */
 	McvProc    *mcvProc;
@@ -2661,24 +2662,36 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 	nd2 *= csel2;
 
 	totalsel1 = s;
-	totalsel1 += unmatchfreq1 * otherfreq2 / nd2;
-	totalsel1 += otherfreq1 * (otherfreq2 + unmatchfreq2) / nd2;
 
-/* 	if (nd2 > mcvb->nitems) */
-/* 		totalsel1 += unmatchfreq1 * otherfreq2 / (nd2 - mcvb->nitems); */
-/* 	if (nd2 > nmatches) */
-/* 		totalsel1 += otherfreq1 * (otherfreq2 + unmatchfreq2) / */
-/* 			(nd2 - nmatches); */
+	if (mcv2->ndimensions == list_length(clauses))
+	{
+		if (nd2 > mcv2->nitems)
+			totalsel1 += unmatchfreq1 * otherfreq2 / (nd2 - mcv2->nitems);
+		if (nd2 > nmatches)
+			totalsel1 += otherfreq1 * (otherfreq2 + unmatchfreq2) /
+				(nd2 - nmatches);
+	}
+	else
+	{
+		totalsel1 += unmatchfreq1 * otherfreq2 / nd2;
+		totalsel1 += otherfreq1 * (otherfreq2 + unmatchfreq2) / nd2;
+	}
 
 	totalsel2 = s;
-	totalsel2 += unmatchfreq2 * otherfreq1 / nd1;
-	totalsel2 += otherfreq2 * (otherfreq1 + unmatchfreq1) / nd1;
 
-/* 	if (nd1 > mcva->nitems) */
-/* 		totalsel2 += unmatchfreq2 * otherfreq1 / (nd1 - mcva->nitems); */
-/* 	if (nd1 > nmatches) */
-/* 		totalsel2 += otherfreq2 * (otherfreq1 + unmatchfreq1) / */
-/* 			(nd1 - nmatches); */
+	if (mcv1->ndimensions == list_length(clauses))
+	{
+		if (nd1 > mcv1->nitems)
+			totalsel2 += unmatchfreq2 * otherfreq1 / (nd1 - mcv1->nitems);
+		if (nd1 > nmatches)
+			totalsel2 += otherfreq2 * (otherfreq1 + unmatchfreq1) /
+				(nd1 - nmatches);
+	}
+	else
+	{
+		totalsel2 += unmatchfreq2 * otherfreq1 / nd1;
+		totalsel2 += otherfreq2 * (otherfreq1 + unmatchfreq1) / nd1;
+	}
 
 	s = Min(totalsel1, totalsel2);
 
-- 
2.45.2

v20240617-0051-review.patchtext/x-patch; charset=UTF-8; name=v20240617-0051-review.patchDownload

From df5ceaf4c693c9b395c614c73300c039f094b3cb Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Mon, 17 Jun 2024 17:07:34 +0200
Subject: [PATCH v20240617 51/56] review

---
 src/backend/statistics/mcv.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/src/backend/statistics/mcv.c b/src/backend/statistics/mcv.c
index 8910911d233..696565eb0b8 100644
--- a/src/backend/statistics/mcv.c
+++ b/src/backend/statistics/mcv.c
@@ -2663,6 +2663,7 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 
 	totalsel1 = s;
 
+	/* FIXME same comments / concerns as for preceding patch */
 	if (mcv2->ndimensions == list_length(clauses))
 	{
 		if (nd2 > mcv2->nitems)
-- 
2.45.2

v20240617-0052-pgindent.patchtext/x-patch; charset=UTF-8; name=v20240617-0052-pgindent.patchDownload

From ad33475ff70429865b85c257d95ac3e9af8e4cc4 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Mon, 17 Jun 2024 17:11:24 +0200
Subject: [PATCH v20240617 52/56] pgindent

---
 src/backend/statistics/mcv.c | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/src/backend/statistics/mcv.c b/src/backend/statistics/mcv.c
index 696565eb0b8..4e1a03a8c30 100644
--- a/src/backend/statistics/mcv.c
+++ b/src/backend/statistics/mcv.c
@@ -2266,8 +2266,8 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 	MCVList    *mcv1,
 			   *mcv2;
 	int			i,
-		j,
-		nmatches = 0;
+				j,
+				nmatches = 0;
 	Selectivity s = 0;
 
 	/* match bitmaps and selectivity for baserel conditions (if any) */
@@ -2290,12 +2290,12 @@ mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 				nd1,
 				totalsel1;
 
-	double	matchfreq2,
-			unmatchfreq2,
-			otherfreq2,
-			mcvfreq2,
-			nd2,
-			totalsel2;
+	double		matchfreq2,
+				unmatchfreq2,
+				otherfreq2,
+				mcvfreq2,
+				nd2,
+				totalsel2;
 
 	/* info about clauses and how they match to MCV stats */
 	McvProc    *mcvProc;
-- 
2.45.2

v20240617-0053-Fix-typo-error-s-grantee-guarantee.patchtext/x-patch; charset=UTF-8; name=v20240617-0053-Fix-typo-error-s-grantee-guarantee.patchDownload

From 655d32d01b29ce5458ffb00c0ccd738a0ebbc357 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Mon, 17 Jun 2024 17:16:15 +0200
Subject: [PATCH v20240617 53/56] Fix typo error, s/grantee/guarantee/.

---
 src/backend/optimizer/path/clausesel.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c
index 206fe627e58..31dba9f7621 100644
--- a/src/backend/optimizer/path/clausesel.c
+++ b/src/backend/optimizer/path/clausesel.c
@@ -213,14 +213,14 @@ clauselist_selectivity_ext(PlannerInfo *root,
 	 * to detect when this makes sense, but we can check that there are join
 	 * clauses, and that at least some of the rels have stats.
 	 *
-	 * rel != NULL can't grantee the clause is not a join clause, for example
-	 * t1 left join t2 ON t1.a = 3, but it can grantee we can't use extended
+	 * rel != NULL can't guarantee the clause is not a join clause, for example
+	 * t1 left join t2 ON t1.a = 3, but it can guarantee we can't use extended
 	 * statistics for estimation since it has only 1 relid.
 	 *
 	 * XXX Is that actually behaving like that? Won't the (t1.a=3) be turned
 	 * into a regular clause? I haven't tried, though.
 	 *
-	 * XXX: so we can grantee estimatedclauses == NULL now, so
+	 * XXX: so we can guarantee estimatedclauses == NULL now, so
 	 * estimatedclauses in statext_try_join_estimates is removed.
 	 *
 	 * XXX Maybe remove the comment and add an assert estimatedclauses==NULL.
-- 
2.45.2

v20240617-0054-clauselist_selectivity_hook.patchtext/x-patch; charset=UTF-8; name=v20240617-0054-clauselist_selectivity_hook.patchDownload

From 78ba919ff4a9228c9631885c037467e5438693b2 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Mon, 17 Jun 2024 12:46:19 +0200
Subject: [PATCH v20240617 54/56] clauselist_selectivity_hook

---
 src/backend/optimizer/path/clausesel.c  |  5 +++++
 src/backend/statistics/extended_stats.c |  2 +-
 src/backend/utils/adt/selfuncs.c        |  1 +
 src/include/statistics/statistics.h     |  8 ++++++++
 src/include/utils/selfuncs.h            | 11 +++++++++++
 5 files changed, 26 insertions(+), 1 deletion(-)

diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c
index 31dba9f7621..b6a63a33da7 100644
--- a/src/backend/optimizer/path/clausesel.c
+++ b/src/backend/optimizer/path/clausesel.c
@@ -138,6 +138,11 @@ clauselist_selectivity_ext(PlannerInfo *root,
 	if (clauses == NULL)
 		return 1.0;
 
+	if (clauselist_selectivity_hook)
+		s1 = clauselist_selectivity_hook(root, clauses, varRelid, jointype,
+										 sjinfo, &estimatedclauses,
+										 use_extended_stats);
+
 	/*
 	 * Disable the single-clause optimization when estimating a join clause.
 	 *
diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c
index d22d511b09a..1f24ce1fa89 100644
--- a/src/backend/statistics/extended_stats.c
+++ b/src/backend/statistics/extended_stats.c
@@ -1715,7 +1715,7 @@ statext_is_compatible_clause(PlannerInfo *root, Node *clause, Index relid,
  * 0-based 'clauses' indexes we estimate for and also skip clause items that
  * already have a bit set.
  */
-static Selectivity
+Selectivity
 statext_mcv_clauselist_selectivity(PlannerInfo *root, List *clauses, int varRelid,
 								   JoinType jointype, SpecialJoinInfo *sjinfo,
 								   RelOptInfo *rel, Bitmapset **estimatedclauses,
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 5f5d7959d8e..ff98fda08c8 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -146,6 +146,7 @@
 /* Hooks for plugins to get control when we ask for stats */
 get_relation_stats_hook_type get_relation_stats_hook = NULL;
 get_index_stats_hook_type get_index_stats_hook = NULL;
+clauselist_selectivity_hook_type clauselist_selectivity_hook = NULL;
 
 static double eqsel_internal(PG_FUNCTION_ARGS, bool negate);
 static double eqjoinsel_inner(Oid opfuncoid, Oid collation,
diff --git a/src/include/statistics/statistics.h b/src/include/statistics/statistics.h
index d1368a05833..7b6cadc4daa 100644
--- a/src/include/statistics/statistics.h
+++ b/src/include/statistics/statistics.h
@@ -104,6 +104,14 @@ extern void BuildRelationExtStatistics(Relation onerel, bool inh, double totalro
 extern int	ComputeExtStatisticsRows(Relation onerel,
 									 int natts, VacAttrStats **vacattrstats);
 extern bool statext_is_kind_built(HeapTuple htup, char type);
+extern Selectivity statext_mcv_clauselist_selectivity(PlannerInfo *root,
+													  List *clauses,
+													  int varRelid,
+													  JoinType jointype,
+													  SpecialJoinInfo *sjinfo,
+													   RelOptInfo *rel,
+													   Bitmapset **estimatedclauses,
+													   bool is_or);
 extern Selectivity dependencies_clauselist_selectivity(PlannerInfo *root,
 													   List *clauses,
 													   int varRelid,
diff --git a/src/include/utils/selfuncs.h b/src/include/utils/selfuncs.h
index f2563ad1cb3..253f584c659 100644
--- a/src/include/utils/selfuncs.h
+++ b/src/include/utils/selfuncs.h
@@ -148,6 +148,17 @@ typedef bool (*get_index_stats_hook_type) (PlannerInfo *root,
 										   VariableStatData *vardata);
 extern PGDLLIMPORT get_index_stats_hook_type get_index_stats_hook;
 
+/* Hooks for plugins to get control when we ask for selectivity estimation */
+typedef Selectivity (*clauselist_selectivity_hook_type) (
+												PlannerInfo *root,
+												List *clauses,
+												int varRelid,
+												JoinType jointype,
+												SpecialJoinInfo *sjinfo,
+												Bitmapset **estimatedclauses,
+												bool use_extended_stats);
+extern PGDLLIMPORT clauselist_selectivity_hook_type clauselist_selectivity_hook;
+
 /* Functions in selfuncs.c */
 
 extern void examine_variable(PlannerInfo *root, Node *node, int varRelid,
-- 
2.45.2

v20240617-0055-review.patchtext/x-patch; charset=UTF-8; name=v20240617-0055-review.patchDownload

From 6cbe375b63f01ccc364d1b0d4ea63ae404ebb70d Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Mon, 17 Jun 2024 17:21:55 +0200
Subject: [PATCH v20240617 55/56] review

---
 src/backend/optimizer/path/clausesel.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c
index b6a63a33da7..6765b97afde 100644
--- a/src/backend/optimizer/path/clausesel.c
+++ b/src/backend/optimizer/path/clausesel.c
@@ -138,6 +138,9 @@ clauselist_selectivity_ext(PlannerInfo *root,
 	if (clauses == NULL)
 		return 1.0;
 
+	/*
+	 * FIXME this really deserves some comment, no?
+	 */
 	if (clauselist_selectivity_hook)
 		s1 = clauselist_selectivity_hook(root, clauses, varRelid, jointype,
 										 sjinfo, &estimatedclauses,
@@ -165,6 +168,8 @@ clauselist_selectivity_ext(PlannerInfo *root,
 	 * to go against the idea of this check to be cheap. Moreover, it won't
 	 * work for OR clauses, which may have multiple parts but we still see
 	 * them as a single BoolExpr clause (it doesn't work later, though).
+	 *
+	 * XXX Shouldn't this also consider the estimatedclauses?
 	 */
 	if (list_length(clauses) == 1)
 	{
@@ -238,6 +243,10 @@ clauselist_selectivity_ext(PlannerInfo *root,
 	 * estimator, and then apply some "correction" to the result?
 	 *
 	 * XXX Same thing for the joinType removal, I guess.
+	 *
+	 * XXX Isn't this broken if the hook estimates some of the clauses? We've
+	 * removed the bitmap from statext_try_join_estimates() on the grounds that
+	 * it's always NULL, but with the hook that's no longer the case.
 	 */
 	if (use_extended_stats && rel == NULL &&
 		statext_try_join_estimates(root, clauses, varRelid, jointype, sjinfo))
-- 
2.45.2

v20240617-0056-pgindent.patchtext/x-patch; charset=UTF-8; name=v20240617-0056-pgindent.patchDownload

From 73fb4b74adee8f861c120f2825f23957d482f997 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Mon, 17 Jun 2024 17:23:54 +0200
Subject: [PATCH v20240617 56/56] pgindent

---
 src/backend/optimizer/path/clausesel.c | 10 +++++-----
 src/include/statistics/statistics.h    |  6 +++---
 src/include/utils/selfuncs.h           | 14 +++++++-------
 3 files changed, 15 insertions(+), 15 deletions(-)

diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c
index 6765b97afde..abc4815967d 100644
--- a/src/backend/optimizer/path/clausesel.c
+++ b/src/backend/optimizer/path/clausesel.c
@@ -223,9 +223,9 @@ clauselist_selectivity_ext(PlannerInfo *root,
 	 * to detect when this makes sense, but we can check that there are join
 	 * clauses, and that at least some of the rels have stats.
 	 *
-	 * rel != NULL can't guarantee the clause is not a join clause, for example
-	 * t1 left join t2 ON t1.a = 3, but it can guarantee we can't use extended
-	 * statistics for estimation since it has only 1 relid.
+	 * rel != NULL can't guarantee the clause is not a join clause, for
+	 * example t1 left join t2 ON t1.a = 3, but it can guarantee we can't use
+	 * extended statistics for estimation since it has only 1 relid.
 	 *
 	 * XXX Is that actually behaving like that? Won't the (t1.a=3) be turned
 	 * into a regular clause? I haven't tried, though.
@@ -245,8 +245,8 @@ clauselist_selectivity_ext(PlannerInfo *root,
 	 * XXX Same thing for the joinType removal, I guess.
 	 *
 	 * XXX Isn't this broken if the hook estimates some of the clauses? We've
-	 * removed the bitmap from statext_try_join_estimates() on the grounds that
-	 * it's always NULL, but with the hook that's no longer the case.
+	 * removed the bitmap from statext_try_join_estimates() on the grounds
+	 * that it's always NULL, but with the hook that's no longer the case.
 	 */
 	if (use_extended_stats && rel == NULL &&
 		statext_try_join_estimates(root, clauses, varRelid, jointype, sjinfo))
diff --git a/src/include/statistics/statistics.h b/src/include/statistics/statistics.h
index 7b6cadc4daa..a2e49b4080e 100644
--- a/src/include/statistics/statistics.h
+++ b/src/include/statistics/statistics.h
@@ -109,9 +109,9 @@ extern Selectivity statext_mcv_clauselist_selectivity(PlannerInfo *root,
 													  int varRelid,
 													  JoinType jointype,
 													  SpecialJoinInfo *sjinfo,
-													   RelOptInfo *rel,
-													   Bitmapset **estimatedclauses,
-													   bool is_or);
+													  RelOptInfo *rel,
+													  Bitmapset **estimatedclauses,
+													  bool is_or);
 extern Selectivity dependencies_clauselist_selectivity(PlannerInfo *root,
 													   List *clauses,
 													   int varRelid,
diff --git a/src/include/utils/selfuncs.h b/src/include/utils/selfuncs.h
index 253f584c659..f2ad70743b5 100644
--- a/src/include/utils/selfuncs.h
+++ b/src/include/utils/selfuncs.h
@@ -150,13 +150,13 @@ extern PGDLLIMPORT get_index_stats_hook_type get_index_stats_hook;
 
 /* Hooks for plugins to get control when we ask for selectivity estimation */
 typedef Selectivity (*clauselist_selectivity_hook_type) (
-												PlannerInfo *root,
-												List *clauses,
-												int varRelid,
-												JoinType jointype,
-												SpecialJoinInfo *sjinfo,
-												Bitmapset **estimatedclauses,
-												bool use_extended_stats);
+														 PlannerInfo *root,
+														 List *clauses,
+														 int varRelid,
+														 JoinType jointype,
+														 SpecialJoinInfo *sjinfo,
+														 Bitmapset **estimatedclauses,
+														 bool use_extended_stats);
 extern PGDLLIMPORT clauselist_selectivity_hook_type clauselist_selectivity_hook;
 
 /* Functions in selfuncs.c */
-- 
2.45.2

#31

Andrei Lepikhov

lepihov@gmail.com

over 1 year ago

In reply to: Tomas Vondra (#30)

Re: using extended statistics to improve join estimates

On 17/6/2024 18:10, Tomas Vondra wrote:

Let me quickly go through the original parts - most of this is already
in the "review" patches, but it's better to quote the main points here
to start a discussion. I'll omit some of the smaller suggestions, so
please look at the 'review' patches.

v20240617-0001-Estimate-joins-using-extended-statistics.patch

- rewords a couple comments, particularly for statext_find_matching_mcv

- a couple XXX comments about possibly stale/inaccurate comments
v20240617-0054-clauselist_selectivity_hook.patch

- I believe this does not work with the earlier patch that removed
estimatedclaused bitmap from the "try" function.

This patch set is too big to eat at once - it's just challenging to
invent examples and counterexamples. Can we see these two patches in the
master and analyse further improvements based on that?

Some thoughts:
You remove verRelid. I have thought about replacing this value with
RelOptInfo, which would allow extensions (remember selectivity hook) to
know about the underlying path tree.

The first patch is generally ok, and I vote for having it in the master.
However, the most harmful case I see most reports about is parameterised
JOIN on multiple anded clauses. In that case, we have a scan filter on
something like the below:
x = $1 AND y = $2 AND ...
As I see, current patch doesn't resolve this issue currently.

--
regards, Andrei Lepikhov

#32

Andrei Lepikhov

lepihov@gmail.com

over 1 year ago

In reply to: Andrei Lepikhov (#31)

1 attachment(s)

Re: using extended statistics to improve join estimates

On 3/9/2024 14:58, Andrei Lepikhov wrote:

On 17/6/2024 18:10, Tomas Vondra wrote:
x = $1 AND y = $2 AND ...
As I see, current patch doesn't resolve this issue currently.

Let's explain my previous argument with an example (see in attachment).

The query designed to be executed with parameterised NL join:

EXPLAIN (ANALYZE, TIMING OFF)
SELECT * FROM test t1 NATURAL JOIN test1 t2 WHERE t2.x1 < 1;

After applying the topmost patch from the patchset we can see two
different estimations (explain tuned a little bit) before and after
extended statistics:

-- before:

Nested Loop (rows=1) (actual rows=10000 loops=1)
-> Seq Scan on test1 t2 (rows=100) (actual rows=100 loops=1)
Filter: (x1 < 1)
-> Memoize (rows=1) (actual rows=100 loops=100)
Cache Key: t2.x1, t2.x2, t2.x3, t2.x4
-> Index Scan using test_x1_x2_x3_x4_idx on test t1
(rows=1 width=404) (actual rows=100 loops=1)
Index Cond: ((x1 = t2.x1) AND (x2 = t2.x2) AND
(x3 = t2.x3) AND (x4 = t2.x4))

-- after:

Nested Loop (rows=10000) (actual rows=10000 loops=1)
-> Seq Scan on test1 t2 (rows=100) (actual rows=100 loops=1)
Filter: (x1 < 1)
-> Memoize (rows=1) (actual rows=100 loops=100)
Cache Key: t2.x1, t2.x2, t2.x3, t2.x4
-> Index Scan using test_x1_x2_x3_x4_idx on test t1 (rows=1)
(actual rows=100 loops=1)
Index Cond: ((x1 = t2.x1) AND (x2 = t2.x2) AND
(x3 = t2.x3) AND (x4 = t2.x4))

You can see, that index condition was treated as join clause and PNL
estimated correctly by an MCV on both sides.
But scan estimation is incorrect.
Moreover, sometimes we don't have MCV at all. And the next step for this
patch should be implementation of bare estimation by the only ndistinct
on each side.

What to do with the scan filter? Not sure so far, but it looks like here
may be used the logic similar to var_eq_non_const().

--
regards, Andrei Lepikhov

#33

Ilia Evdokimov

ilya.evdokimov@tantorlabs.com

7 months ago

In reply to: Tomas Vondra (#30)

Re: using extended statistics to improve join estimates

Hi hackers

Еhank you for your work.

Let me start my review from the top — specifically, in clausesel.c, the
function clauselist_selectivity_ext():

1. About check clauses == NULL. In my opinion, this check should be
kept. This issue has already been discussed previously[0]/messages/by-id/016e33b7-2830-4300-bc89-e7ce9e613bad@tantorlabs.com, and I think
it's better to keep the safety check.

2. I noticed that the patch applies extended statistics to OR clauses as
well. There's an example from regression tests illustrating this:

Before applying ext stats:
SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join
join_test_2 j2 on ((j1.a + 1 = j2.a + 1) or (j1.b = j2.b))');
estimated | actual
-----------+--------
104500 | 100000

After applying ext stats:
SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join
join_test_2 j2 on ((j1.a + 1 = j2.a + 1) or (j1.b = j2.b))');
estimated | actual
-----------+--------
190000 | 100000
(1 row)

I agree that, at least for now, we should focus solely on AND clauses.
To do that, we should impose the same restriction in
clauselist_selectivity_or() as we already do in
clauselist_selectivity_ext().

What do you think? Or shall we consider OR-clauses as well?

[0]: /messages/by-id/016e33b7-2830-4300-bc89-e7ce9e613bad@tantorlabs.com
/messages/by-id/016e33b7-2830-4300-bc89-e7ce9e613bad@tantorlabs.com

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.