Improve selectivity estimate for range queries

Started by Yuzuko Hosoyaabout 7 years ago12 messages
#1Yuzuko Hosoya
hosoya.yuzuko@lab.ntt.co.jp
1 attachment(s)

Hi,

I found the problem about selectivity estimate for range queries
on master as follows. This is my test case:

postgres=# CREATE TABLE test(id int, val text);
CREATE TABLE
postgres=# INSERT INTO test SELECT i, 'testval' FROM generate_series(0,29999)i;
INSERT 0 30000
postgres=# VACUUM analyze test;
VACUUM
postgres=# EXPLAIN ANALYZE SELECT * FROM test WHERE id > 0 and id < 10000;
QUERY PLAN
----------------------------------------------------------------------------------------------------
---
Seq Scan on test (cost=0.00..613.00 rows=150 width=12) (actual time=0.044..21.371 rows=9999
loops=1)
Filter: ((id > 0) AND (id < 10000))
Rows Removed by Filter: 20001
Planning Time: 0.099 ms
Execution Time: 37.142 ms
(5 rows)

Although the actual number of rows was 9999, the estimated number
of rows was 150.

So I dug to see what happened and thought that the following part
in clauselist_selectivity caused this problem.

------------
/*
* Now scan the rangequery pair list.
*/
while (rqlist != NULL)
{
RangeQueryClause *rqnext;

if (rqlist->have_lobound && rqlist->have_hibound)
{
/* Successfully matched a pair of range clauses */
Selectivity s2;

/*
* Exact equality to the default value probably means the
* selectivity function punted. This is not airtight but should
* be good enough.
*/
if (rqlist->hibound == DEFAULT_INEQ_SEL ||
rqlist->lobound == DEFAULT_INEQ_SEL)
{
s2 = DEFAULT_RANGE_INEQ_SEL;
}
else
{
s2 = rqlist->hibound + rqlist->lobound - 1.0;
------------

In my environment, the selectivity for id > 0 was 0.99990000000000001,
and the selectivity for id < 10000 was 0.33333333333333331. Then, the
value of rqlist->hibound and rqlist->lobound are set respectively.
On the other hand, DEFAULT_INEQ_SEL is 0.3333333333333333. As a result,
the final selectivity is computed according to DEFAULT_RANGE_INEQ_SEL,
since the condition of the second if statement was satisfied. However,
these selectivities were computed according to the statistics, so the
final selectivity had to be calculated from rqlist->hibound + rqlist->lobound - 1.0.

My test case might be uncommon, but I think it would cause some problems.
To handle such cases I've thought up of an idea based on a previous commit[1]https://git.postgresql.org/gitweb/?p=postgresql.git&amp;a=commitdiff&amp;h=4c2777d0b733220d9029f78817af8ce
which solved a similar problem related to DEFAULT_NUM_DISTINCT. Just like
this modification, I added a flag which shows whether the selectivity
was calculated according to the statistics or not to clauselist_selectivity,
and used it as a condition of the if statement instead of
rqlist->hibound == DEFAULT_INEQ_SEL and rqlist->lobound == DEFAULT_INEQ_SEL.
I am afraid that changes were more than I had expected, but we can estimate
selectivity correctly.

Patch attached. Do you have any thoughts?

[1]: https://git.postgresql.org/gitweb/?p=postgresql.git&amp;a=commitdiff&amp;h=4c2777d0b733220d9029f78817af8ce
https://git.postgresql.org/gitweb/?p=postgresql.git&amp;a=commitdiff&amp;h=4c2777d0b733220d9029f78817af8ce

Best regards,
Yuzuko Hosoya
NTT Open Source Software Center

Attachments:

improve_selectivity_estimate_for_range_queries.patchapplication/octet-stream; name=improve_selectivity_estimate_for_range_queries.patchDownload
diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c
index f4717942c3..1b9ae2e901 100644
--- a/src/backend/optimizer/path/clausesel.c
+++ b/src/backend/optimizer/path/clausesel.c
@@ -37,10 +37,13 @@ typedef struct RangeQueryClause
 	bool		have_hibound;	/* found a high-bound clause yet? */
 	Selectivity lobound;		/* Selectivity of a var > something clause */
 	Selectivity hibound;		/* Selectivity of a var < something clause */
+	bool		lobound_isdefault;	/* lobound is a default selectivity? */
+	bool		hibound_isdefault;	/* hibound is a default selectivity? */
 } RangeQueryClause;
 
 static void addRangeClause(RangeQueryClause **rqlist, Node *clause,
-			   bool varonleft, bool isLTsel, Selectivity s2);
+			   bool varonleft, bool isLTsel, Selectivity s2,
+			   bool isdefault);
 static RelOptInfo *find_single_rel_for_clauses(PlannerInfo *root,
 							List *clauses);
 
@@ -108,6 +111,7 @@ clauselist_selectivity(PlannerInfo *root,
 	RangeQueryClause *rqlist = NULL;
 	ListCell   *l;
 	int			listidx;
+	bool		isdefault;
 
 	/*
 	 * If there's exactly one clause, just go directly to
@@ -115,7 +119,7 @@ clauselist_selectivity(PlannerInfo *root,
 	 */
 	if (list_length(clauses) == 1)
 		return clause_selectivity(root, (Node *) linitial(clauses),
-								  varRelid, jointype, sjinfo);
+								  varRelid, jointype, sjinfo, &isdefault);
 
 	/*
 	 * Determine if these clauses reference a single relation.  If so, and if
@@ -165,7 +169,8 @@ clauselist_selectivity(PlannerInfo *root,
 			continue;
 
 		/* Always compute the selectivity using clause_selectivity */
-		s2 = clause_selectivity(root, clause, varRelid, jointype, sjinfo);
+		s2 = clause_selectivity(root, clause, varRelid, jointype,
+								sjinfo, &isdefault);
 
 		/*
 		 * Check for being passed a RestrictInfo.
@@ -227,12 +232,12 @@ clauselist_selectivity(PlannerInfo *root,
 					case F_SCALARLTSEL:
 					case F_SCALARLESEL:
 						addRangeClause(&rqlist, clause,
-									   varonleft, true, s2);
+									   varonleft, true, s2, isdefault);
 						break;
 					case F_SCALARGTSEL:
 					case F_SCALARGESEL:
 						addRangeClause(&rqlist, clause,
-									   varonleft, false, s2);
+									   varonleft, false, s2, isdefault);
 						break;
 					default:
 						/* Just merge the selectivity in generically */
@@ -260,12 +265,11 @@ clauselist_selectivity(PlannerInfo *root,
 			Selectivity s2;
 
 			/*
-			 * Exact equality to the default value probably means the
-			 * selectivity function punted.  This is not airtight but should
-			 * be good enough.
+			 * If either selectivity is default, we use a default estimate.
+			 * This is not airtight but should be good enough.
 			 */
-			if (rqlist->hibound == DEFAULT_INEQ_SEL ||
-				rqlist->lobound == DEFAULT_INEQ_SEL)
+			if (rqlist->hibound_isdefault ||
+				rqlist->lobound_isdefault)
 			{
 				s2 = DEFAULT_RANGE_INEQ_SEL;
 			}
@@ -332,7 +336,8 @@ clauselist_selectivity(PlannerInfo *root,
  */
 static void
 addRangeClause(RangeQueryClause **rqlist, Node *clause,
-			   bool varonleft, bool isLTsel, Selectivity s2)
+			   bool varonleft, bool isLTsel, Selectivity s2,
+			   bool isdefault)
 {
 	RangeQueryClause *rqelem;
 	Node	   *var;
@@ -377,6 +382,8 @@ addRangeClause(RangeQueryClause **rqlist, Node *clause,
 				if (rqelem->lobound > s2)
 					rqelem->lobound = s2;
 			}
+			if (isdefault)
+				rqelem->lobound_isdefault = true;
 		}
 		else
 		{
@@ -397,6 +404,8 @@ addRangeClause(RangeQueryClause **rqlist, Node *clause,
 				if (rqelem->hibound > s2)
 					rqelem->hibound = s2;
 			}
+			if (isdefault)
+				rqelem->hibound_isdefault = true;
 		}
 		return;
 	}
@@ -404,17 +413,24 @@ addRangeClause(RangeQueryClause **rqlist, Node *clause,
 	/* No matching var found, so make a new clause-pair data structure */
 	rqelem = (RangeQueryClause *) palloc(sizeof(RangeQueryClause));
 	rqelem->var = var;
+	rqelem->lobound_isdefault = false;
+	rqelem->hibound_isdefault = false;
+
 	if (is_lobound)
 	{
 		rqelem->have_lobound = true;
 		rqelem->have_hibound = false;
 		rqelem->lobound = s2;
+		if (isdefault)
+			rqelem->lobound_isdefault = true;
 	}
 	else
 	{
 		rqelem->have_lobound = false;
 		rqelem->have_hibound = true;
 		rqelem->hibound = s2;
+		if (isdefault)
+			rqelem->lobound_isdefault = true;
 	}
 	rqelem->next = *rqlist;
 	*rqlist = rqelem;
@@ -575,7 +591,8 @@ clause_selectivity(PlannerInfo *root,
 				   Node *clause,
 				   int varRelid,
 				   JoinType jointype,
-				   SpecialJoinInfo *sjinfo)
+				   SpecialJoinInfo *sjinfo,
+				   bool *isdefault)
 {
 	Selectivity s1 = 0.5;		/* default for any unhandled clause type */
 	RestrictInfo *rinfo = NULL;
@@ -695,7 +712,7 @@ clause_selectivity(PlannerInfo *root,
 									  (Node *) get_notclausearg((Expr *) clause),
 									  varRelid,
 									  jointype,
-									  sjinfo);
+									  sjinfo, isdefault);
 	}
 	else if (and_clause(clause))
 	{
@@ -723,7 +740,7 @@ clause_selectivity(PlannerInfo *root,
 												(Node *) lfirst(arg),
 												varRelid,
 												jointype,
-												sjinfo);
+												sjinfo, isdefault);
 
 			s1 = s1 + s2 - s1 * s2;
 		}
@@ -748,7 +765,7 @@ clause_selectivity(PlannerInfo *root,
 			s1 = restriction_selectivity(root, opno,
 										 opclause->args,
 										 opclause->inputcollid,
-										 varRelid);
+										 varRelid, isdefault);
 		}
 
 		/*
@@ -816,7 +833,7 @@ clause_selectivity(PlannerInfo *root,
 								(Node *) ((RelabelType *) clause)->arg,
 								varRelid,
 								jointype,
-								sjinfo);
+								sjinfo, isdefault);
 	}
 	else if (IsA(clause, CoerceToDomain))
 	{
@@ -825,7 +842,7 @@ clause_selectivity(PlannerInfo *root,
 								(Node *) ((CoerceToDomain *) clause)->arg,
 								varRelid,
 								jointype,
-								sjinfo);
+								sjinfo, isdefault);
 	}
 	else
 	{
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 480fd250e9..c181ef0ea6 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -4246,6 +4246,7 @@ approx_tuple_count(PlannerInfo *root, JoinPath *path, List *quals)
 	SpecialJoinInfo sjinfo;
 	Selectivity selec = 1.0;
 	ListCell   *l;
+	bool   isdefault;
 
 	/*
 	 * Make up a SpecialJoinInfo for JOIN_INNER semantics.
@@ -4270,7 +4271,7 @@ approx_tuple_count(PlannerInfo *root, JoinPath *path, List *quals)
 		Node	   *qual = (Node *) lfirst(l);
 
 		/* Note that clause_selectivity will be able to cache its result */
-		selec *= clause_selectivity(root, qual, 0, JOIN_INNER, &sjinfo);
+		selec *= clause_selectivity(root, qual, 0, JOIN_INNER, &sjinfo, &isdefault);
 	}
 
 	/* Apply it to the input relation sizes */
diff --git a/src/backend/optimizer/util/orclauses.c b/src/backend/optimizer/util/orclauses.c
index 1e78028abe..c92eeae824 100644
--- a/src/backend/optimizer/util/orclauses.c
+++ b/src/backend/optimizer/util/orclauses.c
@@ -260,7 +260,7 @@ consider_new_or_clause(PlannerInfo *root, RelOptInfo *rel,
 	RestrictInfo *or_rinfo;
 	Selectivity or_selec,
 				orig_selec;
-
+	bool  isdefault;
 	/*
 	 * Build a RestrictInfo from the new OR clause.  We can assume it's valid
 	 * as a base restriction clause.
@@ -280,7 +280,7 @@ consider_new_or_clause(PlannerInfo *root, RelOptInfo *rel,
 	 * saving work later.)
 	 */
 	or_selec = clause_selectivity(root, (Node *) or_rinfo,
-								  0, JOIN_INNER, NULL);
+								  0, JOIN_INNER, NULL, &isdefault);
 
 	/*
 	 * The clause is only worth adding to the query if it rejects a useful
@@ -344,7 +344,7 @@ consider_new_or_clause(PlannerInfo *root, RelOptInfo *rel,
 
 		/* Compute inner-join size */
 		orig_selec = clause_selectivity(root, (Node *) join_or_rinfo,
-										0, JOIN_INNER, &sjinfo);
+										0, JOIN_INNER, &sjinfo, &isdefault);
 
 		/* And hack cached selectivity so join size remains the same */
 		join_or_rinfo->norm_selec = orig_selec / or_selec;
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index a570ac0aab..cb7595e69b 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -1740,7 +1740,8 @@ restriction_selectivity(PlannerInfo *root,
 						Oid operatorid,
 						List *args,
 						Oid inputcollid,
-						int varRelid)
+						int varRelid,
+						bool *isdefault)
 {
 	RegProcedure oprrest = get_oprrest(operatorid);
 	float8		result;
@@ -1752,12 +1753,13 @@ restriction_selectivity(PlannerInfo *root,
 	if (!oprrest)
 		return (Selectivity) 0.5;
 
-	result = DatumGetFloat8(OidFunctionCall4Coll(oprrest,
+	result = DatumGetFloat8(OidFunctionCall5Coll(oprrest,
 												 inputcollid,
 												 PointerGetDatum(root),
 												 ObjectIdGetDatum(operatorid),
 												 PointerGetDatum(args),
-												 Int32GetDatum(varRelid)));
+												 Int32GetDatum(varRelid),
+												 PointerGetDatum(isdefault)));
 
 	if (result < 0.0 || result > 1.0)
 		elog(ERROR, "invalid restriction selectivity: %f", result);
diff --git a/src/backend/statistics/dependencies.c b/src/backend/statistics/dependencies.c
index 58d0df20f6..1cee5d213a 100644
--- a/src/backend/statistics/dependencies.c
+++ b/src/backend/statistics/dependencies.c
@@ -954,6 +954,7 @@ dependencies_clauselist_selectivity(PlannerInfo *root,
 	MVDependencies *dependencies;
 	AttrNumber *list_attnums;
 	int			listidx;
+	bool  isdefault;
 
 	/* initialize output argument */
 	*estimatedclauses = NULL;
@@ -1068,7 +1069,7 @@ dependencies_clauselist_selectivity(PlannerInfo *root,
 				clause = (Node *) lfirst(l);
 
 				s2 = clause_selectivity(root, clause, varRelid, jointype,
-										sjinfo);
+										sjinfo, &isdefault);
 
 				/* mark this one as done, so we don't touch it again. */
 				*estimatedclauses = bms_add_member(*estimatedclauses, listidx);
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index c3db9ea070..378619b8c2 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -573,7 +573,8 @@ neqsel(PG_FUNCTION_ARGS)
  */
 static double
 scalarineqsel(PlannerInfo *root, Oid operator, bool isgt, bool iseq,
-			  VariableStatData *vardata, Datum constval, Oid consttype)
+			  VariableStatData *vardata, Datum constval, Oid consttype,
+			  bool *isdefault)
 {
 	Form_pg_statistic stats;
 	FmgrInfo	opproc;
@@ -582,8 +583,11 @@ scalarineqsel(PlannerInfo *root, Oid operator, bool isgt, bool iseq,
 				sumcommon;
 	double		selec;
 
+	*isdefault = false;
+
 	if (!HeapTupleIsValid(vardata->statsTuple))
 	{
+		*isdefault = true;
 		/* no stats available, so default result */
 		return DEFAULT_INEQ_SEL;
 	}
@@ -1101,6 +1105,7 @@ scalarineqsel_wrapper(PG_FUNCTION_ARGS, bool isgt, bool iseq)
 	Oid			operator = PG_GETARG_OID(1);
 	List	   *args = (List *) PG_GETARG_POINTER(2);
 	int			varRelid = PG_GETARG_INT32(3);
+	bool       *isdefault = (bool *) PG_GETARG_POINTER(4);
 	VariableStatData vardata;
 	Node	   *other;
 	bool		varonleft;
@@ -1114,13 +1119,17 @@ scalarineqsel_wrapper(PG_FUNCTION_ARGS, bool isgt, bool iseq)
 	 */
 	if (!get_restriction_variable(root, args, varRelid,
 								  &vardata, &other, &varonleft))
+	{
+		*isdefault = true;
 		PG_RETURN_FLOAT8(DEFAULT_INEQ_SEL);
+	}
 
 	/*
 	 * Can't do anything useful if the something is not a constant, either.
 	 */
 	if (!IsA(other, Const))
 	{
+		*isdefault = true;
 		ReleaseVariableStats(vardata);
 		PG_RETURN_FLOAT8(DEFAULT_INEQ_SEL);
 	}
@@ -1145,6 +1154,7 @@ scalarineqsel_wrapper(PG_FUNCTION_ARGS, bool isgt, bool iseq)
 		operator = get_commutator(operator);
 		if (!operator)
 		{
+			*isdefault = true;
 			/* Use default selectivity (should we raise an error instead?) */
 			ReleaseVariableStats(vardata);
 			PG_RETURN_FLOAT8(DEFAULT_INEQ_SEL);
@@ -1154,7 +1164,7 @@ scalarineqsel_wrapper(PG_FUNCTION_ARGS, bool isgt, bool iseq)
 
 	/* The rest of the work is done by scalarineqsel(). */
 	selec = scalarineqsel(root, operator, isgt, iseq,
-						  &vardata, constval, consttype);
+						  &vardata, constval, consttype, isdefault);
 
 	ReleaseVariableStats(vardata);
 
@@ -1607,6 +1617,7 @@ booltestsel(PlannerInfo *root, BoolTestType booltesttype, Node *arg,
 {
 	VariableStatData vardata;
 	double		selec;
+	bool   isdefault;
 
 	examine_variable(root, arg, varRelid, &vardata);
 
@@ -1731,14 +1742,14 @@ booltestsel(PlannerInfo *root, BoolTestType booltesttype, Node *arg,
 			case IS_TRUE:
 			case IS_NOT_FALSE:
 				selec = (double) clause_selectivity(root, arg,
-													varRelid,
-													jointype, sjinfo);
+													varRelid, jointype,
+													sjinfo, &isdefault);
 				break;
 			case IS_FALSE:
 			case IS_NOT_TRUE:
 				selec = 1.0 - (double) clause_selectivity(root, arg,
-														  varRelid,
-														  jointype, sjinfo);
+														  varRelid, jointype,
+														  sjinfo, &isdefault);
 				break;
 			default:
 				elog(ERROR, "unrecognized booltesttype: %d",
@@ -2235,6 +2246,7 @@ rowcomparesel(PlannerInfo *root,
 	Oid			inputcollid = linitial_oid(clause->inputcollids);
 	List	   *opargs;
 	bool		is_join_clause;
+	bool        isdefault;
 
 	/* Build equivalent arg list for single operator */
 	opargs = list_make2(linitial(clause->largs), linitial(clause->rargs));
@@ -2283,7 +2295,7 @@ rowcomparesel(PlannerInfo *root,
 		s1 = restriction_selectivity(root, opno,
 									 opargs,
 									 inputcollid,
-									 varRelid);
+									 varRelid, &isdefault);
 	}
 
 	return s1;
@@ -3039,6 +3051,7 @@ mergejoinscansel(PlannerInfo *root, Node *clause,
 				rightmin,
 				rightmax;
 	double		selec;
+	bool        isdefault;
 
 	/* Set default results if we can't figure anything out. */
 	/* XXX should default "start" fraction be a bit more than 0? */
@@ -3206,13 +3219,13 @@ mergejoinscansel(PlannerInfo *root, Node *clause,
 	 * non-default estimates, else stick with our 1.0.
 	 */
 	selec = scalarineqsel(root, leop, isgt, true, &leftvar,
-						  rightmax, op_righttype);
+						  rightmax, op_righttype, &isdefault);
 	if (selec != DEFAULT_INEQ_SEL)
 		*leftend = selec;
 
 	/* And similarly for the right variable. */
 	selec = scalarineqsel(root, revleop, isgt, true, &rightvar,
-						  leftmax, op_lefttype);
+						  leftmax, op_lefttype, &isdefault);
 	if (selec != DEFAULT_INEQ_SEL)
 		*rightend = selec;
 
@@ -3236,13 +3249,13 @@ mergejoinscansel(PlannerInfo *root, Node *clause,
 	 * our own default.
 	 */
 	selec = scalarineqsel(root, ltop, isgt, false, &leftvar,
-						  rightmin, op_righttype);
+						  rightmin, op_righttype, &isdefault);
 	if (selec != DEFAULT_INEQ_SEL)
 		*leftstart = selec;
 
 	/* And similarly for the right variable. */
 	selec = scalarineqsel(root, revltop, isgt, false, &rightvar,
-						  leftmin, op_lefttype);
+						  leftmin, op_lefttype, &isdefault);
 	if (selec != DEFAULT_INEQ_SEL)
 		*rightstart = selec;
 
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 77ca7ff837..19382b8a28 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -214,7 +214,8 @@ extern Selectivity clause_selectivity(PlannerInfo *root,
 				   Node *clause,
 				   int varRelid,
 				   JoinType jointype,
-				   SpecialJoinInfo *sjinfo);
+				   SpecialJoinInfo *sjinfo,
+				   bool *isdefault);
 extern void cost_gather_merge(GatherMergePath *path, PlannerInfo *root,
 				  RelOptInfo *rel, ParamPathInfo *param_info,
 				  Cost input_startup_cost, Cost input_total_cost,
diff --git a/src/include/optimizer/plancat.h b/src/include/optimizer/plancat.h
index 7d53cbbb87..2cb077e6ff 100644
--- a/src/include/optimizer/plancat.h
+++ b/src/include/optimizer/plancat.h
@@ -46,7 +46,8 @@ extern Selectivity restriction_selectivity(PlannerInfo *root,
 						Oid operatorid,
 						List *args,
 						Oid inputcollid,
-						int varRelid);
+						int varRelid,
+						bool *isdefault);
 
 extern Selectivity join_selectivity(PlannerInfo *root,
 				 Oid operatorid,
#2Kyotaro HORIGUCHI
horiguchi.kyotaro@lab.ntt.co.jp
In reply to: Yuzuko Hosoya (#1)
Re: Improve selectivity estimate for range queries

Hello.

At Thu, 20 Dec 2018 17:21:29 +0900, "Yuzuko Hosoya" <hosoya.yuzuko@lab.ntt.co.jp> wrote in <008701d4983d$02e731c0$08b59540$@lab.ntt.co.jp>

In my environment, the selectivity for id > 0 was 0.99990000000000001,
and the selectivity for id < 10000 was 0.33333333333333331. Then, the
value of rqlist->hibound and rqlist->lobound are set respectively.
On the other hand, DEFAULT_INEQ_SEL is 0.3333333333333333. As a result,
the final selectivity is computed according to DEFAULT_RANGE_INEQ_SEL,
since the condition of the second if statement was satisfied. However,
these selectivities were computed according to the statistics, so the
final selectivity had to be calculated from rqlist->hibound + rqlist->lobound - 1.0.
My test case might be uncommon, but I think it would cause some problems.

Yeah, its known behavior as the comment just above. If in your
example query, the table weer very large and had an index it
could run faultly an index scan over a fair amount of tuples. But
I'm not sure how much it matters and your example doesn't show
that.

The behvavior discussed here is introduced by this commit.

| commit 547bb4a7f2bdccad9253a99211ce84b3f9de485a
| Author: Tom Lane <tgl@sss.pgh.pa.us>
| Date: Tue Nov 9 00:34:46 2004 +0000
|
| Use a hopefully-more-reliable method of detecting default selectivity
| estimates when combining the estimates for a range query. As pointed out
| by Miquel van Smoorenburg, the existing check for an impossible combined
| result would quite possibly fail to detect one default and one non-default
| input. It seems better to use the default range query estimate in such
| cases. To do so, add a check for an estimate of exactly DEFAULT_INEQ_SEL.
| This is a bit ugly because it introduces additional coupling between
| clauselist_selectivity and scalarltsel/scalargtsel, but it's not like
| there wasn't plenty already...

To handle such cases I've thought up of an idea based on a previous commit[1]
which solved a similar problem related to DEFAULT_NUM_DISTINCT. Just like
this modification, I added a flag which shows whether the selectivity

The commit [1] added almost no complexity, but this does. Affects
many functions to introduce tighter coupling between
operator-selectivity functions and clause selectivity functions.
isdefault travells alone too long just to bind remote
functions. We might need more pricipled way to handle that.

was calculated according to the statistics or not to clauselist_selectivity,
and used it as a condition of the if statement instead of
rqlist->hibound == DEFAULT_INEQ_SEL and rqlist->lobound == DEFAULT_INEQ_SEL.
I am afraid that changes were more than I had expected, but we can estimate
selectivity correctly.

Patch attached. Do you have any thoughts?

I've just run over the patch, but some have comments.

+ if (isdefault)
+ rqelem->lobound_isdefault = true;

This is taking the disjunction of lobounds ever added, I suppose
it should be the last lobounds' isdefault.

As you see, four other instances of "selec ==/!= DEFAULT_*" exist
in mergejoinsel. Don't they lead to similar kind of problem?

     Selectivity lobound;        /* Selectivity of a var > something clause */
     Selectivity hibound;        /* Selectivity of a var < something clause */
+    bool        lobound_isdefault;    /* lobound is a default selectivity? */
+    bool        hibound_isdefault;    /* hibound is a default selectivity? */
 } RangeQueryClause;

Around the [1], there was a discussion about introducing the
notion of uncertainty to selectivity. The isdefualt is a kind of
uncertainty value indicating '0/100% uncertain'. So my feeling is
saying that it's better that Selectivity has certinty component
then building a selectivity arithmetics involving uncertainty,
though I don't have a clear picture.

[1]
https://git.postgresql.org/gitweb/?p=postgresql.git&amp;a=commitdiff&amp;h=4c2777d0b733220d9029f78817af8ce

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#3Yuzuko Hosoya
hosoya.yuzuko@lab.ntt.co.jp
In reply to: Kyotaro HORIGUCHI (#2)
1 attachment(s)
RE: Improve selectivity estimate for range queries

Hi,

Thanks for the comments.
I attach the v2 patch.

From: Kyotaro HORIGUCHI [mailto:horiguchi.kyotaro@lab.ntt.co.jp]
Sent: Friday, December 21, 2018 12:25 PM

Hello.

At Thu, 20 Dec 2018 17:21:29 +0900, "Yuzuko Hosoya" <hosoya.yuzuko@lab.ntt.co.jp> wrote in
<008701d4983d$02e731c0$08b59540$@lab.ntt.co.jp>

In my environment, the selectivity for id > 0 was 0.99990000000000001,
and the selectivity for id < 10000 was 0.33333333333333331. Then, the
value of rqlist->hibound and rqlist->lobound are set respectively.
On the other hand, DEFAULT_INEQ_SEL is 0.3333333333333333. As a
result, the final selectivity is computed according to
DEFAULT_RANGE_INEQ_SEL, since the condition of the second if statement
was satisfied. However, these selectivities were computed according to
the statistics, so the final selectivity had to be calculated from rqlist->hibound +

rqlist->lobound

- 1.0.

My test case might be uncommon, but I think it would cause some problems.

Yeah, its known behavior as the comment just above. If in your example query, the table weer very

large

and had an index it could run faultly an index scan over a fair amount of tuples. But I'm not sure
how much it matters and your example doesn't show that.

The behvavior discussed here is introduced by this commit.

| commit 547bb4a7f2bdccad9253a99211ce84b3f9de485a
| Author: Tom Lane <tgl@sss.pgh.pa.us>
| Date: Tue Nov 9 00:34:46 2004 +0000
|
| Use a hopefully-more-reliable method of detecting default selectivity
| estimates when combining the estimates for a range query. As pointed out
| by Miquel van Smoorenburg, the existing check for an impossible combined
| result would quite possibly fail to detect one default and one non-default
| input. It seems better to use the default range query estimate in such
| cases. To do so, add a check for an estimate of exactly DEFAULT_INEQ_SEL.
| This is a bit ugly because it introduces additional coupling between
| clauselist_selectivity and scalarltsel/scalargtsel, but it's not like
| there wasn't plenty already...

To handle such cases I've thought up of an idea based on a previous
commit[1] which solved a similar problem related to
DEFAULT_NUM_DISTINCT. Just like this modification, I added a flag
which shows whether the selectivity

The commit [1] added almost no complexity, but this does. Affects many functions to introduce

tighter

coupling between operator-selectivity functions and clause selectivity functions.
isdefault travells alone too long just to bind remote functions. We might need more pricipled way

to

handle that.

Yes, indeed. But I have not found better way than this approach yet.

was calculated according to the statistics or not to
clauselist_selectivity, and used it as a condition of the if statement
instead of
rqlist->hibound == DEFAULT_INEQ_SEL and rqlist->lobound == DEFAULT_INEQ_SEL.
I am afraid that changes were more than I had expected, but we can
estimate selectivity correctly.

Patch attached. Do you have any thoughts?

I've just run over the patch, but some have comments.

+ if (isdefault)
+ rqelem->lobound_isdefault = true;

This is taking the disjunction of lobounds ever added, I suppose it should be the last lobounds'
isdefault.

Indeed. I fixed it.

As you see, four other instances of "selec ==/!= DEFAULT_*" exist in mergejoinsel. Don't they lead
to similar kind of problem?

I didn't encounter similar problems, but as you said, such problems would be occurred due to the
condition, selec != DEFAULT_INEQ_SEL. I changed these conditions into "!isdefault".

Selectivity lobound;        /* Selectivity of a var > something clause */
Selectivity hibound;        /* Selectivity of a var < something clause */
+    bool        lobound_isdefault;    /* lobound is a default selectivity? */
+    bool        hibound_isdefault;    /* hibound is a default selectivity? */
} RangeQueryClause;

Around the [1], there was a discussion about introducing the notion of uncertainty to selectivity.
The isdefualt is a kind of uncertainty value indicating '0/100% uncertain'. So my feeling is

saying

that it's better that Selectivity has certinty component then building a selectivity arithmetics
involving uncertainty, though I don't have a clear picture.

I see. Indeed if we would change Selectivity so that it includes
information about default/non-default, such problems I mentioned
would be solved. But I'm afraid that this modification would lead
to a big impact, since there are lots of functions that calculate
selectivity. I also don't have a concreate idea, so I will think
about it. Besides that, I'd like to ask community whether such
modification would be acceptable or not.

Best regards,

Yuzuko Hosoya
NTT Open Source Software Center

Show quoted text

[1]
https://git.postgresql.org/gitweb/?p=postgresql.git&amp;a=commitdiff&amp;h=4c2
777d0b733220d9029f78817af8ce

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

v2_improve_selectivity_estimate_for_range_queries.patchapplication/octet-stream; name=v2_improve_selectivity_estimate_for_range_queries.patchDownload
diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c
index f4717942c3..c557614b71 100644
--- a/src/backend/optimizer/path/clausesel.c
+++ b/src/backend/optimizer/path/clausesel.c
@@ -37,10 +37,13 @@ typedef struct RangeQueryClause
 	bool		have_hibound;	/* found a high-bound clause yet? */
 	Selectivity lobound;		/* Selectivity of a var > something clause */
 	Selectivity hibound;		/* Selectivity of a var < something clause */
+	bool		lobound_isdefault;	/* lobound is a default selectivity? */
+	bool		hibound_isdefault;	/* hibound is a default selectivity? */
 } RangeQueryClause;
 
 static void addRangeClause(RangeQueryClause **rqlist, Node *clause,
-			   bool varonleft, bool isLTsel, Selectivity s2);
+			   bool varonleft, bool isLTsel, Selectivity s2,
+			   bool isdefault);
 static RelOptInfo *find_single_rel_for_clauses(PlannerInfo *root,
 							List *clauses);
 
@@ -108,6 +111,7 @@ clauselist_selectivity(PlannerInfo *root,
 	RangeQueryClause *rqlist = NULL;
 	ListCell   *l;
 	int			listidx;
+	bool		isdefault;
 
 	/*
 	 * If there's exactly one clause, just go directly to
@@ -115,7 +119,7 @@ clauselist_selectivity(PlannerInfo *root,
 	 */
 	if (list_length(clauses) == 1)
 		return clause_selectivity(root, (Node *) linitial(clauses),
-								  varRelid, jointype, sjinfo);
+								  varRelid, jointype, sjinfo, &isdefault);
 
 	/*
 	 * Determine if these clauses reference a single relation.  If so, and if
@@ -165,7 +169,8 @@ clauselist_selectivity(PlannerInfo *root,
 			continue;
 
 		/* Always compute the selectivity using clause_selectivity */
-		s2 = clause_selectivity(root, clause, varRelid, jointype, sjinfo);
+		s2 = clause_selectivity(root, clause, varRelid, jointype,
+								sjinfo, &isdefault);
 
 		/*
 		 * Check for being passed a RestrictInfo.
@@ -227,12 +232,12 @@ clauselist_selectivity(PlannerInfo *root,
 					case F_SCALARLTSEL:
 					case F_SCALARLESEL:
 						addRangeClause(&rqlist, clause,
-									   varonleft, true, s2);
+									   varonleft, true, s2, isdefault);
 						break;
 					case F_SCALARGTSEL:
 					case F_SCALARGESEL:
 						addRangeClause(&rqlist, clause,
-									   varonleft, false, s2);
+									   varonleft, false, s2, isdefault);
 						break;
 					default:
 						/* Just merge the selectivity in generically */
@@ -260,12 +265,11 @@ clauselist_selectivity(PlannerInfo *root,
 			Selectivity s2;
 
 			/*
-			 * Exact equality to the default value probably means the
-			 * selectivity function punted.  This is not airtight but should
-			 * be good enough.
+			 * If either selectivity is default, we use a default estimate.
+			 * This is not airtight but should be good enough.
 			 */
-			if (rqlist->hibound == DEFAULT_INEQ_SEL ||
-				rqlist->lobound == DEFAULT_INEQ_SEL)
+			if (rqlist->hibound_isdefault ||
+				rqlist->lobound_isdefault)
 			{
 				s2 = DEFAULT_RANGE_INEQ_SEL;
 			}
@@ -332,7 +336,8 @@ clauselist_selectivity(PlannerInfo *root,
  */
 static void
 addRangeClause(RangeQueryClause **rqlist, Node *clause,
-			   bool varonleft, bool isLTsel, Selectivity s2)
+			   bool varonleft, bool isLTsel, Selectivity s2,
+			   bool isdefault)
 {
 	RangeQueryClause *rqelem;
 	Node	   *var;
@@ -364,6 +369,8 @@ addRangeClause(RangeQueryClause **rqlist, Node *clause,
 			{
 				rqelem->have_lobound = true;
 				rqelem->lobound = s2;
+				if (isdefault)
+					rqelem->lobound_isdefault = true;
 			}
 			else
 			{
@@ -375,7 +382,13 @@ addRangeClause(RangeQueryClause **rqlist, Node *clause,
 				 *------
 				 */
 				if (rqelem->lobound > s2)
+				{
 					rqelem->lobound = s2;
+					if (isdefault)
+						rqelem->lobound_isdefault = true;
+					else
+						rqelem->lobound_isdefault = false;
+				}
 			}
 		}
 		else
@@ -384,6 +397,8 @@ addRangeClause(RangeQueryClause **rqlist, Node *clause,
 			{
 				rqelem->have_hibound = true;
 				rqelem->hibound = s2;
+				if (isdefault)
+					rqelem->hibound_isdefault = true;
 			}
 			else
 			{
@@ -395,7 +410,13 @@ addRangeClause(RangeQueryClause **rqlist, Node *clause,
 				 *------
 				 */
 				if (rqelem->hibound > s2)
+				{
 					rqelem->hibound = s2;
+					if (isdefault)
+						rqelem->hibound_isdefault = true;
+					else
+						rqelem->hibound_isdefault = false;
+				}
 			}
 		}
 		return;
@@ -404,17 +425,24 @@ addRangeClause(RangeQueryClause **rqlist, Node *clause,
 	/* No matching var found, so make a new clause-pair data structure */
 	rqelem = (RangeQueryClause *) palloc(sizeof(RangeQueryClause));
 	rqelem->var = var;
+	rqelem->lobound_isdefault = false;
+	rqelem->hibound_isdefault = false;
+
 	if (is_lobound)
 	{
 		rqelem->have_lobound = true;
 		rqelem->have_hibound = false;
 		rqelem->lobound = s2;
+		if (isdefault)
+			rqelem->lobound_isdefault = true;
 	}
 	else
 	{
 		rqelem->have_lobound = false;
 		rqelem->have_hibound = true;
 		rqelem->hibound = s2;
+		if (isdefault)
+			rqelem->lobound_isdefault = true;
 	}
 	rqelem->next = *rqlist;
 	*rqlist = rqelem;
@@ -575,7 +603,8 @@ clause_selectivity(PlannerInfo *root,
 				   Node *clause,
 				   int varRelid,
 				   JoinType jointype,
-				   SpecialJoinInfo *sjinfo)
+				   SpecialJoinInfo *sjinfo,
+				   bool *isdefault)
 {
 	Selectivity s1 = 0.5;		/* default for any unhandled clause type */
 	RestrictInfo *rinfo = NULL;
@@ -695,7 +724,7 @@ clause_selectivity(PlannerInfo *root,
 									  (Node *) get_notclausearg((Expr *) clause),
 									  varRelid,
 									  jointype,
-									  sjinfo);
+									  sjinfo, isdefault);
 	}
 	else if (and_clause(clause))
 	{
@@ -723,7 +752,7 @@ clause_selectivity(PlannerInfo *root,
 												(Node *) lfirst(arg),
 												varRelid,
 												jointype,
-												sjinfo);
+												sjinfo, isdefault);
 
 			s1 = s1 + s2 - s1 * s2;
 		}
@@ -748,7 +777,7 @@ clause_selectivity(PlannerInfo *root,
 			s1 = restriction_selectivity(root, opno,
 										 opclause->args,
 										 opclause->inputcollid,
-										 varRelid);
+										 varRelid, isdefault);
 		}
 
 		/*
@@ -816,7 +845,7 @@ clause_selectivity(PlannerInfo *root,
 								(Node *) ((RelabelType *) clause)->arg,
 								varRelid,
 								jointype,
-								sjinfo);
+								sjinfo, isdefault);
 	}
 	else if (IsA(clause, CoerceToDomain))
 	{
@@ -825,7 +854,7 @@ clause_selectivity(PlannerInfo *root,
 								(Node *) ((CoerceToDomain *) clause)->arg,
 								varRelid,
 								jointype,
-								sjinfo);
+								sjinfo, isdefault);
 	}
 	else
 	{
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 7bf67a0529..fa43722a7b 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -4241,6 +4241,7 @@ approx_tuple_count(PlannerInfo *root, JoinPath *path, List *quals)
 	SpecialJoinInfo sjinfo;
 	Selectivity selec = 1.0;
 	ListCell   *l;
+	bool   isdefault;
 
 	/*
 	 * Make up a SpecialJoinInfo for JOIN_INNER semantics.
@@ -4265,7 +4266,7 @@ approx_tuple_count(PlannerInfo *root, JoinPath *path, List *quals)
 		Node	   *qual = (Node *) lfirst(l);
 
 		/* Note that clause_selectivity will be able to cache its result */
-		selec *= clause_selectivity(root, qual, 0, JOIN_INNER, &sjinfo);
+		selec *= clause_selectivity(root, qual, 0, JOIN_INNER, &sjinfo, &isdefault);
 	}
 
 	/* Apply it to the input relation sizes */
diff --git a/src/backend/optimizer/util/orclauses.c b/src/backend/optimizer/util/orclauses.c
index 1e78028abe..c92eeae824 100644
--- a/src/backend/optimizer/util/orclauses.c
+++ b/src/backend/optimizer/util/orclauses.c
@@ -260,7 +260,7 @@ consider_new_or_clause(PlannerInfo *root, RelOptInfo *rel,
 	RestrictInfo *or_rinfo;
 	Selectivity or_selec,
 				orig_selec;
-
+	bool  isdefault;
 	/*
 	 * Build a RestrictInfo from the new OR clause.  We can assume it's valid
 	 * as a base restriction clause.
@@ -280,7 +280,7 @@ consider_new_or_clause(PlannerInfo *root, RelOptInfo *rel,
 	 * saving work later.)
 	 */
 	or_selec = clause_selectivity(root, (Node *) or_rinfo,
-								  0, JOIN_INNER, NULL);
+								  0, JOIN_INNER, NULL, &isdefault);
 
 	/*
 	 * The clause is only worth adding to the query if it rejects a useful
@@ -344,7 +344,7 @@ consider_new_or_clause(PlannerInfo *root, RelOptInfo *rel,
 
 		/* Compute inner-join size */
 		orig_selec = clause_selectivity(root, (Node *) join_or_rinfo,
-										0, JOIN_INNER, &sjinfo);
+										0, JOIN_INNER, &sjinfo, &isdefault);
 
 		/* And hack cached selectivity so join size remains the same */
 		join_or_rinfo->norm_selec = orig_selec / or_selec;
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index a570ac0aab..cb7595e69b 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -1740,7 +1740,8 @@ restriction_selectivity(PlannerInfo *root,
 						Oid operatorid,
 						List *args,
 						Oid inputcollid,
-						int varRelid)
+						int varRelid,
+						bool *isdefault)
 {
 	RegProcedure oprrest = get_oprrest(operatorid);
 	float8		result;
@@ -1752,12 +1753,13 @@ restriction_selectivity(PlannerInfo *root,
 	if (!oprrest)
 		return (Selectivity) 0.5;
 
-	result = DatumGetFloat8(OidFunctionCall4Coll(oprrest,
+	result = DatumGetFloat8(OidFunctionCall5Coll(oprrest,
 												 inputcollid,
 												 PointerGetDatum(root),
 												 ObjectIdGetDatum(operatorid),
 												 PointerGetDatum(args),
-												 Int32GetDatum(varRelid)));
+												 Int32GetDatum(varRelid),
+												 PointerGetDatum(isdefault)));
 
 	if (result < 0.0 || result > 1.0)
 		elog(ERROR, "invalid restriction selectivity: %f", result);
diff --git a/src/backend/statistics/dependencies.c b/src/backend/statistics/dependencies.c
index 140783cfb3..97d672a14f 100644
--- a/src/backend/statistics/dependencies.c
+++ b/src/backend/statistics/dependencies.c
@@ -951,6 +951,7 @@ dependencies_clauselist_selectivity(PlannerInfo *root,
 	MVDependencies *dependencies;
 	AttrNumber *list_attnums;
 	int			listidx;
+	bool  isdefault;
 
 	/* initialize output argument */
 	*estimatedclauses = NULL;
@@ -1065,7 +1066,7 @@ dependencies_clauselist_selectivity(PlannerInfo *root,
 				clause = (Node *) lfirst(l);
 
 				s2 = clause_selectivity(root, clause, varRelid, jointype,
-										sjinfo);
+										sjinfo, &isdefault);
 
 				/* mark this one as done, so we don't touch it again. */
 				*estimatedclauses = bms_add_member(*estimatedclauses, listidx);
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index ffca0fe5bb..797ce8e42b 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -570,7 +570,8 @@ neqsel(PG_FUNCTION_ARGS)
  */
 static double
 scalarineqsel(PlannerInfo *root, Oid operator, bool isgt, bool iseq,
-			  VariableStatData *vardata, Datum constval, Oid consttype)
+			  VariableStatData *vardata, Datum constval, Oid consttype,
+			  bool *isdefault)
 {
 	Form_pg_statistic stats;
 	FmgrInfo	opproc;
@@ -579,8 +580,11 @@ scalarineqsel(PlannerInfo *root, Oid operator, bool isgt, bool iseq,
 				sumcommon;
 	double		selec;
 
+	*isdefault = false;
+
 	if (!HeapTupleIsValid(vardata->statsTuple))
 	{
+		*isdefault = true;
 		/* no stats available, so default result */
 		return DEFAULT_INEQ_SEL;
 	}
@@ -1097,6 +1101,7 @@ scalarineqsel_wrapper(PG_FUNCTION_ARGS, bool isgt, bool iseq)
 	Oid			operator = PG_GETARG_OID(1);
 	List	   *args = (List *) PG_GETARG_POINTER(2);
 	int			varRelid = PG_GETARG_INT32(3);
+	bool       *isdefault = (bool *) PG_GETARG_POINTER(4);
 	VariableStatData vardata;
 	Node	   *other;
 	bool		varonleft;
@@ -1110,13 +1115,17 @@ scalarineqsel_wrapper(PG_FUNCTION_ARGS, bool isgt, bool iseq)
 	 */
 	if (!get_restriction_variable(root, args, varRelid,
 								  &vardata, &other, &varonleft))
+	{
+		*isdefault = true;
 		PG_RETURN_FLOAT8(DEFAULT_INEQ_SEL);
+	}
 
 	/*
 	 * Can't do anything useful if the something is not a constant, either.
 	 */
 	if (!IsA(other, Const))
 	{
+		*isdefault = true;
 		ReleaseVariableStats(vardata);
 		PG_RETURN_FLOAT8(DEFAULT_INEQ_SEL);
 	}
@@ -1141,6 +1150,7 @@ scalarineqsel_wrapper(PG_FUNCTION_ARGS, bool isgt, bool iseq)
 		operator = get_commutator(operator);
 		if (!operator)
 		{
+			*isdefault = true;
 			/* Use default selectivity (should we raise an error instead?) */
 			ReleaseVariableStats(vardata);
 			PG_RETURN_FLOAT8(DEFAULT_INEQ_SEL);
@@ -1150,7 +1160,7 @@ scalarineqsel_wrapper(PG_FUNCTION_ARGS, bool isgt, bool iseq)
 
 	/* The rest of the work is done by scalarineqsel(). */
 	selec = scalarineqsel(root, operator, isgt, iseq,
-						  &vardata, constval, consttype);
+						  &vardata, constval, consttype, isdefault);
 
 	ReleaseVariableStats(vardata);
 
@@ -1603,6 +1613,7 @@ booltestsel(PlannerInfo *root, BoolTestType booltesttype, Node *arg,
 {
 	VariableStatData vardata;
 	double		selec;
+	bool   isdefault;
 
 	examine_variable(root, arg, varRelid, &vardata);
 
@@ -1727,14 +1738,14 @@ booltestsel(PlannerInfo *root, BoolTestType booltesttype, Node *arg,
 			case IS_TRUE:
 			case IS_NOT_FALSE:
 				selec = (double) clause_selectivity(root, arg,
-													varRelid,
-													jointype, sjinfo);
+													varRelid, jointype,
+													sjinfo, &isdefault);
 				break;
 			case IS_FALSE:
 			case IS_NOT_TRUE:
 				selec = 1.0 - (double) clause_selectivity(root, arg,
-														  varRelid,
-														  jointype, sjinfo);
+														  varRelid, jointype,
+														  sjinfo, &isdefault);
 				break;
 			default:
 				elog(ERROR, "unrecognized booltesttype: %d",
@@ -2231,6 +2242,7 @@ rowcomparesel(PlannerInfo *root,
 	Oid			inputcollid = linitial_oid(clause->inputcollids);
 	List	   *opargs;
 	bool		is_join_clause;
+	bool        isdefault;
 
 	/* Build equivalent arg list for single operator */
 	opargs = list_make2(linitial(clause->largs), linitial(clause->rargs));
@@ -2279,7 +2291,7 @@ rowcomparesel(PlannerInfo *root,
 		s1 = restriction_selectivity(root, opno,
 									 opargs,
 									 inputcollid,
-									 varRelid);
+									 varRelid, &isdefault);
 	}
 
 	return s1;
@@ -3035,6 +3047,7 @@ mergejoinscansel(PlannerInfo *root, Node *clause,
 				rightmin,
 				rightmax;
 	double		selec;
+	bool        isdefault;
 
 	/* Set default results if we can't figure anything out. */
 	/* XXX should default "start" fraction be a bit more than 0? */
@@ -3202,14 +3215,14 @@ mergejoinscansel(PlannerInfo *root, Node *clause,
 	 * non-default estimates, else stick with our 1.0.
 	 */
 	selec = scalarineqsel(root, leop, isgt, true, &leftvar,
-						  rightmax, op_righttype);
-	if (selec != DEFAULT_INEQ_SEL)
+						  rightmax, op_righttype, &isdefault);
+	if (!isdefault)
 		*leftend = selec;
 
 	/* And similarly for the right variable. */
 	selec = scalarineqsel(root, revleop, isgt, true, &rightvar,
-						  leftmax, op_lefttype);
-	if (selec != DEFAULT_INEQ_SEL)
+						  leftmax, op_lefttype, &isdefault);
+	if (!isdefault)
 		*rightend = selec;
 
 	/*
@@ -3232,14 +3245,14 @@ mergejoinscansel(PlannerInfo *root, Node *clause,
 	 * our own default.
 	 */
 	selec = scalarineqsel(root, ltop, isgt, false, &leftvar,
-						  rightmin, op_righttype);
-	if (selec != DEFAULT_INEQ_SEL)
+						  rightmin, op_righttype, &isdefault);
+	if (!isdefault)
 		*leftstart = selec;
 
 	/* And similarly for the right variable. */
 	selec = scalarineqsel(root, revltop, isgt, false, &rightvar,
-						  leftmin, op_lefttype);
-	if (selec != DEFAULT_INEQ_SEL)
+						  leftmin, op_lefttype, &isdefault);
+	if (!isdefault)
 		*rightstart = selec;
 
 	/*
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 77ca7ff837..19382b8a28 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -214,7 +214,8 @@ extern Selectivity clause_selectivity(PlannerInfo *root,
 				   Node *clause,
 				   int varRelid,
 				   JoinType jointype,
-				   SpecialJoinInfo *sjinfo);
+				   SpecialJoinInfo *sjinfo,
+				   bool *isdefault);
 extern void cost_gather_merge(GatherMergePath *path, PlannerInfo *root,
 				  RelOptInfo *rel, ParamPathInfo *param_info,
 				  Cost input_startup_cost, Cost input_total_cost,
diff --git a/src/include/optimizer/plancat.h b/src/include/optimizer/plancat.h
index 7d53cbbb87..2cb077e6ff 100644
--- a/src/include/optimizer/plancat.h
+++ b/src/include/optimizer/plancat.h
@@ -46,7 +46,8 @@ extern Selectivity restriction_selectivity(PlannerInfo *root,
 						Oid operatorid,
 						List *args,
 						Oid inputcollid,
-						int varRelid);
+						int varRelid,
+						bool *isdefault);
 
 extern Selectivity join_selectivity(PlannerInfo *root,
 				 Oid operatorid,
#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Yuzuko Hosoya (#3)
Re: Improve selectivity estimate for range queries

"Yuzuko Hosoya" <hosoya.yuzuko@lab.ntt.co.jp> writes:

From: Kyotaro HORIGUCHI [mailto:horiguchi.kyotaro@lab.ntt.co.jp]

At Thu, 20 Dec 2018 17:21:29 +0900, "Yuzuko Hosoya" <hosoya.yuzuko@lab.ntt.co.jp> wrote in
<008701d4983d$02e731c0$08b59540$@lab.ntt.co.jp>

To handle such cases I've thought up of an idea based on a previous
commit[1] which solved a similar problem related to
DEFAULT_NUM_DISTINCT. Just like this modification, I added a flag
which shows whether the selectivity

The commit [1] added almost no complexity, but this does. Affects many
functions to introduce tighter coupling between operator-selectivity
functions and clause selectivity functions. isdefault travells alone
too long just to bind remote functions. We might need more pricipled
way to handle that.

Yeah, I don't think that this approach is a reasonable tradeoff to
eliminate a narrow case. In particular, what's been done here to
clause_selectivity is totally unprincipled and poorly thought out:
you can't add an output parameter that's set in only some cases.
Yet, there's no good way to decide how to set it in many cases.
For instance, what would you do for an AND or OR where some of
the branches are default estimates and some not?

Around the [1], there was a discussion about introducing the notion of
uncertainty to selectivity. The isdefualt is a kind of uncertainty
value indicating '0/100% uncertain'. So my feeling is saying that
it's better that Selectivity has certinty component then building a
selectivity arithmetics involving uncertainty, though I don't have a
clear picture.

I see. Indeed if we would change Selectivity so that it includes
information about default/non-default, such problems I mentioned
would be solved. But I'm afraid that this modification would lead
to a big impact, since there are lots of functions that calculate
selectivity. I also don't have a concreate idea, so I will think
about it. Besides that, I'd like to ask community whether such
modification would be acceptable or not.

The planner definitely has a lot of problems that could be addressed
with some sort of uncertainty framework ... but it'd be a huge
undertaking, which is why nobody's tried yet.

A smaller-footprint way to fix the immediate problem might be to
change the values of DEFAULT_INEQ_SEL and friends so that they're
even less likely to be matched by accident. That is, instead of
0.3333333333333333, use 0.333333333333342 or some such. It might
seem that that's just moving the problem around, but I think it
might be possible to show that such a value couldn't be computed
by scalarltsel given a histogram with no more than 10000 members.
(I haven't tried to actually prove that, but it seems intuitive
that the set of possible results would be quantized with no more
than about 5 digits precision.)

regards, tom lane

#5Kyotaro HORIGUCHI
horiguchi.kyotaro@lab.ntt.co.jp
In reply to: Tom Lane (#4)
Re: Improve selectivity estimate for range queries

Hello.

At Fri, 21 Dec 2018 11:50:28 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote in <28533.1545411028@sss.pgh.pa.us>

"Yuzuko Hosoya" <hosoya.yuzuko@lab.ntt.co.jp> writes:

From: Kyotaro HORIGUCHI [mailto:horiguchi.kyotaro@lab.ntt.co.jp]

At Thu, 20 Dec 2018 17:21:29 +0900, "Yuzuko Hosoya" <hosoya.yuzuko@lab.ntt.co.jp> wrote in
<008701d4983d$02e731c0$08b59540$@lab.ntt.co.jp>

To handle such cases I've thought up of an idea based on a previous
commit[1] which solved a similar problem related to
DEFAULT_NUM_DISTINCT. Just like this modification, I added a flag
which shows whether the selectivity

The commit [1] added almost no complexity, but this does. Affects many
functions to introduce tighter coupling between operator-selectivity
functions and clause selectivity functions. isdefault travells alone
too long just to bind remote functions. We might need more pricipled
way to handle that.

Yeah, I don't think that this approach is a reasonable tradeoff to
eliminate a narrow case. In particular, what's been done here to
clause_selectivity is totally unprincipled and poorly thought out:
you can't add an output parameter that's set in only some cases.
Yet, there's no good way to decide how to set it in many cases.
For instance, what would you do for an AND or OR where some of
the branches are default estimates and some not?

Around the [1], there was a discussion about introducing the notion of
uncertainty to selectivity. The isdefualt is a kind of uncertainty
value indicating '0/100% uncertain'. So my feeling is saying that
it's better that Selectivity has certinty component then building a
selectivity arithmetics involving uncertainty, though I don't have a
clear picture.

I see. Indeed if we would change Selectivity so that it includes
information about default/non-default, such problems I mentioned
would be solved. But I'm afraid that this modification would lead
to a big impact, since there are lots of functions that calculate
selectivity. I also don't have a concreate idea, so I will think
about it. Besides that, I'd like to ask community whether such
modification would be acceptable or not.

The planner definitely has a lot of problems that could be addressed
with some sort of uncertainty framework ... but it'd be a huge
undertaking, which is why nobody's tried yet.

Sure.

A smaller-footprint way to fix the immediate problem might be to
change the values of DEFAULT_INEQ_SEL and friends so that they're
even less likely to be matched by accident. That is, instead of
0.3333333333333333, use 0.333333333333342 or some such. It might

Sounds reasonable.

seem that that's just moving the problem around, but I think it
might be possible to show that such a value couldn't be computed
by scalarltsel given a histogram with no more than 10000 members.
(I haven't tried to actually prove that, but it seems intuitive
that the set of possible results would be quantized with no more
than about 5 digits precision.)

FWIW, I got the following result on my environment. It seems
different enough if this holds on all supported platforms, though
there still is a case where the result of a sequence of
arithmetics makes false match.

x = 0.33333333333333331
d = 1.000000e+30 match
d = 1.000000e+31: 0.000000 match
x = 0.33333333333334197
d = 0.000000e+00 match
d = 1.000000e+00: 0.000000 match

==== t.c
#include <stdio.h>
#include <math.h>

int test(double x)
{
double d = 1.0;
double d0 = 0;

fprintf(stderr, "x = %19.17f\n", x);
while (d != d0)
{
double z = floor(d * 3);
double z1 = z + 1.0;
double y = d / z;
double y1 = d / z1;

/* Check if both sides of d * 3 doesn't make match */
if (y != x && y1 != x)
{
fprintf(stderr, " d = %e match\n", d0);
fprintf(stderr, " d = %e: %f match\n", d);
return 0;
}

d0 = d;
d = d * 10;
}
}

int main(void)
{
test(0.3333333333333333);
test(0.333333333333342);

return 0;
}
====

--
Kyotaro Horiguchi
NTT Open Source Software Center

#6Kyotaro HORIGUCHI
horiguchi.kyotaro@lab.ntt.co.jp
In reply to: Kyotaro HORIGUCHI (#5)
Re: Improve selectivity estimate for range queries

Mmm.

At Tue, 08 Jan 2019 16:26:38 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20190108.162638.106314087.horiguchi.kyotaro@lab.ntt.co.jp>

FWIW, I got the following result on my environment. It seems
different enough if this holds on all supported platforms, though
there still is a case where the result of a sequence of
arithmetics makes false match.

x = 0.33333333333333331
d = 1.000000e+30 match
d = 1.000000e+31: 0.000000 match

Of course the "match" in the last line above is a mistake of "NOT
match".

--
Kyotaro Horiguchi
NTT Open Source Software Center

#7Kyotaro HORIGUCHI
horiguchi.kyotaro@oss.ntt.co.jp
In reply to: Kyotaro HORIGUCHI (#6)
Re: Improve selectivity estimate for range queries

Sigh.. In the frrst place, the loop should not stop until the maximum
exponent.
sorry, but I don't have a time now..

2019年1月8日(火) 16:43 Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>:

Mmm.

At Tue, 08 Jan 2019 16:26:38 +0900 (Tokyo Standard Time), Kyotaro
HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <
20190108.162638.106314087.horiguchi.kyotaro@lab.ntt.co.jp>

FWIW, I got the following result on my environment. It seems
different enough if this holds on all supported platforms, though
there still is a case where the result of a sequence of
arithmetics makes false match.

x = 0.33333333333333331
d = 1.000000e+30 match
d = 1.000000e+31: 0.000000 match

Of course the "match" in the last line above is a mistake of "NOT
match".

--

Show quoted text

Kyotaro Horiguchi
NTT Open Source Software Center

#8Kyotaro HORIGUCHI
horiguchi.kyotaro@lab.ntt.co.jp
In reply to: Kyotaro HORIGUCHI (#5)
2 attachment(s)
Re: Improve selectivity estimate for range queries

At Tue, 08 Jan 2019 16:26:38 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20190108.162638.106314087.horiguchi.kyotaro@lab.ntt.co.jp>

Hello.

At Fri, 21 Dec 2018 11:50:28 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote in <28533.1545411028@sss.pgh.pa.us>

seem that that's just moving the problem around, but I think it
might be possible to show that such a value couldn't be computed
by scalarltsel given a histogram with no more than 10000 members.
(I haven't tried to actually prove that, but it seems intuitive
that the set of possible results would be quantized with no more
than about 5 digits precision.)

I think we don't need a perfect proof for that. The fact that
exactly 1/3 is quite natural and common but 1/3 + ε is not would
be enough.

FWIW, I got the following result on my environment. It seems
different enough if this holds on all supported platforms, though
there still is a case where the result of a sequence of
arithmetics makes false match.

Simple selectivity of a relation theoretically cannot match with
the epsilon. (Of couse on *my* environment.)

(0.333..)
binary format: 3f d5 55 55 55 55 55 55
x = 0.333333333333333315
231 matches, 79 no_matches

(0.3{13}42..)
binary format: 3f d5 55 55 55 55 55 f1
x = 0.333333333333341975
0 matches, 310 no_matches

(0.3{15}42..)
binary format: 3f d5 55 55 55 55 55 57
x = 0.333333333333333426
0 matches, 310 no_matches

It seems that 0.3{13}42 is correctly 0.3{15}42, which makes just
two LSBs difference from 1/3. I believe C is well standardized on
the translation. Other DEFAULT_*_SELs are not compared in this
way.

The attached small patch fixes the case mentioned in this thread,
but I'm not sure where to put a test. Static assertion is not
usable. Assertion in somewhere perhaps in clauselist_selectivity
seems somewhat overdone.. I don't find a suitable place in
regression test..

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

t.ctext/plain; charset=us-asciiDownload
avoid_false_match_with_default_ineq_sel.patchtext/x-patch; charset=us-asciiDownload
diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c
index 3739b9817a..cdeaac22c8 100644
--- a/src/backend/optimizer/path/clausesel.c
+++ b/src/backend/optimizer/path/clausesel.c
@@ -109,7 +109,7 @@ clauselist_selectivity(PlannerInfo *root,
 	ListCell   *l;
 	int			listidx;
 
-	/*
+ 	/*
 	 * If there's exactly one clause, just go directly to
 	 * clause_selectivity(). None of what we might do below is relevant.
 	 */
diff --git a/src/include/utils/selfuncs.h b/src/include/utils/selfuncs.h
index 5cc4cf15e2..15a8d2402a 100644
--- a/src/include/utils/selfuncs.h
+++ b/src/include/utils/selfuncs.h
@@ -33,8 +33,12 @@
 /* default selectivity estimate for equalities such as "A = b" */
 #define DEFAULT_EQ_SEL	0.005
 
-/* default selectivity estimate for inequalities such as "A < b" */
-#define DEFAULT_INEQ_SEL  0.3333333333333333
+/*
+ * default selectivity estimate for inequalities such as "A < b"
+ * The last two digits prevent it from making a false match with 1/3 computed
+ * from histogram and/or MCV.
+ */
+#define DEFAULT_INEQ_SEL  0.33333333333333342
 
 /* default selectivity estimate for range inequalities "A > b AND A < c" */
 #define DEFAULT_RANGE_INEQ_SEL	0.005
#9Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#4)
Re: Improve selectivity estimate for range queries

On Fri, Dec 21, 2018 at 11:50 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

A smaller-footprint way to fix the immediate problem might be to
change the values of DEFAULT_INEQ_SEL and friends so that they're
even less likely to be matched by accident. That is, instead of
0.3333333333333333, use 0.333333333333342 or some such. It might
seem that that's just moving the problem around, but I think it
might be possible to show that such a value couldn't be computed
by scalarltsel given a histogram with no more than 10000 members.
(I haven't tried to actually prove that, but it seems intuitive
that the set of possible results would be quantized with no more
than about 5 digits precision.

That's not a dumb idea, but it seems pretty unprincipled to me, and to
be honest I'm kind of surprised that you're not proposing something
cleaner. If you admit the need to distinguish between real estimates
and last-ditch ones, why shouldn't we have an explicit representation
of that rather than trying to pick a last-ditch value that is unlikely
to be confused with a real one?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#9)
Re: Improve selectivity estimate for range queries

Robert Haas <robertmhaas@gmail.com> writes:

On Fri, Dec 21, 2018 at 11:50 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

A smaller-footprint way to fix the immediate problem might be to
change the values of DEFAULT_INEQ_SEL and friends so that they're
even less likely to be matched by accident. That is, instead of
0.3333333333333333, use 0.333333333333342 or some such.

That's not a dumb idea, but it seems pretty unprincipled to me, and to
be honest I'm kind of surprised that you're not proposing something
cleaner.

The problem is the invasiveness of such a change (large) vs the benefit
(not so large). The upthread patch attempted to add a separate signaling
path, but it was very incomplete --- and yet both I and Horiguchi-san
thought it was already too messy.

Maybe at some point we'll go over to something reasonably principled,
like adding confidence intervals to all selectivity estimates. That
would be *really* invasive but perhaps would bring enough benefit to
justify itself. But the current patch is just attempting to fix one
extremely narrow pain point, and if that is all it's doing it should
have a commensurately small footprint. So I don't think the submitted
patch looks good from a cost/benefit standpoint.

regards, tom lane

#11Yuzuko Hosoya
hosoya.yuzuko@lab.ntt.co.jp
In reply to: Tom Lane (#10)
RE: Improve selectivity estimate for range queries

Hi,

Thanks for the comments, and I'm sorry for the late reply.

From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Friday, January 11, 2019 7:04 AM

Robert Haas <robertmhaas@gmail.com> writes:
On Fri, Dec 21, 2018 at 11:50 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

A smaller-footprint way to fix the immediate problem might be to
change the values of DEFAULT_INEQ_SEL and friends so that they're
even less likely to be matched by accident. That is, instead of
0.3333333333333333, use 0.333333333333342 or some such.

That's not a dumb idea, but it seems pretty unprincipled to me, and to
be honest I'm kind of surprised that you're not proposing something
cleaner.

The problem is the invasiveness of such a change (large) vs the benefit (not so large). The

upthread

patch attempted to add a separate signaling path, but it was very incomplete --- and yet both I

and

Horiguchi-san thought it was already too messy.

Maybe at some point we'll go over to something reasonably principled, like adding confidence

intervals

to all selectivity estimates. That would be *really* invasive but perhaps would bring enough

benefit

to justify itself. But the current patch is just attempting to fix one extremely narrow pain

point,

and if that is all it's doing it should have a commensurately small footprint. So I don't think

the

submitted patch looks good from a cost/benefit standpoint.

Yes, I agree with you. Indeed the patch I attached is insufficient in cost-effectiveness.
However, I want to solve problems of that real estimates happened to equal to the default
values such as this case, even though it's a narrow pain point. So I tried distinguishing
explicitly between real estimates and otherwise as Robert said.

The idea Tom proposed and Horiguchi-san tried seems reasonable, but I'm concerned whether
any range queries really cannot match 0.333333333333342 (or some such) by accident in any
environments. Is the way which Horiguchi-san did enough to prove that?

Best regards,
Yuzuko Hosoya
NTT Open Source Software Center

#12Kyotaro HORIGUCHI
horiguchi.kyotaro@lab.ntt.co.jp
In reply to: Yuzuko Hosoya (#11)
Re: Improve selectivity estimate for range queries

Hi.

At Fri, 11 Jan 2019 11:36:47 +0900, "Yuzuko Hosoya" <hosoya.yuzuko@lab.ntt.co.jp> wrote in <000001d4a956$806a2ab0$813e8010$@lab.ntt.co.jp>

Hi,

Thanks for the comments, and I'm sorry for the late reply.

From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Friday, January 11, 2019 7:04 AM

Robert Haas <robertmhaas@gmail.com> writes:
On Fri, Dec 21, 2018 at 11:50 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

A smaller-footprint way to fix the immediate problem might be to
change the values of DEFAULT_INEQ_SEL and friends so that they're
even less likely to be matched by accident. That is, instead of
0.3333333333333333, use 0.333333333333342 or some such.

That's not a dumb idea, but it seems pretty unprincipled to me, and to
be honest I'm kind of surprised that you're not proposing something
cleaner.

The problem is the invasiveness of such a change (large) vs the benefit (not so large). The

upthread

patch attempted to add a separate signaling path, but it was very incomplete --- and yet both I

and

Horiguchi-san thought it was already too messy.

Maybe at some point we'll go over to something reasonably principled, like adding confidence

intervals

to all selectivity estimates. That would be *really* invasive but perhaps would bring enough

benefit

to justify itself. But the current patch is just attempting to fix one extremely narrow pain

point,

and if that is all it's doing it should have a commensurately small footprint. So I don't think

the

submitted patch looks good from a cost/benefit standpoint.

Yes, I agree with you. Indeed the patch I attached is insufficient in cost-effectiveness.
However, I want to solve problems of that real estimates happened to equal to the default
values such as this case, even though it's a narrow pain point. So I tried distinguishing
explicitly between real estimates and otherwise as Robert said.

The idea Tom proposed and Horiguchi-san tried seems reasonable, but I'm concerned whether
any range queries really cannot match 0.333333333333342 (or some such) by accident in any
environments. Is the way which Horiguchi-san did enough to prove that?

It cannot be perfect in priciple, but I think it is reliable
enough. This is not principled at all but very effective
comparatively the complexity, I think.

Actually, i/(i*3+1) for some 10^n's are:

1/ 4:binary format: 3f d0 00 00 00 00 00 00
10/ 31:binary format: 3f d4 a5 29 4a 52 94 a5
100/ 301:binary format: 3f d5 43 30 7a 78 c5 51
1000/ 3001:binary format: 3f d5 53 83 74 70 f1 95
10000/ 30001:binary format: 3f d5 55 26 bb 44 2b a8
100000/ 300001:binary format: 3f d5 55 50 ac 4a 74 6d
1000000/ 3000001:binary format: 3f d5 55 54 de 07 5a 96
10000000/ 30000001:binary format: 3f d5 55 55 49 67 22 6d
100000000/ 300000001:binary format: 3f d5 55 55 54 23 e9 d7
1000000000/ 3000000001:binary format: 3f d5 55 55 55 36 ca 95
10000000000/ 30000000001:binary format: 3f d5 55 55 55 52 47 75
100000000000/ 300000000001:binary format: 3f d5 55 55 55 55 07 25
1000000000000/ 3000000000001:binary format: 3f d5 55 55 55 55 4d 84
10000000000000/ 30000000000001:binary format: 3f d5 55 55 55 55 54 8d
100000000000000/ 300000000000001:binary format: 3f d5 55 55 55 55 55 41
1000000000000000/ 3000000000000001:binary format: 3f d5 55 55 55 55 55 53

So, (as Tom said upthread) intuitively we can expect that we are
safe up to roughly 10^10? for the simple cases.

# Of course involving histogram makes things complex, though.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center